Data Setup

Let's read in a new dataset now that has some messiness to it

load("data/Student_Attributes.rda")
head(stuatt[, 1:4], 7)

  sid school_year male race_ethnicity
1   1        2004    1              B
2   1        2005    1              H
3   1        2006    1              H
4   1        2007    1              H
5   2        2006    0              W
6   2        2007    0              B
7   3        2006    1              H

What's wrong with this?

How can R help correct this?

Identify problems
Enforce business rules for messy data consistently
Build data cleaning into all analyses tasks across the workflow
Analyze inconsistencies and do reports

Strategic Data Project

The Strategic Data Project is a project housed at Harvard Center for Education Policy Research aimed at bringing high quality research methods and data analysis to bear on strategic management and policy decisions in the education sector
SDP was formed on two fundamental premises:

Policy and management decisions can directly influence schools' and teachers' ability to improve student achievement
Valid and reliable data analysis significantly improves the quality of decision making

Their focus is on bringing together the right people, assembling the right data, and performing the right analysis because this will improve decisions made by leadership
They are smart folks who have done a lot of the important but unexciting work of systematically identifying how to clean, document, and transparently evaluate datasets

Toolkit - Data Cleaning

The SDP has come up with a great tutorial and guided analyses using a great synthetic data set to help walk through the process of cleaning data
This was written in Stata, we are porting it to R (you can contribute to this effort if you like), and are going to walk through just a single lesson of it here (Clean Data Building)
You can get the toolkit lesson that this tutorial is adapted from online
There are five toolkits in addition to a data guide that are incredibly helpful, so we are just touching the tip of the iceberg
Other modules include:

How to identify essential data elements for analyzing student achievement
Clean, check, and build variables in the dataset
Connect relevant datasets from different soruces
Analyze datasets
Adopt coding best practices to facilitate shared and replicable data analysis

SDP Task 1 Student Attributes Intro

Drop the first_9th_school_year_reported variable

stuatt$first_9th_year_reported <- NULL

To drop variables in R we assign them NULL, another R quirk

SDP Task 1 - Step 1: Consistent Gender

Is gender unique for each student?

length(unique(stuatt$sid))

[1] 21803

length(unique(stuatt$sid, stuatt$male))

[1] 21806

Nah, we have 21,803 unique students in our dataset, but 21,806 unique combinations of gender and student

Testing Uniqueness

Below we write a small function that automates the check we did on the last slide
How does this work?

testuniqueness <- function(id, group) {
    length(unique(id)) == length(unique(id, group))
}  # Need better varname and some optimization to the speed of this code
testuniqueness(stuatt$sid, stuatt$male)

[1] FALSE

testuniqueness(stuatt$sid, stuatt$race_ethnicity)

[1] FALSE

testuniqueness(stuatt$sid, stuatt$birth_date)

[1] FALSE

Messy...

Where is the data messy?

stuatt[17:21, 1:3]

   sid school_year male
17   7        2004    1
18   7        2005    1
19   7        2006    1
20   7        2007    0
21   7        2008    1

Student 7 has an inconsistently reported gender in our data
We need a business rule to handle fixing this, and a way to implement it
SDP provides the rule, R provides the systematic implementation

Unifying Consistent Gender Values

First we create a variable with the number of unique values gender takes per student
In R to do this we create a summary table of student attributes by collapsing the data set into one row per student using the plyr strategy we learned in Tutorial 3
Then we ask R to tell us how many rows have what values for the length of gender

library(plyr)
sturow <- ddply(stuatt, .(sid), summarize, nvals_gender = length(unique(male)))
table(sturow$nvals_gender)


    1     2 
21799     4

So 4 students have more than one unique value for gender

Fixing the pesky observations

At this point there are a number of business rules we could adopt
We could assign students the most recent value, the most frequent value, or even a random value!
Let's see if replacing it with the most frequent value works

# A function to find the most frequent value
library(eeptools)
sturow <- ddply(stuatt, .(sid), summarize, nvals_gender = length(unique(male)), 
    gender_mode = statamode(male), gender_recent = tail(male, 1))
head(sturow[7:10, ])

   sid nvals_gender gender_mode gender_recent
7    7            2           1             1
8    8            1           1             1
9    9            1           1             1
10  10            1           1             1

Fixing observations II

Now we have two objects stuatt and sturow and we need to replace some values from stuatt with some values from sturow
merge to the rescue!
Let's merge our two data objects into a temporary data object called tempdf

tempdf <- merge(stuatt, sturow)  # R finds the linking variable already
head(tempdf[17:21, c(1, 2, 3, 10, 11)])

   sid school_year male nvals_gender gender_mode
17   7        2004    1            2           1
18   7        2005    1            2           1
19   7        2006    1            2           1
20   7        2007    0            2           1
21   7        2008    1            2           1

print(subset(tempdf[, c(1, 2, 3, 10, 11)], sid == 12506))

        sid school_year male nvals_gender gender_mode
50064 12506        2004    1            2           .
50065 12506        2005    0            2           .

We fixed observation 7, but not observation 12506

Fixing where the mode does not work

print(subset(tempdf[, c(1, 2, 3, 10, 11, 12)], sid == 12506))

        sid school_year male nvals_gender gender_mode gender_recent
50064 12506        2004    1            2           .             0
50065 12506        2005    0            2           .             0

Our next business rule is to assign the most recent value of gender from the gender_recent variable when there is not a value of gender_mode that is valid
This seems like a pretty simple job for recoding our variable!

Recode Gender

Two step process: first we assign tempdf$male to be the same as tempdf$gender_mode
Then, where tempdf$male is now a "." indicating no modal category exists, we assign tempdf$gender_recent to be tempdf$male
Go ahead and try this and use testuniqueness(tempdf$id,tempdf$male) to check if it worked

Results

tempdf$male <- tempdf$gender_mode
tempdf$male[tempdf$male == "."] <- tempdf$gender_recent[tempdf$male == "."]
# we have to put the filter on both sides of the assignment operator
testuniqueness(tempdf$id, tempdf$male)

[1] TRUE

Now let's clean up our workspace, we created a lot of temporary variables that we don't need

rm(sturow)
stuatt <- tempdf
stuatt$nvals_gender <- NULL
stuatt$gender_mode <- NULL
stuatt$gender_recent <- NULL
# or just run stuatt<-tempdf[,1:9]
rm(tempdf)

Create a consistent race and ethnicity indicator

Let's practice the same procedure on race

A Note About Variable Types

In the SDP Toolkit you are advised to convert the race_ethnicity variable to numeric and add labels to it
This is because Stata and other statistical packages don't have internal data structures that can handle the factor variable type like R can, and rely on numeric coding schemes
Why don't we need to do this in R?
In fact, in R, we should probably recode the male variable as a factor with values M and F
One problem is that our datafile uses 'NA' for Native American and we do have to recode that... why?

Recoding Race

What's wrong with our race variable?

summary(stuatt$race_ethnicity)

    A     B     H   M/O     W  NA's 
 7303 25321 30444  2809 20528  1129

How do we do this?

length(stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)])
stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)] <- "AI"
summary(stuatt$race_ethnicity)

Why doesn't this work?

Correct conversion

length(stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)])

[1] 1129

stuatt$race_ethnicity <- as.character(stuatt$race_ethnicity)
stuatt$race_ethnicity[is.na(stuatt$race_ethnicity)] <- "AI"
stuatt$race_ethnicity <- factor(stuatt$race_ethnicity)
summary(stuatt$race_ethnicity)

    A    AI     B     H   M/O     W 
 7303  1129 25321 30444  2809 20528

Factors are pesky, even though they are useful and keep us from having to remember numeric representations of our data
In fact, if you read the toolkit, this is a big drawback of Stata because you must constantly refer back to the numbers to remember what number corresponds to "hispanic"

Inconsistency Within Years

Let's consider student 3 in our dataset

stuatt[7:9, c("sid", "school_year", "race_ethnicity")]

  sid school_year race_ethnicity
7   3        2006              H
8   3        2006              B
9   3        2007              B

How is this different from our prior problem?
Since student 3 was recorded twice in the same year and given a different race/ethnicity we now have to figure out some rules for assigning a consistent value

Business Rule

Again, we are implementing a business rule which means we are making some arbitrary decisions about the data
In this case, if a student is hispanic we will code both values as hispanic
If the student is not hispanic in either observation, we will code the student as _multiple

Let's calculate the number of values per year

nvals <- ddply(stuatt, .(sid, school_year), summarize, nvals_race = length(unique(race_ethnicity)), 
    tmphispanic = length(which(race_ethnicity == "H")))
tempdf <- merge(stuatt, nvals)
# Clean up
rm(nvals)
# Recode race_ethnicity
tempdf$race2 <- tempdf$race_ethnicity
tempdf$race2[tempdf$nvals_race > 1 & tempdf$tmphispanic == 1] <- "H"
tempdf$race2[tempdf$nvals_race > 1 & tempdf$tmphispanic != 1] <- "M/O"
tempdf$race_ethnicity <- tempdf$race2

# Clean up by removing old variables
tempdf$race2 <- NULL
tempdf$nvals_race <- NULL
tempdf$tmphispanic <- NULL
# Resort our result
tempdf <- tempdf[order(tempdf$sid, tempdf$school_year), ]

Compare them

        sid school_year race_ethnicity
56201     3        2006              H
56202     3        2006              H
81064  8552        2005              W
81065  8552        2006            M/O
81066  8552        2006            M/O
6162  11382        2005              H
6163  11382        2005              H
6164  11382        2006              H

        sid school_year race_ethnicity
7         3        2006              H
8         3        2006              B
34290  8552        2005              W
34291  8552        2006              A
34292  8552        2006              W
45674 11382        2005              H
45675 11382        2005            M/O
45676 11382        2006              H

OK

Merge it back together

stuatt <- tempdf
rm(tempdf)

Break in Case of Emergency

# Stupid hack workaround of ddply bug when running too many of these
# sequentially
ddply_race <- function(x, y, z) {
    NewColName <- "race_ethnicity"
    z <- ddply(x, .(y, z), .fun = function(xx, col) {
        c(nvals_race = length(unique(xx[, col])))
    }, NewColName)
    z$sid <- z$y
    z$school_year <- z$z
    z$y <- NULL
    z$z <- NULL
    return(z)
}

nvals <- ddply_race(stuatt, stuatt$sid, stuatt$school_year)
tempdf <- merge(stuatt, nvals)
tempdf$temp_ishispanic <- NA
tempdf$temp_ishispanic[tempdf$race_ethnicity == "H" & tempdf$nvals_race > 1] <- 1

Inconsistency across years

So we are in the clear right?
No, our data still has messiness across years:

head(stuatt[, c("sid", "school_year", "race_ethnicity")])

      sid school_year race_ethnicity
1       1        2004              B
2       1        2005              H
3       1        2006              H
4       1        2007              H
44618   2        2006              W
44619   2        2007              B

Student 1 and 2 are both listed as black and hispanic at alternate times

So...

What do we do?

Try it on your own

Remember, this is tough stuff, so feel free to ask for help!

Answer

tempdf <- ddply(stuatt, .(sid), summarize, var_temp = statamode(race_ethnicity), 
    nvals = length(unique(race_ethnicity)), most_recent_year = max(school_year), 
    most_recent_var = tail(race_ethnicity, 1))

tempdf$race2[tempdf$var_temp != "."] <- tempdf$var_temp[tempdf$var_temp != "."]
tempdf$race2[tempdf$var_temp == "."] <- paste(tempdf$most_recent_var[tempdf$var_temp == 
    "."])

tempdf <- merge(stuatt, tempdf)
head(tempdf[, c(1, 2, 4, 14)], 7)

  sid school_year race_ethnicity race2
1   1        2004              B     H
2   1        2005              H     H
3   1        2006              H     H
4   1        2007              H     H
5   2        2006              W     B
6   2        2007              B     B
7   3        2006              H     H

Why do we have to do a paste command?
What other parts of this code are important to remember?
Always filter on both sides
Always use summarize in the ddply call in this situation

A Faster Way

The nice thing about R is we can role processes together once we understand them
Let's build a script to do this more efficiently

task1 <- function(df, id, year, var) {
    require(plyr)
    mdf <- eval(parse(text = paste("ddply(", df, ",.(", id, "),summarize,\nvar_temp=statamode(", 
        var, "),\nnvals=length(unique(", var, ")),most_recent_year=max(", year, 
        "),\nmost_recent_var=tail(", var, ",1))", sep = "")))
    mdf$var2[mdf$var_temp != "."] <- mdf$var_temp[mdf$var_temp != "."]
    mdf$var2[mdf$var_temp == "."] <- as.character(mdf$most_recent_var[mdf$var_temp == 
        "."])
    ndf <- eval(parse(text = paste("merge(", df, ",mdf)", sep = "")))
    rm(mdf)
    return(ndf)
}
# Note data must be sorted
tempdf <- task1(stuatt, stuatt$sid, stuatt$school_year, stuatt$race_ethnicity)

Other References

The Strategic Data Project Toolkit
UCLA ATS: R FAQ on Data Management
Video Tutorials
The Split-Apply-Combine Strategy for Data Analysis by Hadley Wickham available in the Journal of Statistical Software vol 40, Issue 1, April 2011

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.9.6. Here is my session info:

print(sessionInfo(), locale = FALSE)

R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] sandwich_2.2-9  quantreg_4.94   SparseM_0.96    gridExtra_0.9.1
 [5] mgcv_1.7-22     eeptools_0.1    mapproj_1.2-0   maps_2.3-0     
 [9] proto_0.3-10    plyr_1.8        stringr_0.6.2   ggplot2_0.9.3  
[13] lmtest_0.9-30   zoo_1.7-9       knitr_0.9.6    

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.2-0   dichromat_1.2-4   
 [4] digest_0.6.0       evaluate_0.4.3     formatR_0.7       
 [7] gtable_0.1.2       labeling_0.1       lattice_0.20-10   
[10] MASS_7.3-22        Matrix_1.0-10      munsell_0.4       
[13] nlme_3.1-106       RColorBrewer_1.0-5 reshape2_1.2.2    
[16] scales_0.2.3       tools_2.15.2

Tutorial 4: Cleaning and Merging Data

Overview

Data Setup

How can R help correct this?

Strategic Data Project

Toolkit - Data Cleaning

SDP Task 1 Student Attributes Intro

SDP Task 1 - Step 1: Consistent Gender

Testing Uniqueness

Where is the data messy?

Unifying Consistent Gender Values

Fixing the pesky observations

Fixing observations II

Fixing where the mode does not work

Recode Gender

Results

Create a consistent race and ethnicity indicator

A Note About Variable Types

Recoding Race

Correct conversion

Inconsistency Within Years

Business Rule

Let's calculate the number of values per year

Compare them

OK

Break in Case of Emergency

Inconsistency across years

So...

What do we do?

Try it on your own

Remember, this is tough stuff, so feel free to ask for help!

Answer

A Faster Way

Other References

Session Info

Attribution and License