DPI R Bootcamp
Jared Knowles
In this lesson we hope to learn:
# Set working directory to the tutorial directory In RStudio can do
# this in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
load("data/smalldata.rda")
# Note if we don't assign data to 'df' R just prints contents of
# table
table function is our friendtable(df$grade, df$year)
2000 2001 2002
3 200 100 200
4 100 200 100
5 200 100 200
6 100 200 100
7 200 100 200
8 100 200 100
table(df$year, df$race)
A B H I W
2000 16 370 93 7 414
2001 16 370 93 7 414
2002 16 370 93 7 414
with(df[df$grade == 3, ], {
table(year, race)
})
race
year A B H I W
2000 4 78 22 4 92
2001 1 44 8 2 45
2002 0 74 20 1 105
with specifies a data object to work on, in this case all elements of df where grade==3table is the same command as above, but since we specified the data object in the with statement, we don't need the df$ in front of the variables of interestdf2 <- subset(df, grade == 3)
table(df2$year, df2$race)
A B H I W
2000 4 78 22 4 92
2001 1 44 8 2 45
2002 0 74 20 1 105
rm(df2)
with(df[df$___==___,]...)with(df[df$race=="B",],{table(year,grade)})
grade
year 3 4 5 6 7 8
2000 78 48 87 39 74 44
2001 44 78 48 87 39 74
2002 74 44 78 48 87 39
table(df$year, df$proflvl)
advanced basic below basic proficient
2000 56 313 143 388
2001 229 183 64 424
2002 503 27 3 367
table(df$race, df$proflvl)
advanced basic below basic proficient
A 19 7 3 19
B 160 302 162 486
H 54 76 33 116
I 7 4 1 9
W 548 134 11 549
prop.table(table(df$race, df$proflvl))
advanced basic below basic proficient
A 0.0070370 0.0025926 0.0011111 0.0070370
B 0.0592593 0.1118519 0.0600000 0.1800000
H 0.0200000 0.0281481 0.0122222 0.0429630
I 0.0025926 0.0014815 0.0003704 0.0033333
W 0.2029630 0.0496296 0.0040741 0.2033333
round(prop.table(table(df$race, df$proflvl), 1), digits = 3)
advanced basic below basic proficient
A 0.396 0.146 0.062 0.396
B 0.144 0.272 0.146 0.438
H 0.194 0.272 0.118 0.416
I 0.333 0.190 0.048 0.429
W 0.441 0.108 0.009 0.442
1 tells R we want proportions rowise, a 2 goes columnwiseround tells R to cut off some digits for usaggregate function that can be used and helps us avoid the clustering problems aboveformula (think I want variable X by grouping factor Y) and the statistic we want to compute# Reading Scores by Race
aggregate(readSS ~ race, FUN = mean, data = df)
race readSS
1 A 508.7
2 B 460.2
3 H 473.2
4 I 485.2
5 W 533.2
aggregate can take us a little further, we can use aggregate multiple variables at a timeaggregate(cbind(readSS, mathSS) ~ race, data = df, mean)
race readSS mathSS
1 A 508.7 477.9
2 B 460.2 442.5
3 H 473.2 442.7
4 I 485.2 455.9
5 W 533.2 529.8
formula syntaxhead(aggregate(cbind(readSS, mathSS) ~ race + grade, data = df, mean),
8)
race grade readSS mathSS
1 A 3 397.8 454.8
2 B 3 409.8 371.6
3 H 3 417.7 364.2
4 I 3 407.6 449.3
5 W 3 481.1 450.7
6 A 4 456.0 438.2
7 B 4 426.9 408.1
8 H 4 418.8 404.6
ag <- aggregate(readSS ~ race + grade, data = df, mean)
xtabs(readSS ~ ., data = ag)
grade
race 3 4 5 6 7 8
A 397.8 456.0 479.1 539.5 600.4 605.3
B 409.8 426.9 447.6 470.9 492.3 523.5
H 417.7 418.8 481.2 489.1 500.3 534.2
I 407.6 531.1 547.6 0.0 405.5 518.0
W 481.1 498.5 517.1 546.6 565.2 596.1
ftable(xtabs(readSS ~ ., data = ag))
grade 3 4 5 6 7 8
race
A 397.8 456.0 479.1 539.5 600.4 605.3
B 409.8 426.9 447.6 470.9 492.3 523.5
H 417.7 418.8 481.2 489.1 500.3 534.2
I 407.6 531.1 547.6 0.0 405.5 518.0
W 481.1 498.5 517.1 546.6 565.2 596.1
What is the mean reading score for 6th grade students with disabilities?
481.83
How many points is this from non-disabled students?
29.877
aggregate(cbind(readSS, mathSS) ~ disab + grade, data = df, mean)
disab grade readSS mathSS
1 0 3 449.9 418.3
2 1 3 421.1 376.3
3 0 4 464.0 454.2
4 1 4 438.2 425.1
5 0 5 484.9 470.2
6 1 5 475.1 431.0
7 0 6 511.7 507.9
8 1 6 481.8 476.9
9 0 7 532.0 532.0
10 1 7 516.1 474.3
11 0 8 567.6 567.7
12 1 8 518.8 534.1
aggregate we do:z <- aggregate(readSS ~ dist, FUN = mean, data = df)
z
dist readSS
1 205 496.5
2 402 500.5
3 495 491.6
plyraggregate is cool, but it isn't very flexibleplyr packageplyr is a set of routines/logical structure for transforming, summarizing, reshaping, and reorganizing data objects of one type in R into another type (or the same type)plyr package has a number of utilities to help us split-apply-combine across data types for both input and outputfor loops to iterate over groups of students, because in R for loops are slow, inefficient, and impracticalplyr to the rescue, while not as fast as a compiled language, it is pretty dang good!
ddply has before it combines it back for us when we do the call ddply(df,.(sex,age),"nrow")
plyr has a straightforward syntaxplyr functions are in the format XXply. The two X's specify what the input file we are applying a function to is, and then what way we would like it outputted.plyr d = dataframe, l= list, m=matrix, and a=array. By far the most common usage is ddplyplyr in Tutorial 4 as well library(plyr)
myag<-ddply(df, .(dist,grade),summarize,
mean_read=mean(readSS,na.rm=T),
mean_math=mean(mathSS,na.rm=T),
sd_read=sd(readSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_read=length(readSS),
count_math=length(mathSS))
summarize tells ddply what we are doing to the data framehead(myag)
dist grade mean_read mean_math sd_read sd_math count_read
1 205 3 451.7 406.1 93.52 72.45 200
2 205 4 438.9 459.9 77.76 79.10 100
3 205 5 487.9 462.6 85.30 75.10 200
4 205 6 514.7 526.8 76.83 66.04 100
5 205 7 530.0 521.5 84.82 74.85 200
6 205 8 575.5 581.2 79.58 83.45 100
count_math
1 200
2 100
3 200
4 100
5 200
6 100
myag<-ddply(df, .(dist,grade),summarize,
mean_read=mean(readSS,na.rm=T),
mean_math=mean(mathSS,na.rm=T),
sd_read=sd(readSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_read=length(readSS),
count_math=length(mathSS),
count_black=length(race[race=='B']),
per_black=length(race[race=='B'])/length(readSS))
summary(myag[,7:10])
count_read count_math count_black per_black
Min. :100 Min. :100 Min. :36.0 Min. :0.360
1st Qu.:100 1st Qu.:100 1st Qu.:41.2 1st Qu.:0.386
Median :150 Median :150 Median :60.5 Median :0.412
Mean :150 Mean :150 Mean :61.7 Mean :0.411
3rd Qu.:200 3rd Qu.:200 3rd Qu.:82.0 3rd Qu.:0.439
Max. :200 Max. :200 Max. :92.0 Max. :0.480
data.table which is really handyplyr versionWhat district ID has the highest mean score for 4th grade ELL students on reading? Math?
How many students are in these classes?
12 and 7 respectively
myag2<-ddply(df, .(dist,grade,ell),summarize,
mean_read=mean(readSS,na.rm=T),
mean_math=mean(mathSS,na.rm=T),
sd_read=sd(readSS,na.rm=T),
sd_math=sd(mathSS,na.rm=T),
count_read=length(readSS),
count_math=length(mathSS),
count_black=length(race[race=='B']),
per_black=length(race[race=='B'])/length(readSS))
subset(myag2,ell==1&grade==4)
dist grade ell mean_read mean_math sd_read sd_math count_read
4 205 4 1 403.0 392.9 64.52 39.09 16
16 402 4 1 443.1 388.7 79.52 53.28 29
28 495 4 1 408.8 431.9 77.47 70.77 13
count_math count_black per_black
4 16 2 0.12500
16 29 6 0.20690
28 13 1 0.07692
order function to sort datadf.badsort <- order(df$readSS, df$mathSS)
head(df.badsort)
[1] 106 1026 2 56 122 118
df.sort <- df[order(df$readSS, df$mathSS, df$attday), ]
head(df[, c(3, 23, 29, 30)])
stuid attday readSS mathSS
1 149995 180 357.3 387.3
2 13495 180 263.9 302.6
3 106495 160 369.7 365.5
4 45205 168 346.6 344.5
5 142705 156 373.1 441.2
6 14995 157 436.8 463.4
head(df.sort[, c(3, 23, 29, 30)])
stuid attday readSS mathSS
106 106705 160 251.5 277.0
1026 80995 176 263.2 377.8
2 13495 180 263.9 302.6
56 122402 180 264.3 271.7
122 79705 168 266.4 318.7
118 40495 173 266.9 275.0
head(df[with(df, order(-readSS, -attday)), c(3, 23, 29, 30)])
stuid attday readSS mathSS
1631 145205 137 833.2 828.4
1462 107705 180 773.3 746.6
2252 122902 180 744.0 621.6
2341 44902 175 741.7 676.3
1482 134705 180 739.2 705.4
1630 14495 162 738.9 758.2
- denotes we want descending order, R's default is ascending orderM <- matrix(c(1, 2, 2, 2, 3, 6, 4, 5), 4, 2, byrow = FALSE, dimnames = list(NULL,
c("a", "b")))
M[order(M[, "a"], -M[, "b"]), ]
a b
[1,] 1 3
[2,] 2 6
[3,] 2 5
[4,] 2 4
mytab <- table(df$grade, df$year)
mytab[order(mytab[, 1]), ]
2000 2001 2002
4 100 200 100
6 100 200 100
8 100 200 100
3 200 100 200
5 200 100 200
7 200 100 200
mytab[order(mytab[, 2]), ]
2000 2001 2002
3 200 100 200
5 200 100 200
7 200 100 200
4 100 200 100
6 100 200 100
8 100 200 100
# Gives all rows that meet this requirement
df[df$readSS > 800, ]
X school stuid grade schid dist white black hisp indian
1631 1281061 852 145205 8 15 205 1 0 0 0
asian econ female ell disab sch_fay dist_fay luck ability
1631 0 0 1 0 0 0 0 0 108.3
measerr teachq year attday schoolscore district schoolhigh
1631 6.325 155.7 2001 137 227.7 19 0
schoolavg schoollow readSS mathSS proflvl race
1631 1 0 833.2 828.4 advanced W
df$grade[df$mathSS > 800]
[1] 8
# Gives all values of grade that meet this requirement
df$grade[df$black == 1 & df$readSS > 650]
[1] 8 7 8 6 6 7 8 7 8 8 8 4
df$black=1 or black==1?spread indicating whether a district has high or low spread among its student scoresmyag$spread <- NA # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
summary(myag$spread)
high low
15 3
myag$spread <- NA # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
myag called schoolperf for mean_math scores with the following coding scheme:| Grade | Score Range | Code |
|---|---|---|
| 3 | >425 | "Hi" |
| 4 | >450 | "Hi" |
| 5 | >475 | "Hi" |
| 6 | >500 | "Hi" |
| 7 | >525 | "Hi" |
| 8 | >575 | "Hi" |
dist?myag$schoolperf <- "lo"
myag$schoolperf[myag$grade == 3 & myag$mean_math > 425] <- "hi"
myag$schoolperf[myag$grade == 4 & myag$mean_math > 450] <- "hi"
myag$schoolperf[myag$grade == 5 & myag$mean_math > 475] <- "hi"
myag$schoolperf[myag$grade == 6 & myag$mean_math > 500] <- "hi"
myag$schoolperf[myag$grade == 7 & myag$mean_math > 525] <- "hi"
myag$schoolperf[myag$grade == 8 & myag$mean_math > 575] <- "hi"
myag$schoolperf <- as.factor(myag$schoolperf)
summary(myag$schoolperf)
hi lo
9 9
table(myag$dist, myag$schoolperf)
hi lo
205 3 3
402 3 3
495 3 3
myag$mean_read[myag$dist == 6 & myag$grade == 3] <- NA
head(myag[, 1:4], 2)
dist grade mean_read mean_math
1 205 3 451.7 406.1
2 205 4 438.9 459.9
myag$mean_read[myag$dist == 6 & myag$grade == 3] <- myag$mean_read[myag$dist ==
6 & myag$grade == 4]
head(myag[, 1:4], 2)
dist grade mean_read mean_math
1 205 3 451.7 406.1
2 205 4 438.9 459.9
myag$mean_read[myag$grade == 3] <- NA
head(myag[order(myag$grade), 1:4])
dist grade mean_read mean_math
1 205 3 NA 406.1
7 402 3 NA 431.9
13 495 3 NA 405.5
2 205 4 438.9 459.9
8 402 4 474.9 432.8
14 495 4 447.8 469.1
mean(myag$mean_math)
[1] 490.7
mean(myag$mean_read)
[1] NA
mean of a vector with NAmean(myag$mean_math, na.rm = T)
[1] 490.7
mean(myag$mean_read, na.rm = T)
[1] 507.5
mean_read of less than 500?length(myag$dist[myag$mean_read < 500])
[1] 10
head(myag$mean_read[myag$mean_read < 500])
[1] NA 438.9 487.9 NA 474.9 472.5
badvar <- myag$mean_read + myag$sd_read
summary(badvar)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
517 566 592 587 614 655 3
myag$sd_read[myag$count_read < 100 & myag$mean_read < 550] <- NA
length(myag$mean_read[myag$mean_read < 550])
[1] 16
length(myag$mean_read[myag$mean_read < 550 & !is.na(myag$mean_read)])
[1] 13
!is.na() ?is.na() is a helpful function to identify TRUE if a value is missing! is the reverse operatordf which has multiple rows per student and myag which has multiple rows per schoolnames(myag)
[1] "dist" "grade" "mean_read" "mean_math"
[5] "sd_read" "sd_math" "count_read" "count_math"
[9] "count_black" "per_black" "spread" "schoolperf"
names(df[, c(2, 3, 4, 6)])
[1] "school" "stuid" "grade" "dist"
dist and grade are in common. Is this ok?year as well?merge we want to consider with ?mergemerge automagically combine the datasimple_merge <- merge(df, myag)
names(simple_merge)
[1] "grade" "dist" "X" "school"
[5] "stuid" "schid" "white" "black"
[9] "hisp" "indian" "asian" "econ"
[13] "female" "ell" "disab" "sch_fay"
[17] "dist_fay" "luck" "ability" "measerr"
[21] "teachq" "year" "attday" "schoolscore"
[25] "district" "schoolhigh" "schoolavg" "schoollow"
[29] "readSS" "mathSS" "proflvl" "race"
[33] "mean_read" "mean_math" "sd_read" "sd_math"
[37] "count_read" "count_math" "count_black" "per_black"
[41] "spread" "schoolperf"
simple_merge(df1,df2,by=c("id1","id2"))simple_merge(df1,df2,by.x=c("id1","id2"),by.y=c("id1_a","id2_a"))notsosimple_merge(df1,df2,all.x=TRUE,all.y=TRUE)x observations (df1), all the y observations (df2) or neither, or bothhead(df[, 1:10], 3)
X school stuid grade schid dist white black hisp indian
1 44 1 149995 3 105 495 0 1 0 0
2 53 1 13495 3 45 495 0 1 0 0
3 116 1 106495 3 45 495 0 1 0 0
head(widedf[, c(1, 28:40)], 3)
stuid readSS.2000 mathSS.2000 proflvl.2000 race.2000 X.2001
1 149995 357.3 387.3 basic B 441000
2 13495 263.9 302.6 below basic B 531000
3 106495 369.7 365.5 basic B 1161000
school.2001 grade.2001 schid.2001 dist.2001 white.2001 black.2001
1 1 4 105 495 0 1
2 1 4 45 495 0 1
3 1 4 45 495 0 1
hisp.2001 indian.2001
1 0 0
2 0 0
3 0 0
reshape is the way to move from wide to longwidedf <- reshape(df, timevar = "year", idvar = "stuid", direction = "wide")
idvar represents the unit we want to represent a single row, in this case each unique student gets a single rowtimevar is the variable that differenaties between two rows with the same student IDtimevar needn't always represent time!direction tells R we are going to move to wide datavarying argument we can tell R explicitly which items we want to move widereshape function works well in both directionslongdf <- reshape(widedf, idvar = "stuid", timevar = "year", varying = names(widedf[,
2:91]), direction = "long", sep = ".")
subset function to get only 4th grade scoresg4 <- subset(df, grade == 4)
dim(g4)
[1] 400 32
g4_b <- df[df$grade == 4, ]
identical(g4, g4_b)
[1] TRUE
Say we are unhappy about attributing the school/grade mean score across years to student-year observations like we did in this lesson. Let's fix it by first aggregating our student data frame to a school/grade/year data frame, and second by merging that new data frame with our student level data.
Sort the student-level data frame on attday and ability in descending order.
Find the highest proportion of black students in any school/grade/year combination.
It is good to include the session info, e.g. this document is produced with knitr version 0.9.6. Here is my session info:
print(sessionInfo(), locale = FALSE)
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
attached base packages:
[1] grid stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] sandwich_2.2-9 quantreg_4.94 SparseM_0.96 gridExtra_0.9.1
[5] mgcv_1.7-22 eeptools_0.1 mapproj_1.2-0 maps_2.3-0
[9] proto_0.3-10 plyr_1.8 stringr_0.6.2 ggplot2_0.9.3
[13] lmtest_0.9-30 zoo_1.7-9 knitr_0.9.6
loaded via a namespace (and not attached):
[1] colorspace_1.2-0 dichromat_1.2-4 digest_0.6.0
[4] evaluate_0.4.3 formatR_0.7 gtable_0.1.2
[7] labeling_0.1 lattice_0.20-10 MASS_7.3-22
[10] Matrix_1.0-10 munsell_0.4 nlme_3.1-106
[13] RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3
[16] tools_2.15.2
This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.