DPI R Bootcamp
Jared Knowles
packagespackages are essentially free and open source add-ons for Rggplot2 packagelme4 package for mixed effects modelsscatterplot3d package (also webGL)sptmcaretRservefun package# You can find and install packages within R
install.packages("foo") # Name must be in quotes
install.packages(c("foo", "foo1", "foo2"))
# Packages get updated FREQUENTLY
update.packages() # Gonna update them all
library(ggplot2) and you're done!plyr ggplot2 lme4 sp knitr
vignettes which walk you through use cases for the packagecaret has a great examplels() commandmydata$thingIwant or mydata@thingIwant$ and @ distinction depends on whether this is an S3 or an S4 class$ doesn't work, use @wd is both your friend and enemygetwd()C:/Users/My Documents/My Project/ which is then set as the working directorysetwd() command: setwd("PATH/TO/MY PROJECT/")data and handouts/data or ~data which means, look for the data folder in our current working directoryC:/Path/To/My/Data or usr/home/jaredrocksNA or NaN or NULL depending on the context.a <- c(1, 2, 3) # a is a vector with three elements
# Ask R for element 4
print(a[4])
## [1] NA
NA and NULL?a <- c(a, NULL) # Append NULL onto a
print(a)
## [1] 1 2 3
# Notice no change
a <- c(a, NA)
print(a)
## [1] 1 2 3 NA
NA can hold a place, NULL cannotNaN is even more special, and only holds things like imaginary numbersNaN stands for "Not a Number"b <- 1
b <- sqrt(-b)
## Warning: NaNs produced
print(b)
## [1] NaN
pi/0
## [1] Inf
sin(Inf) = NaNProjectTemplate package for a more detailed philosophy about organizing projects.csv .dta .sas .spss .dat and even .xls and .xlsx with some care# Set working directory to the tutorial directory In RStudio can do this
# in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
df <- read.csv("data/smalldata.csv")
# Note if we don't assign data to 'df' R just prints contents of table
## 'data.frame': 2700 obs. of 6 variables:
## $ schoolavg: int 1 1 1 1 1 1 1 1 1 1 ...
## $ schoollow: int 0 0 0 0 0 0 0 0 0 0 ...
## $ readSS : num 357 264 370 347 373 ...
## $ mathSS : num 387 303 365 344 441 ...
## $ proflvl : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
## $ race : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(df)
## [1] 2700 32
summarysummary(df[, 1:5])
## X school stuid grade
## Min. : 44 Min. : 1 Min. : 205 Min. :3.00
## 1st Qu.: 108677 1st Qu.: 195 1st Qu.: 44205 1st Qu.:4.00
## Median : 458596 Median : 436 Median : 88205 Median :5.00
## Mean : 557918 Mean : 460 Mean : 99229 Mean :5.44
## 3rd Qu.: 972291 3rd Qu.: 717 3rd Qu.:132205 3rd Qu.:7.00
## Max. :1499992 Max. :1000 Max. :324953 Max. :8.00
## schid
## Min. : 6.0
## 1st Qu.: 15.0
## Median : 55.5
## Mean : 52.0
## 3rd Qu.: 75.0
## Max. :105.0
namesnames(df)
## [1] "X" "school" "stuid" "grade" "schid"
## [6] "dist" "white" "black" "hisp" "indian"
## [11] "asian" "econ" "female" "ell" "disab"
## [16] "sch_fay" "dist_fay" "luck" "ability" "measerr"
## [21] "teachq" "year" "attday" "schoolscore" "district"
## [26] "schoolhigh" "schoolavg" "schoollow" "readSS" "mathSS"
## [31] "proflvl" "race"
attributes and classnames(attributes(df))
## [1] "names" "row.names" "class"
class(df)
## [1] "data.frame"
str which lists all data elements in an object and their typelibrary(RODBC) # interface driver for R
channel <- odbcConnect("Mydatabase.location", uid = "useR", pwd = "secret")
# establish connection we can do multiple connections in the same R
# session
#
# WARNING: credentials stored in plain text unless you do some magic
table_list <- sqltables(channel, schema = "My_DB")
# Get a list of tables in the connection
colnames(sqlFetch(channel, "My_DB.TABLE_NAME", max = 1))
# get the column names of a table
datapull <- sqlQuery(channel, "SELECT DATA1, DATA2, DATA3 FROM My_DB.TABLE_NAME")
# execute some SQLquery, can paste any SQLquery as a string into this
# space
random <- sample(unique(df$stuid), 100)
random2 <- sample(unique(df$stuid), 120)
messdf <- df
messdf$readSS[messdf$stuid %in% random] <- NA
messdf$mathSS[messdf$stuid %in% random2] <- NA
summary function helps identify missing datasummary(messdf[, c("stuid", "readSS", "mathSS")])
## stuid readSS mathSS
## Min. : 205 Min. :252 Min. :210
## 1st Qu.: 44205 1st Qu.:431 1st Qu.:418
## Median : 88205 Median :497 Median :480
## Mean : 99229 Mean :497 Mean :484
## 3rd Qu.:132205 3rd Qu.:564 3rd Qu.:543
## Max. :324953 Max. :833 Max. :828
## NA's :223 NA's :288
nrow(messdf[!complete.cases(messdf), ]) # number of rows with missing data
## [1] 494
na.omit functioncleandf <- na.omit(messdf)
nrow(cleandf)
## [1] 2206
dim(messdf)
## [1] 2700 32
str(messdf[, 18:26])
## 'data.frame': 2700 obs. of 9 variables:
## $ luck : int 0 1 0 1 0 0 1 0 0 0 ...
## $ ability : num 87.9 97.8 104.5 111.7 81.9 ...
## $ measerr : num 11.13 6.82 -7.86 -17.57 52.98 ...
## $ teachq : num 39.0902 0.0985 39.5389 24.1161 56.6806 ...
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ attday : int 180 180 160 168 156 157 169 180 170 152 ...
## $ schoolscore: num 29.2 56 56 56 56 ...
## $ district : int 3 3 3 3 3 3 3 3 3 3 ...
## $ schoolhigh : int 0 0 0 0 0 0 0 0 0 0 ...
names(messdf)
## [1] "X" "school" "stuid" "grade" "schid"
## [6] "dist" "white" "black" "hisp" "indian"
## [11] "asian" "econ" "female" "ell" "disab"
## [16] "sch_fay" "dist_fay" "luck" "ability" "measerr"
## [21] "teachq" "year" "attday" "schoolscore" "district"
## [26] "schoolhigh" "schoolavg" "schoollow" "readSS" "mathSS"
## [31] "proflvl" "race"
id variables, this is useful and it is good to check if these variables have multiple rows per id or not and we do this using length and uniquelength(unique(messdf$stuid))
## [1] 1200
length(unique(messdf$schid))
## [1] 6
length(unique(messdf$dist))
## [1] 3
unique(messdf$grade)
## [1] 3 4 5 6 7 8
unique(messdf$econ)
## [1] 0 1
unique(messdf$race)
## [1] B H I W A
## Levels: A B H I W
unique(messdf$disab)
## [1] 0 1
Read in the CSV file from the T drive or the project folder
Read in the R data file from the T drive or the project folder
Read in the sample datafile. Find the readSS (reading scale score) for student 205 in grade 4.
Create a list of two attributes for each district in the df datafile.
Think about your own data warehouse environment. Could R interface with it? How?
It is good to include the session info, e.g. this document is produced with knitr version 0.9.6. Here is my session info:
print(sessionInfo(), locale = FALSE)
## R version 2.15.2 (2012-10-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] sandwich_2.2-9 quantreg_4.94 SparseM_0.96 gridExtra_0.9.1
## [5] mgcv_1.7-22 eeptools_0.1 mapproj_1.2-0 maps_2.3-0
## [9] proto_0.3-10 plyr_1.8 stringr_0.6.2 ggplot2_0.9.3
## [13] lmtest_0.9-30 zoo_1.7-9 knitr_0.9.6
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-0 dichromat_1.2-4 digest_0.6.0
## [4] evaluate_0.4.3 formatR_0.7 gtable_0.1.2
## [7] labeling_0.1 lattice_0.20-10 MASS_7.3-22
## [10] Matrix_1.0-10 munsell_0.4 nlme_3.1-106
## [13] RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3
## [16] tools_2.15.2
This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is free of known copyright restrictions.