A quick note on R `packages`

packages are essentially free and open source add-ons for R
There are over 3,000 packages available for R that add all sorts of functionality
A few examples (from the mundane to the crazy)
Additional graphics capabilities from the ggplot2 package
Advanced regression techniques from the lme4 package for mixed effects models
3d graphics from the scatterplot3d package (also webGL)
GIS analytics and mapping functionality with sp
Text mining analytics with tm
Predictive modeling frameworks with caret
Interfaces to other programming languages like Python, Java, and C and C++
A web server: Rserve
And Minesweeper from the fun package

I can haz packages?

# You can find and install packages within R
install.packages("foo")  # Name must be in quotes
install.packages(c("foo", "foo1", "foo2"))
# Packages get updated FREQUENTLY
update.packages()  # Gonna update them all

Note, on Windows Vista and later R either needs to be run as an administrator to install packages, or you have to fiddle with where the packages are installed
Packages are stored in something called the library which is just a collection of packages
Sometimes folks call packages libraries
Loading a package couldn't be easier library(ggplot2) and you're done!

Finding Packages

Official packages are found on CRAN (Comprehensive R A Network)
Unofficial packages or beta versions of packages are found on RForge and GitHub
To find out what packages are out there that do a specific function, try:
Google "doing X in R package"
Look at CRAN taskviews
CRAN taskviews are great to find a bunch of packages related to a problem you are trying to solve

Some Must Have Packages

plyr ggplot2 lme4 sp knitr

These packages come with data, code, functions, and utilities that you can access
If you want to learn about them, check them out in the RStudio Packages tab, which gives an index of their help files and handy links to their help documentation
Some packages also have vignettes which walk you through use cases for the package
caret has a great example

Data Management

Data management in R used to be managed by the ls() command
Go ahead, type it.
Now you can look at the Workspace tab in RStudio and have a complete list of the data in R's memory that is accessible to you
All data objects have names
All object names are unique (not strictly so, but let's not violate this)
To reference items within an object type we need to give it an address like in mydata$thingIwant or mydata@thingIwant
The $ and @ distinction depends on whether this is an S3 or an S4 class
R will warn you if you do it wrong, so just remember when $ doesn't work, use @

The Working Directory

The working directory, lovingly denoted by wd is both your friend and enemy
R needs to know where it is in your file system to be able to access data and write output
Check this by typing getwd()
R work is best done in a selfcontained directory, like C:/Users/My Documents/My Project/ which is then set as the working directory
How to set the working directory? The setwd() command: setwd("PATH/TO/MY PROJECT/")

Preliminaries

Get and set the working directory
Understand file system paths
Understand relative and full paths
Find out where things are
To start, we'll download the Bootcamp files and open up "Tutorial 2.R"
We'll notice that in the Bootcamp folder we have a number of subfolders for things like data and handouts
When you do an R project, it really helps to put it into a self-contained folder so you can reference everything you need easily within the file system

RStudio Shortcuts

RStudio makes this so much easier!
RStudio can set the working directory to the Files pane in the bottom right, or to the Source file in the top left
The source file is the most likely path, but sometimes we need to access data in a separate folder and store results in another
So RStudio shortcuts can't help there

RStudio Projects

A great feature of RStudio is that it allows us to create a project which allows us to quickly jump from folder to folder, store data, and keep open tasks and scripts
If you have multiple R tasks at once (how lucky!) you have to use projects
Let's walkthrough creating a Project in RStudio and understand it a little bit
DEMO

Manipulating Project Paths

Paths are strings that tell R where to find files on your computer
R uses both full and relative paths
When we want to call files from different directories or write to different directories, we can use relative paths within the project
Relative paths look like /data or ~data which means, look for the data folder in our current working directory
Full paths work differently, they specify the exact location on the hard drive of the machine like C:/Path/To/My/Data or usr/home/jaredrocks

When to use a full path

Never
It breaks the ability to pass a project directory to a coauthor/collaborator
But when you must...
When you have something on a whole different drive or network store
When you don't think you'll migrate your code to a new operating system or a different machine/network environment
When the project is simple

Ground Rules

Get used to plain text input files
R can handle other formats, but your error rate increases as does the tweaking necessary
R has a limited set of special characters (symbols) you cannot use in your data input to be translated correctly
These symbols are reserved and will be interpreted in strange ways if you include them in your plain text data file
Most of them are fairly obvious operators, see Paul Murrell's excellent summary

Missing Data Symbols

Missing data has the symbols NA or NaN or NULL depending on the context.
Consider:

a <- c(1, 2, 3)  # a is a vector with three elements
# Ask R for element 4
print(a[4])

## [1] NA

But what is the difference between NA and NULL?

a <- c(a, NULL)  # Append NULL onto a
print(a)

## [1] 1 2 3

# Notice no change
a <- c(a, NA)
print(a)

## [1]  1  2  3 NA

NA can hold a place, NULL cannot

What the heck is Not a Number?

NaN is even more special, and only holds things like imaginary numbers
NaN stands for "Not a Number"

b <- 1
b <- sqrt(-b)

## Warning: NaNs produced

print(b)

## [1] NaN

pi/0

## [1] Inf

Inf is a special case as well representing an infinite value
Just for fun sin(Inf) = NaN

Beginning Analysis

Now let's set up our analysis project
It is best to keep projects discrete in directories
Create a few subdirectories
Data
Functions / R src
Figures / Plots
Cleaned Data
We've already done this in the Bootcamp folder itself!
See the ProjectTemplate package for a more detailed philosophy about organizing projects

Organization of a project is key

Separate data from scripts
Separate automatic scripts from interactive scripts
Put figures apart from both of these
Always keep your raw data

Create our Project

Open the "bootcamp" folder on your machine
Underneath this make a a "figure" folder to go with the "data" folder
Create an R script in the "bootcamp" folder called "myscriptname.R"
Put all the data files in the "data" folder

Read in Data

Reading in data is one of the trickiest issues for R
This is because R is incredibly flexible and can handle data in almost any form including .csv .dta .sas .spss .dat and even .xls and .xlsx with some care
So we have to carefully specify the data types to R so it can understand what form the data needs to take
Compared to C this is great!

CSV is Our Friend

The easiest data type is .csv though Excel files can be read as well

# Set working directory to the tutorial directory In RStudio can do this
# in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
df <- read.csv("data/smalldata.csv")
# Note if we don't assign data to 'df' R just prints contents of table

Let's Check What We Got

## 'data.frame':    2700 obs. of  6 variables:
##  $ schoolavg: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ schoollow: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ readSS   : num  357 264 370 347 373 ...
##  $ mathSS   : num  387 303 365 344 441 ...
##  $ proflvl  : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
##  $ race     : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...

Always Check Your Data

A few great commands:

dim(df)

## [1] 2700   32

summary

summary(df[, 1:5])

##        X               school         stuid            grade     
##  Min.   :     44   Min.   :   1   Min.   :   205   Min.   :3.00  
##  1st Qu.: 108677   1st Qu.: 195   1st Qu.: 44205   1st Qu.:4.00  
##  Median : 458596   Median : 436   Median : 88205   Median :5.00  
##  Mean   : 557918   Mean   : 460   Mean   : 99229   Mean   :5.44  
##  3rd Qu.: 972291   3rd Qu.: 717   3rd Qu.:132205   3rd Qu.:7.00  
##  Max.   :1499992   Max.   :1000   Max.   :324953   Max.   :8.00  
##      schid      
##  Min.   :  6.0  
##  1st Qu.: 15.0  
##  Median : 55.5  
##  Mean   : 52.0  
##  3rd Qu.: 75.0  
##  Max.   :105.0

Checking your data II

names

names(df)

##  [1] "X"           "school"      "stuid"       "grade"       "schid"      
##  [6] "dist"        "white"       "black"       "hisp"        "indian"     
## [11] "asian"       "econ"        "female"      "ell"         "disab"      
## [16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
## [21] "teachq"      "year"        "attday"      "schoolscore" "district"   
## [26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
## [31] "proflvl"     "race"

attributes and class

names(attributes(df))

## [1] "names"     "row.names" "class"

class(df)

## [1] "data.frame"

str which lists all data elements in an object and their type

Data Warehouses, Oracle, SQL and RODBC

Do you have data in a warehouse?
RODBC can help
You can query the data directly and bring it into R, saving time and hassle
Makes your work reproducible, always start with a clean slate of data
At DPI this can allow us to pull data directly from LDS or other databases using SQL queries

An Example From DPI

The basics of the RODBC package are easy to understand

library(RODBC)  # interface driver for R
channel <- odbcConnect("Mydatabase.location", uid = "useR", pwd = "secret")
# establish connection we can do multiple connections in the same R
# session
# 
# WARNING: credentials stored in plain text unless you do some magic
table_list <- sqltables(channel, schema = "My_DB")
# Get a list of tables in the connection
colnames(sqlFetch(channel, "My_DB.TABLE_NAME", max = 1))
# get the column names of a table
datapull <- sqlQuery(channel, "SELECT DATA1, DATA2, DATA3 FROM My_DB.TABLE_NAME")
# execute some SQLquery, can paste any SQLquery as a string into this
# space

Missing Data

Let's add some missing data to our dataframe so we can see how missing data works

random <- sample(unique(df$stuid), 100)
random2 <- sample(unique(df$stuid), 120)
messdf <- df
messdf$readSS[messdf$stuid %in% random] <- NA
messdf$mathSS[messdf$stuid %in% random2] <- NA

What is this code doing?
How can we find out?
Don't try this at home!

Checking for Missing Data

The summary function helps identify missing data

summary(messdf[, c("stuid", "readSS", "mathSS")])

##      stuid            readSS        mathSS   
##  Min.   :   205   Min.   :252   Min.   :210  
##  1st Qu.: 44205   1st Qu.:431   1st Qu.:418  
##  Median : 88205   Median :497   Median :480  
##  Mean   : 99229   Mean   :497   Mean   :484  
##  3rd Qu.:132205   3rd Qu.:564   3rd Qu.:543  
##  Max.   :324953   Max.   :833   Max.   :828  
##                   NA's   :223   NA's   :288

nrow(messdf[!complete.cases(messdf), ])  # number of rows with missing data

## [1] 494

To get rid of missing data, we can copy our data with all missing cases dropped using the na.omit function

cleandf <- na.omit(messdf)
nrow(cleandf)

## [1] 2206

Now we have the data

What next?
We need to do some basic diagnostics on our data to understand the look and feel of it before we proceed
Here are a few examples of scripts we could run to understand our data object

dim(messdf)

## [1] 2700   32

str(messdf[, 18:26])

## 'data.frame':    2700 obs. of  9 variables:
##  $ luck       : int  0 1 0 1 0 0 1 0 0 0 ...
##  $ ability    : num  87.9 97.8 104.5 111.7 81.9 ...
##  $ measerr    : num  11.13 6.82 -7.86 -17.57 52.98 ...
##  $ teachq     : num  39.0902 0.0985 39.5389 24.1161 56.6806 ...
##  $ year       : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ attday     : int  180 180 160 168 156 157 169 180 170 152 ...
##  $ schoolscore: num  29.2 56 56 56 56 ...
##  $ district   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ schoolhigh : int  0 0 0 0 0 0 0 0 0 0 ...

Looking at data structure

names(messdf)

##  [1] "X"           "school"      "stuid"       "grade"       "schid"      
##  [6] "dist"        "white"       "black"       "hisp"        "indian"     
## [11] "asian"       "econ"        "female"      "ell"         "disab"      
## [16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
## [21] "teachq"      "year"        "attday"      "schoolscore" "district"   
## [26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
## [31] "proflvl"     "race"

It looks like we have a number of id variables, this is useful and it is good to check if these variables have multiple rows per id or not and we do this using length and unique

length(unique(messdf$stuid))

## [1] 1200

length(unique(messdf$schid))

## [1] 6

length(unique(messdf$dist))

## [1] 3

Checking for Coding

Data is coded using numeric or character representations of attributes - commonly things are coded using a 1 and 0 scheme or an A,B,C scheme
With R we can check how our variables are coded very easily

unique(messdf$grade)

## [1] 3 4 5 6 7 8

unique(messdf$econ)

## [1] 0 1

unique(messdf$race)

## [1] B H I W A
## Levels: A B H I W

unique(messdf$disab)

## [1] 0 1

Which are factors?
Which are not?

Next Steps

In the next section we will learn to aggregate, explore, reshape, and recode data
Questions?

Exercises

Read in the CSV file from the T drive or the project folder
Read in the R data file from the T drive or the project folder
Read in the sample datafile. Find the readSS (reading scale score) for student 205 in grade 4.
Create a list of two attributes for each district in the df datafile.
Think about your own data warehouse environment. Could R interface with it? How?

Other References

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 0.9.6. Here is my session info:

print(sessionInfo(), locale = FALSE)

## R version 2.15.2 (2012-10-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] sandwich_2.2-9  quantreg_4.94   SparseM_0.96    gridExtra_0.9.1
##  [5] mgcv_1.7-22     eeptools_0.1    mapproj_1.2-0   maps_2.3-0     
##  [9] proto_0.3-10    plyr_1.8        stringr_0.6.2   ggplot2_0.9.3  
## [13] lmtest_0.9-30   zoo_1.7-9       knitr_0.9.6    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-0   dichromat_1.2-4    digest_0.6.0      
##  [4] evaluate_0.4.3     formatR_0.7        gtable_0.1.2      
##  [7] labeling_0.1       lattice_0.20-10    MASS_7.3-22       
## [10] Matrix_1.0-10      munsell_0.4        nlme_3.1-106      
## [13] RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3      
## [16] tools_2.15.2

Tutorial 2: Getting Data In

Overview

A quick note on R `packages`

I can haz packages?

Finding Packages

Some Must Have Packages

Data Management

The Working Directory

Preliminaries

RStudio Shortcuts

RStudio Projects

Manipulating Project Paths

When to use a full path

Ground Rules

Missing Data Symbols

What the heck is Not a Number?

Beginning Analysis

Organization of a project is key

Create our Project

Read in Data

CSV is Our Friend

Let's Check What We Got

Always Check Your Data

Checking your data II

Data Warehouses, Oracle, SQL and RODBC

An Example From DPI

Missing Data

Checking for Missing Data

Now we have the data

Looking at data structure

Checking for Coding

Next Steps

Exercises

Other References

Session Info

Attribution and License

Tutorial 2: Getting Data In

Overview

A quick note on R packages

I can haz packages?

Finding Packages

Some Must Have Packages

Data Management

The Working Directory

Preliminaries

RStudio Shortcuts

RStudio Projects

Manipulating Project Paths

When to use a full path

Ground Rules

Missing Data Symbols

What the heck is Not a Number?

Beginning Analysis

Organization of a project is key

Create our Project

Read in Data

CSV is Our Friend

Let's Check What We Got

Always Check Your Data

Checking your data II

Data Warehouses, Oracle, SQL and RODBC

An Example From DPI

Missing Data

Checking for Missing Data

Now we have the data

Looking at data structure

Checking for Coding

Next Steps

Exercises

Other References

Session Info

Attribution and License

A quick note on R `packages`