Install Packages (2)

Now, copy and paste the code below into the bottom left window in RStudio (the R terminal):
Sector 67 machines already have these installed.

install_new<-function(mypkg){
  if (mypkg %in% installed.packages()) cat("Package already installed")
  else{cat("Package not found, so installing with dependencies... /n
           Press CTRL C to abort.")
    Sys.sleep(5)
    install.packages(mypkg,repos="http://cran.wustl.edu/")
}
}

install_new('plyr')
install_new('lmtest')
install_new('ggplot2')
install_new('gridExtra')
install_new('stringr')
install_new('knitr')
install_new('quantreg')
install_new('zoo')
install_new('xtable')
install_new('lme4')
install_new('caret')

References and Resources for the Previous Section

Overview

What is R?
What is RStudio?
How does it work?
What makes the language different?
Why learn it?

R

R is an Open Source (and freely available) environment for statistical computing and graphics
Available for Windows, Mac OS X, and Linux
R is being actively developed with two major releases per year and dozens of releases of add on packages
R can be extended with 'packages' that contain data, code, and documentation to add new functionality

Why Use R

R is a common tool among data experts at major universities
No need to go through procurement, R can be installed in any environment on any machine and used with no licensing or agreements needed
R source code is very readable to increase transparency of processes
R code is easily borrowed from and shared with others
R is incredibly flexible and can be adapted to specific local needs
R is under incredibly active development, improving greatly, and supported wildly by both professional and academic developers

R Advantages Continued

R is platform agnostic - Linux, Mac, PC, server, desktop, etc.
R can output results in a variety of formats
R can build routines straight out of a database for common and universal reporting

R Can Compliment Other Tools

R plays nicely with data from Stata, SPSS, SAS and others
R can check work, produce output, visualize results from other programs
R can do bleeding edge analyses that aren't available in proprietary packages yet
R is becoming more prevalent in undergraduate statistics courses - more and more potential employees are learning it each year

R's Drawbacks

R is based on S, which is close to 40 years old
R only has features that the community contributes
Not the ideal solution to all problems
R is a programming language and not a software package--steeper learning curve
R can be much slower than compiled languages

R Vocabulary

packages are add on features to R that include data, new functions and methods, and extended capabilities. Think of them as ``apps'' on your phone. We've already installed several!
terminal this is the main window of R where you enter commands
scripts these are where you store commands to be run in the terminal later, like syntax files in SPSS or .do files in Stata
functions commands that do something to an object in R
dataframe the main element for statistical purposes, an object with rows and columns that includes numbers, factors, and other data types
workspace the working memory of R where all objects are stored
vector the basic unit of data in R
symbols used to name and store objects or to designate operations/functions
attributes determine how functions act on objects

Components of an R Setup

R - R works in the command line of any OS, but also comes with a basic GUI to operate on its own in Windows and Mac download
RStudio - a much better way to work in R that allows editing of scripts, operation of R, viewing of the workspace, and R help all on one screen download

Self-help

In the spirit of open-source R is very much a self-guided tool
Let's see, type: ?summary
Now type: ??regression
For tricky questions, funky error messages (there are many), and other issues, use Google (include "in R" to the end of your query)
We can also use RSeek - the search engine just for R!
StackOverflow has become a great resource with many questions for many specific packages in R, and a rating system for answers
A number of R Core members contribute there

Let's Look at RStudio

RStudio has made R more accessible and easy to use than ever!
Open up Intro to R Programming.R
4 panels, various tabs
Help, plots, file structure
Workspace, history, version control
Working files
The R Console

R As A Calculator

2 + 2  # add numbers

[1] 4

2 * pi  #multiply by a constant

[1] 6.283

7 + runif(1, min = 0, max = 1)  #add a random variable

[1] 7.457

4^4  # powers

[1] 256

sqrt(4^4)  # functions

[1] 16

Arithmetic Operators

In addition to the obvious + - = / * and exponential ^, there is also integer division %/% and remainder in integer division (known as modulo arithmetic) %%

2 + 2

[1] 4

2/2

[1] 1

2 * 2

[1] 4

2^2

[1] 4

2 == 2

[1] TRUE

23%/%2

[1] 11

23%%2

[1] 1

Other Key Symbols

<- is the assignment operator, it declares something is something else

foo <- 3
foo

[1] 3

: is the sequence operator

1:10

 [1]  1  2  3  4  5  6  7  8  9 10

# it increments by one
a <- 100:120
a

 [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
[18] 117 118 119 120

This is handy

Comments in R

# denotes a comment in R
Anything after the # is not evaluated and ignored in R
This is handy for making things reproducible

# Something I want to keep from R
# Like my secret from the R engine
# Maybe intended for a human and not the computer
# Like: Look at this cool plot!

myplot(readSS,mathSS,data=df)

R Advanced Math

R also supports advanced mathematical features and expressions
R can take integrals and derivatives and express complex functions
Easiest of all, R can generate distributions of data very easily
e.g. rnorm(100) or rbinom(100)
This comes in handy when writing examples and building analyses because it is trivial to generate a synthetic piece of data to use as an example
Go ahead, try typing hist(rnorm(10000)) into RStudio

Using the Workspace

To do more we need to learn how to manipulate the 'workspace'.
This includes all the vectors, datasets, and functions stored in memory.
All R objects are stored in the memory of the computer, limiting the available space for calculation to the size of the RAM on your machine.
R makes organizing the workspace easy.

Using the Workspace (2)

x <- 5  #store a variable with <-
x  #print the variable

[1] 5

z <- 3
ls()  #list all variables

[1] "a"   "foo" "x"   "z"

ls.str()  #list and describe variables

a :  int [1:21] 100 101 102 103 104 105 106 107 108 109 ...
foo :  num 3
x :  num 5
z :  num 3

rm(x)  # delete a variable
ls()

[1] "a"   "foo" "z"

R as a Language

R is more than statistical software, it is a computer language
Like any language it has rules (some poorly enforced), and conventions
You will learn more as you go, but we'll go over a few to start

Case sensitivity matters

a <- 3
A <- 4
print(c(a, A))

[1] 3 4

a ≠ A

What happens if I type print(a,A)?

`c` is our friend

So what does c do?

A <- c(3, 4)
print(A)

[1] 3 4

c stands for concatenate and allows vectors to have multiple elements
If you ever need two elements in a vector, you need to wrap it up in c, which is one of the most used functions you will ever use
c is important to put any vector together, but remember that objects within a vector must all be of the same type

Language

In language there are a number of ways to say the same thing
The dog chased the cat.
The cat was chased by the dog.
By the dog, the cat was chased.
Some ways are more elegant than others, all convey the same message.

a <- runif(100)  # Generate 100 random numbers
b <- runif(100)  # 100 more
c <- NULL  # Setup for loop (declare variables)
for (i in 1:100) {
    # Loop just like in Java or C
    c[i] <- a[i] * b[i]
}
d <- a * b
identical(c, d)  # Test equality

[1] TRUE

Which is nicer? c or d?

More Language Bugs Features

R is maddeningly inconsistent in it's naming conventions
Some functions are camelCase; others are.dot.separated; others use_underscores
Function results are stored in a variety of ways across function implementations
R has multiple graphics packages that different functions use for default plot construction (base, grid, lattice, and ggplot2)
R has multiple packages and functions to do the same analysis as well, though some standardization has started to occur
Be flexible and be aware of R's flexibility

Objects

Everything in R is an object - even functions
Objects can be manipulated many ways
A common example is applying the summary function to a variety of object types and seeing how it adapts

summary(df[, 28:31])  #summary look at df object

   schoollow         readSS        mathSS           proflvl    
 Min.   :0.000   Min.   :252   Min.   :210   advanced   : 788  
 1st Qu.:0.000   1st Qu.:430   1st Qu.:418   basic      : 523  
 Median :0.000   Median :495   Median :480   below basic: 210  
 Mean   :0.242   Mean   :496   Mean   :483   proficient :1179  
 3rd Qu.:0.000   3rd Qu.:562   3rd Qu.:543                     
 Max.   :1.000   Max.   :833   Max.   :828

summary(df$readSS)  #summary of a single column

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    252     430     495     496     562     833

-The $ says to look for object readSS in object df

Graphics too

library(ggplot2) # Load graphics Package
library(eeptools)
qplot(readSS,mathSS,data=df,geom='point',alpha=I(0.3))+theme_dpi()+
  opts(title='Test Score Relationship')+
  geom_smooth()

Student Test Scores

Handling Data in R

R handles data differently than many other statistical packages
In R, all elements are objects

length(unique(df$school))

[1] 173

length(unique(df$stuid))

[1] 1200

uniqstu <- length(unique(df$stuid))
uniqstu

[1] 1200

Results of function calls can be stored

Special Operators

The comparison operators <, >, <=, >=, ==, and != are used to compare values across vectors

big <- c(9, 12, 15, 25)
small <- c(9, 3, 4, 2)
# Give us a nice vector of logical values
big > small

[1] FALSE  TRUE  TRUE  TRUE

big = small
# Oops--don't do this, reassigns big to small
print(big)

[1] 9 3 4 2

print(small)

[1] 9 3 4 2

Comparison operators can be tricky, so to keep it straight never use = or == to assign anything, always use <-

Special Operators (II)

The best way to evaluate these objects is to use brackets [] to avoid confusion

big <- c(9, 12, 15, 25)
big[big == small]

[1] 9

# Returns values where the logical vector is true
big[big > small]

[1] 12 15 25

big[big < small]  # Returns an empty set

numeric(0)

Special operators (III)

The %in% operator determines whether each value in the left operand can be matched with one of the values in the right operand.

big <- c(9, 12, 15, 25)
small <- c(9, 12, 15, 25, 9, 1, 3)
big[small %in% big]

[1]  9 12 15 25 NA

9, 12, 15, and 25 all appear in big, but small also has objects that do not appear in big and so an NA is returned
What if we reverse this?

big[big %in% small]

[1]  9 12 15 25

No NA

Special operators (IV)

The logical operators | (or) and & (and) can be used to combine two logical values and produce another logical value as the result. The operator ! (not) negates a logical value. These operators allow complex conditions to be constructed.

foo <- c("a", NA, 4, 9, 8.7)
!is.na(foo)  # Returns TRUE for non-NA

[1]  TRUE FALSE  TRUE  TRUE  TRUE

class(foo)

[1] "character"

a <- foo[!is.na(foo)]
a

[1] "a"   "4"   "9"   "8.7"

class(a)

[1] "character"

Special operators (V)

The operators || and && are similar, but they combine two logical vectors. The comparison is performed element by element, so the result is also a logical vector.

zap <- c(1, 4, 8, 2, 9, 11)
zap[zap > 2 | zap < 8]

[1]  1  4  8  2  9 11

zap[zap > 2 & zap < 8]

[1] 4

Regular Expressions

R also supports a full suite of regular expressions
This could be material for a full tutorial in a more advanced bootcamp
If you know and use regex, then rest assured you can keep using it in R

R Data Modes

R allows users to implement a number of different types of data
The three basic data types are numeric data, character data, and logical data
Vectors must be of one consistent type of data, so if you make a vector with multiple types, it generally defaults to being a character vector

Data Modes in R (numeric)

numeric vectors contain, as you would guess, numbers!

is.numeric(A)

[1] TRUE

class(A)

[1] "numeric"

print(A)

[1] 3 4

Data Modes (Character)

character is known as strings in other software, any characters that have no numeric meaning

b <- c("one", "two", "three")
print(b)

[1] "one"   "two"   "three"

is.numeric(b)

[1] FALSE

Data Modes (Logical)

logical is TRUE or FALSE values, useful for logical testing and programming
We've already seen these returned when we have asked R a question before

c <- c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)
is.numeric(c)

[1] FALSE

is.character(c)

[1] FALSE

is.logical(c)  # Results in a logical value

[1] TRUE

Easier way

Just ask R using the class function

class(A)

[1] "numeric"

class(b)

[1] "character"

class(c)

[1] "logical"

A Note on Vectors

Vectors are collections of consistent data types
numeric can either be double or integer depending on the bytes size
logical
character
complex
raw
All vectors must be consistent among types, but some data objects like data frames can combine multiple vectors of different types

Factor

A factor is a very special and sometimes frustrating data type in R

myfac <- factor(c("basic", "proficient", "advanced", "minimal"))
class(myfac)

[1] "factor"

myfac  # What order are the factors in?

[1] basic      proficient advanced   minimal   
Levels: advanced basic minimal proficient

What if we don't like the order these are in? Factor order is important for all kinds of things like plot type, regression output, and more

Ordering the Factor

Ordered factors simply have an additional attribute explaining the order of the levels of a factor
This is a useful shortcut when we want to preserve some of the meaning provided by the order
Think cardinal data

myfac_o <- ordered(myfac, levels = c("minimal", "basic", "proficient", "advanced"))
myfac_o

[1] basic      proficient advanced   minimal   
Levels: minimal < basic < proficient < advanced

summary(myfac_o)

   minimal      basic proficient   advanced 
         1          1          1          1

Reclassifying Factors

Turning factors into other data types can be tricky. All factor levels have an underlying numeric structure.

class(myfac_o)

[1] "ordered" "factor"

unclass(myfac_o)

[1] 2 3 4 1
attr(,"levels")
[1] "minimal"    "basic"      "proficient" "advanced"

defac <- unclass(myfac_o)
defac

[1] 2 3 4 1
attr(,"levels")
[1] "minimal"    "basic"      "proficient" "advanced"

What is wrong with this? Well--why would minimal be 2 and basic be 3?
Be careful! The best way to unpack a factor is to convert it to a character first.

Defactor

# From the eeptools package
defac <- function(x) {
    x <- as.character(x)
    x
}

defac(myfac_o)

[1] "basic"      "proficient" "advanced"   "minimal"

defac <- defac(myfac_o)
defac

[1] "basic"      "proficient" "advanced"   "minimal"

Convert to Numeric?

What if we do want it to be numeric?
The best way to do this is to recode the variable manually--we'll discuss this later
You can try to convert it to numeric though, but do at your own risk:

myfac_o

[1] basic      proficient advanced   minimal   
Levels: minimal < basic < proficient < advanced

as.numeric(myfac_o)

[1] 2 3 4 1

If we did not properly specify the order above, this would be wrong!

myfac

[1] basic      proficient advanced   minimal   
Levels: advanced basic minimal proficient

as.numeric(myfac)

[1] 2 4 1 3

Dates

R has built-in ways to handle dates
See lubridate package for more advanced functionality including mathematical operations on dates

mydate <- as.Date("7/20/2012", format = "%m/%d/%Y")
# Input is a character string and a parser
class(mydate)  # this is date

[1] "Date"

weekdays(mydate)  # what day of the week is it?

[1] "Friday"

mydate + 30  # Operate on dates

[1] "2012-08-19"

More Dates

# We can parse other formats of dates
mydate2 <- as.Date("8-5-1988", format = "%d-%m-%Y")
mydate2

[1] "1988-05-08"


mydate - mydate2

Time difference of 8839 days

# Can add and subtract two date objects

A few notes on dates

R converts all dates to numeric values, like Excel and other languages
The origin date in R is January 1, 1970

as.numeric(mydate)  # days since 1-1-1970

[1] 15541

as.Date(56, origin = "2013-4-29")  # we can set our own origin

[1] "2013-06-24"

Why care so much about classes?

Classes determine what you can and can't do with objects
Different classes have different computational times associated with them, so the choice can affect the optimization of the code
Classes allow you to keep projects/data organized
Because R makes you care

Data Structures in R

R has a number of basic data classes as well as arbitrary specialized object types for various purposes
vectors are the basic data class in R and can be thought of as a single column of data (even a column of length 1)
matrices and arrays are rows and columns of all the same mode data
dataframes are rows and columns where the columns can represent different data types
lists are arbitrary combinations of disparate object types in R

Vectors

Everything is a vector in R, even single numbers
Single objects are "atomic" vectors

print(1)

[1] 1

# The 1 in braces means this element is a vector of length 1
print("This tutorial is awesome")

[1] "This tutorial is awesome"

# This is a vector of length 1 consisting of a single 'string of
# characters'

Vectors 2

print(LETTERS)

 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

# This vector has 26 character elements
print(LETTERS[6])

[1] "F"

# The sixth element of this vector has length 1
length(LETTERS[6])

[1] 1

# The length of that element is a number with length 1

Matrices

Matrices are combinations of vectors of the same length and data type
We can have numeric matrices, character matrices, or logical matrices
Can't mix types

mymat <- matrix(1:36, nrow = 6, ncol = 6)
rownames(mymat) <- LETTERS[1:6]
colnames(mymat) <- LETTERS[7:12]
class(mymat)

[1] "matrix"

Matrices II

rownames(mymat)

[1] "A" "B" "C" "D" "E" "F"

colnames(mymat)

[1] "G" "H" "I" "J" "K" "L"

mymat

  G  H  I  J  K  L
A 1  7 13 19 25 31
B 2  8 14 20 26 32
C 3  9 15 21 27 33
D 4 10 16 22 28 34
E 5 11 17 23 29 35
F 6 12 18 24 30 36

Arrays

Arrays are a set of matrices of the same dim and class
Arrays allow dimensions to be named

myarray <- array(1:42, dim = c(7, 3, 2), dimnames = list(c("tiny", "small", 
    "medium", "medium-ish", "large", "big", "huge"), c("slow", "moderate", "fast"), 
    c("boring", "fun")))
class(myarray)

[1] "array"

dim(myarray)

[1] 7 3 2

Arrays II

dimnames(myarray)

[[1]]
[1] "tiny"       "small"      "medium"     "medium-ish" "large"     
[6] "big"        "huge"      

[[2]]
[1] "slow"     "moderate" "fast"    

[[3]]
[1] "boring" "fun"

myarray

, , boring

           slow moderate fast
tiny          1        8   15
small         2        9   16
medium        3       10   17
medium-ish    4       11   18
large         5       12   19
big           6       13   20
huge          7       14   21

, , fun

           slow moderate fast
tiny         22       29   36
small        23       30   37
medium       24       31   38
medium-ish   25       32   39
large        26       33   40
big          27       34   41
huge         28       35   42

Lists

Lists are arbitrary collections of objects.
The objects do not have to be of the same type or same element or same dimensions

myvec <- c(1, 2, 4, 5, 9)
mylist <- list(vec = myvec, mat = mymat, arr = myarray, date = mydate)
class(mylist)

[1] "list"

length(mylist)

[1] 4

names(mylist)

[1] "vec"  "mat"  "arr"  "date"

Print a List

str(mylist)

List of 4
 $ vec : num [1:5] 1 2 4 5 9
 $ mat : int [1:6, 1:6] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "A" "B" "C" "D" ...
  .. ..$ : chr [1:6] "G" "H" "I" "J" ...
 $ arr : int [1:7, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:7] "tiny" "small" "medium" "medium-ish" ...
  .. ..$ : chr [1:3] "slow" "moderate" "fast"
  .. ..$ : chr [1:2] "boring" "fun"
 $ date: Date[1:1], format: "2012-07-20"

Lists (II)

R has two object classification schemes S3 and S4
For S3 use $ or [[]] to extract elements
For S4 use @ to extract elements

mylist$vec

[1] 1 2 4 5 9

mylist[[2]][1, 3]

[1] 13

Where are we getting the object in the second row from?

So what?

Matrices, lists, and arrays are useful for storing analyses results, generating reports, and doing analysis on many objects types
A useful tip is to use the attributes function to learn about the object

attributes(mylist)

$names
[1] "vec"  "mat"  "arr"  "date"

attributes(myarray)[1:2][2]

$dimnames
$dimnames[[1]]
[1] "tiny"       "small"      "medium"     "medium-ish" "large"     
[6] "big"        "huge"      

$dimnames[[2]]
[1] "slow"     "moderate" "fast"    

$dimnames[[3]]
[1] "boring" "fun"

They also provide simplified ways to get used to operating on dataframes by reducing complexity

Dataframes

Dataframes are combinations of vectors of the same length, but can be of different types

str(df[, 25:32])

'data.frame':   2700 obs. of  8 variables:
 $ district  : int  3 3 3 3 3 3 3 3 3 3 ...
 $ schoolhigh: int  0 0 0 0 0 0 0 0 0 0 ...
 $ schoolavg : int  1 1 1 1 1 1 1 1 1 1 ...
 $ schoollow : int  0 0 0 0 0 0 0 0 0 0 ...
 $ readSS    : num  357 264 370 347 373 ...
 $ mathSS    : num  387 303 365 344 441 ...
 $ proflvl   : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
 $ race      : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...

Data frames must have consistent dimensions
Dataframes are what we use most commonly as a "dataset" for analysis
Dataframes are what sets R apart from other programming languages like C, C++, Python, and Perl.
The dataframe structure is much more complex and much easier to use than any datastructure in these languages - though Python is catching up!

Converting Between Types

R has built in functions to allow you to force objects to move between types
These follow the general form as.whatIwant as in as.factor or as.table or as.data.frame
You will use these commands a lot to convert output from various functions into a form you can input into a different function
A good example is converting correlation matrices into dataframes so we can plot them

Summing it Up

Vectors are used to store bits of data
Matrices are combinations of vectors of the same length and type
Matrices are most commonly used in statistical models (in the background), and for computation
Arrays are stacks of matrices and are used in building multiple models or for storing complex data structures
Lists are groups of R objects commonly used to combine function output in useful ways (like store model results and model data together)

Other References for the Previous Section

Books

There are a number of great R books available for learning the beginnings of R and for learning specific statistical techniques
"Discovering Statistics Using R" by Andy Field
"Applied Econometrics with R" by Achim Zeileis and Christian Kleiber
"The Art of R Programming" by Norman Matloff
The R Inferno
The R Book by Michael J. Crawley (new version imminent)
The R Cookbook

Overview

Loading Packages
Understand data types
Load a CSV file
Organize an analysis project
Query a database

A quick note on R `packages`

packages are essentially free and open source add-ons for R
There are over 3,000 packages available for R that add all sorts of functionality
A few examples (from the mundane to the crazy)
Additional graphics capabilities from the ggplot2 package
Advanced regression techniques from the lme4 package for mixed effects models
3d graphics from the scatterplot3d package (also webGL)
GIS analytics and mapping functionality with sp
Text mining analytics with tm
Predictive modeling frameworks with caret
Interfaces to other programming languages like Python, Java, and C and C++
A web server: Rserve
And Minesweeper from the fun package

I can haz packages?

# You can find and install packages within R
install.packages("foo")  # Name must be in quotes
install.packages(c("foo", "foo1", "foo2"))
# Packages get updated FREQUENTLY
update.packages()  # Gonna update them all

Note, on Windows Vista and later R either needs to be run as an administrator to install packages, or you have to fiddle with where the packages are installed
Packages are stored in something called the library which is just a collection of packages
Sometimes folks call packages libraries
Loading a package couldn't be easier library(ggplot2) and you're done!

Finding Packages

Official packages are found on CRAN (Comprehensive R A Network)
Unofficial packages or beta versions of packages are found on RForge and GitHub
To find out what packages are out there that do a specific function, try:
Google "doing X in R package"
Look at CRAN taskviews
CRAN taskviews are great to find a bunch of packages related to a problem you are trying to solve

The Working Directory

The working directory, lovingly denoted by wd is both your friend and enemy
R needs to know where it is in your file system to be able to access data and write output
Check this by typing getwd()
R work is best done in a selfcontained directory, like C:/Users/My Documents/My Project/ which is then set as the working directory
How to set the working directory? The setwd() command: setwd("PATH/TO/MY PROJECT/")

Ground Rules

Get used to plain text input files
R can handle other formats, but your error rate increases as does the tweaking necessary
R has a limited set of special characters (symbols) you cannot use in your data input to be translated correctly
These symbols are reserved and will be interpreted in strange ways if you include them in your plain text data file
Most of them are fairly obvious operators, see Paul Murrell's excellent summary

Missing Data Symbols

Missing data has the symbols NA or NaN or NULL depending on the context.
Consider:

a <- c(1, 2, 3)  # a is a vector with three elements
# Ask R for element 4
print(a[4])

[1] NA

But what is the difference between NA and NULL?

a <- c(a, NULL)  # Append NULL onto a
print(a)

[1] 1 2 3

# Notice no change
a <- c(a, NA)
print(a)

[1]  1  2  3 NA

NA can hold a place, NULL cannot

What the heck is Not a Number?

NaN is even more special, and only holds things like imaginary numbers
NaN stands for "Not a Number"

b <- 1
b <- sqrt(-b)

Warning: NaNs produced

print(b)

[1] NaN

pi/0

[1] Inf

Inf is a special case as well representing an infinite value
Just for fun sin(Inf) = NaN

Read in Data

Reading in data is one of the trickiest issues for R
This is because R is incredibly flexible and can handle data in almost any form including .csv .dta .sas .spss .dat and even .xls and .xlsx with some care
So we have to carefully specify the data types to R so it can understand what form the data needs to take
Compared to C this is great!

CSV is Our Friend

The easiest data type is .csv though Excel files can be read as well

# Set working directory to the tutorial directory In RStudio can do this
# in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
df <- read.csv("data/smalldata.csv")
# Note if we don't assign data to 'df' R just prints contents of table

Let's Check What We Got

'data.frame':   2700 obs. of  6 variables:
 $ schoolavg: int  1 1 1 1 1 1 1 1 1 1 ...
 $ schoollow: int  0 0 0 0 0 0 0 0 0 0 ...
 $ readSS   : num  357 264 370 347 373 ...
 $ mathSS   : num  387 303 365 344 441 ...
 $ proflvl  : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
 $ race     : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...

Always Check Your Data

A few great commands:

dim(df)

[1] 2700   32

summary

summary(df[, 1:5])

       X               school         stuid            grade     
 Min.   :     44   Min.   :   1   Min.   :   205   Min.   :3.00  
 1st Qu.: 108677   1st Qu.: 195   1st Qu.: 44205   1st Qu.:4.00  
 Median : 458596   Median : 436   Median : 88205   Median :5.00  
 Mean   : 557918   Mean   : 460   Mean   : 99229   Mean   :5.44  
 3rd Qu.: 972291   3rd Qu.: 717   3rd Qu.:132205   3rd Qu.:7.00  
 Max.   :1499992   Max.   :1000   Max.   :324953   Max.   :8.00  
     schid      
 Min.   :  6.0  
 1st Qu.: 15.0  
 Median : 55.5  
 Mean   : 52.0  
 3rd Qu.: 75.0  
 Max.   :105.0

Checking your data II

names

names(df)

 [1] "X"           "school"      "stuid"       "grade"       "schid"      
 [6] "dist"        "white"       "black"       "hisp"        "indian"     
[11] "asian"       "econ"        "female"      "ell"         "disab"      
[16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
[21] "teachq"      "year"        "attday"      "schoolscore" "district"   
[26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
[31] "proflvl"     "race"

attributes and class

names(attributes(df))

[1] "names"     "row.names" "class"

class(df)

[1] "data.frame"

str which lists all data elements in an object and their type

Other References for the Previous Section

Overview

In this lesson we hope to learn:

Aggregating data
Organizing our data
Manipulating vectors
Dealing with missing data

Again, read in our dataset

# Set working directory to the tutorial directory In RStudio can do
# this in 'Tools' tab
setwd("~/GitHub/r_tutorial_ed")
# Load some data
load("data/smalldata.rda")
# Note if we don't assign data to 'df' R just prints contents of
# table

Aggregation

Sometimes we need to do some basic checking for the number of observations or types of observations in our dataset
To do this quickly and easily - the table function is our friend
Let's look at our observations by year and grade

table(df$grade, df$year)

   
    2000 2001 2002
  3  200  100  200
  4  100  200  100
  5  200  100  200
  6  100  200  100
  7  200  100  200
  8  100  200  100

The first command gives the rows, the second gives the columns
Ugly, but effective

Aggregation can be more complex

Let's aggregate by race and year

table(df$year, df$race)

      
         A   B   H   I   W
  2000  16 370  93   7 414
  2001  16 370  93   7 414
  2002  16 370  93   7 414

Race is consistent across years, interesting
What if we want to only look at 3rd graders that year?

More complicated still

with(df[df$grade == 3, ], {
    table(year, race)
})

      race
year     A   B   H   I   W
  2000   4  78  22   4  92
  2001   1  44   8   2  45
  2002   0  74  20   1 105

with specifies a data object to work on, in this case all elements of df where grade==3
table is the same command as above, but since we specified the data object in the with statement, we don't need the df$ in front of the variables of interest

df2 <- subset(df, grade == 3)
table(df2$year, df2$race)

      
         A   B   H   I   W
  2000   4  78  22   4  92
  2001   1  44   8   2  45
  2002   0  74  20   1 105

rm(df2)

Quick question, how can we understand the three types of closures we have in this function: () [] and {}

Tables cont.

This is really powerful for looking at the descriptive dimensions of the data, we can ask questions like:
How many students are at each proficiency level each year?

table(df$year, df$proflvl)

      
       advanced basic below basic proficient
  2000       56   313         143        388
  2001      229   183          64        424
  2002      503    27           3        367

How many students are at each proficiency level by race?

table(df$race, df$proflvl)

   
    advanced basic below basic proficient
  A       19     7           3         19
  B      160   302         162        486
  H       54    76          33        116
  I        7     4           1          9
  W      548   134          11        549

Checking Understanding

We have seen how to chain functions together
We have also seen how to examine a dataframe by looking at the observations in it
We are now going to move on to aggregating data so we can look at unique cases when we have more than one observation for each unit

Aggregating Data

One of the most common questions you need to answer is to compute aggregates of data
R has an aggregate function that can be used and helps us avoid the clustering problems above
This works great for simple aggregation like scale score by race, we just need a formula (think I want variable X by grouping factor Y) and the statistic we want to compute

# Reading Scores by Race
aggregate(readSS ~ race, FUN = mean, data = df)

  race readSS
1    A  508.7
2    B  460.2
3    H  473.2
4    I  485.2
5    W  533.2

Aggregate Isn't Enough

aggregate is cool, but it isn't very flexible
We can only use aggregate output as a table, which we have to convert to a data frame
There is a better way; the plyr package
plyr is a set of routines/logical structure for transforming, summarizing, reshaping, and reorganizing data objects of one type in R into another type (or the same type)
We will focus here on summarizing and aggregating a data frame, but later in the bootcamp we'll apply functions to lists and turn lists into data frames as well
This is cool!

The Logic of plyr

In R this is known as "split, apply, and combine"
Why? First, we split the data into groups by some factor or logical operator
Then we apply some function or another to that group (i.e. count the unique values of a variable, take the mean of a variable, etc.)
Then we combine the data back together
This has some advantages - unlike other methods, the data does not have to be ordered by our ID variable for this to work
The disadvantage is that this method is computationally expensive, even in R, and requires copying our data frame using up RAM

An Aside about Split-Apply-Combine

The plyr package has a number of utilities to help us split-apply-combine across data types for both input and output
In R we can't just use for loops to iterate over groups of students, because in R for loops are slow, inefficient, and impractical
plyr to the rescue, while not as fast as a compiled language, it is pretty dang good!
And still readable

The logic of plyr

This shows how the dataframe is broken up into pieces and each piece then gets whatever functions, summaries, or transformations we apply to it

How plyr works on dataframes

And this shows the output ddply has before it combines it back for us when we do the call ddply(df,.(sex,age),"nrow")

Using plyr

plyr has a straightforward syntax
All plyr functions are in the format XXply. The two X's specify what the input file we are applying a function to is, and then what way we would like it outputted.
In plyr d = dataframe, l= list, m=matrix, and a=array. By far the most common usage is ddply
From a dataframe, to a dataframe.
We will see more of plyr in Tutorial 4 as well

plyr in Action

  library(plyr)
myag<-ddply(df, .(dist,grade),summarize,
            mean_read=mean(readSS,na.rm=T),
            mean_math=mean(mathSS,na.rm=T),
            sd_read=sd(readSS,na.rm=T),
            sd_math=sd(mathSS,na.rm=T),
            count_read=length(readSS),
            count_math=length(mathSS))

This looks complex, but it only has a few components.
The first argument is the dataframe we are working on, the next argument is the level of identification we want to aggregate to
summarize tells ddply what we are doing to the data frame
Then we make a list of new variable names, and how to calculate them on each of the subsets in our large data frame
That's it!

Results

head(myag)

  dist grade mean_read mean_math sd_read sd_math count_read
1  205     3     451.7     406.1   93.52   72.45        200
2  205     4     438.9     459.9   77.76   79.10        100
3  205     5     487.9     462.6   85.30   75.10        200
4  205     6     514.7     526.8   76.83   66.04        100
5  205     7     530.0     521.5   84.82   74.85        200
6  205     8     575.5     581.2   79.58   83.45        100
  count_math
1        200
2        100
3        200
4        100
5        200
6        100

Sorting

A key way to explore data in tabular form is to sort data
Sorting data in R can be dangerous as you can reorder the vectors of a dataframe
We use the order function to sort data

df.badsort <- order(df$readSS, df$mathSS)
head(df.badsort)

[1]  106 1026    2   56  122  118

Why is this wrong? What is R giving us?
Rownames...

Correct Example

To fix it, we need to tell R to reorder the dataframe using the rownames in the order we want

df.sort <- df[order(df$readSS, df$mathSS, df$attday), ]
head(df[, c(3, 23, 29, 30)])

   stuid attday readSS mathSS
1 149995    180  357.3  387.3
2  13495    180  263.9  302.6
3 106495    160  369.7  365.5
4  45205    168  346.6  344.5
5 142705    156  373.1  441.2
6  14995    157  436.8  463.4

head(df.sort[, c(3, 23, 29, 30)])

      stuid attday readSS mathSS
106  106705    160  251.5  277.0
1026  80995    176  263.2  377.8
2     13495    180  263.9  302.6
56   122402    180  264.3  271.7
122   79705    168  266.4  318.7
118   40495    173  266.9  275.0

Let's clean it up a bit more

head(df[with(df, order(-readSS, -attday)), c(3, 23, 29, 30)])

      stuid attday readSS mathSS
1631 145205    137  833.2  828.4
1462 107705    180  773.3  746.6
2252 122902    180  744.0  621.6
2341  44902    175  741.7  676.3
1482 134705    180  739.2  705.4
1630  14495    162  738.9  758.2

Here we find the high performing students, note that the - denotes we want descending order, R's default is ascending order
This is easy to correct

About sorting

Sorting works differently on some data types like matrices

M <- matrix(c(1, 2, 2, 2, 3, 6, 4, 5), 4, 2, byrow = FALSE, dimnames = list(NULL, 
    c("a", "b")))
M[order(M[, "a"], -M[, "b"]), ]

     a b
[1,] 1 3
[2,] 2 6
[3,] 2 5
[4,] 2 4

About Sorting

Tables are familiar

mytab <- table(df$grade, df$year)
mytab[order(mytab[, 1]), ]

   
    2000 2001 2002
  4  100  200  100
  6  100  200  100
  8  100  200  100
  3  200  100  200
  5  200  100  200
  7  200  100  200

mytab[order(mytab[, 2]), ]

   
    2000 2001 2002
  3  200  100  200
  5  200  100  200
  7  200  100  200
  4  100  200  100
  6  100  200  100
  8  100  200  100

Filtering Data

Filtering data is an incredibly powerful feature and we have already seen it used to do some interesting things
Filtering data in R is loaded with trouble though, because the filtering arguments must be very carefully specified
Filtering is like a mini-sort, and we've done it already
Always, always, always check your work
And remember, this is the place the NAs do the most damage
Let's look at some examples

Basic Filtering a Column

# Gives all rows that meet this requirement
df[df$readSS > 800, ]

           X school  stuid grade schid dist white black hisp indian
1631 1281061    852 145205     8    15  205     1     0    0      0
     asian econ female ell disab sch_fay dist_fay luck ability
1631     0    0      1   0     0       0        0    0   108.3
     measerr teachq year attday schoolscore district schoolhigh
1631   6.325  155.7 2001    137       227.7       19          0
     schoolavg schoollow readSS mathSS  proflvl race
1631         1         0  833.2  828.4 advanced    W

df$grade[df$mathSS > 800]

[1] 8

# Gives all values of grade that meet this requirement

Before the brackets we specify what we want returned, and within the brackets we present the logical expression to evaluate
Behind the scenes R does a logical test and gets the row numbers that match the logical expression
It then combines them back with the object in front of the brackets to return the values
This seems basic enough, let's filter on multiple dimensions

Multiple filters

df$grade[df$black == 1 & df$readSS > 650]

 [1] 8 7 8 6 6 7 8 7 8 8 8 4

The & operator tells R we want rows where both of these are true
How would we tell R we wanted rows where either were true?
What happens if we type df$black=1 or black==1?
Why won't this work?

Using filters to assign values

We can also use filters to assign values as well
This is how you recode variables and create new ones
Let's create a variable spread indicating whether a district has high or low spread among its student scores

myag$spread <- NA  # create variable
myag$spread[myag$sd_read < 75] <- "low"
myag$spread[myag$sd_read > 75] <- "high"
myag$spread <- as.factor(myag$spread)
summary(myag$spread)

high  low 
  15    3

How did we define spread in this block of code?

Merging Data

It is unlikely all the data we will want resides in a single dataset and often we have to combine data from several sources
R makes this easy, but that simplicity comes at a cost - it can be easy to make mistakes if you don't specify things carefully
Let's merge attributes about a student's school with the student row data
We might want to do that if we want to evaluate the performance of students in different school climates, and school climate was measured in part by the mean performance

Merging Data II

We have two data objects df which has multiple rows per student and myag which has multiple rows per school
What are the variables that link these two together?

names(myag)

[1] "dist"       "grade"      "mean_read"  "mean_math"  "sd_read"   
[6] "sd_math"    "count_read" "count_math" "spread"

names(df[, c(2, 3, 4, 6)])

[1] "school" "stuid"  "grade"  "dist"

It looks like dist and grade are in common. Is this ok?
Why might we want to consider re-aggregating with year as well?
For this example we won't just yet

Merge Options

We have a few options with merge we want to consider with ?merge
In the simple case we let merge automagically combine the data

simple_merge <- merge(df, myag)
names(simple_merge)

 [1] "grade"       "dist"        "X"           "school"     
 [5] "stuid"       "schid"       "white"       "black"      
 [9] "hisp"        "indian"      "asian"       "econ"       
[13] "female"      "ell"         "disab"       "sch_fay"    
[17] "dist_fay"    "luck"        "ability"     "measerr"    
[21] "teachq"      "year"        "attday"      "schoolscore"
[25] "district"    "schoolhigh"  "schoolavg"   "schoollow"  
[29] "readSS"      "mathSS"      "proflvl"     "race"       
[33] "mean_read"   "mean_math"   "sd_read"     "sd_math"    
[37] "count_read"  "count_math"  "spread"

It looks like it did a good job

Merge Options

In complicated cases, merge has some important options we should review
First is the simple sounding 'by' argument:
simple_merge(df1,df2,by=c("id1","id2"))
We can also specify simple_merge(df1,df2,by.x=c("id1","id2"),by.y=c("id1_a","id2_a"))
This allows us to have different names for our ID variables
Now, what if we have two different sized objects and not all matches between them?
notsosimple_merge(df1,df2,all.x=TRUE,all.y=TRUE)
We can tell R whether we want to keep all of the x observations (df1), all the y observations (df2) or neither, or both

Reshaping Data

Reshaping data is a slightly different issue than aggregating data
Let's review the two data types: long and wide

head(df[, 1:10], 3)

    X school  stuid grade schid dist white black hisp indian
1  44      1 149995     3   105  495     0     1    0      0
2  53      1  13495     3    45  495     0     1    0      0
3 116      1 106495     3    45  495     0     1    0      0

Now let's look at wide:

head(widedf[, c(1, 28:40)], 3)

   stuid readSS.2000 mathSS.2000 proflvl.2000 race.2000  X.2001
1 149995       357.3       387.3        basic         B  441000
2  13495       263.9       302.6  below basic         B  531000
3 106495       369.7       365.5        basic         B 1161000
  school.2001 grade.2001 schid.2001 dist.2001 white.2001 black.2001
1           1          4        105       495          0          1
2           1          4         45       495          0          1
3           1          4         45       495          0          1
  hisp.2001 indian.2001
1         0           0
2         0           0
3         0           0

How did we reshape this data?

Wide Data v. Long Data

The great debate
Most econometrics, panel, and time series datasets come wide and so these seem familiar
R for most cases prefers long data, including for most graphing and analysis functions
So we have to learn both

The reshape Function

reshape is the way to move from wide to long
The data stays the same, but the shape of it changes
The long data had dimensions: 2700, 32
The wide data has dimensions: 1200, 91
How do we get to these numbers?
The rows in the wide dataframe represent unique students

Deconstructing reshape

widedf <- reshape(df, timevar = "year", idvar = "stuid", direction = "wide")

idvar represents the unit we want to represent a single row, in this case each unique student gets a single row
In this simple case timevar is the variable that differenaties between two rows with the same student ID
Note that timevar needn't always represent time!
direction tells R we are going to move to wide data
As written all data will move, but using the varying argument we can tell R explicitly which items we want to move wide

What about Wide to Long?

We often need to do this to plot data in R
Luckily the reshape function works well in both directions

longdf <- reshape(widedf, idvar = "stuid", timevar = "year", varying = names(widedf[, 
    2:91]), direction = "long", sep = ".")

If our data is formatted nicely, R can do the guessing and identify the years for us by parsing the dataframe names

Subsetting Data

We have already seen a lot of subsetting examples above, which is what filtering is, but R provides some great shortcuts to this
Let's look at the subset function to get only 4th grade scores

g4 <- subset(df, grade == 4)
dim(g4)

[1] 400  32

This is equivalent to:

g4_b <- df[df$grade == 4, ]

These two elements are the same:

identical(g4, g4_b)

[1] TRUE

Other References for the Previous Section

Overview

In this lesson we hope to learn:

How to use summary statistics to look at data
How to run basic statistical tests on a dataset
How to use formulas to build a statistical model
Analyze subsets of data

Datasets

In this tutorial we will use a number of datasets of different types:

stulong: student-level assessment and demographics data (simulated and research ready)
midwest_schools.csv: aggregate school level test score averages from a large Midwest state

Reading Data In

We start with the aggregate school level data

load("data/midwest_schools.rda")
head(midsch[, 1:12])

  district_id school_id subject grade n1   ss1 n2   ss2 predicted
1          14       130    math     4 44 433.1 40 463.0     468.7
2          70        20    math     4 18 443.0 20 477.2     476.5
3         112        80    math     4 86 445.4 94 472.6     478.4
4         119        50    math     4 95 427.1 94 460.7     464.1
5         147        60    math     4 27 424.2 27 458.7     461.8
6         147       125    math     4 17 423.5 26 463.1     461.2
  residuals  resid_z  resid_t
1   -5.7446 -0.59190 -0.59171
2    0.7235  0.07456  0.07452
3   -5.7509 -0.59267 -0.59248
4   -3.3586 -0.34606 -0.34591
5   -3.0937 -0.31877 -0.31863
6    1.8530  0.19094  0.19085

What do we have then?

We have unique identifiers for districts and schools
For each school/district combination we have a row of test scores in year 1 and year 2 by test_year (of year 1); grade; and subject
How can we use R to ask this?

table(midsch$test_year, midsch$grade)

      
          4    5    6    7    8
  2007 1150 1094  472  638  734
  2008 1204 1146  462  588  692
  2009 1173 1092  434  592  668
  2010 1120 1090  428  610  686
  2011 1126 1060  420  618  688

length(unique(midsch$district_id))

[1] 357

length(unique(midsch$school_id))

[1] 247

What's wrong with this?
More districts than schools? The IDs must be goofed
We need to create a unique school ID

Explore Data Structure (II)

table(midsch$subject, midsch$grade)

      
          4    5    6    7    8
  math 2886 2741 1108 1523 1734
  read 2887 2741 1108 1523 1734

Why don't we want to do table(midsch$district_id,midsch$grade)
What else do we want to know?

Diagnostic Plots Perhaps

library(ggplot2)
qplot(ss1, ss2, data = midsch, alpha = I(0.07)) + theme_dpi() + geom_smooth() + 
    geom_smooth(method = "lm", se = FALSE, color = "purple")

plot of chunk diag1

Frequencies, Crosstabs, and t-tests

Some of the most basic analyses we can implement in R are sometimes the most useful
This is really useful in an education context or for evaluating experiments quickly when we are interested in whether the difference we observe in groups is real, or due to chance

Let's take a simple example of cars

Sometimes we want to compare groups of data to other groups or a fixed value
We use a t-test for this, but only if we believe the data are normally distributed

data(mtcars)  # load the data from R
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

T-test

We can t-test the mpg variable then
Let's test it against an assumption about the population using a one-sided test

mean(mtcars$mpg)

[1] 20.09

t.test(mtcars$mpg, mu = 18, alternative = "greater")


    One Sample t-test

data:  mtcars$mpg 
t = 1.962, df = 31, p-value = 0.02938
alternative hypothesis: true mean is greater than 18 
95 percent confidence interval:
 18.28   Inf 
sample estimates:
mean of x 
    20.09

t.test(mtcars$mpg, mu = 22, alternative = "less")


    One Sample t-test

data:  mtcars$mpg 
t = -1.792, df = 31, p-value = 0.04144
alternative hypothesis: true mean is less than 22 
95 percent confidence interval:
 -Inf 21.9 
sample estimates:
mean of x 
    20.09

What does t.test(mtcars$mpg,mu=18) test?

Other References for the Previous Section

Overview

In this lesson we hope to learn:

What is data visualization and why does it matter?
How to draw diagnostic plots in base graphics
Colors
`ggplot2'
Basic geoms
Layering and faceting plots
Putting it together

Basic Plot

Don't use R base graphics
ggplot2 is pretty much the new standard in R
Create beautiful, clear plots with a nice consistent set of language conventions

library(ggplot2)
qplot(readSS, mathSS, data = df)

Understanding Grammar of Graphics through A Scatterplot

qplot(readSS, mathSS, data = df, alpha = I(0.3)) + theme_dpi()

Geoms

Geoms are the way data is represented, you can think of it like a chart type in another programming language
We have seen a number of examples, and geoms can be combined in unique ways to convey more data

Aesthetics

geoms allow us to only display a couple of data elements at once, to do more we need to map to other visual representations
This is what aesthetics are
aesthetics are colors, glyphs (shapes), and sizes of graph objects mapped to visual cues
ggplot2 has an extended syntax that makes this obvious

ggplot(df, aes(x = readSS, y = mathSS)) + geom_point()

# Identical to: qplot(readSS,mathSS,data=df)

aes says we are specifying aesthetics, here we specified x and y to make a two dimensional graphic

Examples of Aesthetics

data(mpg)
qplot(displ, cty, data = mpg) + theme_dpi()

qplot(displ, cty, data = mpg, size = cyl) + theme_dpi()

qplot(displ, cty, data = mpg, shape = drv, size = I(3)) + theme_dpi()

qplot(displ, cty, data = mpg, color = class) + theme_dpi()

Thinking about Aesthetics

One concern is discrete v. continuous variables

Aesthetic	Discrete	Continuous
Color	Disparate colors	Sequential or divergent colors
Size	Unique size for each value	mapping to radius of value
Shape	A shape for each value	does not make sense

Another is ordered v. unordered

Aesthetic	Ordered	Unordered
Color	Sequential or divergent colors R	ainbow
Size	Increasing or decreasing radius *	does not make sense*
Shape	does not make sense A	shape for each value

Layers

Exactly what they sound like, each plot is a simple series of layers
One way to do layers is to break plots up into small multiples (see Tufte)

qplot(readSS, mathSS, data = df) + facet_wrap(~grade) + theme_dpi(base_size = 12) + 
    geom_smooth(method = "lm", se = FALSE, size = I(1.2))

We can also facet across more attributes

qplot(readSS, mathSS, data = df) + facet_grid(ell ~ grade) + theme_dpi(base_size = 12) + 
    geom_smooth(method = "lm", se = FALSE, size = I(1.2))

Visualizing Categorical Data

Visualization is not just limited to continuous data using scatterplots
Sometimes we want to look at the density of data in different categories
Sometimes we want to look at the size of groups compared to an expected group size
What are some other examples?

Structural Plots

library(vcd)
df$proflvl <- factor(df$proflvl, levels = c("advanced", "proficient", "basic", 
    "below basic"))
a <- structable(proflvl ~ race, data = df)
mosaic(a, shade = TRUE)

Another example

library(vcd)
df$proflvl <- factor(df$proflvl, levels = c("advanced", "proficient", "basic", 
    "below basic"))
a <- structable(female ~ race, data = df)
mosaic(a, shade = TRUE)

What are the basic plot types?

What are some advanced plot types?

Scary R Code

library(grid)
p1<-qplot(readSS,..density..,data=df,fill=race,
      position='fill',geom='density')+scale_fill_brewer(
        type='qual',palette=2)

p2<-qplot(readSS,..fill..,data=df,fill=race,
      position='fill',geom='density')+scale_fill_brewer(
        type='qual',palette=2)+ylim(c(0,1))+theme_bw()+
          opts(legend.position='none',
               axis.text.x=theme_blank(),
               axis.text.y=theme_blank(),
               axis.ticks=theme_blank(),
               panel.margin=unit(0,"lines"))+ylab('')+
                 xlab('')

vp<-viewport(x=unit(.65,"npc"),y=unit(.73,"npc"),
             width=unit(.2,"npc"),height=unit(.2,"npc"))
print(p1)
print(p2,vp=vp)

References for the Previous Section

Overview

In this lesson we hope to learn:

Quick and basic export of results
Writing a basic report
Exporting graphics for use in other documents
Reproducible research

Exporting data

Of course, other times we need to export raw data or results of our analyses
R is very flexible in these cases, and depending on what you want to do, you are probably best served by Googling a specific results
However, a few functions are indispensible for such work
These are the foreign library, save, write.csv, and write.dta
Note that R can also write SPSS files, SAS files, and files for just about any statistics program (sometimes even Excel, but CSV is preferred for this purpose)

Introduction to R Programming

Wi-Fi

Outline

Materials

Installing R on Windows

Installation Tips

Next Install packages

Install Packages

Install Packages (2)

References and Resources for the Previous Section

Overview

R

Why Use R

R Advantages Continued

R Can Compliment Other Tools

R's Drawbacks

R Vocabulary

Components of an R Setup

Self-help

Let's Look at RStudio

R As A Calculator

Arithmetic Operators

Other Key Symbols

Comments in R

R Advanced Math

Using the Workspace

Using the Workspace (2)

R as a Language

c is our friend

Language

More Language Bugs Features

Objects

Graphics too

Handling Data in R

Special Operators

Special Operators (II)

Special operators (III)

Special operators (IV)

Special operators (V)

Regular Expressions

R Data Modes

Data Modes in R (numeric)

Data Modes (Character)

Data Modes (Logical)

Easier way

A Note on Vectors

Factor

Ordering the Factor

Reclassifying Factors

Defactor

Convert to Numeric?

Dates

More Dates

A few notes on dates

Why care so much about classes?

Data Structures in R

Vectors

Vectors 2

Matrices

Matrices II

Arrays

Arrays II

Lists

Print a List

Lists (II)

So what?

Dataframes

Converting Between Types

Summing it Up

Other References for the Previous Section

Books

Overview

A quick note on R packages

I can haz packages?

Finding Packages

The Working Directory

Ground Rules

Missing Data Symbols

What the heck is Not a Number?

Read in Data

`c` is our friend

A quick note on R `packages`