Statistical Learning and Applied Modeling in Education

Examples and Concerns

Jared Knowles

01-31-2014

Motivation

Of the two modeling cultures, we've tended to focus overwhelmingly on one
Computation increases are changing everything
Data is growing and many problems have different issues
Prediction is underused and undervalued, and this undermines inference

The Data Modeling Culture

Starts philosophically with the idea that we have written down a set of X that describe Y with a known functional form that we are testing
Black box between x and y can be known because the data generating process DGP is some functional combination of predictors, parameters, and noise
Model fit is based on goodness of fit and residual tests

The Algorithmic Modeling Culture

Black box is unknowable - we are not modeling nature but seeking to use similar inputs to predict the outputs of the natural process
Model fit measured by prediction accuracy

The Wisconsin Dropout Early Warning System

“To help keep all kids on a path to graduation, we just delivered - with no new funding - a new statewide Dropout Early Warning System, called DEWS, to all districts. DEWS makes it possible to identify kids who may be at risk, and allows districts to intervene as early as middle school.” ~ Tony Evers

DEWS

The Dropout Early Warning System for the Wisconsin Department of Public Instruction
Leverages DPI's administrative records to provide predictions on student high school completion while students are in middle grades
Communicates the results to school staff in all Wisconsin public schools serving students in the middle grades
Comes with an interpretive guide and strategies for success available online
Released in September of 2013, with bi-annual updates in April and August
Serves as a good example of where social science and applied modeling intersect

Why DEWS?

Every child a graduate, college and career ready – Agenda 2017, DPI's strategic plan
DEWS focuses on providing schools and districts an early notice of whether or not a student is likely to complete high school on time
DEWS uses data on historical cohorts of students in Wisconsin to link middle grade student outcome data with the long-term outcome of on time graduation
DEWS provides a relatively accurate assessment of the likelihood of on time graduation for individual students across the state

Graduation and Droput By the Numbers

9,092 students in 2010-11 did not graduate with their cohort.

Group	Expected	Grads	Rate	Difference
White	54,468	49,783	91.4%	-
American Indian	1,027	737	71.7%	19.7
Asian	2,517	2,225	88.4%	3
Black	6,889	4,395	63.8%	27.6
Hispanic	4,751	3,420	72.0%	19.4
Total	69,652	60,560	86.9%	-

What is DEWS?

DEWS is an applied statistical model that combines several major features:

Data import, filtering, and cleaning for analysis from the state longitudinal database
A machine learning algorithm to search for the best predictive model
A prediction routine to apply models to current students
An exporting feature to push predictions into the state business intelligence tool, WISEdash for Districts
A display layer available to schools and districts securely for exploring the results
In reality, it resembles software as much as a statistical analysis

Under the Hood of DEWS

DEWS consists of several sub-routines that can be thought of as states of building a statistical model

Data acquisition
Data cleaning, normalizing, and standardizing
Model feature and model algorithm search
Model testing
Model selection
New case scoring
Prediction export for reporting

All modules are built in the free and open source statistical computing language, R.

DEWS by the Numbers

Analyzes over 250,000 historical records of student graduation

Provides predictions on over 180,000 students in the state

Produces predictions on students in over 1,000 schools

Selects from over 50 candidate statistical models per grade

Hundreds of users have accessed thousands of individual student reports across nearly every Wisconsin school district

Working on open sourcing the code

Being explored in Michigan, New Jersey, and school districts in Kansas, Montana, and Minnesota

DEWS as an Applied Model

Data and Computing Trends

Available data in education is growing astronomically.
People are talking about things like “data science” and “big data” even NSF.
Data sources are shifting from national surveys to administrative records
More data = more problems; more data + more sources = more problems\( ^{2} \)

Increased Computational Power

Increased data size and complexity leads to new problems that increased computational power often helps to solve.

Examples of Challenges and Solutions Posed by Computation

Bigger datasets have highly complex structures to them such as deep hierarchies, cross classifications, and high collinearity
- With 12 regions, 72 counties, 424 districts, 2,200 schools, and tens of thousands of classrooms and hundreds of thousands of students the modeling data structure is complex

Straining our Generalized Linear Models

Increased number of predictors allows us to build models of complex group interactions that separate

Parameter estimates of demographic indicators are invalid when the demographic indicator is not observed in each outcome category (they are perfectly collinear)

Quasi-separation can occur when this is close

Corrections exist to adjust for the fact that maximum likelihood estimates are invalid in this case (Bayesian estimates, Firth bias-correction)

Again, leveraging computation to address a problem of increased data complexity

DEWS

DEWS data has a complicated hierarchical structure
DEWS data has rare cases that have to be addressed (e.g. blind students) across most indicators
Using CPU-intensive techniques can work, but is not limitless – some models are too slow to developed, modified, evaluated, and implemented
As it is, DEWS takes about 48 hours to build data and models, test them, select the winners, and produce predictions for current students
But in the future… who knows?

Being a Modeling Pluralist

Schools of statistical thoughts are sometimes jokingly likened to religions. This analogy is not perfect - unlike religions, statistical methods have no supernatural content and make essentially no demands on our personal lives. Looking at the comparison from the other direction, it is possible to be agnostic, atheistic, or simply live one's life without religion, but it is not really possible to do statistics without some philosophy. ~ Andrew Gelman

What is a statistical model?

“All models are wrong, some models are useful” ~ George Box
Statistical models are mathematical summaries of correlations and probabilities of known data
Being wrong is a feature of a statistical model, the goal is to explain as much data as possible with as few variables as possible
The most common in the social sciences is the linear regression model
Sometimes the goal is inference and other times it is prediction

Statistical Modeling

It is useful to remember that in all statistical modeling we are looking at the following relationship:

\[ \hat{Y} = \hat{f}(X) \]

In this case \( \hat{f} \) represents our estimate of the function that links \( X \) and \( Y \). In traditional linear modeling, \( \hat{f} \) takes the form:

\[ \hat{Y} = \alpha + \beta(X) + \epsilon \]

However, there exist limitless alternative \( \hat{f} \) which we can explore. Applied modeling techniques help us expand the \( \hat{f} \) space we search within.

Functional forms

Figure Adapted from James et al. 2013

Figure adapted from James et al. 2013 (figure 2.7)

Buyer Beware

A big computer, a complex algorithm and a long time does not equal science. ~ Robert Gentleman

Statistical Learning or Statistical Inference?

The line between statistical learning and statistical inference has always been blurry and unclear. A few questions can help:

Am I interested in accurately estimating unobserved observations based on what I have learned in my sample?

Am I interested in the relationships among the parameters in my sample because of a theory I am testing, or because of how they can explain an outcome I am interested in?

Is the data I am using common and relatively untransformed? Will new data be created regularly that I can fit the same model to and update?

Why the Difference?

Algorithmic Models:

Provide information to users about what to expect given certain data
Serve many goals including prediction of non-observed outcomes, summarizing large datasets, measuring uncertainty
Goals for the model are defined by explicit tradeoffs

Data Models:

Focused on understanding patterns in the current data
Seek to understand how current data extrapolates to a population
Estimates population parameters from sample data about relationships between inputs and outputs

Predicting Dropout

Algorithmic Models:

Data: Regularly collected at specific timepoints, standardized

Many cohorts with common data

Interested in learning which students today are likely to dropout in the future

Want: Confident predictions on likely graduation of new students, used to decide how to allocate resources and services to students

Data Models:

Data: national survey data, unlikely to be collected on future observations

One cohort is followed in the data set

Interested in learning if social and emotional concerns are more important than academic success in predicting graduation

Want: unbiased and precise estimates of parameters and if possible ability to make causal claims

On Prediction

Note that prediction is important in both cases
In data models, making a good prediction is the sign that our theory has explanatory power
In algorithmic models, making a good prediction is a sign that we have approximated the natural process correctly
In both cases, we should care deeply about prediction and think carefully about measuring it

On Nails, Hammers, and Models

The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed. ~ Leo Breiman

Some Vocabulary

Training data
Test data
Bias (error)
Variance (error)

Data the model is fit to (analytical sample)
Data the model predicts, to evaluate model fit
Refers to the amount of error due to simplifying a complex process
The amount the \( f \) would change if fit to a different training set of data

The Challenge

When using a statistical model to make predictions we have to think clearly about the data we use to build the model, and the data we will be making predictions about
We may build a model with high internal validity for the data at hand, but that data may not be representative of the data the model will apply to
We call this the training error and the test error
In inferential statistics we often seek to reduce training error and not concern ourselves with test error
In applied modeling we focus on finding the optimal tradeoff between variance and bias in order to reduce test error

A Simple Motivating Example

plot of chunk unnamed-chunk-2

Forecasting Apple Stock Could be Useful

Fit a model on the earlier part of the data (in blue)

plot of chunk unnamed-chunk-3

Forecasts Are Tricky

Fit another model on the middle part of the data (purple)

plot of chunk unnamed-chunk-4

Evaluating Model Fit

How do we know how well our models fit? A very brief model comparison review:

\( \\R^2 \) - ratio of explained variation to total variation (generally)
Nested model tests:
- F test and Likelihood ratio tests (restricted and unrestricted model)
Same sample tests:
- AIC, BIC, etc. (different penalties for model parameters)
These don't give us a sense of how the model will do on new data, and they are not easy to explain!

Predicting New Data

Test both models on the full data!

plot of chunk unnamed-chunk-5

The Bias - Variance Tradeoff

The purple and blue models are identical except each was “trained” on different data, the difference between their predictions is variance
Both have the less bias on the data they are trained, but the linear model has a different bias - a feature of the flexibility in the model
Less flexible models like linear models will have more bias, but are less variable in response to the data they are trained on
How do we pick the model? We think about which model fits our application best

Model fit = Fit to signal + fit to noise

Training data (sample) can lead to model overfit (the blue line)
- Non-linear behaviors can be right around the corner
Training data can lead to bias in future predictions (the purple line)
- Time changes things and the process/logic of updating models is important
We need both methods of \( f \) and methods of evaluating models that can insulate against overfit and reduce bias
This means different measures of model fit to choose among competing models

Bias, Variance, Training, and Test Data

Figure from Hastie, Tibshirani and Friedman (2009). Springer-Verlag (Figure 7.1)

ESL7.1

Measuring Fit Differently

Define a metric of accuracy (ROC, AUC, kappa, RMSE, etc.)
Define a strategy to estimate test data accuracy/error
Perform the test, sensitivity checks

Metrics of Model Fit

In the continuous case, Root Mean Square Error (RMSE)
In the discrete case, there are a number of options including kappa, ROC, AUC, and others
ROC: Receiver Operating Characteristic, AUC: Area Under the (ROC) Curve
Many of these metrics can be extended to the multi-class case as well

Confusion Matrix

		Actual
		Non-grad	Graduate
Predicted	Non-grad	a	b
Predicted	Graduate	c	d

Some performance metrics we can use:

Accuracy: \( \frac{(a+d)}{(a+b+c+d)} \)
Precision (positive predictive value) = \( \frac{a}{(a+b)} \)
Sensitivity (recall) = \( \frac{a}{(a+c)} \)
Specificity (negative predictive value) = \( \frac{d}{(b+d)} \)
False alarm (1-specificity) = \( \frac{b}{(b+d)} \)

Confusion Matrix

		Actual
		Non-grad	Graduate
Predicted	Non-grad	a	b
Predicted	Graduate	c	d

Accuracy: \( \frac{(a+d)}{(a+b+c+d)} \)

Accuracy is a good measure if our classes are fairly balanced and we care about overall correctly dividing the data into the groups.

If one group is much larger than another though, this method can be misleading.

Confusion Matrix

		Actual
		Non-grad	Graduate
Predicted	Non-grad	a	b
Predicted	Graduate	c	d

Precision (negative predictive value) = \( \frac{a}{(a+b)} \)

Of all the cases we predict to be non-graduates, what proportion actually graduate?
If we are interested in the non-graduate class, then this is a very useful metric to understand how good we are at identifying this group. Useful if this class is a rare class.

Confusion Matrix

		Actual
		Non-grad	Graduate
Predicted	Non-grad	a	b
Predicted	Graduate	c	d

Sensitivity (recall) = \( \frac{a}{(a+c)} \)

Of all the non-graduate cases, what percentage do we correctly identify (recall)?
Useful if we are interested in rare-event models where we want to accurately identify rare events, and are less worried about how accurate we are with the modal or common case.

Confusion Matrix

		Actual
		Non-grad	Graduate
Predicted	Non-grad	a	b
Predicted	Graduate	c	d

Specificity (positive predictive value) = \( \frac{d}{(b+d)} \)

False alarm (1-specificity) = \( \frac{b}{(b+d)} \)

Of all the graduate cases, what proportion actually do we predict correctly?
If we are interested in one class, this metric is either interesting on its own, or as the balancing metric (false alarm) that we seek to hold constant while increasing our sensitivity.

Receiver Operating Characteristic

ROC represents the tradeoff between the fraction of non-graduates identified out of all non-graduates, and the fraction of false non-graduates out of all graduates

Can represent the variation in classification accuracy as the discrimination threshold is varied

Can support decision analysis by allowing a decision to be made explicitly about the balance between false-positives and false-negatives

Excellent for optimizing rare-class identification

Estimating the Test Error

In cases where observations are cheap, 50% of the sample is for training, 25% for validation, and 25% for final testing
When data are not cheap, a number of methods can be used to approximate the test set error
K fold cross-validation splits the data into 5 groups, and uses each group 1 time as a validation set, fitting the model to the other 4 groups
- “Overall, ﬁve- or tenfold cross-validation are recommended as a good compromise…” Hastie et al p. 243
Alternatives include bootstraps, leave one out cross-validation, leave-group-out cross validation, out-of-bag estimates

Summary of Methods

Method	Data Loss	External Validity
Hold 1 Cohort Out	Highest	Highest
Random Sample from Multiple Cohorts	High	Higher
Simple Random Sample in Training Data	Moderate	Low
Stratified Sample Within Training Data	Moderate	High
Repeated Fold Cross-Validation	Low	Moderate

The method used for estimating the test error is arguably more important than the selection of the algorithm being tested.

Model Fit: Predicting Dropouts

plot of chunk unnamed-chunk-6

Adapted from Bowers and Sprott 2013

Problems

Most EWIs have a low true positive identification rate
EWI literature does not report performance on a test dataset of future students
High performing EWIs have immense data requirements
Alarming false positive rates and no ability to tune these rates due to single indicator
But… we have a strong baseline universe to compare to

Evaluating Multiple DEWS Candidate Models Using ROC

plot of chunk unnamed-chunk-7

AUCs Across Methods

plot of chunk unnamed-chunk-8

Selecting the Best Model

In an applied context we may consider additional criteria in selecting the best model:

Accuracy (on the test data)
Transparency (stakeholder support)
Speed (both for development, and providing on-time prediction)
Support costs (the model lives on)
Data availability
Stability (reduce data reliance)

Data Cleaning

The one line in your methods section that took 80% of the work.

Data Cleaning

Data cleaning decisions are incredibly consequential, yet little formal training is made in their application
Sometimes cleaning takes the form of automatic filters, other times it is in compliance with a business rule
Data cleaning is incredibly important when using administrative data
Data cleaning is high stakes in an applied setting
Knowing the content helps, asking an expert such as a database administrator is even better

Preparing Data

Recoding categorical variables
Centering and scaling
Dealing with missingness

Coding Categorical Data

Key tradeoff is between losing information by reducing categories and need to consolidate sparse groups to produce valid estimates
Consider whether categories can be ordered, eases data demands for estimation
Solution is to propose collapses of groups that are similar with respect to the DV and which are too small for strong estimates
There may be non-empirical concerns as well such as perception of groups of categories as similar
May disappear in the end after feature / variable selection, but need to be considered

Scale and Center the Continous Variables

Scaling and centering can reduce noise in assessment data (issues like lowest obtainable scale score, grade and year equating)
Can help deal with attendance data that is heavily skewed toward the high end
Converting counts to percents helps keep units (like schools) on a similar scale
Improves the efficiency of many statistical algorithms and MCMC methods

Missing Data Issues

Missing data is acutely important to predictive modeling, need to consider when and what data you need to predict outcomes will be available
Data is often missing in administrative records, and is almost never missing at random (students missing data are more likely to dropout!)
In Wisconsin, each additional year of data longitudinally that a method requires eliminates 10% of students from receiving a prediction due to missing data
Identifying that the outcome is correctly classified and not a default assumption is important - e.g. dropouts

Communication

Use graphics to display your model results to users. How to do that is a subject for another talk.

The Most Accurate Model is Easy to Find!

plot of chunk unnamed-chunk-9

Credits

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
CPU power graph: http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance/
Example ROC graph from Wikipedia

Contact Info

DEWS Homepage: http://wise.dpi.wi.gov/wisedash_dews
E-mail: jared.knowles@dpi.wi.gov / jeknowles@wisc.edu
GitHub: http://www.github.com/jknowles
Homepage: www.jaredknowles.com
Google+: https://plus.google.com/+JaredKnowles

Further Resources

The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. Nate Silver. (2012). Penguin.

The Black Swan: Second Edition: The Impact of the Highly Improbable (2nd ed. 2010). Nassim Taleb. Random House.

An Introduction to Statistical Learning (2013). Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer. Download the book

Elements of Statistical Learning (Second Edition, 2011). Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Springer Download the book

An Aside on Unsupervised Models

plot of chunk clusters

These are familiar techniques for dimension reduction like cluster analysis, factor analysis, or principal components analysis
Can be useful for starting an analysis, looking for structure

Tradeoffs

plot of chunk unnamed-chunk-10