Statistical Learning and Applied Modeling in Education

Examples and Concerns

Jared Knowles

01-31-2014

Motivation

  • Of the two modeling cultures, we've tended to focus overwhelmingly on one
  • Computation increases are changing everything
  • Data is growing and many problems have different issues
  • Prediction is underused and undervalued, and this undermines inference

The Data Modeling Culture

  • Starts philosophically with the idea that we have written down a set of X that describe Y with a known functional form that we are testing
  • Black box between x and y can be known because the data generating process DGP is some functional combination of predictors, parameters, and noise
  • Model fit is based on goodness of fit and residual tests

FP

The Algorithmic Modeling Culture

  • Black box is unknowable - we are not modeling nature but seeking to use similar inputs to predict the outputs of the natural process
  • Model fit measured by prediction accuracy

FP

The Wisconsin Dropout Early Warning System

“To help keep all kids on a path to graduation, we just delivered - with no new funding - a new statewide Dropout Early Warning System, called DEWS, to all districts. DEWS makes it possible to identify kids who may be at risk, and allows districts to intervene as early as middle school.” ~ Tony Evers

DEWS

  • The Dropout Early Warning System for the Wisconsin Department of Public Instruction
  • Leverages DPI's administrative records to provide predictions on student high school completion while students are in middle grades
  • Communicates the results to school staff in all Wisconsin public schools serving students in the middle grades
  • Comes with an interpretive guide and strategies for success available online
  • Released in September of 2013, with bi-annual updates in April and August
  • Serves as a good example of where social science and applied modeling intersect

Why DEWS?

  • Every child a graduate, college and career ready – Agenda 2017, DPI's strategic plan
  • DEWS focuses on providing schools and districts an early notice of whether or not a student is likely to complete high school on time
  • DEWS uses data on historical cohorts of students in Wisconsin to link middle grade student outcome data with the long-term outcome of on time graduation
  • DEWS provides a relatively accurate assessment of the likelihood of on time graduation for individual students across the state

Graduation and Droput By the Numbers

9,092 students in 2010-11 did not graduate with their cohort.

Group Expected Grads Rate Difference
White 54,468 49,783 91.4% -
American Indian 1,027 737 71.7% 19.7
Asian 2,517 2,225 88.4% 3
Black 6,889 4,395 63.8% 27.6
Hispanic 4,751 3,420 72.0% 19.4
Total 69,652 60,560 86.9% -

What is DEWS?

DEWS is an applied statistical model that combines several major features:

  • Data import, filtering, and cleaning for analysis from the state longitudinal database
  • A machine learning algorithm to search for the best predictive model
  • A prediction routine to apply models to current students
  • An exporting feature to push predictions into the state business intelligence tool, WISEdash for Districts
  • A display layer available to schools and districts securely for exploring the results
  • In reality, it resembles software as much as a statistical analysis

Under the Hood of DEWS

DEWS consists of several sub-routines that can be thought of as states of building a statistical model

  1. Data acquisition
  2. Data cleaning, normalizing, and standardizing
  3. Model feature and model algorithm search
  4. Model testing
  5. Model selection
  6. New case scoring
  7. Prediction export for reporting

All modules are built in the free and open source statistical computing language, R.

DEWS by the Numbers

  • Analyzes over 250,000 historical records of student graduation
  • Provides predictions on over 180,000 students in the state
  • Produces predictions on students in over 1,000 schools
  • Selects from over 50 candidate statistical models per grade
  • Hundreds of users have accessed thousands of individual student reports across nearly every Wisconsin school district
  • Working on open sourcing the code
  • Being explored in Michigan, New Jersey, and school districts in Kansas, Montana, and Minnesota

DEWS as an Applied Model

Data and Computing Trends

  • Available data in education is growing astronomically.
  • People are talking about things like “data science” and “big data” even NSF.
  • Data sources are shifting from national surveys to administrative records
  • More data = more problems; more data + more sources = more problems\( ^{2} \)

Increased Computational Power

FP

Increased data size and complexity leads to new problems that increased computational power often helps to solve.

Examples of Challenges and Solutions Posed by Computation

  • Bigger datasets have highly complex structures to them such as deep hierarchies, cross classifications, and high collinearity
    • Methods like HLM have difficulty scaling to 3, 4, or 5 levels that may exist within a statewide data system
    • Cross-nested and cross-classified observations are common in observational data, and difficult to deal with for many approaches
    • Alternative methods like Bayesian mixed effect regression or regression trees are more CPU intensive, but more flexible
    • With 12 regions, 72 counties, 424 districts, 2,200 schools, and tens of thousands of classrooms and hundreds of thousands of students the modeling data structure is complex

Straining our Generalized Linear Models

  • Increased number of predictors allows us to build models of complex group interactions that separate
  • Parameter estimates of demographic indicators are invalid when the demographic indicator is not observed in each outcome category (they are perfectly collinear)
  • Quasi-separation can occur when this is close
  • Corrections exist to adjust for the fact that maximum likelihood estimates are invalid in this case (Bayesian estimates, Firth bias-correction)
  • Again, leveraging computation to address a problem of increased data complexity

DEWS

  • DEWS data has a complicated hierarchical structure
  • DEWS data has rare cases that have to be addressed (e.g. blind students) across most indicators
  • Using CPU-intensive techniques can work, but is not limitless – some models are too slow to developed, modified, evaluated, and implemented
  • As it is, DEWS takes about 48 hours to build data and models, test them, select the winners, and produce predictions for current students
  • But in the future… who knows?

Being a Modeling Pluralist

Schools of statistical thoughts are sometimes jokingly likened to religions. This analogy is not perfect - unlike religions, statistical methods have no supernatural content and make essentially no demands on our personal lives. Looking at the comparison from the other direction, it is possible to be agnostic, atheistic, or simply live one's life without religion, but it is not really possible to do statistics without some philosophy. ~ Andrew Gelman

What is a statistical model?

  • “All models are wrong, some models are useful” ~ George Box
  • Statistical models are mathematical summaries of correlations and probabilities of known data
  • Being wrong is a feature of a statistical model, the goal is to explain as much data as possible with as few variables as possible
  • The most common in the social sciences is the linear regression model
  • Sometimes the goal is inference and other times it is prediction

Statistical Modeling

It is useful to remember that in all statistical modeling we are looking at the following relationship:

\[ \hat{Y} = \hat{f}(X) \]

In this case \( \hat{f} \) represents our estimate of the function that links \( X \) and \( Y \). In traditional linear modeling, \( \hat{f} \) takes the form:

\[ \hat{Y} = \alpha + \beta(X) + \epsilon \]

However, there exist limitless alternative \( \hat{f} \) which we can explore. Applied modeling techniques help us expand the \( \hat{f} \) space we search within.

Functional forms

Figure Adapted from James et al. 2013

Figure adapted from James et al. 2013 (figure 2.7)

Buyer Beware

A big computer, a complex algorithm and a long time does not equal science. ~ Robert Gentleman

Statistical Learning or Statistical Inference?

The line between statistical learning and statistical inference has always been blurry and unclear. A few questions can help:

  • Am I interested in accurately estimating unobserved observations based on what I have learned in my sample?
  • Am I interested in the relationships among the parameters in my sample because of a theory I am testing, or because of how they can explain an outcome I am interested in?
  • Is the data I am using common and relatively untransformed? Will new data be created regularly that I can fit the same model to and update?

Why the Difference?

Algorithmic Models:

  • Provide information to users about what to expect given certain data
  • Serve many goals including prediction of non-observed outcomes, summarizing large datasets, measuring uncertainty
  • Goals for the model are defined by explicit tradeoffs

Data Models:

  • Focused on understanding patterns in the current data
  • Seek to understand how current data extrapolates to a population
  • Estimates population parameters from sample data about relationships between inputs and outputs

Predicting Dropout

Algorithmic Models:

  • Data: Regularly collected at specific timepoints, standardized
  • Many cohorts with common data
  • Interested in learning which students today are likely to dropout in the future
  • Want: Confident predictions on likely graduation of new students, used to decide how to allocate resources and services to students

Data Models:

  • Data: national survey data, unlikely to be collected on future observations
  • One cohort is followed in the data set
  • Interested in learning if social and emotional concerns are more important than academic success in predicting graduation
  • Want: unbiased and precise estimates of parameters and if possible ability to make causal claims

On Prediction

  • Note that prediction is important in both cases
  • In data models, making a good prediction is the sign that our theory has explanatory power
  • In algorithmic models, making a good prediction is a sign that we have approximated the natural process correctly
  • In both cases, we should care deeply about prediction and think carefully about measuring it

On Nails, Hammers, and Models

The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed. ~ Leo Breiman

Some Vocabulary

  • Training data
  • Test data
  • Bias (error)
  • Variance (error)
  • Data the model is fit to (analytical sample)
  • Data the model predicts, to evaluate model fit
  • Refers to the amount of error due to simplifying a complex process
  • The amount the \( f \) would change if fit to a different training set of data

The Challenge

  • When using a statistical model to make predictions we have to think clearly about the data we use to build the model, and the data we will be making predictions about
  • We may build a model with high internal validity for the data at hand, but that data may not be representative of the data the model will apply to
  • We call this the training error and the test error
  • In inferential statistics we often seek to reduce training error and not concern ourselves with test error
  • In applied modeling we focus on finding the optimal tradeoff between variance and bias in order to reduce test error

A Simple Motivating Example

plot of chunk unnamed-chunk-2

Forecasting Apple Stock Could be Useful

  • Fit a model on the earlier part of the data (in blue)

plot of chunk unnamed-chunk-3

Forecasts Are Tricky

  • Fit another model on the middle part of the data (purple)

plot of chunk unnamed-chunk-4

Evaluating Model Fit

How do we know how well our models fit? A very brief model comparison review:

  • \( \\R^2 \) - ratio of explained variation to total variation (generally)
  • Nested model tests:
    • F test and Likelihood ratio tests (restricted and unrestricted model)
  • Same sample tests:
    • AIC, BIC, etc. (different penalties for model parameters)
  • These don't give us a sense of how the model will do on new data, and they are not easy to explain!

Predicting New Data

  • Test both models on the full data!

plot of chunk unnamed-chunk-5

The Bias - Variance Tradeoff

  • The purple and blue models are identical except each was “trained” on different data, the difference between their predictions is variance
  • Both have the less bias on the data they are trained, but the linear model has a different bias - a feature of the flexibility in the model
  • Less flexible models like linear models will have more bias, but are less variable in response to the data they are trained on
  • How do we pick the model? We think about which model fits our application best

Model fit = Fit to signal + fit to noise

  • Training data (sample) can lead to model overfit (the blue line)
    • Non-linear behaviors can be right around the corner
  • Training data can lead to bias in future predictions (the purple line)
    • Time changes things and the process/logic of updating models is important
  • We need both methods of \( f \) and methods of evaluating models that can insulate against overfit and reduce bias
  • This means different measures of model fit to choose among competing models

Bias, Variance, Training, and Test Data

Figure from Hastie, Tibshirani and Friedman (2009). Springer-Verlag (Figure 7.1)

ESL7.1

Measuring Fit Differently

  1. Define a metric of accuracy (ROC, AUC, kappa, RMSE, etc.)
  2. Define a strategy to estimate test data accuracy/error
  3. Perform the test, sensitivity checks

Metrics of Model Fit

  • In the continuous case, Root Mean Square Error (RMSE)
  • In the discrete case, there are a number of options including kappa, ROC, AUC, and others
  • ROC: Receiver Operating Characteristic, AUC: Area Under the (ROC) Curve
  • Many of these metrics can be extended to the multi-class case as well

Confusion Matrix

Actual
Non-grad Graduate
Predicted Non-grad a b
Graduate c d

Some performance metrics we can use:

  • Accuracy: \( \frac{(a+d)}{(a+b+c+d)} \)
  • Precision (positive predictive value) = \( \frac{a}{(a+b)} \)
  • Sensitivity (recall) = \( \frac{a}{(a+c)} \)
  • Specificity (negative predictive value) = \( \frac{d}{(b+d)} \)
  • False alarm (1-specificity) = \( \frac{b}{(b+d)} \)

Confusion Matrix

Actual
Non-grad Graduate
Predicted Non-grad a b
Graduate c d

Accuracy: \( \frac{(a+d)}{(a+b+c+d)} \)

Accuracy is a good measure if our classes are fairly balanced and we care about overall correctly dividing the data into the groups.

If one group is much larger than another though, this method can be misleading.

Confusion Matrix

Actual
Non-grad Graduate
Predicted Non-grad a b
Graduate c d

Precision (negative predictive value) = \( \frac{a}{(a+b)} \)

  • Of all the cases we predict to be non-graduates, what proportion actually graduate?
  • If we are interested in the non-graduate class, then this is a very useful metric to understand how good we are at identifying this group. Useful if this class is a rare class.

Confusion Matrix

Actual
Non-grad Graduate
Predicted Non-grad a b
Graduate c d

Sensitivity (recall) = \( \frac{a}{(a+c)} \)

  • Of all the non-graduate cases, what percentage do we correctly identify (recall)?
  • Useful if we are interested in rare-event models where we want to accurately identify rare events, and are less worried about how accurate we are with the modal or common case.

Confusion Matrix

Actual
Non-grad Graduate
Predicted Non-grad a b
Graduate c d

Specificity (positive predictive value) = \( \frac{d}{(b+d)} \)

False alarm (1-specificity) = \( \frac{b}{(b+d)} \)

  • Of all the graduate cases, what proportion actually do we predict correctly?
  • If we are interested in one class, this metric is either interesting on its own, or as the balancing metric (false alarm) that we seek to hold constant while increasing our sensitivity.

Receiver Operating Characteristic

FP

  • ROC represents the tradeoff between the fraction of non-graduates identified out of all non-graduates, and the fraction of false non-graduates out of all graduates
  • Can represent the variation in classification accuracy as the discrimination threshold is varied
  • Can support decision analysis by allowing a decision to be made explicitly about the balance between false-positives and false-negatives
  • Excellent for optimizing rare-class identification

Estimating the Test Error

  • In cases where observations are cheap, 50% of the sample is for training, 25% for validation, and 25% for final testing
  • When data are not cheap, a number of methods can be used to approximate the test set error
  • K fold cross-validation splits the data into 5 groups, and uses each group 1 time as a validation set, fitting the model to the other 4 groups
    • “Overall, five- or tenfold cross-validation are recommended as a good compromise…” Hastie et al p. 243
  • Alternatives include bootstraps, leave one out cross-validation, leave-group-out cross validation, out-of-bag estimates

Summary of Methods

Method Data Loss External Validity
Hold 1 Cohort Out Highest Highest
Random Sample from Multiple Cohorts High Higher
Simple Random Sample in Training Data Moderate Low
Stratified Sample Within Training Data Moderate High
Repeated Fold Cross-Validation Low Moderate

The method used for estimating the test error is arguably more important than the selection of the algorithm being tested.

Model Fit: Predicting Dropouts

plot of chunk unnamed-chunk-6

Adapted from Bowers and Sprott 2013

Problems

  • Most EWIs have a low true positive identification rate
  • EWI literature does not report performance on a test dataset of future students
  • High performing EWIs have immense data requirements
  • Alarming false positive rates and no ability to tune these rates due to single indicator
  • But… we have a strong baseline universe to compare to

Evaluating Multiple DEWS Candidate Models Using ROC

plot of chunk unnamed-chunk-7

AUCs Across Methods

plot of chunk unnamed-chunk-8

Selecting the Best Model

In an applied context we may consider additional criteria in selecting the best model:

  • Accuracy (on the test data)
  • Transparency (stakeholder support)
  • Speed (both for development, and providing on-time prediction)
  • Support costs (the model lives on)
  • Data availability
  • Stability (reduce data reliance)

Data Cleaning

The one line in your methods section that took 80% of the work.

Data Cleaning

  • Data cleaning decisions are incredibly consequential, yet little formal training is made in their application
  • Sometimes cleaning takes the form of automatic filters, other times it is in compliance with a business rule
  • Data cleaning is incredibly important when using administrative data
  • Data cleaning is high stakes in an applied setting
  • Knowing the content helps, asking an expert such as a database administrator is even better

Preparing Data

  1. Recoding categorical variables
  2. Centering and scaling
  3. Dealing with missingness

Coding Categorical Data

  • Key tradeoff is between losing information by reducing categories and need to consolidate sparse groups to produce valid estimates
  • Consider whether categories can be ordered, eases data demands for estimation
  • Solution is to propose collapses of groups that are similar with respect to the DV and which are too small for strong estimates
  • There may be non-empirical concerns as well such as perception of groups of categories as similar
  • May disappear in the end after feature / variable selection, but need to be considered

Scale and Center the Continous Variables

  • Scaling and centering can reduce noise in assessment data (issues like lowest obtainable scale score, grade and year equating)
  • Can help deal with attendance data that is heavily skewed toward the high end
  • Converting counts to percents helps keep units (like schools) on a similar scale
  • Improves the efficiency of many statistical algorithms and MCMC methods

Missing Data Issues

  • Missing data is acutely important to predictive modeling, need to consider when and what data you need to predict outcomes will be available
  • Data is often missing in administrative records, and is almost never missing at random (students missing data are more likely to dropout!)
  • In Wisconsin, each additional year of data longitudinally that a method requires eliminates 10% of students from receiving a prediction due to missing data
  • Identifying that the outcome is correctly classified and not a default assumption is important - e.g. dropouts

Communication

Use graphics to display your model results to users. How to do that is a subject for another talk.

The Most Accurate Model is Easy to Find!

plot of chunk unnamed-chunk-9

Credits

Contact Info

Further Resources

  • The Signal and the Noise: Why So Many Predictions Fail — but Some Don't. Nate Silver. (2012). Penguin.
  • The Black Swan: Second Edition: The Impact of the Highly Improbable (2nd ed. 2010). Nassim Taleb. Random House.
  • An Introduction to Statistical Learning (2013). Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer. Download the book
  • Elements of Statistical Learning (Second Edition, 2011). Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Springer Download the book

An Aside on Unsupervised Models

plot of chunk clusters

  • These are familiar techniques for dimension reduction like cluster analysis, factor analysis, or principal components analysis
  • Can be useful for starting an analysis, looking for structure

Tradeoffs

plot of chunk unnamed-chunk-10