Applied Modeling Talk

Jared Knowles
October, 2013

Overview

What are statistical models?
What is applied modeling?
What to keep in mind?
Why is this important?

What is a model?

An abstraction from reality
What features does it have?
What purpose does it serve?

747Model

Is this a model?

747Model

What is a statistical model?

“All models are wrong, some models are useful”
Being wrong is a feature of a statistical model, the goal is to explain as much data as possible with as few variables as possible
Statistical models are mathematical summaries of correlations and probabilities
The most common form is linear regression models

Do machines really learn?

Applied modeling goes by many names: statistical learning, machine learning, predictive analytics, and data mining.

The key differences between applied modeling and statistical inference are:

Emphasis on predictive validity
De-emphasis on parameter values
Test and training data
Measures of fit

Goals for This Talk

Introduce the world of statistical models outside of linear models
Demonstrate the specific techniques around building predictive models
Discuss the tradeoffs in applied vs. research models
Show some R code!

Applied Models and Inference

Applied modeling and inferential statistics share many of the same concepts:

Regression estimation
Concerns about representativeness of data and samples
Fear of outliers
Robustness and sensitivity

plot of chunk unnamed-chunk-1

Supervised vs. Unsupervised Learning

A key distinction in statistical learning is that between supervised and unsupervised techniques.

supervised - relationship between inputs and outputs is being explored
unsupervised - the relationship among inputs is being explored, no output

We will focus on supervised learning for the most part in this talk.

plot of chunk clusters

Statistical Modeling

It is useful to remember that in statistical modeling, in the supervised case, we are looking at the following relationship:

\[ \hat{Y} = \hat{f}(X) \]

In this case \( \hat{f} \) represents our estimate of the function that links \( X \) and \( Y \). In traditional linear modeling, \( \hat{f} \) takes the form:

\[ \hat{Y} = \alpha + \beta(X) + \epsilon \]

However, there exist limitless alternative \( \hat{f} \) which we can explore. Applied modeling techniques help us expand the \( \hat{f} \) space we search within.

How do we choose f?

Choosing \( f \) is about tradeoffs, the most obvious is between flexibility and interpretability.

plot of chunk unnamed-chunk-2

Why the Difference?

Applied Models:

Provide information to users about what to expect given certain data
Goals for the model are defined by the needs of the users

Inferential Models:

Seek to learn about the relationships in the data at hand
Focused on understanding patterns in the current data

Some Vocabulary

Training data
Test data
Bias (error)
Variance (error)

Data the model is fit to
Data the model is applied to, but not fit to, to evaluate model fit
Refers to the amount of error due to simplifying a complex process
The amount the \( f \) would change if fit to a different training set of data

The Challenge

When using a statistical model to make predictions we have to think clearly about the data we use to build the model, and the data we will be making predictions about
We may build a model with high internal validity for the data at hand, but that data may not be representative of the data the model will apply to
We call this the training error and the test error
In inferential statistics we often seek to reduce training error and not concern ourselves with test error

A Trivial Example

Consider the following training data:

plot of chunk unnamed-chunk-3

Consider the Test Data

How does our model fit the test data?

plot of chunk unnamed-chunk-4

Consider the Pooled Data

plot of chunk unnamed-chunk-5

What do we learn?

The data was generated from the same function but there was a trend across groups
Predicting from this training model leads to bias in our predictions
The relationship among the groups iteratively shifted, so data earlier in the process understated the effect coming later
These out of sample differences are what make forecasting tricky
Traditional methods rely on sampling to do this
How do we protect ourselves in the case when we can't sample the data we want to predict because it hasn't been generated yet?

When Could this Matter: Stocks?

plot of chunk unnamed-chunk-6

Forecasting Apple Stock Could be Useful

plot of chunk unnamed-chunk-7

Forecasts Are Tricky

plot of chunk unnamed-chunk-8

The Further We Get From The Training Data...

plot of chunk unnamed-chunk-9

Overfit

Training data can lead to model overfit
We need both methods of f and methods of evaluating models that can insulate against overfit
This means different measures of model fit
Extrapolation gets harder
Time changes everything
Non-linear behaviors
Paradigm shifts

Measuring Fit Differently

Classification measures
Mean Squared Error for the test data
Folding, cross validation, and other methods of measuring error

Takeaways

Linear modeling and regression is just one of many choices of modeling data
Model fit within the data may not provide reliable or useful results outside of the data
Different problems call for different statistical modeling techniques
These techniques include different functional forms, different measures of model fit, and different ways of classifying successful models
Models are tools to understand complex relationships

Keep an open mind

K Nearest Neighbors

KNN

Further Resources

An Introduction to Statistical Learning (2013). Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer.
Elements of Statistical Learning (Second Edition, 2011). Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Springer