Generative (i.e., Classical Statistics) | |||
---|---|---|---|
Predictive (i.e., Machine Learning) | |||
A Gentle Introduction
What machine learning (ML) means in substantive terms
How machine learning differs from classical social statistics
Key terms and concepts
An introduction to \(k\)-Nearest Neighbours algorithms
A broad definition
A more concrete definition
An even more concrete definition
Once deployed, SML algorithms learn the complex patterns linking \(X\)—a set of features (or independent variables)—to a target variable (or outcome), \(Y\).
The goal of SML is to optimize predictions—i.e., to find functions or algorithms that offer substantial predictive power when confronted with new or unseen data.
Examples of SML algorithms include logistic regressions, random forests, ridge regressions and neural networks.
A quick note on terminology
If a target variable is quantitative, we are dealing with a regression problem.
If a target variable is qualitative, we are dealing with a classification problem.
Note: This figure is an adaptation of the diagram depicted here
UML techniques search for a representation of the inputs (or features) that is more useful than \(X\) itself (Molina and Garip 2019).
In UML, there is no observed \(Y\) variable—or target—to supervise the estimation process. Instead, we only have a vector of inputs to work with.
The goal in UML is to develop a lower-dimensional representation of complex data by inductively learning from the interrelationships among inputs.
As Grimmer and colleagues (2021) note, “machine learning is as much a culture defined by a distinct set of values and tools as it is a set of algorithms.”
This point has, of course, been made elsewhere.
Breiman (2001) famously used the imagery of warring cultures to describe two major traditions—(i) the generative modelling culture and (ii) the predictive modelling culture—that have achieved hegemony within the world of statistical modelling.
The terms generative and predictive (as opposed to data and algorithmic) come from Donoho’s (2017) 50 Years of Data Science.
Quantity of Interest | Primary Goals | Key Strengths | Key Limitations |
---|---|---|---|
Generative (i.e., Classical Statistics) | |||
Inferring relationships between X and Y |
Interpretability; emphasis on uncertainty around estimates; explanatory power |
Bounded by statistical assumptions, inattention to variance across samples | |
Predictive (i.e., Machine Learning) | |||
Generating accurate predictions of Y |
Predictive power; potential to simplify high dimensional data; relatively unconstrained by statistical assumptions |
Inattention to explanatory processes, opaque links between X and Y |
Note: To be sure, the putative strengths and weaknesses of these modelling “cultures” have been hotly debated.
Advances in machine learning can provide empirical leverage to social scientists and sharpen social theory in one fell swoop.
Lundberg, Brand and Jeon (2022), for instance, argue that adopting a machine learning framework can help social scientists:
While ML is often associated with induction, van Loon (2022) argues that SML algorithms can help us deductively resolve predictability hypotheses as well.
Image can be retrieved here.
Bias emerges when we build SML algorithms that fail to sufficiently map the patterns—or pick up the empirical signal–linking \(X\) and \(Y\). Think: underfitting.
Variance arises when our algorithms not only pick up the signal linking \(X\) and \(Y\), but some of the noise in our data as well. Think: overfitting.
When adopting an SML framework, researchers try to strike the optimal balance between bias and variance.
We can use a training set to fit our algorithm—to find weights (or coefficients), recursively split the feature space to grow decision trees and so on.
Training data should constitute the largest of our three disjoint sets.
We can use a validation set to find the right estimator out of a series of candidate algorithms—or to choose the best-fitting parameterization of a single algorithm.
Often, using both training and validation sets can be costly: data sparsity can give rise to bias.
Thus, when limited to smaller samples, analysts often combine training and validation—say, by recycling training data for model tuning and selection.
We can use a testing set to generate a measure of our model’s predictive accuracy (e.g., the F1 score for classification problems)—or to derive our generalization error.
This subsample is used only once (to report the performance metric); put another way, it cannot be used to train, tune or select our algorithm.
Unlike conventional approaches to sample partition, \(k\) or \(v\)-fold cross-validation allows us to learn from all our data.
\(k\)-fold cross-validation proceeds as follows:
Stratified \(k\)-fold cross-validation ensures that the distribution of class labels (or for numeric targets, the mean) is relatively constant across folds.
Image can be retrieved here.
In SML settings, we automatically learn the parameters (e.g., coefficients) of our algorithms during estimation.
Hyperparameters, on the other hand, are chosen by the analyst, guide the entire learning process, and can powerfully shape our algorithm’s predictive performance.
How can analysts settle on the right hyperparameter value(s) for their algorithm?
GridSearchCV
from scikit-learn
.\(k\)-nearest neighbours (KNNs) are simple, non-parametric algorithms that predict values of \(Y\) based on the distance between rows (or observations’ inputs).
The estimation of KNNs proceeds as follows:
# Importing pandas to wrangle the data, modules from scikit-learn to fit KNN classifier,
# and numpy to iterate over potential values of k (for a grid search):
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
import numpy as np
# Load input data frame:
data = pd.read_csv("https://gattonweb.uky.edu/sheather/book/docs/datasets/MichelinNY.csv",
encoding ='latin-1')
# Zero-in on X and Y variables:
y = data['InMichelin']
X = data.drop(columns = ['InMichelin', 'Restaurant Name'])
# Perform train-test split:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 905)
# Initializing KNN classifier with k = 5, fitting model:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train.values, y_train)
# Stratified k-fold cross-validation:
skfold = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 905)
# Cross-validation score:
cross_val_score(knn, X_train, y_train, cv = skfold).mean()
# Measure of predictive performance:
knn.score(X_test.values, y_test)
# Creating a grid of potential hyperparameter values (odd numbers from 1 to 13):
k_grid = {'n_neighbors': np.arange(start = 1, stop = 15, step = 2) }
# Setting up a grid search to home-in on best value of k:
grid = GridSearchCV(KNeighborsClassifier(), param_grid = k_grid, cv = skfold)
grid.fit(X_train, y_train)
# Extract best score and hyperparameter value:
print("Best Mean Cross-Validation Score: {:.3f}".format(grid.best_score_))
print("Best Parameters (Value of k): {}".format(grid.best_params_))
print("Test Set Score: {:.3f}".format(grid.score(X_test, y_test)))
Using data from {palmerpenguins}
, develop a \(k\)-nearest neighbours regressor or classifier to predict an outcome of interest. Try to report your algorithm’s cross-validation score and out-of-sample performance.
If you don’t remember how to work with the {palmerpenguins}
package in Python, return to the Jupyter Notebook hyperlinked here.