Regression is one of the most fundamental concepts in statistics and machine learning. At its core, regression answers a simple question: given some input variables, what output should we expect?
Think of it like this: if you know a team's expected goals (xG), can you predict how many actual goals they'll score? If you know the difference in team ratings, can you predict the probability of a home win? Regression gives us the mathematical tools to answer these questions.
The information we have: xG, possession %, form, player ratings, odds, etc. Also called "independent variables" or "predictors."
What we want to predict: goals scored, win probability, goal difference, etc. Also called "dependent variable" or "response."
Regression finds the relationship between inputs and outputs using historical data. Once we know this relationship, we can use it to make predictions on new, unseen data.
Linear regression assumes a straight-line relationship between inputs and outputs. Despite its simplicity, it's incredibly powerful and forms the foundation for understanding more complex models.
Linear regression finds coefficients by minimizing the sum of squared residuals. Here's the full derivation:
Predicting Goals from xG:
Interpretation: For every 1.0 increase in xG, we expect 0.92 more goals. The intercept (0.15) is the baseline when xG = 0.
Each coefficient tells you:
Before choosing a regression model, you need to understand what type of data you're predicting. The distribution of your target variable determines which regression technique is appropriate.
Continuous values that can be positive or negative, clustered around a mean.
- • Goal difference (-5 to +5)
- • xG difference
- • Player rating changes
Count data — discrete, non-negative integers representing "how many."
- • Goals scored (0, 1, 2, 3...)
- • Shots on target
- • Corners, fouls
Binary outcomes — yes/no, win/lose, happened/didn't happen.
- • Win or not win
- • Both teams score (BTTS)
- • Over/Under 2.5 goals
Using the wrong distribution leads to poor predictions and invalid confidence intervals. Linear regression assumes errors are normally distributed — if you use it for count data (goals), you might predict negative goals or fractional values that don't make sense.
Logistic regression is the go-to model for binary classification. Despite the name, it's used for classification, not regression of continuous values. It outputs a probability between 0 and 1.
Unlike OLS, logistic regression uses MLE — we find parameters that maximize the probability of observing our data:
Taking the log (for computational stability):
This is the negative cross-entropy loss. We maximize it using gradient descent or Newton-Raphson.
Predicting Home Win:
If xG_diff = 0.5 and form_diff = 1.0, the model might output P(Home Win) = 0.62
Coefficients are in log-odds (not probabilities):
For Home/Draw/Away prediction, we extend to K classes:
Poisson regression is designed for count data — non-negative integers like goals scored. It's the foundation of many football prediction models, especially for predicting scorelines.
The classic football prediction model extends basic Poisson:
Where α = attack strength, β = defense strength. Also includes a ρ parameter to adjust for correlation in low-scoring matches.
Independent Poisson Model:
Each team's goals are modeled separately using their attack/defense ratings. This is the basis of many betting models.
Once you have λ_home and λ_away:
Calculate P for each scoreline (0-0, 1-0, 0-1, ...), then sum to get match outcome probabilities.
Basic Poisson regression assumes home and away goals are independent. In reality, they're often correlated (high-scoring games, defensive games). Advanced models use bivariate Poisson or copulas to model this correlation.
When you have many features (especially correlated ones), linear regression can overfit — the model becomes too tailored to training data and performs poorly on new data. Regularization adds a penalty for complex models.
Adds a penalty proportional to the square of coefficient values.
Adds a penalty proportional to the absolute value of coefficients.
Combines Ridge and Lasso penalties. Best of both worlds.
Regularization increases bias but decreases variance. The optimal α balances these — found via cross-validation.
Like Poisson but handles overdispersion (variance > mean). More flexible for count data.
For ordered categories (e.g., lose/draw/win, ratings 1-5). Uses cumulative link functions.
Predicts specific percentiles (median, 90th percentile) rather than the mean.
For count data with excess zeros. Two-component mixture model.
For betting applications, also track ROI (return on investment), Closing Line Value (did you beat the closing odds?), and calibration plots (do 70% predictions actually happen 70% of the time?).
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import log_loss, roc_auc_score
import numpy as np
# Features (calculated BEFORE each match)
features = ['home_xg_avg_5', 'away_xg_avg_5',
'home_form_5', 'away_form_5',
'xg_diff', 'elo_diff']
X = df[features]
y = df['home_win'] # 1 = home win, 0 = not home win
# Time-based split
tscv = TimeSeriesSplit(n_splits=5)
# Logistic Regression with regularization
model = LogisticRegression(
C=1.0, # Inverse of regularization strength
penalty='l2', # Ridge regularization
max_iter=1000
)
# Train and evaluate
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model.fit(X_train, y_train)
probs = model.predict_proba(X_test)[:, 1]
print(f"Log Loss: {log_loss(y_test, probs):.4f}")
print(f"AUC-ROC: {roc_auc_score(y_test, probs):.4f}")
# Interpret coefficients
for name, coef in zip(features, model.coef_[0]):
print(f"{name}: {coef:.3f} (odds ratio: {np.exp(coef):.2f})")import statsmodels.api as sm
from scipy.stats import poisson
import numpy as np
# Prepare data: each row is one team in one match
# Features: attack strength, opponent defense strength
X = df[['attack_rating', 'opp_defense_rating', 'is_home']]
X = sm.add_constant(X)
y = df['goals_scored']
# Fit Poisson regression (GLM with log link)
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
# Predict expected goals for a match
home_features = [1, 1.2, 0.9, 1] # const, attack, opp_def, is_home
away_features = [1, 1.0, 1.1, 0]
lambda_home = np.exp(np.dot(home_features, results.params))
lambda_away = np.exp(np.dot(away_features, results.params))
print(f"Expected goals - Home: {lambda_home:.2f}, Away: {lambda_away:.2f}")
# Calculate scoreline probabilities
max_goals = 6
for h in range(max_goals):
for a in range(max_goals):
prob = poisson.pmf(h, lambda_home) * poisson.pmf(a, lambda_away)
if prob > 0.01: # Only show likely scorelines
print(f"{h}-{a}: {prob:.1%}")Continuous → Linear, Binary → Logistic, Counts → Poisson
Linear: direct effect, Logistic: log-odds, Poisson: log-rate
Ridge, Lasso, or Elastic Net prevent overfitting
Poisson for goals, logistic for outcomes — simple but effective