Probaballer - Football Analytics & Betting Insights

Random Forest

The wisdom of crowds: combining many decision trees into a powerful, robust ensemble for football prediction.

Machine LearningEnsemble MethodsBaggingClassification

What is Random Forest?

Random Forest is an ensemble learning method that builds many decision trees and combines their predictions. The "random" comes from two sources of randomness that make each tree different: random samples and random feature subsets.

Think of it like polling many football experts, each with slightly different information and perspectives. While individual experts might be wrong, their collective wisdom tends to be more accurate than any single expert.

1. Bootstrap Sampling

Each tree trains on a random sample of the data (with replacement). ~63% of data used, ~37% left out.

2. Random Features

At each split, only a random subset of features is considered. This decorrelates the trees.

3. Aggregate Predictions

Classification: majority vote. Regression: average predictions. Probabilities: average probabilities.

The Two Sources of Randomness

How Random Forest creates diverse, uncorrelated trees

1. Bootstrap Aggregating (Bagging)

The Process

For each tree, randomly sample n observations from training data with replacement. Some observations appear multiple times, others not at all.

Why ~63% are selected?

P(not selected once) = (1 - 1/n)ⁿ → e⁻¹ ≈ 0.368 as n→∞

So P(selected at least once) ≈ 1 - 0.368 = 0.632

Out-of-Bag (OOB) Samples

The ~37% not used to train a tree can validate it — free cross-validation! OOB error ≈ test error.

2. Random Feature Subsets

The Process

At every split in every tree, randomly select m features from all p features. Only these m features are considered for the best split.

Classification

m = √p

e.g., 10 features → use 3

Regression

m = p/3

e.g., 10 features → use 3

Why It Helps

Prevents strong features from dominating every tree

Key Insight: If xG is the best predictor, a standard bagged forest would use it at the root of every tree. With random features, some trees must find other patterns, making the ensemble more robust.

The Mathematics

Why averaging trees reduces error

Variance Reduction Through Averaging

The key mathematical insight: averaging reduces variance when errors are uncorrelated.

For a Single Tree

Var(T) = σ²

For B Independent Trees (Average)

Var(T̄) = Var((1/B)Σᵢ Tᵢ) = σ²/B

Variance decreases linearly with number of trees!

For B Correlated Trees (correlation ρ)

Var(T̄) = ρσ² + (1-ρ)σ²/B

As B→∞, variance approaches ρσ² (irreducible correlation term)

Why Random Features Matter: By forcing trees to use different features, we reduce ρ (correlation between trees). Lower ρ → lower ensemble variance → better generalization!

Splitting Criteria

Each tree greedily chooses splits to maximize purity (minimize impurity):

Classification: Gini Impurity

Gini(t) = 1 - Σₖ pₖ²

pₖ = proportion of class k at node t

Gini = 0 → pure node (all same class)

Gini = 0.5 → binary, equal split

Classification: Entropy

H(t) = -Σₖ pₖ log₂(pₖ)

Information gain = H(parent) - weighted H(children)

Choose split that maximizes information gain

Regression: Variance Reduction

Var(t) = (1/n) Σᵢ (yᵢ - ȳ)²

Choose split that minimizes weighted variance of children: n_L×Var(L) + n_R×Var(R)

Aggregation Methods

Classification (Hard)

ŷ = mode(T₁(x), ..., T_B(x))

Majority vote across trees

Classification (Soft)

P(y=k) = (1/B) Σᵢ Pᵢ(y=k)

Average class probabilities

Regression

ŷ = (1/B) Σᵢ Tᵢ(x)

Simple average of predictions

Key Hyperparameters

The knobs that matter most for tuning

n_estimators

Number of trees in the forest. More trees = more stable predictions, but diminishing returns and slower training.

Typical range: 100 - 1000

✓ Cannot overfit by adding more trees

max_features

Number of features to consider at each split. Lower = more decorrelated trees but weaker individual trees.

Typical: "sqrt" (classification), "log2", or 0.33 (regression)

max_depth

Maximum depth of each tree. None = grow until pure or min_samples_leaf. Limits complexity.

Typical range: None, or 10-30

min_samples_split

Minimum samples required to split a node. Higher = more regularization.

Typical range: 2 - 20

min_samples_leaf

Minimum samples required in a leaf node. Higher = smoother predictions, prevents tiny leaves.

Typical range: 1 - 20

bootstrap

Whether to use bootstrap samples. False = each tree sees all data (less variance reduction).

Default: True (recommended)

Tuning Strategy

Start with defaults (100-500 trees, sqrt features). Increase n_estimators until OOB error stabilizes. Then tune max_features, max_depth, and min_samples_leaf. Use OOB score or cross-validation.

Feature Importance

Understanding what drives predictions

Mean Decrease in Impurity (MDI)

Default in sklearn. For each feature, sum the weighted impurity decrease across all splits using that feature, across all trees.

Importance(j) = Σ_trees Σ_nodes n_t/n × ΔImpurity

Caveat: Biased toward high-cardinality features (many unique values)

Permutation Importance

Shuffle a feature's values and measure how much accuracy decreases. More reliable than MDI.

Importance(j) = Score_original - Score_shuffled

Benefit: Model-agnostic, unbiased, works on test set

Example: Football Match Prediction

home_xg_avg_5

0.182

away_xga_avg_5

0.142

odds_implied_home

0.118

home_form_5

0.095

elo_diff

0.082

Random Forest vs Gradient Boosting

Understanding when to use each approach

Aspect	Random Forest	Gradient Boosting
Tree Building	Parallel (independent)	Sequential (correcting errors)
Tree Depth	Deep trees (low bias)	Shallow trees (high bias)
Variance Reduction	Averaging reduces variance	Boosting reduces bias
Overfitting Risk	Low (more trees = better)	Higher (needs early stopping)
Training Speed	Fast (parallelizable)	Slower (sequential)
Tuning Effort	Low (robust defaults)	Higher (many hyperparams)
Peak Accuracy	Good	Often higher (when tuned)

When to Use Random Forest

Quick baseline model with minimal tuning

When interpretability (feature importance) matters

Parallel training needed (speed on multi-core)

When you want robust, safe predictions

Strengths & Weaknesses

Strengths

•Excellent out-of-the-box performance

•Handles mixed feature types (numeric, categorical)

•Robust to outliers and noisy features

•No feature scaling required

•Built-in feature importance

•Free validation via OOB error

•Parallelizable (fast training)

•Handles missing values (some implementations)

Weaknesses

•Can't extrapolate beyond training data range

•Large memory footprint (stores all trees)

•Slower prediction than single models

•Less interpretable than single decision tree

•May underperform GBM on well-tuned tasks

•Biased toward features with many levels

•Predictions bounded by training labels

Application to Football

Practical implementation for match prediction

Use Cases

•1X2 Prediction: Multi-class classification

•Over/Under: Binary classification

•Goal Difference: Regression

•Player Performance: Regression/Classification

Feature Ideas

•Rolling xG/xGA averages (last 5, 10 games)

•Form (points per game)

•Head-to-head statistics

•Market odds (as features)

•Elo/Rating differences

Example: Random Forest for Match Outcome

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import log_loss, accuracy_score
import numpy as np

# Features (calculated BEFORE each match)
features = [
    'home_xg_avg_5', 'away_xg_avg_5',
    'home_xga_avg_5', 'away_xga_avg_5',
    'home_form_5', 'away_form_5',
    'elo_diff', 'h2h_home_wins', 'h2h_draws'
]

X = df[features]
y = df['result']  # 0=Away, 1=Draw, 2=Home

# Time-based split (don't leak future data!)
tscv = TimeSeriesSplit(n_splits=5)

# Random Forest with good defaults
rf = RandomForestClassifier(
    n_estimators=500,        # Number of trees
    max_features='sqrt',     # √p features per split
    max_depth=None,          # Grow trees fully
    min_samples_leaf=5,      # Prevent tiny leaves
    oob_score=True,          # Get free validation
    n_jobs=-1,               # Use all cores
    random_state=42
)

# Train and evaluate
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    rf.fit(X_train, y_train)
    
    # Predict probabilities
    probs = rf.predict_proba(X_test)
    
    print(f"Fold {fold+1}:")
    print(f"  Log Loss: {log_loss(y_test, probs):.4f}")
    print(f"  Accuracy: {accuracy_score(y_test, rf.predict(X_test)):.4f}")
    print(f"  OOB Score: {rf.oob_score_:.4f}")

# Feature importance
print("\nFeature Importance:")
for name, imp in sorted(zip(features, rf.feature_importances_), 
                        key=lambda x: x[1], reverse=True):
    print(f"  {name}: {imp:.4f}")

Probability Calibration Warning

Random Forest probabilities are often poorly calibrated — they tend to be pushed toward 0 or 1. For betting applications, consider using CalibratedClassifierCV with isotonic or sigmoid calibration to get reliable probability estimates.

Tips for Football Prediction

Use TimeSeriesSplit, never random split

Start with 500 trees, increase if OOB improves

Use OOB score for quick hyperparameter tuning

Calibrate probabilities for betting applications

Check permutation importance, not just MDI

Compare with XGBoost/LightGBM on your data

Key Takeaways

Ensemble of independent trees

Bootstrap sampling + random features → decorrelated trees

Variance reduction through averaging

More trees = lower variance, can't overfit by adding trees

Excellent baseline with minimal tuning

Works well out of the box, robust to noise and outliers

Built-in validation and interpretability

OOB error for free CV, feature importance for understanding