Random Forest is an ensemble learning method that builds many decision trees and combines their predictions. The "random" comes from two sources of randomness that make each tree different: random samples and random feature subsets.
Think of it like polling many football experts, each with slightly different information and perspectives. While individual experts might be wrong, their collective wisdom tends to be more accurate than any single expert.
Each tree trains on a random sample of the data (with replacement). ~63% of data used, ~37% left out.
At each split, only a random subset of features is considered. This decorrelates the trees.
Classification: majority vote. Regression: average predictions. Probabilities: average probabilities.
The key mathematical insight: averaging reduces variance when errors are uncorrelated.
Each tree greedily chooses splits to maximize purity (minimize impurity):
Number of trees in the forest. More trees = more stable predictions, but diminishing returns and slower training.
Number of features to consider at each split. Lower = more decorrelated trees but weaker individual trees.
Maximum depth of each tree. None = grow until pure or min_samples_leaf. Limits complexity.
Minimum samples required to split a node. Higher = more regularization.
Minimum samples required in a leaf node. Higher = smoother predictions, prevents tiny leaves.
Whether to use bootstrap samples. False = each tree sees all data (less variance reduction).
Start with defaults (100-500 trees, sqrt features). Increase n_estimators until OOB error stabilizes. Then tune max_features, max_depth, and min_samples_leaf. Use OOB score or cross-validation.
Default in sklearn. For each feature, sum the weighted impurity decrease across all splits using that feature, across all trees.
Shuffle a feature's values and measure how much accuracy decreases. More reliable than MDI.
| Aspect | Random Forest | Gradient Boosting |
|---|---|---|
| Tree Building | Parallel (independent) | Sequential (correcting errors) |
| Tree Depth | Deep trees (low bias) | Shallow trees (high bias) |
| Variance Reduction | Averaging reduces variance | Boosting reduces bias |
| Overfitting Risk | Low (more trees = better) | Higher (needs early stopping) |
| Training Speed | Fast (parallelizable) | Slower (sequential) |
| Tuning Effort | Low (robust defaults) | Higher (many hyperparams) |
| Peak Accuracy | Good | Often higher (when tuned) |
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import log_loss, accuracy_score
import numpy as np
# Features (calculated BEFORE each match)
features = [
'home_xg_avg_5', 'away_xg_avg_5',
'home_xga_avg_5', 'away_xga_avg_5',
'home_form_5', 'away_form_5',
'elo_diff', 'h2h_home_wins', 'h2h_draws'
]
X = df[features]
y = df['result'] # 0=Away, 1=Draw, 2=Home
# Time-based split (don't leak future data!)
tscv = TimeSeriesSplit(n_splits=5)
# Random Forest with good defaults
rf = RandomForestClassifier(
n_estimators=500, # Number of trees
max_features='sqrt', # √p features per split
max_depth=None, # Grow trees fully
min_samples_leaf=5, # Prevent tiny leaves
oob_score=True, # Get free validation
n_jobs=-1, # Use all cores
random_state=42
)
# Train and evaluate
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
rf.fit(X_train, y_train)
# Predict probabilities
probs = rf.predict_proba(X_test)
print(f"Fold {fold+1}:")
print(f" Log Loss: {log_loss(y_test, probs):.4f}")
print(f" Accuracy: {accuracy_score(y_test, rf.predict(X_test)):.4f}")
print(f" OOB Score: {rf.oob_score_:.4f}")
# Feature importance
print("\nFeature Importance:")
for name, imp in sorted(zip(features, rf.feature_importances_),
key=lambda x: x[1], reverse=True):
print(f" {name}: {imp:.4f}")Random Forest probabilities are often poorly calibrated — they tend to be pushed toward 0 or 1. For betting applications, consider using CalibratedClassifierCV with isotonic or sigmoid calibration to get reliable probability estimates.
Bootstrap sampling + random features → decorrelated trees
More trees = lower variance, can't overfit by adding trees
Works well out of the box, robust to noise and outliers
OOB error for free CV, feature importance for understanding