A neural network is a computer system inspired by how the human brain works. Just as your brain uses billions of interconnected neurons to process information, recognize faces, and make decisions, artificial neural networks use mathematical functions connected together to learn patterns from data.
Imagine teaching a child to recognize a dog. You show them hundreds of pictures of dogs, pointing out features like four legs, fur, a tail, and floppy ears. Eventually, they learn to recognize dogs they have never seen before. Neural networks learn the same way — by seeing many examples and gradually figuring out the patterns that define each category.
More formally, a neural network is a parametric function — a mathematical formula with adjustable numbers (called parameters) that can be tuned to produce desired outputs. We write this as:
In football analytics, neural networks can learn to predict outcomes like:
- Expected Goals (xG): Given shot distance, angle, and defender pressure, predict the probability of scoring
- Trajectory Prediction: Given a player's past positions, predict where they'll be in 2 seconds
- Match Outcome: Given team statistics, predict home win, draw, or away win probabilities
Every neural network is built from simple units called neurons (also called nodes or units). A single neuron does something remarkably simple — it takes numbers in, does some basic math, and produces a number out.
Takes in one or more numbers (like shot distance, angle, defender pressure)
Multiplies each input by a weight (importance), then adds them all together
Passes the sum through a function to produce the final output
The Math Behind a Neuron
Here's what a single neuron computes, shown in both plain English and mathematical notation:
Symbol Definitions
Worked Example: Predicting if a Shot is On Target
z = (-0.05 × 15) + (0.02 × 25) + (-0.80 × 0.3) + 1.20
z = -0.75 + 0.50 - 0.24 + 1.20
z = 0.71
y = 1 / (1 + e^(-0.71))
y = 1 / (1 + 0.49)
y = 1 / 1.49
y = 0.67
The magic of neural networks lies in their ability to learn the right weights and biases from data. Initially, these start as random numbers. Through training, the network adjusts them to make better predictions.
Weights determine how important each input is:
- • Large positive: Input strongly increases output
- • Large negative: Input strongly decreases output
- • Near zero: Input barely matters
The bias is like a baseline or starting point:
- • Shifts the output up or down
- • Works regardless of input values
- • The neuron's default tendency
Imagine you're a scout evaluating strikers. You might weight finishing ability heavily (w=0.9), pace moderately (w=0.5), and heading lightly (w=0.2). Over time, you adjust these weights based on which strikers actually perform well. That's exactly what neural networks do!
If neurons only computed weighted sums, stacking layers would be pointless — the whole network would just be one linear equation. Activation functions add non-linearity, allowing networks to learn complex, curved patterns.
A single neuron can only learn simple patterns. The real power comes from connecting many neurons together in layers. Each layer transforms its inputs, passing results to the next layer.
Receives raw data. Each neuron = one feature (distance, angle, etc.). Just passes data forward.
Where the magic happens! Early layers detect simple patterns, later layers combine them into complex concepts. More layers = deeper network = can learn more complex patterns.
Produces the final prediction. For xG: 1 neuron with sigmoid. For match outcome: 3 neurons with softmax.
Training is like teaching a student through practice tests. You show examples, check answers, explain mistakes, and repeat. This happens through a four-step cycle:
Feed input through network to get prediction
Measure how wrong the prediction was
Find how each weight contributed to error
Adjust weights to reduce error, repeat!
These four steps repeat thousands of times. Each cycle = one iteration. One pass through entire dataset = one epoch. With each iteration, weights gradually shift toward values that produce accurate predictions.
The loss function measures how wrong our predictions are. Higher = worse, lower = better. Training tries to minimize this number.
We know the error (loss), but have thousands of weights. Which weights caused the error? By how much? Backpropagation answers this using the chain rule from calculus.
Backprop works backwards from output to input. For each weight, it calculates: "If I nudge this weight slightly, how much does the loss change?" This is called the gradient — it tells us the direction and magnitude of change needed.
The Chain Rule
What Backprop Computes
For each weight w, backprop calculates the gradient: ∂L/∂w (how loss changes when we change w). This tells us:
- Positive gradient: Increasing w increases loss → we should decrease w
- Negative gradient: Increasing w decreases loss → we should increase w
- Large gradient: This weight has big impact on the error
- Small gradient: This weight barely affects the error
Now we know how each weight contributes to the error (from backprop). Gradient descent uses this information to update the weights, nudging them in the direction that reduces loss.
Imagine you're blindfolded on a hilly landscape, trying to reach the lowest valley (minimum loss). You can feel the slope beneath your feet (the gradient). Gradient descent says: "Always step in the direction that goes most steeply downhill." Repeat until you reach the bottom!
The Update Rule
How Backprop and Gradient Descent Work Together
Learning Rate Effects
Overshoots minimum, loss oscillates or explodes
Converges smoothly to minimum
Converges very slowly, may get stuck
Networks can "memorize" training data instead of learning general patterns. This is called overfitting. Regularization techniques prevent this.
Add penalty for large weights to loss function. Encourages smaller, more distributed weights.
Randomly "turn off" neurons during training. Forces network to not rely on any single neuron.
Stop training when validation loss starts increasing, before overfitting occurs.
Normalize layer inputs. Stabilizes training and acts as mild regularizer.
Predict probability of goal from shot features: distance, angle, body part, assist type, defender positions.
Given past positions, predict where player will be in 1-5 seconds. Used for tactical analysis.
Predict home win, draw, or away win probability based on team form, head-to-head history, and player availability.
- ✓ Neurons compute weighted sums + activation
- ✓ Weights and biases are learned from data
- ✓ Activation functions add non-linearity
- ✓ Layers stack to form deep networks
- ✓ Loss functions measure prediction error
- ✓ Backprop finds how weights affect error
- ✓ Gradient descent updates weights to reduce error
- 2. Convolutional Neural Networks (CNNs)
- 3. Recurrent Neural Networks (RNNs & LSTMs)
- 4. Graph Neural Networks (GNNs)
- 5. Spatiotemporal GNNs for Football
Neural networks are just functions with learnable parameters. Training adjusts these parameters to minimize prediction error. The magic comes from stacking simple operations (weighted sums + activations) into deep architectures that can learn incredibly complex patterns — like predicting the probability of a goal from dozens of contextual features.