Building a robust football analytics pipeline requires combining data from multiple sources: match results from one provider, player statistics from another, odds from bookmakers, and tracking data from yet another. The challenge? None of them agree on how to name things.
This is the entity matching problem (also called record linkage or data deduplication). Without solving it, you cannot join datasets, track players across seasons, or build features that span multiple data sources.
Name order, transliteration, abbreviations, and character sets all differ.
Abbreviations, nicknames, and formal vs informal naming conventions.
Players change teams, share similar names (e.g., multiple "Mohamed Salah" across leagues), use nicknames professionally (e.g., "Rodri" vs "Rodrigo Hernández"), or have names with diacritics that get stripped (e.g., "Müller" vs "Muller").
There is no single perfect solution. The best approach combines multiple techniques, using simple methods for easy cases and more sophisticated methods for ambiguous matches. Here is the toolkit:
Fuzzy matching quantifies the similarity between two strings, allowing for typos, abbreviations, and minor variations. The workhorse algorithm is Levenshtein distance.
Levenshtein Distance
The minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.
Convert distance to 0-100 similarity score. Common in libraries like fuzzywuzzy (Python).
Sort words alphabetically before comparing. Handles name order variations.
Jaro-Winkler Similarity
Designed for short strings like names. Gives higher scores to strings that match from the beginning, which is useful for names with common prefixes.
from rapidfuzz import fuzz, process
teams_source_a = ["Manchester United", "Liverpool", "Arsenal"]
query = "Man Utd"
# Find best match
best_match = process.extractOne(query, teams_source_a)
# Returns: ("Manchester United", 73.3, 0)
# Set threshold for automatic matching
if best_match[1] > 80:
matched_team = best_match[0]
else:
# Flag for manual review
passPhonetic algorithms encode names by how they sound rather than how they are spelled. This catches variations that fuzzy matching might miss.
Classic algorithm encoding names into a letter + 3 digits. Limited but fast.
More sophisticated, handles non-English names better. Returns primary and alternate encodings.
Use case: Pre-filter candidates before applying more expensive similarity measures. If phonetic codes do not match, skip the comparison entirely.
Modern NLP models encode text into dense vector representations (embeddings) that capture semantic meaning. Similar names cluster together in embedding space, enabling matching even when surface forms differ significantly.
Pre-trained models like all-MiniLM-L6-v2 generate 384-dimensional embeddings. Fast, free, and effective for entity matching.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
names = ["Heung-Min Son", "Son Heung-min", "Harry Kane"]
embeddings = model.encode(names)
# Cosine similarity matrix
sim_matrix = cosine_similarity(embeddings)
# Son variants: ~0.92 similarity
# Son vs Kane: ~0.45 similaritytext-embedding-3-small provides high-quality embeddings via API. Better for multilingual names but incurs cost.
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=["손흥민", "Son Heung-min"]
)
# Returns 1536-dimensional embeddings
# High similarity despite different scriptsFor large datasets, use approximate nearest neighbor search to find candidates efficiently.
import faiss import numpy as np # Build index from source A embeddings index = faiss.IndexFlatIP(384) # Inner product (cosine after L2 norm) faiss.normalize_L2(embeddings_a) index.add(embeddings_a) # Query with source B names faiss.normalize_L2(query_embedding) distances, indices = index.search(query_embedding, k=5) # Returns top 5 matches with similarity scores
String similarity alone cannot resolve all ambiguities. Metadata provides crucial disambiguation signals that dramatically improve matching accuracy.
The strongest disambiguator for players. Two players with similar names but different DOBs are definitionally different entities.
Helps distinguish players with common names. Less reliable than DOB but widely available.
Useful for temporal matching within a season. Less reliable across seasons due to transfers.
Weak signal (positions vary by source) but useful in combination with other features.
Combine signals into a weighted match score:
def match_score(name_sim, dob_match, nationality_match, team_match):
score = name_sim * 0.4
if dob_match:
score += 0.35 # Strong signal
if nationality_match:
score += 0.15
if team_match:
score += 0.10
return score
# Threshold: score > 0.75 = auto-match, 0.5-0.75 = manual reviewA production entity matching pipeline combines these techniques in stages, from cheap/fast to expensive/accurate:
Normalize strings (lowercase, remove diacritics, strip whitespace) and check for exact matches. Handles ~60-70% of cases instantly.
Reduce candidate pairs using phonetic codes or first-letter matching. Comparing all pairs is O(n²) — blocking makes it tractable.
Apply Levenshtein/Jaro-Winkler to blocked candidates. High scores (> 90) = auto-match.
For ambiguous cases, compute embedding similarity. Catches semantic matches that fuzzy methods miss.
Cross-check with DOB, nationality, team. Reject mismatches, boost confident matches.
Low-confidence matches go to a human review queue. Build a feedback loop to improve thresholds over time.
Create canonical IDs for each player/team. Map source-specific names to these IDs.
When you find a match, store the alias. Future matches against that alias become exact matches.
A player's team changes over time. Use DOB as the primary key, not team membership.
Audit your matches periodically. False positives corrupt your data; false negatives lose information.
Fast fuzzy string matching. Drop-in replacement for fuzzywuzzy with 10-100x speedup.
Pre-trained embedding models. Start with all-MiniLM-L6-v2 for a good speed/quality balance.
Full record linkage library with active learning. Trains on your corrections to improve over time.
Efficient similarity search for millions of embeddings. Essential for production scale.