Probaballer - Football Analytics & Betting Insights

Entity Matching in Football Data

Solving the challenge of linking players and teams across data sources with different naming conventions.

Data EngineeringNLPRecord Linkage

The Problem

Building a robust football analytics pipeline requires combining data from multiple sources: match results from one provider, player statistics from another, odds from bookmakers, and tracking data from yet another. The challenge? None of them agree on how to name things.

This is the entity matching problem (also called record linkage or data deduplication). Without solving it, you cannot join datasets, track players across seasons, or build features that span multiple data sources.

Player Name Variations

Source AHeung-Min Son

Source BSon Heung-min

Source CH. Son

Source D손흥민

Name order, transliteration, abbreviations, and character sets all differ.

Team Name Variations

Source AManchester United

Source BMan Utd

Source CMan United

Source DMUFC

Abbreviations, nicknames, and formal vs informal naming conventions.

Edge Cases Are Everywhere

Players change teams, share similar names (e.g., multiple "Mohamed Salah" across leagues), use nicknames professionally (e.g., "Rodri" vs "Rodrigo Hernández"), or have names with diacritics that get stripped (e.g., "Müller" vs "Muller").

Solution Approaches

From simple string matching to semantic embeddings

There is no single perfect solution. The best approach combines multiple techniques, using simple methods for easy cases and more sophisticated methods for ambiguous matches. Here is the toolkit:

Exact MatchingFuzzy MatchingPhonetic MatchingEmbeddingsMetadata

1.Fuzzy String Matching

Fuzzy matching quantifies the similarity between two strings, allowing for typos, abbreviations, and minor variations. The workhorse algorithm is Levenshtein distance.

Levenshtein Distance

The minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another.

Levenshtein("Man United", "Man Utd") = 4

Delete 'n', 'i', 'e', 'd' → 4 operations

Ratio (Normalized)

Convert distance to 0-100 similarity score. Common in libraries like fuzzywuzzy (Python).

fuzz.ratio("Arsenal", "Arsenal FC") → 87

Token Sort Ratio

Sort words alphabetically before comparing. Handles name order variations.

fuzz.token_sort_ratio("Son Heung-min", "Heung-Min Son") → 100

Jaro-Winkler Similarity

Designed for short strings like names. Gives higher scores to strings that match from the beginning, which is useful for names with common prefixes.

JaroWinkler("Mohamed", "Mohammed") = 0.97

High similarity despite spelling variation

Python Example

from rapidfuzz import fuzz, process

teams_source_a = ["Manchester United", "Liverpool", "Arsenal"]
query = "Man Utd"

# Find best match
best_match = process.extractOne(query, teams_source_a)
# Returns: ("Manchester United", 73.3, 0)

# Set threshold for automatic matching
if best_match[1] > 80:
    matched_team = best_match[0]
else:
    # Flag for manual review
    pass

2.Phonetic Algorithms

Phonetic algorithms encode names by how they sound rather than how they are spelled. This catches variations that fuzzy matching might miss.

Soundex

Classic algorithm encoding names into a letter + 3 digits. Limited but fast.

Soundex("Robert") → R163

Soundex("Rupert") → R163

Metaphone / Double Metaphone

More sophisticated, handles non-English names better. Returns primary and alternate encodings.

Metaphone("Müller") → MLR

Metaphone("Muller") → MLR

Use case: Pre-filter candidates before applying more expensive similarity measures. If phonetic codes do not match, skip the comparison entirely.

3.Embedding Models

Modern NLP models encode text into dense vector representations (embeddings) that capture semantic meaning. Similar names cluster together in embedding space, enabling matching even when surface forms differ significantly.

Sentence Transformers

Pre-trained models like all-MiniLM-L6-v2 generate 384-dimensional embeddings. Fast, free, and effective for entity matching.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

names = ["Heung-Min Son", "Son Heung-min", "Harry Kane"]
embeddings = model.encode(names)

# Cosine similarity matrix
sim_matrix = cosine_similarity(embeddings)
# Son variants: ~0.92 similarity
# Son vs Kane: ~0.45 similarity

OpenAI Embeddings

text-embedding-3-small provides high-quality embeddings via API. Better for multilingual names but incurs cost.

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["손흥민", "Son Heung-min"]
)
# Returns 1536-dimensional embeddings
# High similarity despite different scripts

Vector Search with FAISS

For large datasets, use approximate nearest neighbor search to find candidates efficiently.

import faiss
import numpy as np

# Build index from source A embeddings
index = faiss.IndexFlatIP(384)  # Inner product (cosine after L2 norm)
faiss.normalize_L2(embeddings_a)
index.add(embeddings_a)

# Query with source B names
faiss.normalize_L2(query_embedding)
distances, indices = index.search(query_embedding, k=5)
# Returns top 5 matches with similarity scores

4.Metadata Disambiguation

String similarity alone cannot resolve all ambiguities. Metadata provides crucial disambiguation signals that dramatically improve matching accuracy.

Date of Birth

The strongest disambiguator for players. Two players with similar names but different DOBs are definitionally different entities.

Match if name similarity > 0.8 AND DOB exact match

Nationality

Helps distinguish players with common names. Less reliable than DOB but widely available.

Use as tiebreaker when multiple candidates exist

Current Team

Useful for temporal matching within a season. Less reliable across seasons due to transfers.

Requires matching teams first (chicken-and-egg)

Position

Weak signal (positions vary by source) but useful in combination with other features.

GK vs FW mismatch = definitely different players

Composite Scoring

Combine signals into a weighted match score:

def match_score(name_sim, dob_match, nationality_match, team_match):
    score = name_sim * 0.4
    if dob_match:
        score += 0.35  # Strong signal
    if nationality_match:
        score += 0.15
    if team_match:
        score += 0.10
    return score

# Threshold: score > 0.75 = auto-match, 0.5-0.75 = manual review

5.Recommended Pipeline

A production entity matching pipeline combines these techniques in stages, from cheap/fast to expensive/accurate:

Exact Match

Normalize strings (lowercase, remove diacritics, strip whitespace) and check for exact matches. Handles ~60-70% of cases instantly.

Blocking

Reduce candidate pairs using phonetic codes or first-letter matching. Comparing all pairs is O(n²) — blocking makes it tractable.

Fuzzy Matching

Apply Levenshtein/Jaro-Winkler to blocked candidates. High scores (> 90) = auto-match.

Embedding Similarity

For ambiguous cases, compute embedding similarity. Catches semantic matches that fuzzy methods miss.

Metadata Verification

Cross-check with DOB, nationality, team. Reject mismatches, boost confident matches.

Manual Review Queue

Low-confidence matches go to a human review queue. Build a feedback loop to improve thresholds over time.

Practical Tips

Build a Master Entity Table

Create canonical IDs for each player/team. Map source-specific names to these IDs.

Store All Aliases

When you find a match, store the alias. Future matches against that alias become exact matches.

Handle Transfers Carefully

A player's team changes over time. Use DOB as the primary key, not team membership.

Log Uncertain Matches

Audit your matches periodically. False positives corrupt your data; false negatives lose information.

Libraries & Resources

rapidfuzz (Python)

Fast fuzzy string matching. Drop-in replacement for fuzzywuzzy with 10-100x speedup.

sentence-transformers

Pre-trained embedding models. Start with all-MiniLM-L6-v2 for a good speed/quality balance.

dedupe (Python)

Full record linkage library with active learning. Trains on your corrections to improve over time.

FAISS (Meta)

Efficient similarity search for millions of embeddings. Essential for production scale.