Machine Learning · Cricket Analytics

Predicting Cricket,
One Ball at a Time

A machine learning system that predicts 4 million individual ball outcomes to forecast T20 match results with Monte Carlo simulation.

4M+Balls Analyzed
63Features Per Ball
55%Ball Accuracy
29%Market Edge

Why Traditional Prediction Fails

a data problem

Traditional Approach

15,000

Historical T20 matches

Predicting match winners directly gives limited training data — not enough to learn nuanced patterns.

Our Approach

4,000,000

Individual ball outcomes

By predicting each ball we get 266× more training data. Rich patterns emerge from abundant examples.

The Key Insight

Each ball has an outcome: dot, single, double, boundary, six, or wicket. By simulating entire matches ball-by-ball with Monte Carlo methods, we capture the natural uncertainty of cricket while leveraging abundant training data.

How It Works

data to predictions
1

Data

8,341 matches → 4M balls

2

Features

63 features per ball

3

Model

XGBoost classifier

4

Simulate

1000 Monte Carlo runs

5

Predict

Win probabilities

63 Features Across 6 Categories

12Match StateScore, wickets, phase
20Player StatsAverages, strike rates
10MomentumRecent balls, pressure
4ChaseTarget, required rate
5MatchupsBatter vs bowler
12ContextVenue, batting order

What Drives the Predictions?

Top 10 features by XGBoost gain importance

Model Leaderboard

four architectures, one task

Tested on T20 World Cup 2024 matches with real betting odds. Lower log loss and Brier are better; market edge is disagreement with the book.

RankModelLog LossBrierMarket EdgeSpeed
1XGBoost · Gradient Boosting0.6550.21929.4%~346s
2MLP · Neural Net0.7070.25427.0%~75s
3LSTM · Recurrent0.7210.26125.8%~420s
4Fine-tuned LLM · Transformer0.7480.27824.1%~890s

XGBoost

Gradient-boosted trees, Optuna-tuned. 444 estimators, max depth 10.

MLP

3-layer feedforward (256→128→64), BatchNorm, focal loss.

LSTM

2-layer with player embeddings over 10-ball windows.

Fine-tuned LLM

GPT-style transformer with LoRA on cricket commentary.

Why XGBoost Wins

The tabular nature of cricket statistics — player averages, match state — plays to gradient boosting's strengths. Neural nets like LSTM suit sequential patterns, but the added complexity doesn't translate into better match predictions here.

Tested Against the Market

44 World Cup matches
0.73

Log Loss

Lower is better
0.26

Brier Score

Calibration metric
29%

Average Edge

vs betting markets

Market Disagreement

The model finds edge opportunities on every match, flagging where betting markets may misprice probabilities.

Calibration

A Brier score of 0.26 indicates reasonable calibration — when the model says 60%, teams win about 60% of the time.

About This Project

CricML is a personal project exploring the intersection of machine learning and cricket analytics. It demonstrates production-grade ML engineering with proper temporal data handling, efficient memory management, and rigorous evaluation against real-world betting markets.