Machine Learning · Cricket Analytics

Predicting Cricket,
One Ball at a Time

A machine learning system that predicts 4 million individual ball outcomes to forecast T20 match results with Monte Carlo simulation.

4M+Balls Analyzed

63Features Per Ball

55%Ball Accuracy

29%Market Edge

Why Traditional Prediction Fails

a data problem

Traditional Approach

15,000

Historical T20 matches

Predicting match winners directly gives limited training data — not enough to learn nuanced patterns.

Our Approach

4,000,000

Individual ball outcomes

By predicting each ball we get 266× more training data. Rich patterns emerge from abundant examples.

The Key Insight

Each ball has an outcome: dot, single, double, boundary, six, or wicket. By simulating entire matches ball-by-ball with Monte Carlo methods, we capture the natural uncertainty of cricket while leveraging abundant training data.

How It Works

data to predictions

Data

8,341 matches → 4M balls

Features

63 features per ball

Model

XGBoost classifier

Simulate

1000 Monte Carlo runs

Predict

Win probabilities

63 Features Across 6 Categories

12Match StateScore, wickets, phase

20Player StatsAverages, strike rates

10MomentumRecent balls, pressure

4ChaseTarget, required rate

5MatchupsBatter vs bowler

12ContextVenue, batting order

What Drives the Predictions?

Top 10 features by XGBoost gain importance

Model Leaderboard

four architectures, one task

Tested on T20 World Cup 2024 matches with real betting odds. Lower log loss and Brier are better; market edge is disagreement with the book.

Rank	Model	Log Loss	Brier	Market Edge	Speed
1	XGBoost · Gradient Boosting	0.655	0.219	29.4%	~346s
2	MLP · Neural Net	0.707	0.254	27.0%	~75s
3	LSTM · Recurrent	0.721	0.261	25.8%	~420s
4	Fine-tuned LLM · Transformer	0.748	0.278	24.1%	~890s

XGBoost

Gradient-boosted trees, Optuna-tuned. 444 estimators, max depth 10.

MLP

3-layer feedforward (256→128→64), BatchNorm, focal loss.

LSTM

2-layer with player embeddings over 10-ball windows.

Fine-tuned LLM

GPT-style transformer with LoRA on cricket commentary.

Why XGBoost Wins

The tabular nature of cricket statistics — player averages, match state — plays to gradient boosting's strengths. Neural nets like LSTM suit sequential patterns, but the added complexity doesn't translate into better match predictions here.

Tested Against the Market

44 World Cup matches

0.73

Log Loss

Lower is better

0.26

Brier Score

Calibration metric

29%

Average Edge

vs betting markets

Market Disagreement

The model finds edge opportunities on every match, flagging where betting markets may misprice probabilities.

Calibration

A Brier score of 0.26 indicates reasonable calibration — when the model says 60%, teams win about 60% of the time.

About This Project

CricML is a personal project exploring the intersection of machine learning and cricket analytics. It demonstrates production-grade ML engineering with proper temporal data handling, efficient memory management, and rigorous evaluation against real-world betting markets.

Technical Deep Dive Try the Demo

Predicting Cricket,One Ball at a Time

Why Traditional Prediction Fails

Traditional Approach

Our Approach

The Key Insight

How It Works

Data

Features

Model

Simulate

Predict

63 Features Across 6 Categories

What Drives the Predictions?

Model Leaderboard

XGBoost

MLP

LSTM

Fine-tuned LLM

Why XGBoost Wins

Tested Against the Market

Market Disagreement

Calibration

About This Project

Predicting Cricket,
One Ball at a Time