Mustang Analytics
Methodology note

We tested an MLB model against the closing line. Then we borrowed an idea, and the edge got bigger.

The model picks home-team wins a little better than a coin flip and a little better than the standings. The betting market still picks winners better. But when we applied a Bayesian shrinkage technique from an open-source baseball project, the model’s value-betting edge grew by forty percent. This page shows what we built, what we borrowed, and the numbers behind both.

Trained on 4,857 games · Scored against 354 real closing lines · Static write-up

4,857 Games used 2024 through 2025 seasons
56.1% Model accuracy on 1,458 held-out games
+14.3 % ROI at 5% edge up from +9.9% before shrinkage
-1.4 pts Vs. closing line market 57.6%, model 56.2%
What We Built

A model that estimates the chance the home team wins

We pulled 4,857 regular-season games from 2024 and 2025 using the MLB Stats API and attached team-strength features to the matchups: prior-season Wins Above Replacement split into offense and defense, season-to-date win percentage going into the game, the ballpark run factor from Baseball Savant, and both teams’ preseason Steamer WAR projections. The features use information that existed before first pitch, so the model cannot peek at the result it is trying to predict.

We added the Steamer projection after testing it on its own. The preseason WAR figure freezes before opening day and joins a game by season alone, so it leaks no outcome information. The difference between the two teams’ projected WAR turned out to be one of the more useful inputs, which suggests the projection systems capture strength the in-season record has not caught up to yet.

The model is a logistic regression, chosen on purpose. A small model with a handful of plain features is easier to audit and harder to fool yourself with than a large one. The input with the largest weight is the home team’s prior-season WAR, which matches what you would expect.

Top features by weight

FeatureWeight
Away team offensive WAR−0.28
Away team prior-season WAR+0.28
Away team last-20 win% (regressed)+0.25
Home team prior-season WAR+0.22
Home team last-20 win% (regressed)+0.21
Home team offensive WAR−0.17
Steamer projected WAR difference+0.16

Weights come from the trained model on standardized inputs, using regressed versions of the rolling stats. The offense and defense splits trade off against the combined WAR terms, so individual signs read less cleanly than the totals.

How We Checked It

Train on the past, test on the future

We sorted the games by date, trained on the first 70 percent, and tested on the last 30 percent. The model learns from games that finished before the ones it gets graded on, the same position a real forecaster is in.

  • Training set: the first 3,399 games in date order.
  • Test set: the last 1,458 games, June 8 to September 28, 2025.
  • The model picks the winner in 56.1 percent of test games.
  • A baseline that picks the team with the better record gets 54.5 percent.

So the model adds 1.6 points over the obvious baseline. Picking winners and beating a betting market are different jobs, though. The market price folds in what our features capture plus injury news, lineups, and late money. To learn whether the model knows something the market does not, we compared it to the actual price.

The Real Test

Against the closing line

The closing line is the betting price just before first pitch. By the time it closes, the price has absorbed injury news, lineup cards, weather, and late money from well-informed bettors. A real edge shows up against the close or it does not exist.

We bought historical odds snapshots taken twelve minutes before first pitch for 354 test games, removed the bookmaker’s margin to recover the market’s implied win probability, and scored both forecasts side by side.

MeasureOur modelMarket close
Picks the winner56.2%57.6%
Brier score (lower is better)0.2440.242
Log loss (lower is better)0.6810.676

354 games with real closing prices from the held-out test window. The market leads by 1.4 points on winners, 0.002 on Brier score, and 0.005 on log loss. The model still does not out-predict the market on raw accuracy.

The market picks winners better and prices probabilities more accurately. That part has not changed. But accuracy is not the only way to beat a market. A model can be wrong more often and still make money if it is right about which games it is wrong about. That is where the next part comes in.

What We Borrowed

A shrinkage trick from an open-source baseball project

While evaluating a GitHub repository called RPCBaseball, we found a technique worth borrowing. The project, built by proselotis and msc123123, blends simulated win probabilities with observed frequencies using a simple Bayesian formula:

final estimate = (model estimate × sample size + prior × weight) ÷ (sample size + weight)

The idea is straightforward. Early in the season, you have very few games to learn from, so the model’s predictions are noisy. Instead of trusting them fully, you blend each prediction with a league-average prior. As the season goes on and more games pile up, the model earns more trust and the prior fades into the background.

We applied this in two places. First, we regressed the rolling team and pitcher stats toward the league average before feeding them to the model, so a team that is 4–0 does not look like a juggernaut. Second, we applied the same shrinkage to the model’s final output: early-season predictions get pulled toward the league-average home win rate of about 53 percent, and late-season predictions are left alone.

Credit

The Bayesian shrinkage formula was adapted from RPCBaseball, an open-source MLB pitch-level analytics project by proselotis, msc123123, and contributors. The original formula appears in probcalc.R and blends simulated win probabilities with empirical frequencies. We adapted it for pre-game prediction and feature engineering. The RPCBaseball repository is private but was shared with us for evaluation.

The Result

The edge got bigger

Before the shrinkage, betting only on games where the model disagreed with the market by at least five percentage points returned about 10 percent on flat stakes. After the shrinkage, the same threshold returned about 14 percent. The technique did not make the model smarter — its accuracy and Brier score barely moved. What it did was make the model less overconfident, so it stopped betting on marginal games where its edge was illusory.

Model versionROI at 0% edgeROI at 5% edgeROI at 8% edge
Original model (no shrinkage)−0.3%+9.9%+1.7%
With prediction shrinkage only−0.1%+14.3%+1.6%
With regressed features only+2.9%+2.3%+8.9%
Both combined+2.8%+5.9%+11.0%

354 games, flat $1 stakes, best available closing decimal prices. “Edge” is the gap between the model’s probability and the market’s implied probability. Positive ROI means the model beat the close.

The clearest gain came from the prediction-level shrinkage alone: +14.3 percent at the five-percent edge, up from +9.9 percent. The combined version — regressed features plus prediction shrinkage — is the most consistent, showing positive returns at every threshold we tested.

The Tempting Part

Why we still do not sell picks

The numbers above look better than they did before, and the improvement came from a principled technique, not from tweaking thresholds until the backtest looked good. We are still not selling picks, for the same reasons as before:

  • The model still loses to the market on raw accuracy and calibration. The shrinkage improved the value-betting edge, not the probability quality.
  • The sample is thin. Even after filtering, we are looking at a few hundred bets across one partial season. That is not enough to separate a real edge from a favorable run.
  • The ROI concentrates at one threshold. The +14.3 percent figure lives at the five-percent edge; at other thresholds the combined model does better, the shrinkage-only model does not. That kind of sensitivity to a single parameter is a yellow flag.

We treat a backtest that needs the right slice to look good as a warning, even when the improvement comes from a sound idea borrowed from people who know baseball.

What this is for

This page is a worked example of how we test a predictive claim before a client acts on it. We built the model, gave it real features, tested it out of sample, put it against a benchmark it could lose to, and reported honestly. Then we found a technique in an open-source project, adapted it, measured the gain, and reported that too — including the caveats. A predictive model that has not faced a benchmark it can lose to is an unpriced risk. If your team relies on a demand forecast, a churn score, or a pricing model, we run this same test before the model touches your budget. That discipline is the service we sell.

Notes & Credits

Game data comes from the MLB Stats API, team strength from FanGraphs WAR and Baseball Savant park factors, preseason projections from FanGraphs Steamer, and closing prices from The Odds API historical endpoint. The Bayesian shrinkage technique was adapted from RPCBaseball by proselotis, msc123123, and contributors — an open-source MLB analytics project whose probcalc.R blends simulated and empirical win probabilities using a sample-size-weighted prior. The value-bet figures use best available decimal prices at the close, flat stakes, and no adjustment for betting limits or line movement after the snapshot. This page is informational. It does not offer betting advice, and we do not sell picks.