A factor model for identifying mispriced lines across the entire college basketball season - built on the premise that structural inefficiencies exist in CBB that simply don't exist in major professional markets.
Years ago in a finance class, I learned about the Fama-French five-factor model - the framework that explains stock returns through market risk, size, value, profitability, and investment patterns. It stuck with me, not because of the specific factors, but because of the underlying idea: you can decompose a complex market into a small number of structural factors that explain why prices are what they are, and more importantly, when they're wrong.
The stock market has gotten so efficient that Fama-French is largely a historical artifact now. Algorithmic trading, high-frequency market makers, and institutional capital have arbitraged away most of the inefficiencies the model was designed to exploit. But the core insight - that markets have structural biases you can identify and measure - always felt like it should carry over to less efficient markets.
Sports betting is the obvious parallel. But the major markets - NFL, NBA, etc. - have gotten extremely optimized. The closing line on a Sunday NFL game is essentially a perfect probability estimate. Thousands of sharp bettors, syndicate operations, and the books' own sophisticated models have hammered those lines into near-perfect efficiency. You're not beating that with a logistic regression and some vibes.
College basketball exists in a weird, fascinating space. The betting public knows maybe 30-40 teams by reputation. A 15-0 mid-major that hasn't played a single Quad 1 opponent is essentially a black box - the market has to assign them a number, but it's guessing in a way it never has to guess about the Kansas City Chiefs.
Are there less efficient betting markets out there? Sure - Vietnamese table tennis, Kazakhstani handball, obscure European futsal leagues. But I'm genuinely not interested in those sports, which matters more than you'd think. Building a model you actually care about is what keeps you honest when the results are disappointing. College basketball strikes the right balance: niche enough that structural inefficiencies plausibly exist, mainstream enough that I'm watching obsessively every March regardless.
There are 363 Division I teams playing roughly 5,400 games per season. The books simply cannot devote the same modeling depth to a Tuesday night Southern Conference game that they give to Duke-UNC on Saturday afternoon.
So: I'm building a Fama-French style factor model for college basketball - regular season and tournament alike. Instead of market beta, size, and value explaining stock returns, I have market-derived factors explaining when the line is structurally mispriced and in which direction.
The model doesn't simulate basketball. It doesn't care about Xs and Os or recruiting rankings. It asks one question: given the structural features of this specific game - who's playing, when in the season, how much attention the market pays to these teams, how sharp money has moved - is the line structurally mispriced, and in which direction?
| Tier | Finance Factor | CBB Factor | What It Captures |
|---|---|---|---|
| 1 | Market Risk Premium (β) | MKTVAL | Gap between KenPom implied probability and market implied probability. The central edge estimate. |
| 1 | Momentum (UMD) | DRIFT | Teams whose ATS performance diverges from market perception - the market hasn't caught up yet. |
| 1 | Sentiment / Behavioral | PUB | Public betting percentage imbalance. Contrarian signal - heavy public sides get inflated. |
| 2 | Size (SMB) | SZE | Market attention composite. Less-watched teams have softer, less-informed lines. |
| 2 | Information Quality | OPACITY | Schedule legibility. A 15-0 team with zero Q1 games is a black box the market can't price confidently. |
| 3 | Profitability (RMW) | Style Axes | Pace, 3PT orientation, and rebounding mismatches - signed differentials so the model learns directional effects. |
Factors are organized into three tiers. Tier 1 runs on every game - these are the universal base signals that always have something to say. Tier 2 provides contextual modifiers that explain when and why Tier 1 errors become systematic. Tier 3 activates only under specific conditions - pace mismatch only fires when the tempo gap exceeds one standard deviation, for instance.
Sitting above all of this is a regime variable called PHASE, which doesn't act as a standalone factor but instead modulates the coefficients of everything else depending on where we are in the season. Early in the year, team profiles are uncertain and the market hasn't had enough games to calibrate - so the model leans harder on the analytical baseline (KenPom) and trusts the market less. During conference play, the market has seen plenty of each team and is generally well-calibrated, so that balance flips. In the tournament specifically, PHASE amplifies the public-money and opacity signals - these are the conditions where behavioral biases and information gaps tend to produce the most systematic mispricings.
Style mismatch was originally a single unsigned Euclidean distance score - a single number representing how stylistically different two teams are. The problem: a distance of 0.7 tells you "these teams are different" but nothing about who benefits from the mismatch. A fast team playing a slow team creates a different game than the reverse, and the market prices them differently too. Moving away from the unsigned distance to three signed differentials (pace, 3PT orientation, rebounding) lets the model learn directional effects and discover asymmetries the scalar version was hiding.
The academic literature on sports betting models is mostly a graveyard of overfit backtests. The guardrail stack is designed to be brutal about what actually survives out-of-sample.
| Layer | Method | What It Catches |
|---|---|---|
| 1 | Elastic Net regularization | Weak and noisy factors |
| 2 | Walk-forward temporal CV | Non-persistent patterns and information leakage |
| 3 | Bootstrap stability (1,000 resamples) | Unstable coefficient estimates |
| 4 | Effect size minimum (1.5 ppt) | Real but too-small-to-matter factors |
| 5 | Distribution drift monitoring | Changing market conditions |
| 6 | Bankroll simulation (1,000 bootstrap seasons) | Overall profitability reality check |
The closing line is approximately efficient. The academic literature suggests meaningful predictive edge exists only in narrow windows, under specific conditions. Whether this model finds any of those windows is an open question - and honestly the most interesting part of building it.
The data pipeline is complete, the factor architecture is designed, and the backtesting framework is ready. Whether it actually finds exploitable edge is an open and honest question.
Full results, backtesting analysis, and honest post-mortem coming once the 2021-2025 data is complete.