Bringing Fama-French to March Madness

The Idea

Years ago in a finance class, I learned about the Fama-French five-factor model - the framework that explains stock returns through market risk, size, value, profitability, and investment patterns. It stuck with me, not because of the specific factors, but because of the underlying idea: you can decompose a complex market into a small number of structural factors that explain why prices are what they are, and more importantly, when they're wrong.

The stock market has gotten so efficient that Fama-French is largely a historical artifact now. Algorithmic trading, high-frequency market makers, and institutional capital have arbitraged away most of the inefficiencies the model was designed to exploit. But the core insight - that markets have structural biases you can identify and measure - always felt like it should carry over to less efficient markets.

Sports betting is the obvious parallel. But the major markets - NFL, NBA, etc. - have gotten extremely optimized. The closing line on a Sunday NFL game is essentially a perfect probability estimate. Thousands of sharp bettors, syndicate operations, and the books' own sophisticated models have hammered those lines into near-perfect efficiency. You're not beating that with a logistic regression and some vibes.

College basketball exists in a weird, fascinating space. The betting public knows maybe 30-40 teams by reputation. A 15-0 mid-major that hasn't played a single Quad 1 opponent is essentially a black box - the market has to assign them a number, but it's guessing in a way it never has to guess about the Kansas City Chiefs.

Are there less efficient betting markets out there? Sure - Vietnamese table tennis, Kazakhstani handball, obscure European futsal leagues. But I'm genuinely not interested in those sports, which matters more than you'd think. Building a model you actually care about is what keeps you honest when the results are disappointing. College basketball strikes the right balance: niche enough that structural inefficiencies plausibly exist, mainstream enough that I'm watching obsessively every March regardless.

The inefficiency window

There are 363 Division I teams playing roughly 5,400 games per season. The books simply cannot devote the same modeling depth to a Tuesday night Southern Conference game that they give to Duke-UNC on Saturday afternoon.

So: I'm building a Fama-French style factor model for college basketball - regular season and tournament alike. Instead of market beta, size, and value explaining stock returns, I have market-derived factors explaining when the line is structurally mispriced and in which direction.

The Factor Architecture

The model doesn't simulate basketball. It doesn't care about Xs and Os or recruiting rankings. It asks one question: given the structural features of this specific game - who's playing, when in the season, how much attention the market pays to these teams, how sharp money has moved - is the line structurally mispriced, and in which direction?

Factor mapping - finance to CBB

Tier	Finance Factor	CBB Factor	What It Captures
1	Market Risk Premium (β)	MKTVAL	Gap between KenPom implied probability and market implied probability. The central edge estimate.
1	Momentum (UMD)	DRIFT	Teams whose ATS performance diverges from market perception - the market hasn't caught up yet.
1	Sentiment / Behavioral	PUB	Public betting percentage imbalance. Contrarian signal - heavy public sides get inflated.
2	Size (SMB)	SZE	Market attention composite. Less-watched teams have softer, less-informed lines.
2	Information Quality	OPACITY	Schedule legibility. A 15-0 team with zero Q1 games is a black box the market can't price confidently.
3	Profitability (RMW)	Style Axes	Pace, 3PT orientation, and rebounding mismatches - signed differentials so the model learns directional effects.

Factors are organized into three tiers. Tier 1 runs on every game - these are the universal base signals that always have something to say. Tier 2 provides contextual modifiers that explain when and why Tier 1 errors become systematic. Tier 3 activates only under specific conditions - pace mismatch only fires when the tempo gap exceeds one standard deviation, for instance.

Sitting above all of this is a regime variable called PHASE, which doesn't act as a standalone factor but instead modulates the coefficients of everything else depending on where we are in the season. Early in the year, team profiles are uncertain and the market hasn't had enough games to calibrate - so the model leans harder on the analytical baseline (KenPom) and trusts the market less. During conference play, the market has seen plenty of each team and is generally well-calibrated, so that balance flips. In the tournament specifically, PHASE amplifies the public-money and opacity signals - these are the conditions where behavioral biases and information gaps tend to produce the most systematic mispricings.

Key design decision

Style mismatch was originally a single unsigned Euclidean distance score - a single number representing how stylistically different two teams are. The problem: a distance of 0.7 tells you "these teams are different" but nothing about who benefits from the mismatch. A fast team playing a slow team creates a different game than the reverse, and the market prices them differently too. Moving away from the unsigned distance to three signed differentials (pace, 3PT orientation, rebounding) lets the model learn directional effects and discover asymmetries the scalar version was hiding.

Guardrails

The academic literature on sports betting models is mostly a graveyard of overfit backtests. The guardrail stack is designed to be brutal about what actually survives out-of-sample.

Layer	Method	What It Catches
1	Elastic Net regularization	Weak and noisy factors
2	Walk-forward temporal CV	Non-persistent patterns and information leakage
3	Bootstrap stability (1,000 resamples)	Unstable coefficient estimates
4	Effect size minimum (1.5 ppt)	Real but too-small-to-matter factors
5	Distribution drift monitoring	Changing market conditions
6	Bankroll simulation (1,000 bootstrap seasons)	Overall profitability reality check

The closing line is approximately efficient. The academic literature suggests meaningful predictive edge exists only in narrow windows, under specific conditions. Whether this model finds any of those windows is an open question - and honestly the most interesting part of building it.

What's Built, What's Next

The data pipeline is complete, the factor architecture is designed, and the backtesting framework is ready. Whether it actually finds exploitable edge is an open and honest question.

Factor architecture and theoretical framework
Tiered model design (Tier 1/2/3, regime variable, interaction terms)
Guardrail stack (bootstrap, temporal CV, effect size, bankroll sim)
Google Sheets database - 10 tabs, 363 D1 teams
KenPom data imported (2021-2026, ratings, style, roster continuity)
2024 tournament data seeded with scores and closing spreads
Model codebase v1 - Elastic Net logistic regression with all guardrails
Historical tournament data for 2021-2023 and 2025
Odds API integration for historical line pulls
Feature computation engine
Backtest on 2021-2025 tournament data
Bootstrap stability testing to prune unreliable factors
Live predictions for 2027 tournament

Full results, backtesting analysis, and honest post-mortem coming once the 2021-2025 data is complete.