A deep dive into how we designed the 8-step data pipeline that transforms raw Statcast pitch-level telemetry into a 3D galaxy of pitcher archetypes. From SV reclassification to K-Means clustering, hitter-vs-cluster matchup matrices, and the Three.js cosmos visualization.
The MLB system is an 8-step sequential pipeline built in Python, orchestrated by run_all.py. Each step reads the previous step's output and writes its own artifacts. The entire pipeline can be resumed from any step with --from N.
pybaseball. Data is chunked by month (Mar-Oct) with retry logic and polite 2s delays. Each season yields ~700K pitches across 59 columns including release speed, spin rate, pitch movement (pfx_x/pfx_z), plate location, and batted ball outcomes. Saved as compressed Parquet files (~150MB/season).
is_sp flag used downstream for role-aware archetype naming.
SV Reclassification: Statcast's "SV" (sweeper) classification is inconsistent across seasons. We built a per-pitcher mapping that examines career-average SV velocity and vertical break to reclassify each pitcher's SV as curveball (pfx_z < -0.50), slider (speed > 84 mph), or sweeper (everything else). This ensures clustering stability across the 2015-2026 dataset.
Separate RHP/LHP Clustering: Rather than clustering all pitchers together, we split by handedness first. This prevents the dominant handedness signal from overwhelming the pitch-mix features. Each hand gets its own StandardScaler, K-Means model, and PCA projection. The X-axis offset (+5/-5) in PCA space creates the visual "galaxy" separation in the Atlas view.
Medoid over Centroid: Archetype representatives are chosen as the geometric medoid (the real pitcher that minimizes total distance to all cluster members), not the mathematical centroid. This means every archetype profile references an actual pitcher's stats, not a phantom average that no real pitcher matches.
Zone Location Entropy: The 13-feature zone location layer captures not just where pitchers throw, but how predictable their patterns are. Shannon entropy across a 9-quadrant grid (3 lateral × 3 vertical) measures location unpredictability, and platoon shift features capture how much a pitcher adjusts against same-side vs opposite-side batters.
How we built a 4-phase pipeline that ingests NBA player tracking data, classifies coaching schemes via percentile-rank play type analysis, clusters players into position-specific archetypes using weighted K-Means, computes multi-level synergy scores, and generates spread/total predictions against live betting lines.
The NBA SIM operates as a 4-phase CLI pipeline (python main.py [collect|analyze|scores|predict|all]). Each phase builds on the previous, with all data persisted to a 17-table SQLite database.
PlayerCollector pulls teams, rosters, and season stats from nba_api. GameCollector fetches game results. LineupCollector pulls 2-through-5-man lineup combinations with net rating and possession counts (with minimum possession thresholds: 30 for 5-man, 50 for 4-man, 75 for 3-man, 100 for 2-man). PlayTypeCollector calls SynergyPlayTypes for all 11 play types in both offensive and defensive groupings. BoxScoreCollector ingests per-game player stats with 27 columns (points, rebounds, assists, plus advanced metrics like usage rate, true shooting, offensive/defensive rating, PIE). OddsCollector pulls live spreads and totals from The Odds API across multiple bookmakers.
FeatureEngineer builds training matrices from the value scores and team-level features. A GamePredictor trains models for spread and total predictions. A ModelEvaluator backtests by training on season N-1 and evaluating on season N, measuring spread/total accuracy. The generate_frontend.py script produces a self-contained HTML dashboard that fetches live odds, computes consensus lines across bookmakers, grades matchup edges (A/B/C), and displays today's games with full scheme and archetype context.
Percentile-Rank Scheme Classification: Instead of using raw play type frequencies, we rank each team's values against all 30 teams to compute percentile scores (0-1). This ensures meaningful differentiation regardless of season-level shifts in play style trends. A team running 18% isolation isn't inherently "ISO-Heavy" unless they're in the top percentile of the league.
Position-Weighted Clustering: Not all stats matter equally for every position. Centers are weighted toward blocks and rebounds; guards toward assists and three-point attempts. The POSITION_FEATURE_WEIGHTS dictionary applies multipliers before StandardScaler normalization, ensuring PCA captures position-relevant variance. The K=4 bias (accepting K=4 over K=3 when silhouette delta < 0.05) prevents oversimplification.
Hungarian Algorithm for Label Assignment: Each archetype label (e.g., "Floor General", "Rim Protector") is defined as a z-score direction vector. After clustering, we build a cost matrix scoring how well each cluster centroid matches each label template, then use the Hungarian algorithm for optimal bipartite matching. This guarantees the most appropriate label assignment without manual intervention.
Bayesian Shrinkage in Synergy Scores: Small-sample lineup data is unreliable. A 5-man lineup with 35 possessions and +20 net rating shouldn't dominate a player's value. We apply Bayesian priors that shrink estimates toward league average, with prior strength proportional to data granularity (100 possessions for 5-man, 30 for 2-man). This balances signal extraction with noise reduction.
Feb 19 slate — 7 picks sized by model confidence. Starting bankroll: 1,000 $PP. Target: 25,000 $PP. Game lines + player props with full rationale from the NBA SIM pipeline.
Post All-Star break opener. 11 games on the Feb 19 slate — the model flagged 3 game lines and 4 player props worth sizing. Unit sizing ($PP) is based on confidence score: A-grade = 5U, B-grade = 3U, D-grade = 1U. Starting bankroll 1,000 $PP with a target of 25,000 $PP.
UNIT KEY: A (90-100) = 50 $PP · B (60-89) = 30 $PP · D (40-59) = 10 $PP
▎ GAME LINES — 3 PICKS
BKN @ CLE — CLE -13.5
O/U 228.0. Brooklyn (DS #29, 15-38) at Cleveland (DS #3, 34-21). BKN Spot-Up Heavy w/ Drop-Coverage (Poor) vs CLE PnR-Heavy (Fast) w/ Drop-Coverage (Good). DS gap: CLE 372 vs BKN 256. Lineup data: CLE's Merrill/Tyson/Mitchell 3-man core is +29.2 NET RTG over 21 games — elite floor. Their Mobley/Allen/Mitchell 5-man is +25.6 NET RTG. Brooklyn has no trending combos that compete. Max unit.
PHX @ SAS — SAS -6.5
O/U 225.5. Phoenix (DS #19, 32-23) at San Antonio (DS #6, 37-16). SAS Trans-Defense (Elite) shuts down PHX's PnR-Heavy sets. Lineup data: Wembanyama's 2-man duos are +29.0 NET RTG over 32 games — the largest sample of any trending combo in the league. PHX has zero lineup combos tracking above +10. Team DS 354 vs 321.
ORL @ SAC — ORL -11.0
O/U 223.5. Orlando (DS #13, 27-23) vs Sacramento (DS #28, 12-44). SAC's Trans-Defense (Poor) vs ORL's Run-and-Gun. Lineup data: SAC has 3 DISASTERCLASS fade combos — Westbrook/Achiuwa/DeRozan/Murray 5-man is -30.7 NET RTG (8 GP), Achiuwa/DeRozan/Sabonis 3-man is -29.3 (11 GP), Achiuwa/Sabonis 2-man is -24.1 (12 GP). Sacramento bleeds points in every lineup combination that gets minutes. Min unit — large spread has juice risk.
▎ PLAYER PROPS — 4 PICKS (sort by: DS ranking)
N. JOKIĆ (DEN vs LAC) — OVER 28.5 PTS
🔮 Versatile Big · Avg 28.7 pts on 70% TS vs LAC's 114 DRTG. Highest DS in the league. LAC runs ISO-Heavy with Drop-Coverage (Avg) — Jokic's post game feasts on drop schemes. Proj: 28.7p / 10.5a / 11.8r. 📈 Trend: consistent production, floor is the line.
D. MITCHELL (CLE vs BKN) — OVER 27.2 PTS
⚡ Scoring Guard · Avg 29.0 pts on 62% TS vs BKN's 117 DRTG — worst defense in the league. Mitchell's scoring-guard archetype thrives in fast PnR vs poor drop coverage. Lineup data: Mitchell's combos with Merrill/Tyson (+29.2 NET, 21 GP) and with Mobley/Allen (+25.6 NET, 6 GP) are both elite — he's the engine. Stacks with CLE -13.5.
V. WEMBANYAMA (SAS vs PHX) — OVER 11.0 REB 📈
🏰 Rim Protector · Avg 11.1 reb, 29 mpg. TRENDING UP over last 5. Lineup data: Wemby's 2-man duos are +29.0 NET RTG over 32 games — biggest sample in the league's hot combos. PHX has no true center to contest boards — Williams (DS 59) is undersized. Wemby's rim-protector archetype dominates the glass against small-ball. Stacks with SAS -6.5. Min unit — rebound props are volatile.
K. LEONARD (LAC vs DEN) — OVER 28.0 PTS 🔥
🧠 Point Forward · Avg 27.9 pts on 62% TS vs DEN's 116 DRTG. HEATING UP over last 5 games. Kawhi's ISO archetype works against Denver's Rim-Protect (Avg) — pulls Jokic to the perimeter. Proj: 28.3p / 3.6a / 6.2r. Trend-based sizing.
| Pick | Side | Conf | $PP | Result |
|---|---|---|---|---|
| BKN @ CLE | CLE -13.5 | 100 A | 50 | — |
| PHX @ SAS | SAS -6.5 | 65 B | 30 | — |
| ORL @ SAC | ORL -11.0 | 46 D | 10 | — |
| Jokić PTS | OVER 28.5 | DS 99 | 30 | — |
| Mitchell PTS | OVER 27.2 | DS 89 | 30 | — |
| Wemby REB | OVER 11.0 | DS 88 | 10 | — |
| Leonard PTS | OVER 28.0 | DS 84 | 30 | — |
| TOTAL RISKED | 190 $PP | |||
All picks sourced from the NBA SIM pipeline — scheme detection, archetype clustering (K-Means on 16 features), Dynamic Score rankings, and lineup synergy. Lines via The Odds API. Full methodology at nbasim.