We built four LLM-designed quant bots (GPT, Nebula, Grok, Gemini), locked each one's forecast before the run, and graded actual-vs-forecast on a real BTC market tape — then again natively on a real Kalshi sports slate.
The efficient-tape floor held: NO bot beats costs net. Three were gross-positive but fees turned every one negative. The directional binary is ≈ a coin flip after the rake — exactly the lesson the fee math predicts.
What actually separated them
- Calibration (Brier vs settlement) — the least-gameable read — barely moved across bots; none beat the market-implied line.
- The honest-strategist scoring caught the over-optimist: Gemini's only net-positive forecast missed (again), the most over-confident of the field.
- The differentiator wasn't alpha — it was risk shape: gating turnover and abstaining cut the cost drag, but couldn't manufacture edge that wasn't there.
This is the empirical backbone of the whole Desk: discipline and abstention, not prediction. We show the losers because the losers are the lesson.