Eight gate-green commits in one afternoon: two parked features (pre-commitment triggers, limit-order ladders), the project's hardest parked epic (Breeden-Litzenberger market-implied distribution extraction) verified against synthetic ground truth, and the first night of a Streamlit → FastAPI + React strangler-fig migration — with the old app untouched as the fallback.
The experiment: follow Anthropic's Fable guidance — bigger objectives, not task lists; rework stale instructions; put verification gates where a human used to sit. The test suite grew from 25 to 31 checks and every commit shipped behind it.
The setup
This is night one of a multi-night autonomous run on a personal portfolio-decision tool — a Monte Carlo engine that turns subjective price forecasts (p10/p50/p90 quantiles) into portfolio-level risk numbers, with hard rules and a decision journal. The codebase came in with an unusual asset: a gate — a test suite the agent must pass before any work counts, enforced by a hook on every turn.
Before any code, a two-hour spec session produced a run contract: objectives with "what done looks like" and "how it's verified" per item, not step-by-step instructions. Two stale sections of the project's agent instructions (a long-shipped build order, an outdated description of the test harness) were deleted first — instructions written for prior models anchor newer ones to old patterns.
One scheduling trick did real work: the owner lives in Singapore, so the US market opens at 9:30pm his time. The one step that genuinely needs human eyes — comparing the extracted market distribution against a broker's reference implementation — lands in his evening, not the agent's night. A "collaborative-only" epic became an autonomous one by moving the checkpoint, not by removing it.
The shipped ledger
| Commit | What it does | Verified by |
|---|---|---|
| Pre-commitment triggers | Price levels decided in advance, stored on the decision memo ("if ACME < 120, buy — that's panic pricing"). A CLI walks all memos and reports fired / near-miss / quiet against a spot snapshot. The point: act on the plan you made calmly, not the impulse of the moment. | gate-pinned |
| Trigger ladders | Limit-order ladders generated from the user's own forecast quantiles — the p10 rung is the user's p10. Sizes capped so no rung violates portfolio rules, with one subtle semantic: a rule blocks a rung only if the trade worsens it. Exports broker-paste CSV. | hand-computed fixtures |
| Breeden-Litzenberger pipeline | Extracts the market's implied probability distribution from option chains: filter quotes, invert to implied vol, smooth with a spline in IV-space (Malz recipe), differentiate the call curve twice. Density, CDF, quantiles, per-file cache. | σ-recovery ≤3% |
| Burst prefetch (daemon-lite) | Enumerate stale (ticker, horizon) pairs, drain the queue in parallel, exit. The adaptive back-off daemon stayed out of scope — its load-bearing design question is flagged in the plan as a human decision, and the agent respected that. | dry-run verified |
| FastAPI + React scaffold | Thin API over the untouched engine + a hand-written Vite/React frontend (light, Apple-ish). First screen renders live engine output; forecast editing shipped the same evening with a stateless override model. | API = engine to the cent |
The fun math: reading the market's mind
The market's entire probability distribution is hiding in the option chain — the second derivative reads it out.
A tight butterfly spread (buy a call at K−δ, sell two at K, buy at K+δ) pays out only if the price lands at K. So its market price is the market's probability of landing there — and the butterfly is the discrete second derivative of the call-price curve. Take ∂²C/∂K² across all strikes and you've recovered the market's full implied density, not just one volatility number.
The catch is noise: differentiating twice amplifies every bad quote violently. The practitioner fix (Malz, 2014) is to smooth in implied-vol space — a far better-behaved curve — then reprice and differentiate. The whole pipeline is verified by a test that can run with the market closed: feed it Black-Scholes prices generated from a known σ, and it must recover a lognormal with that σ to within ~3%.
The failure that taught something
The recovery test passed at σ=0.25 and failed hard at σ=0.60 — recovered 0.43, off by 28%. Not a code bug: the strike filter clipped at 1.5× spot, and a 60-vol distribution's own 90th percentile lives beyond that. B-L can only see the density where strikes exist. The fix was architectural — strike banding belongs to the fetch budget, not the math layer — and the same structural fact is why a handful of very-high-vol names are excluded from the first live pass.
The migration: strangler fig, not rewrite
Streamlit
Got the tool to v1 fast, but reruns the whole script on every widget touch — state management is a permanent fight, direct-manipulation UI is near impossible, and a background refresh daemon needs lock-file hacks. It stays fully working as the demo fallback.
FastAPI + React
The engine was already UI-free, so the backend is a thin serialisation layer — and the gate now asserts the API equals direct engine computation to the cent. The frontend is hand-written Vite + React; edits are stateless overrides posted to pure endpoints, so no math is ever duplicated in TypeScript.
The first React screen — live numbers, a working you-vs-market distribution chart, and bulk forecast entry (ACME 80/100/140, paste ten lines at once) — landed about four hours after the migration decision. The console caught a real bug on first render (duplicate React keys from a ticker held in two custody accounts), which is the kind of thing that earns a framework its keep.
The playbook, generalised
Contract, not task list
Each objective states what done looks like and how it's verified. The agent picks the path — including reordering stages when the situation changes.
The gate is the autonomy
Every feature extends the test gate in the same commit. Unsupervised hours are only safe because "done" is machine-checkable.
De-stale the instructions first
Prior-model docs anchor new models to old shapes. Audit, keep what's physics (broker quirks, brand style), delete what's history.
Move checkpoints, don't remove them
Timezone arbitrage turned a "human-in-the-loop" epic autonomous: the irreducible visual check lands in the owner's evening.
Strangle, don't rewrite
The new stack grows beside the old one. The fallback is always demoable, so a migration can't strand you.
What would change this
Tonight's live market session is the real exam — synthetic recovery proves the math, not the data path. The morning report tells.
Guardrails that held
What an overnight agent must not do
- No package installs. A pre-tool hook blocks every supply-chain install; the owner runs them personally. It fired twice today and was respected both times.
- Read-only brokerage, always. Live-account API access is read-only at the connection level, and anything touching order flow is out of scope by contract.
- Market-implied ≠ forecast. Risk-neutral densities are insurance-demand-inflated. The framing travels inside API payloads so no UI can silently present the market's view as advice.
- Destructive ops escalate. A dirty working tree from a forgotten session was diff-archived before any revert — and the "rubbish" turned out to contain a reusable demo-mode feature.
- Sanitized for publication: all portfolio values, holdings, and identifying details removed or invented; tickers shown are examples only.
- The live IB fetch layer is untested against a real market session at time of writing — that's tonight's checkpoint.
- Single data point, one codebase, one (unusually well-harnessed) project. Generalise with care.
- Built during one session on 10 June 2026 with Claude Fable 5 in Claude Code.