# Historical Validation Plan — arb (Phase 0 → fast viability decision) **Status:** proposed by auditor, awaiting operator approval before any dev work. **Generated:** 2026-04-25, Melbourne. **Goal:** answer "would this strategy have worked recently?" in hours, not weeks. --- ## Context (why this shift) Live paper trading after ~2.5 hours has shown: - 3,025 opportunities scanned across 4 directions (BTC/ETH × kraken/coinbase) - 0 qualifying - Best gross spread observed: 0.036% - Required gross spread to clear fees+slippage+threshold at current config: ~1.25% - Gap: ~35× Waiting 14 days for live data to confirm what's already obvious is too slow. Historical replay can deliver a definitive answer for the same and adjacent markets in an afternoon. --- ## 1. Historical analysis strategy ### Data tiers | Tier | Data type | Source | Cost | Accuracy | Speed-to-answer | |---|---|---|---|---|---| | A | L2 orderbook (10+ levels), tick-level | Tardis.dev / Kaiko | $50–$200 for 7d × 2 venues × 2 pairs | Highest (real fillable size) | 1 day to set up | | B | **L1 best-bid/ask, tick-level** | Tardis.dev / Kaiko, OR self-captured JSONL | $20–$80 (or free if self-captured ≥7d) | High (matches what live engine sees) | Hours | | C | Trade tape (executed prices) | Free from Kraken/Coinbase REST | Free | Medium (no resting bid/ask, only executions) | Hours | | D | OHLC candles + spread proxy | Free from any data API | Free | Low (smooths out the seconds-long arb windows where opportunities actually exist) | Minutes | ### Recommendation: Tier B, 7 days, from Tardis.dev for the first run. Why: - Matches exactly what the live strategy sees. The current strategy is L1-only — it doesn't walk the book — so L2 buys nothing until the strategy is upgraded. - 7 days is enough. If 7-day historical replay shows 0 qualifying at the same fee assumptions the live data uses, that confirms the live signal with high confidence. If 7 days shows 1+ qualifying day, extend to 30. - Tardis tick-level L1 for kraken+coinbase, BTC/USD + ETH/USD, 7 days ≈ $30–$60 USD. ### Skip: - **Tier A initially** — strategy is rejecting on net_edge (fees), not on size-fillability. L2 only matters once we know fees aren't the killer. - **Tier C and D** — trade tape lacks resting quotes, candles smooth out the windows where arb actually exists. Both produce false negatives. ### Free fallback Replay our own self-captured `engine/data/ticks/` first to validate the replay infra, then add Tardis once you confirm budget. ### Self-capture audit (already on this VPS) - `engine/data/ticks/coinbase_2026-04-25.jsonl`: 17,186 lines, 2.9 MB, ~2h coverage - `engine/data/ticks/kraken_2026-04-25.jsonl`: smaller, ~200 KB (kraken emits ~14× fewer updates than coinbase — normal) - Format: `{exchange, symbol, bid, ask, bidSize, askSize, ts, receivedAt}` JSONL — clean, replay-ready - Daily volume estimate: ~35 MB/day per venue → 7 days × 2 venues ≈ 250 MB. Trivial. --- ## 2. Replay/backtest system design The engine has clean separation: `Feeds → OrderBook → Strategy → Risk → Simulator → Ledger`. The replay swaps the feed layer only. ``` ┌──────────────────┐ ┌──────────────────┐ │ Historical files │ ──┐ │ Live WS feeds │ (existing) │ (JSONL / CSV) │ │ ┌───────────────────────┐ │ Kraken+Coinbase │ └──────────────────┘ ├──▶│ FeedAdapter interface │◀──┤ │ │ └───────────────────────┘ └──────────────────┘ ┌──────────────────┐ │ │ │ Replay clock │ ──┘ ▼ │ (returns latest │ ┌────────────────────┐ │ replayed ts) │ │ OrderBook │ (existing, untouched) └──────────────────┘ └────────────────────┘ │ ▼ ┌────────────────────┐ │ Strategy + Risk │ (existing, untouched) │ + Simulator │ └────────────────────┘ │ ▼ ┌────────────────────┐ │ replay-{run}.db │ (separate file, no │ (same schema) │ contamination of live) └────────────────────┘ │ ▼ ┌────────────────────┐ │ /opportunities/ │ (already built — point at │ stats endpoint │ replay DB via flag) └────────────────────┘ ``` ### Five new components, all small 1. **`FeedAdapter` interface** + extract live `feeds/kraken.ts` + `feeds/coinbase.ts` to implement it. 2. **`HistoricalFeedAdapter`** — reads JSONL, emits quotes in timestamp order, calls the same handlers as live. 3. **`ReplayClock`** — replaces `nowUtcMs()` with the timestamp of the most recently replayed quote. Critical: without this, `max_feed_age_ms: 1000` rejects every replayed quote because "now" is 2026-04-25 and the data is from 2026-04-18. 4. **`replay` CLI runner** — `npm run replay -- --from=... --to=... --pairs=... --venues=... --fees-override=... --slippage-override=... --min-edge-pct-override=... --out=replay-7d-baseline.db` 5. **What-if sweep** — same runner with a fee/edge matrix → table of "if fees were X, would we have Y qualifying/day." ### Output (already exists via /opportunities/stats — just point it at the replay DB) - net_edge distribution (p50/p90/p99/max per direction) - best opportunities (top-100 by net_edge) - % qualifying vs % rejected by reason - frequency (per hour, per day) ### Critical correctness check Before trusting the replay, the same captured tick file replayed through the new adapter must produce the **same `opportunities` rows** (same primary keys, same numeric values) that the live engine produced from those quotes. Hash-equality test. Without this, the replay isn't trustworthy. --- ## 3. Multi-market Two separate axes: **Adding pairs (cheap, no new code):** - Add to `symbols:` in config — already supported. Need data for that pair. **Adding exchanges (one-time per exchange, ~half day each):** - Each exchange needs a `FeedAdapter` (live WS) and a normalized symbol mapping. - The replay layer is venue-agnostic: as long as Tardis (or self-capture) gives `{exchange, symbol, bid, ask, ...}` in the standard schema, replay accepts it. **Fast viability sweep:** ``` buy data → replay → stats → decision (per pair × per exchange-pair) ``` For 5 pairs × 5 venues = 10 venue-pairs × 5 currency-pairs = 50 cells. With Tardis data and the replay infra, full sweep runs in minutes. --- ## 4. Execution sequence — 6 steps, sent one at a time These are NOT sent all at once. Brief description of each, then the auditor sends Step 1 only after operator approval. | # | Step | Sends a prompt to dev? | Read-only audit by auditor? | What it answers | |---|---|---|---|---| | 1 | Audit current capture: format, retention, gaps. Confirm replay can use it. | No | Yes | Do we have usable free data? How much per day? | | 2 | Build replay infrastructure (FeedAdapter interface, HistoricalFeedAdapter, ReplayClock, runner CLI, separate replay DB output). Tests prove byte-identical opportunity rows when replaying captured ticks. | **Yes** | — | Replay engine works and matches live | | 3 | Run replay over our self-captured ticks. Verify match. Run `/opportunities/stats` against replay DB. | No | Yes | Replay is trustworthy | | 4 | Decision point: buy 7-day Tardis L1 data for kraken+coinbase BTC/USD+ETH/USD ($30–$60). Auditor writes the ingestion prompt. **Operator authorizes spend.** | **Yes** (+ operator authorization) | — | Real 7-day historical answer | | 5 | Fee-tier sweep: replay same data with fee overrides matching maker/VIP tiers. Decision matrix. | **Yes** (small) | Auditor runs | Is the answer fee-tier-dependent or structural? | | 6 | Multi-market expansion: 1–2 more venues (Binance, OKX) + 2–3 more pairs (SOL, XRP, DOGE). Only if Step 5 isn't a clean "no." | **Yes** | Auditor runs | Does the strategy work on a different market? | --- ## 5. Decision framework Per `(currency_pair, exchange_A → exchange_B)` cell, after a 7-day replay at the actual fee tier you'd trade at: | Outcome | Criteria | Decision | |---|---|---| | **Viable** | ≥3 qualifying opps/day on rolling 7-day avg AND median qualifying `net_edge_usd` > $0.05 AND ≥80% of qualifying have `qty ≥ min_size` (no thin-book illusion) AND best gross spread ≥ `(fees + slippage + min_edge_pct + 50% safety margin)` | Proceed to Phase 1 design for that cell | | **Marginal** | 1–2 qualifying/day OR qualifying clusters in specific hours/sessions OR direction asymmetry suggests a real edge | Buy 30 days L2, retest. Consider strategy tweaks (limit orders / passive) before live | | **Not viable** | 0 qualifying in 7 days AT YOUR ACTUAL FEE TIER AND best gross spread < (fees + slippage) | **Stop.** Move to next market. Do not retest unless market structure changes. | | **Stop-testing** | After Tier B replay shows 0 qualifying AND fee-sweep shows even 0% fees would produce <1 qualifying/day | Strategy is structurally dead for this market. Don't burn more time. | **Tie-breaker for "do we expand to a new market":** worth testing only if `expected_gross_spread_pct > 2 × round_trip_fees_pct`. If not, skip. --- ## What happens next Auditor executes Step 1 (read-only audit, already partially complete) and brings back the result plus the **Step 2 prompt** for operator review before sending to dev. If operator wants to adjust (different data source, different time range, skip a step, change decision thresholds), say so before Step 1 finishes.