# Historical Validation Plan — arb (Phase 0 → fast viability decision)

**Status:** proposed by auditor, awaiting operator approval before any dev work.
**Generated:** 2026-04-25, Melbourne.
**Goal:** answer "would this strategy have worked recently?" in hours, not weeks.

---

## Context (why this shift)

Live paper trading after ~2.5 hours has shown:
- 3,025 opportunities scanned across 4 directions (BTC/ETH × kraken/coinbase)
- 0 qualifying
- Best gross spread observed: 0.036%
- Required gross spread to clear fees+slippage+threshold at current config: ~1.25%
- Gap: ~35×

Waiting 14 days for live data to confirm what's already obvious is too slow. Historical replay can deliver a definitive answer for the same and adjacent markets in an afternoon.

---

## 1. Historical analysis strategy

### Data tiers

| Tier | Data type | Source | Cost | Accuracy | Speed-to-answer |
|---|---|---|---|---|---|
| A | L2 orderbook (10+ levels), tick-level | Tardis.dev / Kaiko | $50–$200 for 7d × 2 venues × 2 pairs | Highest (real fillable size) | 1 day to set up |
| B | **L1 best-bid/ask, tick-level** | Tardis.dev / Kaiko, OR self-captured JSONL | $20–$80 (or free if self-captured ≥7d) | High (matches what live engine sees) | Hours |
| C | Trade tape (executed prices) | Free from Kraken/Coinbase REST | Free | Medium (no resting bid/ask, only executions) | Hours |
| D | OHLC candles + spread proxy | Free from any data API | Free | Low (smooths out the seconds-long arb windows where opportunities actually exist) | Minutes |

### Recommendation: Tier B, 7 days, from Tardis.dev for the first run.

Why:
- Matches exactly what the live strategy sees. The current strategy is L1-only — it doesn't walk the book — so L2 buys nothing until the strategy is upgraded.
- 7 days is enough. If 7-day historical replay shows 0 qualifying at the same fee assumptions the live data uses, that confirms the live signal with high confidence. If 7 days shows 1+ qualifying day, extend to 30.
- Tardis tick-level L1 for kraken+coinbase, BTC/USD + ETH/USD, 7 days ≈ $30–$60 USD.

### Skip:
- **Tier A initially** — strategy is rejecting on net_edge (fees), not on size-fillability. L2 only matters once we know fees aren't the killer.
- **Tier C and D** — trade tape lacks resting quotes, candles smooth out the windows where arb actually exists. Both produce false negatives.

### Free fallback
Replay our own self-captured `engine/data/ticks/` first to validate the replay infra, then add Tardis once you confirm budget.

### Self-capture audit (already on this VPS)

- `engine/data/ticks/coinbase_2026-04-25.jsonl`: 17,186 lines, 2.9 MB, ~2h coverage
- `engine/data/ticks/kraken_2026-04-25.jsonl`: smaller, ~200 KB (kraken emits ~14× fewer updates than coinbase — normal)
- Format: `{exchange, symbol, bid, ask, bidSize, askSize, ts, receivedAt}` JSONL — clean, replay-ready
- Daily volume estimate: ~35 MB/day per venue → 7 days × 2 venues ≈ 250 MB. Trivial.

---

## 2. Replay/backtest system design

The engine has clean separation: `Feeds → OrderBook → Strategy → Risk → Simulator → Ledger`. The replay swaps the feed layer only.

```
┌──────────────────┐                                    ┌──────────────────┐
│ Historical files │ ──┐                                │ Live WS feeds    │ (existing)
│ (JSONL / CSV)    │   │   ┌───────────────────────┐    │ Kraken+Coinbase  │
└──────────────────┘   ├──▶│ FeedAdapter interface │◀──┤                   │
                       │   └───────────────────────┘    └──────────────────┘
┌──────────────────┐   │              │
│ Replay clock     │ ──┘              ▼
│ (returns latest  │       ┌────────────────────┐
│  replayed ts)    │       │  OrderBook         │ (existing, untouched)
└──────────────────┘       └────────────────────┘
                                      │
                                      ▼
                           ┌────────────────────┐
                           │ Strategy + Risk    │ (existing, untouched)
                           │ + Simulator        │
                           └────────────────────┘
                                      │
                                      ▼
                           ┌────────────────────┐
                           │ replay-{run}.db    │ (separate file, no
                           │ (same schema)      │  contamination of live)
                           └────────────────────┘
                                      │
                                      ▼
                           ┌────────────────────┐
                           │ /opportunities/    │ (already built — point at
                           │ stats endpoint     │  replay DB via flag)
                           └────────────────────┘
```

### Five new components, all small

1. **`FeedAdapter` interface** + extract live `feeds/kraken.ts` + `feeds/coinbase.ts` to implement it.
2. **`HistoricalFeedAdapter`** — reads JSONL, emits quotes in timestamp order, calls the same handlers as live.
3. **`ReplayClock`** — replaces `nowUtcMs()` with the timestamp of the most recently replayed quote. Critical: without this, `max_feed_age_ms: 1000` rejects every replayed quote because "now" is 2026-04-25 and the data is from 2026-04-18.
4. **`replay` CLI runner** — `npm run replay -- --from=... --to=... --pairs=... --venues=... --fees-override=... --slippage-override=... --min-edge-pct-override=... --out=replay-7d-baseline.db`
5. **What-if sweep** — same runner with a fee/edge matrix → table of "if fees were X, would we have Y qualifying/day."

### Output (already exists via /opportunities/stats — just point it at the replay DB)

- net_edge distribution (p50/p90/p99/max per direction)
- best opportunities (top-100 by net_edge)
- % qualifying vs % rejected by reason
- frequency (per hour, per day)

### Critical correctness check

Before trusting the replay, the same captured tick file replayed through the new adapter must produce the **same `opportunities` rows** (same primary keys, same numeric values) that the live engine produced from those quotes. Hash-equality test. Without this, the replay isn't trustworthy.

---

## 3. Multi-market

Two separate axes:

**Adding pairs (cheap, no new code):**
- Add to `symbols:` in config — already supported. Need data for that pair.

**Adding exchanges (one-time per exchange, ~half day each):**
- Each exchange needs a `FeedAdapter` (live WS) and a normalized symbol mapping.
- The replay layer is venue-agnostic: as long as Tardis (or self-capture) gives `{exchange, symbol, bid, ask, ...}` in the standard schema, replay accepts it.

**Fast viability sweep:**
```
buy data → replay → stats → decision (per pair × per exchange-pair)
```
For 5 pairs × 5 venues = 10 venue-pairs × 5 currency-pairs = 50 cells. With Tardis data and the replay infra, full sweep runs in minutes.

---

## 4. Execution sequence — 6 steps, sent one at a time

These are NOT sent all at once. Brief description of each, then the auditor sends Step 1 only after operator approval.

| # | Step | Sends a prompt to dev? | Read-only audit by auditor? | What it answers |
|---|---|---|---|---|
| 1 | Audit current capture: format, retention, gaps. Confirm replay can use it. | No | Yes | Do we have usable free data? How much per day? |
| 2 | Build replay infrastructure (FeedAdapter interface, HistoricalFeedAdapter, ReplayClock, runner CLI, separate replay DB output). Tests prove byte-identical opportunity rows when replaying captured ticks. | **Yes** | — | Replay engine works and matches live |
| 3 | Run replay over our self-captured ticks. Verify match. Run `/opportunities/stats` against replay DB. | No | Yes | Replay is trustworthy |
| 4 | Decision point: buy 7-day Tardis L1 data for kraken+coinbase BTC/USD+ETH/USD ($30–$60). Auditor writes the ingestion prompt. **Operator authorizes spend.** | **Yes** (+ operator authorization) | — | Real 7-day historical answer |
| 5 | Fee-tier sweep: replay same data with fee overrides matching maker/VIP tiers. Decision matrix. | **Yes** (small) | Auditor runs | Is the answer fee-tier-dependent or structural? |
| 6 | Multi-market expansion: 1–2 more venues (Binance, OKX) + 2–3 more pairs (SOL, XRP, DOGE). Only if Step 5 isn't a clean "no." | **Yes** | Auditor runs | Does the strategy work on a different market? |

---

## 5. Decision framework

Per `(currency_pair, exchange_A → exchange_B)` cell, after a 7-day replay at the actual fee tier you'd trade at:

| Outcome | Criteria | Decision |
|---|---|---|
| **Viable** | ≥3 qualifying opps/day on rolling 7-day avg AND median qualifying `net_edge_usd` > $0.05 AND ≥80% of qualifying have `qty ≥ min_size` (no thin-book illusion) AND best gross spread ≥ `(fees + slippage + min_edge_pct + 50% safety margin)` | Proceed to Phase 1 design for that cell |
| **Marginal** | 1–2 qualifying/day OR qualifying clusters in specific hours/sessions OR direction asymmetry suggests a real edge | Buy 30 days L2, retest. Consider strategy tweaks (limit orders / passive) before live |
| **Not viable** | 0 qualifying in 7 days AT YOUR ACTUAL FEE TIER AND best gross spread < (fees + slippage) | **Stop.** Move to next market. Do not retest unless market structure changes. |
| **Stop-testing** | After Tier B replay shows 0 qualifying AND fee-sweep shows even 0% fees would produce <1 qualifying/day | Strategy is structurally dead for this market. Don't burn more time. |

**Tie-breaker for "do we expand to a new market":** worth testing only if `expected_gross_spread_pct > 2 × round_trip_fees_pct`. If not, skip.

---

## What happens next

Auditor executes Step 1 (read-only audit, already partially complete) and brings back the result plus the **Step 2 prompt** for operator review before sending to dev.

If operator wants to adjust (different data source, different time range, skip a step, change decision thresholds), say so before Step 1 finishes.