Engine deploy failed: missing DATABASE_URL · postmortem

What happened

On 2026-04-30 between 07:00 UTC and 19:00 UTC, the public /performance/landing endpoint returned wildly inconsistent numbers — including the 7D and 30D windows showing 0 trades but negative PnL of -$277. The engine itself continued trading correctly throughout (no user accounts affected), but the public-facing landing page was unusable for ~12 hours.

Root cause was a chain of three independent bugs that compounded.

Bug 1: Missing DATABASE_URL

A Render env-vars manipulation operation removed several variables that were not in the snapshot file used to restore. SHADOW_DB_DSN was preserved, but the engine code path that built the connection string preferred DATABASE_URL when set and fell back to localhost otherwise. The result was the engine starting up and looping on connection retries to a non-existent local database for ~10 minutes.

Fix: restored DATABASE_URL via Render API.

Bug 2: SQL deserialization panic

After fixing Bug 1, the engine started up but immediately panicked on a Postgres deserialization error. The panic was in restore_performance_metrics_from_db() reading column 2 (cum_trades) from a window function query. SUM(BIGINT) OVER (ORDER BY day) returns NUMERIC in PostgreSQL by default, not BIGINT — tokio-postgres rejected the implicit conversion.

Fix: explicit ::BIGINT cast on all SUM-OVER results.

Bug 3: Stale tracker JSON overwriting DB-restored gauges

Even after fixes 1 and 2, the public landing page returned wildly wrong numbers. The OrderTracker reloads its full history from order_tracker_paper.json on Render's persistent disk on startup. That file was stale (107 trades vs 432 in DB) and the tracker's update_prometheus_stats was overwriting the DB-restored gauge values on every tick.

Fix: introduced a 60-second background refresh loop that re-runs the DB query and re-populates the gauges. The tracker no longer writes to the public landing gauges directly. DB is now the single source of truth for the public landing endpoint.

Detection

User noticed the wrong numbers on the live homepage at ~14:30 UTC (when the API was already returning bad data for several hours). Investigation began with reading Render logs.

Lessons

Snapshot all env vars before any manipulation. We had a partial snapshot. Now we have a scripts/snapshot-env.sh that dumps all vars before any change.
Health checks should reflect application liveness, not just process liveness. The Render health check returned 200 throughout because the metrics server thread was alive even when the main DB connect loop was failing. Adding engine_db_connected Prometheus gauge as a real liveness signal.
DB is the source of truth for public data, always. The tracker's in-memory state was being trusted for landing page numbers. That was wrong — tracker JSON files on persistent disk can drift, get corrupted, or be partially rolled back. Public-facing data now goes directly through DB.
Test SQL queries with non-trivial fixtures locally. Bug 2 would have been caught by any non-empty closed_trades table during local testing. Adding a CI integration test that runs the restore query against a seeded test DB.