Rebuilding a continental dispatch engine without taking it offline.
Eight thousand drivers, 220 dispatchers, fourteen years of accreted logic. We replaced the brain of the operation while the freight kept moving.
01The problem
Northwind’s dispatch system was written in 2011 against a Microsoft SQL Server schema that had been migrated, in place, three times. By 2024 it was the single point of failure for a US$1.3B revenue business. A dispatcher action took an average of 1.4 seconds to resolve; during the morning surge between 5:30 and 7:00 ET, that climbed past nine seconds and the system would silently stop fanning out changes to driver tablets.
The previous attempt at a rewrite had been killed eighteen months prior, after a vendor delivered a Kubernetes-shaped artifact that nobody on the Northwind side felt confident operating. Leadership’s instinct — correctly, in our view — was to be extremely suspicious of anything that looked like a greenfield rebuild. We agreed to a constraint: at no point during the engagement could the dispatch system be unavailable for more than thirty seconds.
02Architecture & decisions
We split the dispatch domain into three services along the seams that already existed in the operations team’s vocabulary: load planning (assigning freight to a tractor), routing (computing the actual sequence of stops), and execution (the stream of events from tablets and ELDs). Each got its own Postgres instance and a single Rust process. Everything talks over Kafka with a small, hand-written schema registry — we tried Protobuf and Avro and ultimately preferred a versioned JSON contract for grep-ability during incidents.
The migration ran for nine months in shadow mode. Writes went to both the old system and the new; reads came from old, with the new system’s output continuously diffed against it. We caught roughly forty behavioural quirks the legacy code had developed over fourteen years — rules nobody could explain — and reproduced the ones the dispatchers depended on, then disabled the ones nobody could justify. The cutover itself was a three-line config change deployed on a Sunday at 0400 ET.
The tradeoff we’ll defend: we kept Postgres per-service and avoided distributed transactions entirely. Every state change in routing emits an event that load planning consumes from Kafka and applies idempotently — including the corrective compensation if a route is later revoked. This costs us roughly forty milliseconds of eventual consistency on the dispatcher’s screen. We tested with the operations team that this was acceptable; it is.
The thing we wish we’d done sooner: written the diff-against-legacy harness in week one, not week six. It was the single tool that made everyone calm during cutover.
// load_planning::apply_route_event pub async fn apply(evt: RouteEvent, db: &Pg) -> Result<()> { // natural key: (load_id, sequence_no) is unique in route_assignments let existing = db.fetch_one( /* sql */ r#"SELECT applied_at FROM route_assignments WHERE load_id = $1 AND sequence_no = $2"#, &[&evt.load_id, &evt.sequence_no], ).await.optional()?; if existing.is_some() { return Ok(()); } // already applied -- safe replay db.execute(/* sql */ r#" INSERT INTO route_assignments (load_id, sequence_no, route, applied_at) VALUES ($1, $2, $3, now()) "#, &[&evt.load_id, &evt.sequence_no, &evt.route]).await?; audit::record(&evt).await?; Ok(()) }
03The product
04Outcome
Twelve months on, the system runs on four engineers from Northwind’s side — the previous monolith took eleven. The on-call burden, measured by paging events per quarter, is down by two-thirds. We’re still on a small monthly retainer for architecture review; everything else is theirs to operate.