Case study 001 — Logistics

Rebuilding a continental dispatch engine without taking it offline.

Eight thousand drivers, 220 dispatchers, fourteen years of accreted logic. We replaced the brain of the operation while the freight kept moving.

Client

Northwind Freight Inc.

Engagement

Embedded build · 11 months

Team

4 engineers, 1 PM

Year

2024—2025

Domain

Freight · LTL dispatch

Stack

Rust, Postgres, Kafka, React

Role

Architecture, build, on-call rotation

Status

Shipped · in production

Fig 1.0 — Dispatch console, primary view. Real-time load assignment across the eastern corridor.

01The problem

Northwind’s dispatch system was written in 2011 against a Microsoft SQL Server schema that had been migrated, in place, three times. By 2024 it was the single point of failure for a US$1.3B revenue business. A dispatcher action took an average of 1.4 seconds to resolve; during the morning surge between 5:30 and 7:00 ET, that climbed past nine seconds and the system would silently stop fanning out changes to driver tablets.

The previous attempt at a rewrite had been killed eighteen months prior, after a vendor delivered a Kubernetes-shaped artifact that nobody on the Northwind side felt confident operating. Leadership’s instinct — correctly, in our view — was to be extremely suspicious of anything that looked like a greenfield rebuild. We agreed to a constraint: at no point during the engagement could the dispatch system be unavailable for more than thirty seconds.

02Architecture & decisions

We split the dispatch domain into three services along the seams that already existed in the operations team’s vocabulary: load planning (assigning freight to a tractor), routing (computing the actual sequence of stops), and execution (the stream of events from tablets and ELDs). Each got its own Postgres instance and a single Rust process. Everything talks over Kafka with a small, hand-written schema registry — we tried Protobuf and Avro and ultimately preferred a versioned JSON contract for grep-ability during incidents.

The migration ran for nine months in shadow mode. Writes went to both the old system and the new; reads came from old, with the new system’s output continuously diffed against it. We caught roughly forty behavioural quirks the legacy code had developed over fourteen years — rules nobody could explain — and reproduced the ones the dispatchers depended on, then disabled the ones nobody could justify. The cutover itself was a three-line config change deployed on a Sunday at 0400 ET.

Fig 2.1 — Service topology after migration

The tradeoff we’ll defend: we kept Postgres per-service and avoided distributed transactions entirely. Every state change in routing emits an event that load planning consumes from Kafka and applies idempotently — including the corrective compensation if a route is later revoked. This costs us roughly forty milliseconds of eventual consistency on the dispatcher’s screen. We tested with the operations team that this was acceptable; it is.

The thing we wish we’d done sooner: written the diff-against-legacy harness in week one, not week six. It was the single tool that made everyone calm during cutover.

Fig 2.2 — The idempotent apply pattern, simplified

// load_planning::apply_route_event
pub async fn apply(evt: RouteEvent, db: &Pg) -> Result<()> {
    // natural key: (load_id, sequence_no) is unique in route_assignments
    let existing = db.fetch_one(
        /* sql */ r#"SELECT applied_at FROM route_assignments
                          WHERE load_id = $1 AND sequence_no = $2"#,
        &[&evt.load_id, &evt.sequence_no],
    ).await.optional()?;

    if existing.is_some() { return Ok(()); }   // already applied -- safe replay

    db.execute(/* sql */ r#"
        INSERT INTO route_assignments (load_id, sequence_no, route, applied_at)
        VALUES ($1, $2, $3, now())
    "#, &[&evt.load_id, &evt.sequence_no, &evt.route]).await?;

    audit::record(&evt).await?;
    Ok(())
}

A single unique constraint, no distributed transaction. The audit write is in the same Postgres transaction as the insert.

03The product

01The dispatcher’s primary view. We resisted the urge to redesign it. The existing keyboard shortcuts and column ordering were the result of fourteen years of expert use and we preserved them exactly — the “new” system felt like the old one on day one, which is what we wanted.

02Route detail with live driver state. The thin column on the right is the event log — every state change visible, copyable, with the source service named. The operations team made us add this in week four and it ended up being the single most-used surface for debugging in production.

03The field supervisor view, for when a regional manager is standing in a yard. We deliberately built this as a tightly-scoped second client rather than a responsive port of the desktop. Different ergonomic context, different software.

04The shadow-mode diff dashboard we ran for nine months. Every disagreement between old and new dispatch was a row here; we triaged them daily. By month six we were closing roughly two per week. By the time we cut over, the diff was empty for fourteen consecutive days.

04Outcome

Dispatcher action · p95

800ms → 120ms

Measured server-side at the api gateway. Morning surge p99 dropped from 9.2s to 410ms.

Cutover downtime

Three-line config change at 0400 ET on a Sunday. No customer-visible impact recorded.

Annual infra cost

−38%

Mostly from retiring two legacy Windows server farms and consolidating onto t4g instances.

Twelve months on, the system runs on four engineers from Northwind’s side — the previous monolith took eleven. The on-call burden, measured by paging events per quarter, is down by two-thirds. We’re still on a small monthly retainer for architecture review; everything else is theirs to operate.

Next case study

Helix Diagnostics — HIPAA intake workflow →

All case studies