Six data sources, three Excel masters of truth, zero confidence in any number.
The operator had a TMS, a fleet system, two customs platforms, a fuel-card feed, and a homegrown invoicing tool. Six sources, no shared keys, no shared schema. Their analytics team rebuilt the same joins every Monday in three different Excel files — and they regularly disagreed.
Executives had stopped asking for KPIs because the numbers kept changing between meetings. The CFO told us bluntly: "I don't trust any single dashboard in this company."
We instrumented the existing pipelines for two weeks before drawing a single arrow.
Most data-platform engagements start with whiteboards. We started with a passive observability layer — measuring ingestion latency, error rates, and schema drift on the existing pipelines. After two weeks we had a real picture: 47% of ingestion jobs were failing silently, 22% of records had at least one quality issue, and the worst offender (the customs feed) was the one nobody wanted to touch.
We then designed the new architecture against the actual failure modes, not against a textbook lambda diagram.
"They didn't come in with a Databricks reference architecture. They came in with our actual error logs."
Bronze → silver → gold on Delta, with Great Expectations gates between layers.
Standard medallion architecture, executed boringly well. Bronze layer ingests raw with full provenance. Silver layer applies business rules and conforms keys across systems. Gold layer is the analytics-ready aggregates.
The non-standard part was the gates: every layer-to-layer transition has Great Expectations checks. If quality crosses a defined threshold, the bad batch holds — it doesn't poison downstream tables. Slack alerts go to the responsible team within 5 minutes; severe issues page on-call.
5 weeks. 5 million rows ingested per month. The CFO finally trusted a number.
Within 5 weeks the new pipeline was the source of truth for 4 of 6 source systems. Data freshness moved from 5-day weekly batch to hourly. 90% of ingestion was fully automated; the remaining 10% (the customs feed) had a documented manual escape hatch with a clear SLA.
The most visible change was cultural: in the third executive review meeting after go-live, two department heads pulled up the same dashboard and got the same number. That hadn't happened in two years.
Numbers that survived go-live.
- Data freshness5 days → 1 hour−97%
- Ingestion automated53% → 90%+37pt
- Records with quality issues22% → 1.4%−94%
- Time to debug a wrong number~3 days → 20 min−99%
- Conflicting executive numbersweekly → 0—