Appearance
Case Study
Observability Platform
Increasing reliability and empowering product teams while lowering cost at Ahold Delhaize.
Read the Full Story
We co-authored a blog post with Ahold Delhaize detailing the observability transformation at Albert Heijn, from fragmented tooling to a unified platform serving 1,500+ engineers.
Overview
Evil8 collaborated with Ahold Delhaize to build a unified observability platform for Albert Heijn, Etos, and Gall&Gall. Over the span of two years, the platform replaced a fragmented landscape of observability tools with a single, self-service solution serving 1,800+ engineers.
Evil8 deployed a Staff Engineer to lead the design and development of the platform, and an Engineering Manager to transform the observability team into a high-performing platform team practising continuous delivery. What started as a tooling consolidation became a grassroots DevOps transformation: engineering teams took ownership of their own observability for the first time.
The Challenge
Albert Heijn's IT organisation had grown through decades of expansion, mergers and acquisitions, leaving behind a heterogeneous landscape where each department followed its own IT strategy. Observability was fragmented across:
- Dedicated monitoring teams maintaining expensive ELK stacks, with engineers requesting dashboards through ticketing systems
- Department-specific Grafana instances with Thanos, inaccessible to other teams
- Legacy systems like Nagios requiring significant maintenance effort
- Azure-managed solutions (Data Explorer, Monitor, Log Analytics) adopted by teams lacking alternatives
- SaaS solutions like Dynatrace
- Many teams with no observability at all
This fragmentation created two major problems: high operational overhead from maintaining disjointed infrastructure and licensing, and high mean-time-to-recovery (MTTR) due to missing end-to-end chain observability.
The Solution
We designed a multi-tenant, self-hosted "SaaS"-like platform built on the LGTM stack (Loki, Grafana, Tempo, Mimir) with a focus on self-service, clear ownership boundaries, and ease of adoption.
- Self-service by design: APIs, UIs and documentation instead of tickets, DMs or pull requests. Engineers are automatically onboarded through existing IAM integration.
- Clear ownership boundaries: The platform team owns the platform. Engineering teams own their observability: dashboards, alerts and on-call rotations.
- Broad integration: OpenTelemetry, Prometheus, Syslog, Azure Diagnostics and VM-based collection ensure any team can adopt the platform regardless of their stack.
- Single source of truth: All existing datasources added to a central Grafana instance, allowing teams to migrate incrementally.
Timeline
- Aug 2023 — Analysis and requirements gathering. Initial platform design drafted.
- Oct 2023 — Alpha release: multi-tenant Grafana with IAM integration and centralised Alertmanager. CI/CD pipeline with ephemeral environments and automated testing in place.
- Dec 2023 — Beta release: Loki, Mimir and Tempo added. Support channels consolidated, documentation prioritised.
- Q1 2024 — General availability. Kubernetes clusters auto-instrumented. Alertmanager deprecation plan started.
- Q3 2024 — A department of ~500 engineers migrated hundreds of applications in a two-day on-site sprint, executed by the engineering teams themselves.
- 2025 — All legacy Alertmanager and Grafana instances turned off. Elastic clusters decommissioned.
The Results
75%team adoption after 1 year of GA
70%reduction in observability costs
50%MTTR reduction in one department
80Mactive time series (3x replicated)
8,000req/speak ingestion at 300 MB/s
1,800+engineers supported
Beyond the numbers, teams started using the platform in ways we didn't anticipate: networking teams monitoring physical and virtual appliances, distribution centres tracking operational processes, and engineering teams collecting signals from point-of-sale machines to detect failures in real time.