Skip to content

← Projects

Case Study

Observability Platform

Increasing reliability and empowering product teams while lowering cost at Ahold Delhaize.

Staff+ EngineeringEngineering Management

Read the Full Story

We co-authored a blog post with Ahold Delhaize detailing the observability transformation at Albert Heijn, from fragmented tooling to a unified platform serving 1,500+ engineers.

Read on Albert Heijn Technology Blog →

Overview

Evil8 collaborated with Ahold Delhaize to build a unified observability platform for Albert Heijn, Etos, and Gall&Gall. Over the span of two years, the platform replaced a fragmented landscape of observability tools with a single, self-service solution serving 1,800+ engineers.

Evil8 deployed a Staff Engineer to lead the design and development of the platform, and an Engineering Manager to transform the observability team into a high-performing platform team practising continuous delivery. What started as a tooling consolidation became a grassroots DevOps transformation: engineering teams took ownership of their own observability for the first time.

The Challenge

Albert Heijn's IT organisation had grown through decades of expansion, mergers and acquisitions, leaving behind a heterogeneous landscape where each department followed its own IT strategy. Observability was fragmented across:

  • Dedicated monitoring teams maintaining expensive ELK stacks, with engineers requesting dashboards through ticketing systems
  • Department-specific Grafana instances with Thanos, inaccessible to other teams
  • Legacy systems like Nagios requiring significant maintenance effort
  • Azure-managed solutions (Data Explorer, Monitor, Log Analytics) adopted by teams lacking alternatives
  • SaaS solutions like Dynatrace
  • Many teams with no observability at all

This fragmentation created two major problems: high operational overhead from maintaining disjointed infrastructure and licensing, and high mean-time-to-recovery (MTTR) due to missing end-to-end chain observability.

The Solution

We designed a multi-tenant, self-hosted "SaaS"-like platform built on the LGTM stack (Loki, Grafana, Tempo, Mimir) with a focus on self-service, clear ownership boundaries, and ease of adoption.

  • Self-service by design: APIs, UIs and documentation instead of tickets, DMs or pull requests. Engineers are automatically onboarded through existing IAM integration.
  • Clear ownership boundaries: The platform team owns the platform. Engineering teams own their observability: dashboards, alerts and on-call rotations.
  • Broad integration: OpenTelemetry, Prometheus, Syslog, Azure Diagnostics and VM-based collection ensure any team can adopt the platform regardless of their stack.
  • Single source of truth: All existing datasources added to a central Grafana instance, allowing teams to migrate incrementally.

Timeline

  • Aug 2023 — Analysis and requirements gathering. Initial platform design drafted.
  • Oct 2023 — Alpha release: multi-tenant Grafana with IAM integration and centralised Alertmanager. CI/CD pipeline with ephemeral environments and automated testing in place.
  • Dec 2023 — Beta release: Loki, Mimir and Tempo added. Support channels consolidated, documentation prioritised.
  • Q1 2024 — General availability. Kubernetes clusters auto-instrumented. Alertmanager deprecation plan started.
  • Q3 2024 — A department of ~500 engineers migrated hundreds of applications in a two-day on-site sprint, executed by the engineering teams themselves.
  • 2025 — All legacy Alertmanager and Grafana instances turned off. Elastic clusters decommissioned.

The Results

75%team adoption after 1 year of GA
70%reduction in observability costs
50%MTTR reduction in one department
80Mactive time series (3x replicated)
8,000req/speak ingestion at 300 MB/s
1,800+engineers supported

Beyond the numbers, teams started using the platform in ways we didn't anticipate: networking teams monitoring physical and virtual appliances, distribution centres tracking operational processes, and engineering teams collecting signals from point-of-sale machines to detect failures in real time.

Let's Talk

Facing similar observability challenges? We'd love to hear from you.

Get in Touch

Engineering better organisations, one step at a time.