Summary
Between 2026-02-11 14:30 and 2026-02-12 07:00 UTC we experienced intermittent Dashboard / SDK API degradation and delays in event delivery to integrations (incl. custom webhook). The incident was caused by an unintended network configuration change during planned maintenance, which led to unstable network interfaces on a subset of our infrastructure.
Timeline (UTC)
- 14:30 — network instability begins on a limited subset of infrastructure;
- 15:10 — event sending latency begins increasing;
- 16:20 — service instances in the affected area begin restarting; 5xx errors increase for part of Dashboard traffic;
- 17:31 — peak impact: SDK downtime begins; we place temporary restrictions to stabilize the platform and reduce cascading failures;
- 17:51 — SDK recovers;
- 18:00 — event processing resumes, but with elevated delay;
- 19:00 — Dashboard recovers;
- 20:00 — event delivery returns to normal levels;
- 20:40–22:10 — intermittent Dashboard degradation;
- 05:30–06:30 — second wave: broad latency spike caused by a wider propagation of the same triggering change;
- 06:40 — triggering change fully reverted; services stabilize.
Customer impact
Event delivery to integrations
- 2026-02-11 15:10–17:31 UTC — events were delivered with ~5–10 min delay;
- 2026-02-11 17:31–18:00 UTC — event delivery paused;
- 2026-02-11 18:00–20:00 UTC — events were delivered with ~15–20 min delay, then returned to normal.
Dashboard API
- 2026-02-11 14:30–16:20 UTC — high latency for a portion of traffic;
- 2026-02-11 16:20–17:31 UTC — elevated 5xx errors for a portion of traffic;
- 2026-02-11 17:31–19:00 UTC — Dashboard was unavailable;
- 2026-02-11 19:00–22:10 UTC — intermittent latency, then stabilized;
- 2026-02-12 05:30–06:30 UTC — high latency affecting a larger share of traffic.
SDK API
- 2026-02-11 17:31–17:51 UTC — full downtime;
- 2026-02-12 05:30–06:30 UTC — high latency.
Root cause
During planned maintenance, a DevOps engineer executed a bash script intended to add a host route and restart the network interface. The script contained a bug that caused it to run in an infinite loop, sequentially restarting network interfaces across nodes. These repeated interface restarts introduced intermittent network disruption and latency in communication between internal services.
This had a particularly strong impact on our Postgres connection pool:
- network restarts repeatedly dropped existing database connections, forcing the service to open new ones;
- the old connections were not immediately removed on the Postgres side (they had a 20-minute lifetime/timeout);
- as a result, dead/stale connections accumulated quickly and eventually exhausted the Postgres connection limit (max_connections);
- once the limit was reached, service could not open new connections, which caused errors and degraded availability.