Dashboard performance issues

Incident Report for Adapty

Postmortem

Summary

Between 2026-02-11 14:30 and 2026-02-12 07:00 UTC we experienced intermittent Dashboard / SDK API degradation and delays in event delivery to integrations (incl. custom webhook). The incident was caused by an unintended network configuration change during planned maintenance, which led to unstable network interfaces on a subset of our infrastructure.

Timeline (UTC)

14:30 — network instability begins on a limited subset of infrastructure;
15:10 — event sending latency begins increasing;
16:20 — service instances in the affected area begin restarting; 5xx errors increase for part of Dashboard traffic;
17:31 — peak impact: SDK downtime begins; we place temporary restrictions to stabilize the platform and reduce cascading failures;
17:51 — SDK recovers;
18:00 — event processing resumes, but with elevated delay;
19:00 — Dashboard recovers;
20:00 — event delivery returns to normal levels;
20:40–22:10 — intermittent Dashboard degradation;
05:30–06:30 — second wave: broad latency spike caused by a wider propagation of the same triggering change;
06:40 — triggering change fully reverted; services stabilize.

Customer impact

Event delivery to integrations

2026-02-11 15:10–17:31 UTC — events were delivered with ~5–10 min delay;
2026-02-11 17:31–18:00 UTC — event delivery paused;
2026-02-11 18:00–20:00 UTC — events were delivered with ~15–20 min delay, then returned to normal.

Dashboard API

2026-02-11 14:30–16:20 UTC — high latency for a portion of traffic;
2026-02-11 16:20–17:31 UTC — elevated 5xx errors for a portion of traffic;
2026-02-11 17:31–19:00 UTC — Dashboard was unavailable;
2026-02-11 19:00–22:10 UTC — intermittent latency, then stabilized;
2026-02-12 05:30–06:30 UTC — high latency affecting a larger share of traffic.

SDK API

2026-02-11 17:31–17:51 UTC — full downtime;
2026-02-12 05:30–06:30 UTC — high latency.

Root cause

During planned maintenance, a DevOps engineer executed a bash script intended to add a host route and restart the network interface. The script contained a bug that caused it to run in an infinite loop, sequentially restarting network interfaces across nodes. These repeated interface restarts introduced intermittent network disruption and latency in communication between internal services.

This had a particularly strong impact on our Postgres connection pool:

network restarts repeatedly dropped existing database connections, forcing the service to open new ones;
the old connections were not immediately removed on the Postgres side (they had a 20-minute lifetime/timeout);
as a result, dead/stale connections accumulated quickly and eventually exhausted the Postgres connection limit (max_connections);
once the limit was reached, service could not open new connections, which caused errors and degraded availability.

Posted Feb 16, 2026 - 14:12 UTC

Resolved

This incident has been resolved.

Posted Feb 12, 2026 - 06:55 UTC

Investigating

We are currently experiencing a partial service degradation affecting the Dashboard. Users may encounter intermittent latency or incomplete data rendering. Our engineering team is investigating the root cause and working towards a resolution.

Posted Feb 12, 2026 - 05:59 UTC

This incident affected: Dashboard.