We're experiencing an elevated level of API errors and are currently looking into the issue.

Incident Report for Adapty

Postmortem

Postmortem — Adapty SDK API elevated errors (2026-06-17)

Summary

On 2026-06-17 between 16:06 and 19:40 UTC, the Adapty SDK API returned elevated 5xx errors and increased latency for mobile apps, in two bursts (around 16:06 and around 18:07 UTC). A coordination service that the API relies on to serialize per-user updates reached its connection limit during a routine deployment, and requests that depend on it began to stall. This in turn exhausted the database connection pool, briefly cascading into errors across SDK endpoints and delaying some third-party integration deliveries. We mitigated by relieving load on the coordination service and raising its connection ceiling; the issue recurred about an hour later before being stabilized.

Impact

  • Duration: Approximately 3h34m (16:06 → 19:40 UTC), with two acute windows: ~16:06–16:53 and ~18:07–18:34 UTC.
  • Surfaces affected: SDK endpoints (profile updates, attribution, receipt/purchase validation for the App Store and Play Store), third-party integration deliveries, and parts of the Dashboard.
  • Visible signal: Elevated 5xx rates and latency on SDK API requests; some integration deliveries were delayed.
  • Recovery: SDK clients that automatically retry recovered most requests on retry. Some integration deliveries and attribution events that do not retry may have been delayed and, in rare cases, may require replay.

Timeline (UTC)

Time Event
~16:06 Elevated SDK API latency and 5xx errors begin during a routine deployment.
~16:10 A coordination service reaches its connection ceiling, causing dependent requests to fail.
~16:42 First burst largely subsides after load is relieved.
~17:08 Connection ceiling raised to add headroom.
~18:07 A second, shorter burst occurs and is mitigated.
19:40 Full recovery confirmed across affected services.

Root cause and contributing factors

Root cause: A service used to coordinate per-user updates on high-traffic SDK endpoints reached its maximum connection limit during a deployment, when a surge of reconnecting clients combined with in-flight request backlog. Requests waiting on that coordination layer stalled, which then exhausted the database connection pool and produced cascading errors.

Contributing factors:

  1. The coordination service's connection limit was shared across all SDK request paths, and client connections were not bounded on the application side — so a reconnection surge could saturate it.
  2. There was no early-warning alert below the service's terminal connection threshold; the first signal appeared only once the system was already near saturation.
  3. Raising the connection ceiling added headroom but did not address the underlying surge behavior, which is why a second burst occurred.

What went well

  • Protective limits behaved as designed and prevented data inconsistency under load.
  • Once identified, mitigations were applied quickly from a known recovery procedure.
  • Automatic SDK client retries recovered the majority of affected requests.

What we will do

Prevent — stop the same cause from recurring

  • Bound client connections to the coordination service on the application side so a reconnection surge cannot saturate it.
  • Distribute coordination load across multiple nodes to remove the single-node connection ceiling.

Detect — catch the same cause faster next time

  • Add early-warning alerting on coordination-service connections and connection rate, well before saturation.
  • Add dedicated alerting on database connection-pool utilization.

Mitigate — limit blast radius if the same cause recurs

  • Stagger deployment rollouts to avoid synchronized client reconnection surges.
  • Make the affected request path degrade gracefully instead of stalling when the coordination service is unavailable.

We apologize for the disruption to your applications during this window. If your integration is missing data from this period, please reach out to your usual Adapty contact and reference the 2026-06-17 incident.

Posted Jun 17, 2026 - 21:16 UTC

Resolved

This incident has been resolved.
Posted Jun 17, 2026 - 16:59 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 17, 2026 - 16:30 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Jun 17, 2026 - 16:25 UTC

Update

We are continuing to investigate this issue.
Posted Jun 17, 2026 - 16:13 UTC

Investigating

We are currently investigating this issue.
Posted Jun 17, 2026 - 16:11 UTC
This incident affected: API and Dashboard.