Elevated API Errors

Incident Report for Adapty

Postmortem

Postmortem — Adapty SDK API and Dashboard errors (2026-06-18)

Summary

On 2026-06-18 between 06:20 and 08:34 UTC, the Adapty SDK API and Dashboard returned elevated 5xx errors for approximately two hours. The cause was a change to an internal coordination service (used to safely serialize concurrent operations) that could not handle production load and began rejecting connections. Requests that depend on this service failed until the change was reverted, after which traffic recovered.

Impact

  • Duration: Approximately 2 hours (06:20 → 08:34 UTC); the most acute window was ~07:10 → 08:17 UTC.
  • Surfaces affected: SDK endpoints (purchase and receipt validation for the App Store, Play Store, and Paddle), the Dashboard (publishing Flows), and some third-party integration deliveries.
  • Visible signal: Elevated 5xx error rates and increased latency on SDK endpoints; Dashboard Flow publishing returned errors.
  • Recovery: SDK clients that automatically retry recovered most requests. Some integration deliveries or operations that do not retry may need to be re-run; if you believe data from this window is missing, contact your Adapty representative.

Timeline (UTC)

Time Event
~06:20 Automated monitoring begins reporting elevated SDK latency and 5xx errors.
~07:11 Impact on Dashboard Flow publishing confirmed; investigation underway.
~07:21 Root cause identified; remediation begins.
~08:00 Change reverted; error rates begin falling.
~08:17 Traffic stabilized.
~08:34 Recovery confirmed; incident closed.

Root cause and contributing factors

Root cause: An internal coordination service that serializes concurrent operations was switched to a new backing store that was not provisioned for production load. Under real traffic it exhausted its connection and memory limits and began resetting connections, so any request needing to coordinate an operation failed.

Contributing factors:

  1. The change reached full production traffic without a gradual/canary rollout or an instant rollback control to bound its blast radius.
  2. The new backing store was under-provisioned for production load.
  3. A failure in this coordination service caused requests to fail outright rather than falling back gracefully, so a single dependency issue affected many surfaces at once.

What went well

  • Once engaged, the team identified the root cause within minutes.
  • The fix was straightforward and fast to apply (revert the change).
  • The coordination service failed safely — there is no evidence of inconsistent or duplicated data.
  • SDK clients with automatic retry recovered most requests.

What we will do

Prevent

  • Roll out changes to shared coordination infrastructure gradually (canary), with an instant rollback control.
  • Require a kill-switch for any change exercised against shared production infrastructure.

Detect

  • Add targeted early-warning alerts on the coordination service's saturation (connections/memory) so issues are caught before they affect customer requests.

Mitigate

  • Make the coordination service fail gracefully (fall back / degrade) instead of failing requests outright.
  • Right-size the new backing store before any further production exposure.

We apologize for the disruption to your applications during this window. If your integration is missing data from this period, please reach out to your usual Adapty contact and reference the 2026-06-18 incident.

Posted Jun 18, 2026 - 14:24 UTC

Resolved

This incident has been resolved.
Posted Jun 18, 2026 - 08:27 UTC

Update

We are continuing to monitor for any further issues.
Posted Jun 18, 2026 - 08:18 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 18, 2026 - 08:17 UTC

Identified

We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Jun 18, 2026 - 07:21 UTC
This incident affected: API and Dashboard.