SDK HTTP API partially unavailable due to storage nodes desynchronization

Incident Report for Adapty

Postmortem

Postmortem — SDK API & Dashboard 5xx errors (2026-05-19)

Summary

On 2026-05-19 between 16:21 and 16:55 UTC, the Adapty SDK API and Dashboard returned elevated 5xx errors for approximately 34 minutes. The primary key-value store in the SDK request path entered a protective "stop-writes" state after its cluster nodes' clocks drifted past the allowed skew threshold; the drift accumulated because outbound NTP traffic from the affected hosts had been blocked at the network firewall layer, leaving the nodes unable to synchronize with upstream time sources. A clock step-correction was applied, the database resumed accepting writes within one minute, and all customer-visible errors cleared by 16:55 UTC.

Impact

  • Duration: Approximately 34 minutes (16:21 → 16:55 UTC); the most acute window was 16:21 → 16:46 UTC.
  • Surfaces affected: SDK endpoints (profile updates, attribution, receipt validation, App Store / Play Store / Stripe / Paddle webhook deliveries), the Dashboard, and 3rd-party integration deliveries.
  • Visible signal: Elevated 5xx rates at the edge for SDK endpoints; the mobile-API readiness probe briefly reported DOWN between 16:43 and 16:45 UTC.
  • Recovery: SDK clients that automatically retry recovered most requests on retry. Some webhook deliveries and attribution events that do not retry may require backfill or replay.

Timeline (UTC)

Time Event
~16:00 Database cluster clock skew crosses the configured stop-writes threshold; the protective alert enters its evaluation delay window.
16:21 Stop-writes condition activates; the database begins rejecting writes; SDK 5xx errors begin at the edge.
16:22–16:32 5xx error rate climbs across SDK endpoints.
16:43 NTP step-correction applied to the affected hosts; cluster clock skew drops from ~38 seconds to under 1 second within ~1 minute.
16:46 The database resumes accepting writes; SDK 5xx rates begin returning to baseline.
16:55 All customer-visible errors cleared; incident fully resolved.

Root cause and contributing factors

Root cause: The SDK request path depends on a clustered key-value store that protects data consistency by halting writes when cluster nodes' clocks disagree by more than a configured threshold. On 2026-05-19, the nodes' clocks had been drifting for an extended period and crossed that threshold around 16:00 UTC, which caused the cluster to halt writes and the SDK API to return 5xx.

Contributing factors:

  1. Outbound NTP egress was blocked at the network firewall layer for the affected hosts, so the time daemon on each node could not reach upstream NTP servers. The block accumulated drift silently over many hours.
  2. No early-warning alert on clock drift. The only alert in place fired at the protective stop-writes threshold — when the incident was already in progress, not while it was approaching. There was no graduated alert at a lower skew value that would have provided hours of lead time.
  3. No direct NTP-synchronization health check at the host level. Time-sync failure was only observable as a downstream effect on cluster skew, not as a primary signal.
  4. Protective stop-writes worked as designed. The database correctly refused potentially-inconsistent writes, which is preferred over silent data corruption; however, this surfaces to customers as 5xx rather than as a graceful degradation.

What went well

  • The protective stop-writes mechanism behaved as intended and prevented data inconsistency in the affected database.
  • Once identified, the fix was surgical and fast: a single clock step-correction returned cluster skew from ~38 seconds to under 1 second in ~1 minute, and the database recovered immediately afterward.

What we will do

Prevent — stop the same cause from recurring

  • Audit firewall egress rules on every host in the affected tier; confirm outbound NTP (UDP/123) is explicitly allowed to documented upstream time sources; add an active probe that verifies egress and alerts if it regresses.
  • Document the time-synchronization topology for the affected database tier and remove ambiguity about expected behavior.

Detect — catch the same cause faster next time

  • Add an early-warning alert on cluster clock skew at a fraction of the protective threshold, with a multi-minute evaluation window, so drift is visible hours before it can cause stop-writes.
  • Add a direct NTP-synchronization health alert (time-daemon offset, stratum, last-sync-age) on every host in the affected tier, independent of the cluster skew metric.

Mitigate — limit blast radius if the same cause recurs

  • Review whether the affected SDK endpoints can degrade more gracefully when the database is in stop-writes mode (read-only path, cached fallback, queue-and-retry), rather than returning immediate 5xx.

We apologize for the disruption to your applications during this window. If your integration is missing data from this period, please reach out to your usual Adapty contact and reference the 2026-05-19 incident.

Posted May 19, 2026 - 17:39 UTC

Resolved

All SDK HTTP API metrics have returned to normal. As part of our service security enhancement efforts, we accidentally blocked clock synchronization traffic. Clock synchronization is essential for our distributed storage.
Posted May 19, 2026 - 16:47 UTC

Monitoring

We have applied the fix. Storage nodes are in sync now. Response time and HTTP status codes are getting back to normal.
Posted May 19, 2026 - 16:45 UTC

Investigating

We are currently investigating this issue.
Posted May 19, 2026 - 16:38 UTC
This incident affected: API.