The issue started on the 10th of August at 8:13 UTC and lasted till 20:09 UTC. It affected only a "trial started" event on some Android applications, instead of it, we sent a "subscription started" event.
We rolled out an update to support Billing 5 API for all the customers. We tested this update for almost a month, and during the last 2 weeks it was staged tested on some of the production apps too, that are part of our beta program. During the testing, we checked integrations, and all was fine. After releasing to prod, we monitored the metrics, database load, unexpected errors, and integrations too.
We check the number of events generated and sent to the integrations, but we didn’t check the ratio of different events. In reality, even if we did, we might’ve missed it because not all apps were affected and that’s why we didn’t encounter it during the testing.
The problem was that we rely on Google RTDN to receive updates about subscription events. When the trial or subscription is activated in the app, Adapty SDK sends a request to the backend, and this is how we know about transactions. At the same time, Google or Apple also send the event to the webhook with a similar payload but fewer data (no prices, no country, no paywall, etc). Typically, we would receive a request from our SDK, and then from Google. But turned out it wasn’t the case for Android. For some apps, webhook events come before SDK requests. And with the old logic which was before Billing 5, it was not a problem, because the information about the trial was inside the payload. But in Billing 5, Google deleted this information from the payload and we have to rely on data from SDK to process it correctly. So in cases, when the webhook was fired before SDK requests, we didn’t know it was the trial.
We don’t know, why some webhooks are sent before, and some after. We should’ve thought about it in advance but it slipped away.
We're sorry it happened. We’re mitigating the reasons that caused problems in the past. And we align our processes to prevent potential cases like that, but unfortunately, this time we failed.
To prevent this from happening, in the coming weeks we will implement a dashboard, that will display information on a number of generated events by types, platforms, apps, and destinations. We will also check all the places where requests from our SDK and Apple/Google webhooks can interfere.
No data loss occurred during this issue, and app users were not affected in any way. It only affected the analytics dashboard and integrations. We will fix the names and payload for corrupted events in the nearest future.
Posted Aug 10, 2023 - 08:00 UTC