This outage was caused by Cloudflare WAF Managed Rules migration. We use them to block unwanted requests to our API. Those are mostly bots. We also use rate-limiting rules. This update should’ve been invisible to the customers. But for some reason, the number of requests to our load balancer nodes increased almost 10x. This caused all of them to be 100% occupied and they started to drop most of the new connections. After introducing new load balancer nodes to the cluster, we were able to handle incoming requests. Over time the number of requests has gone back to normal.
We contacted Cloudflare for the details on what could be the reason for this behavior, and we’ll provide updates once we hear back from them.
To prevent this from happening in the future, we will introduce rate limiting not only on the Cloudflare level but also inside our infrastructure.