Delayed webhook delivery
Incident Report for Elastic Path
Postmortem

What happened?

On the 30th April 2019, we were made aware that integration events were not being sent [How did we notice this?]. On checking the message count for the integrations.event queue, there were around 50,000 unprocessed events.

The normal fix for this issue is to restart the delegator. This forces a reconnect and all unprocessed messages will be picked up. However, the integrations delegator then attempted to process all 50k+ messages at the same time instead of processing them in batches. This caused the delegator to consume messages until it ran out of resources, at which point it would crash and then restart, once again consuming messages until it crashed.

‌Since the delegator will acknowledge a message before it has finished processing it.

All messages that have been acked but have not finished being processed will not be re-queued by Rabbit and will be lost forever. It is not possible to identify how many of these queued events would have had integrations attached and how many integration jobs were lost.

What was the root cause?

The integrations delegator stopped consuming messages from Rabbit. The consumer count in Rabbit never dipped below 1.

What steps did we take to identify and isolate the issue?

Checking the unprocessed message count in Rabbit showed that messages were stacking up and this was the issue.

How long did it take for us to triangulate it, and is there anything we could do to shorten that time?

  • From the start of messages not being processed until the queue was empty was around 30 hours.
  • We currently only monitor the number of consumers of a queue, expecting that a consumer being disconnected would show in the number of consumers connected which wasn't the case. An additional monitor on unacked messages in a queue would have let us spot this much faster.

Which customers/services bore the brunt of the outage?

  • Integrations and all customers that use that functionality.

How did we fix it?

  • Restarting the integrations delegator.

What did we learn? How will those learnings advise our process, product, and strategy?

1. Monitoring the queue consumer count alone does not give us full visibility into whether the integrations delegator/consumer apps are processing messages. We should also monitor the number of messages in a queue.
2. The integrations delegator should not acknowledge a message until it has finished processing it.
3. We should be able to limit how many messages are processed by an app at once.
4. The integrations consumer should not use auto-ack for its messages and should instead acknowledge them when it has finished processing them.
5. If the integrations delegator/consumer cannot reconnect to the queue automatically when it loses its connection, we should use health checks to have Kubernetes restart it automatically. We shouldn't be relying on a human to do it.
6. This Compose/RabbitMQ/Our queue consumers setup are used for critical components of our customers' systems. They do not reach the levels of reliability we need.
7. We do not have enough visibility into what is happening inside these services. We should be able to identify what events were triggered when.
8. Being able to replay past events (with payloads) would be very useful in situations like this.

Posted Jun 12, 2019 - 08:00 UTC

Resolved
We have identified an issue that caused the delivery of some webhooks to be delayed between April 29th and 30th. Processing the delayed events is now complete. If you believe a required event was not executed, please contact support.
Posted Apr 29, 2019 - 07:00 UTC