Increased webhook error rates
Incident Report for Elastic Path
Postmortem

On August 21st, our monitoring picked up that the service responsible for sending webhooks was being restarted more often than we would expect. An engineer looked into it and deemed that the service was receiving webhook configurations that were invalid. This caused the app to crash, the app was then restarted by the scheduler. The engineer raised a ticket which was treated as a priority, but not critical due to the fact that the infrastructure was designed to handle this gracefully.

This continued until the ticket was picked up the following day. At this point, there was a backlog of webhooks to be processed due to the fact that the application wasn't running as often as usual (due to the restarts), slowing down webhook delivery. Our engineers identified that there was also a possibility that if the app was restarted after a job had been retrieved from the queue but before the webhook had been sent, then it would not be re-queued automatically. Similarly the webhook might be sent, but the log entry might not be written. At this point we raised an issue on our status page.

The actions we decided to take based on this were:

  1. Improve the handling of the error when the app receives an invalid configuration so there is no crash
  2. Patch the issue that meant messages could be dropped
  3. Improve logging of invalid configs so we can identify why they are invalid and improve validation on webhook set up
  4. We also identified that timing out failed webhooks (when the receiving endpoint doesn't respond) takes far too long. Processing batches of webhooks that all have long timeouts slows down the speed at which webhooks are sent considerably. We want to address this by lowering the timeout and processing more jobs concurrently.
  5. We want to improve the data we log for each webhook sent and how the logs are written. This will make the logs more useful and more reliable.

#1, #2 and #3 have already been rolled out to production. Our longer term plan for the webhooks system is to simplify how the system works internally so there are less potential points of failure as well as improving reliability and visibility for end users.

Posted Aug 29, 2019 - 09:03 UTC

Resolved
We believe this issue is resolved now but will continue to monitor. Work to patch the underlying issue is ongoing to prevent this reoccurring and a postmortem will follow.
Posted Aug 22, 2019 - 15:15 UTC
Monitoring
We have rolled out the patch to mitigate the issue and are continuing to work on a full solution to prevent this from happening in the future.
Posted Aug 22, 2019 - 11:30 UTC
Identified
We've identified the cause of the issue and are rolling out a patch.
Posted Aug 22, 2019 - 09:31 UTC
Investigating
We're investigating an issue which is causing increased error rates when sending webhooks.
Posted Aug 22, 2019 - 09:25 UTC
This incident affected: EU (EU Webhooks).