On August 21st, our monitoring picked up that the service responsible for sending webhooks was being restarted more often than we would expect. An engineer looked into it and deemed that the service was receiving webhook configurations that were invalid. This caused the app to crash, the app was then restarted by the scheduler. The engineer raised a ticket which was treated as a priority, but not critical due to the fact that the infrastructure was designed to handle this gracefully.
This continued until the ticket was picked up the following day. At this point, there was a backlog of webhooks to be processed due to the fact that the application wasn't running as often as usual (due to the restarts), slowing down webhook delivery. Our engineers identified that there was also a possibility that if the app was restarted after a job had been retrieved from the queue but before the webhook had been sent, then it would not be re-queued automatically. Similarly the webhook might be sent, but the log entry might not be written. At this point we raised an issue on our status page.
The actions we decided to take based on this were:
#1, #2 and #3 have already been rolled out to production. Our longer term plan for the webhooks system is to simplify how the system works internally so there are less potential points of failure as well as improving reliability and visibility for end users.