On the 30th April 2019, we were made aware that integration events were not being sent [How did we notice this?]. On checking the message count for the integrations.event
queue, there were around 50,000
unprocessed events.
The normal fix for this issue is to restart the delegator. This forces a reconnect and all unprocessed messages will be picked up. However, the integrations delegator then attempted to process all 50k+ messages at the same time instead of processing them in batches. This caused the delegator to consume messages until it ran out of resources, at which point it would crash and then restart, once again consuming messages until it crashed.
Since the delegator will acknowledge a message before it has finished processing it.
All messages that have been acked but have not finished being processed will not be re-queued by Rabbit and will be lost forever. It is not possible to identify how many of these queued events would have had integrations attached and how many integration jobs were lost.
The integrations delegator stopped consuming messages from Rabbit. The consumer count in Rabbit never dipped below 1
.
Checking the unprocessed message count in Rabbit showed that messages were stacking up and this was the issue.
1. Monitoring the queue consumer count alone does not give us full visibility into whether the integrations delegator/consumer apps are processing messages. We should also monitor the number of messages in a queue.
2. The integrations delegator should not acknowledge a message until it has finished processing it.
3. We should be able to limit how many messages are processed by an app at once.
4. The integrations consumer should not use auto-ack for its messages and should instead acknowledge them when it has finished processing them.
5. If the integrations delegator/consumer cannot reconnect to the queue automatically when it loses its connection, we should use health checks to have Kubernetes restart it automatically. We shouldn't be relying on a human to do it.
6. This Compose/RabbitMQ/Our queue consumers setup are used for critical components of our customers' systems. They do not reach the levels of reliability we need.
7. We do not have enough visibility into what is happening inside these services. We should be able to identify what events were triggered when.
8. Being able to replay past events (with payloads) would be very useful in situations like this.