Summary of Impact
On 2024-02-09 around 01:00 CET the processing of OCPI push messages got severly delayed. All customers subscribed to OCPI sessions did not receive messages in time.
1. Around 01:45: Sending of OCPI push messages slowed down, causing messages to be produced at a higher rate than then OCPI Publisher could process them. Leading to a linear increase of messages.
2. 08:11: A customer reported the issue, GFX alarms picked up the problem but were not monitored outside business hours yet.
3. 08:24: Issue identified to be caused by slow receiver
4. 09:13: workaround implemented
5. 09:35: All queued messages processed, system back to normal operating state.
Lessons Learned for System and Process Improvements:
Current system design critically vulnerable for delayed message processing. Happened before (2023 07 05)
Permanent improvement of system design discussed with Integrations team
Quick fix: adding time-to-live (TTL) to OCPI publish messages that have only limited time value (e.g. session updates/patch session).
Discussed with Integrations team that another level of escalation alerting.
