Summary of Impact
On 2024-02-02 around 11:50 am our core system stopped processing messages due to an out of memory issue with the CPO service. This resulted in most of the GreenFlux CPO functionality to stop working.
Detailed Timeline of Events
All listed events happened between 2024-02-02 and 2024-02-05. Times refer to CET.
- 2024-02-02: 11:48 - Errors started on GreenFlux production.
- 2024-02-02: 12:05 - Internal investigation started.
- 2024-02-02: 12:21 - Errors stopped on GreenFlux production.
- 2024-02-05: 09:30 - Errors started on GreenFlux production.
- 2024-02-05: 10:40 - Errors stopped on GreenFlux production.
- 2024-02-05: 17:03 - Issue identified as chargers with too many auth list items, resulting in out of memory exceptions in the CPO Service.
- 2024-02-05: 18:52 - Hotfix deployed to use less memory.
Mitigation and Resolution
- We deployed a hotfix to use less memory for chargers with this many auth list items by optimizing a query.
- We removed the auth list items added for the locations and chargers involved.
- We will cap the number of auth list items that a location/charger can have to 1000.
