Date of incident: 2025-04-23
Date reported internally: 2025-04-23
Date reported to customer: 2025-04-23
Summary of Impact
One of GreenFlux databases suffered from maxing out utilization, causing most queries to fail on timeout. As a result most essential services could no longer reliably establish or maintain connection to this database.
Vital platform capabilities (authorization, starting and stopping transactions, processing meter values, real time payments, CDR calculation, etc.) stopped functioning.
Detailed Timeline of Events
All times refer to CEST time zone (UTC+2)
- 16:50: Monitoring triggered a high priority alert for excessive number of active message on service bus.
- 17:12: First ticket received from customer
- 17:18: Internal chat was created. It was observed that the database on GFX PROD reached 100% utilization.
- 17:27: Message posted on status page to inform customers.
- 17:41: Specific service was stopped to remove the high load of messages
- 18:27: Vital platform services (mostly CPO domain) back to normal operation and backlog of messages processed.
Confirmed Root Cause of the Incident
CDR Export attempts (triggered on EV Portal) caused query to consume all available resources on Greenflux database.
Mitigation and Resolution
Manual identification and killing of impacted queries. Subsequent manual actions to support system restoration.
Lessons Learned for System and Process Improvements
Some improvements need to be done on the export service in order to avoid that situation in the future
