2024-03-04 - OCPI SessionService outage – EV Portal

Summary of Impact

On 2024-03-04 around 15:24 Greenflux began migrating the session service to a different Service plan.

As a result, transaction started/updated/stopped queue messages were spiking.

Detailed Timeline of Events

All listed events happened on 2024-03-03. Times refer to CET.

Around 15:24: Integrations team removed the session service from the current Service plan and triggered the deployment of the session service to deploy to the new Service plan.

15:48: First alert were triggered

15:56: Investigation shows that it's related to one ocpi subscription

17:07: Number of queued messages start decreasing, and scales out from 1 instance to 10 instances to speed up the process.

17:47: All queued messages processed, system back to normal operating state, and scales back to 4 instances.

Confirmed Root Cause of the Incident

Migrating session service is the root cause. Issue happened in the release pipeline.

That didn't cause any issue when done in the test environment.

Mitigation and Resolution

Manually create the session service function app through the Azure portal and enable system identity to bypass the ARM error and run the deployment again.

Lessons Learned for System and Process Improvements

Even if it's a technical release that doesn't involve a feature change, needs to be communicated.

Use different pipeline instead of classic release pipeline.