Some customers in the Frankfurt region might experience failures loading the application

Incident Report for logz.io

Postmortem

Executive Summary

Thursday, December 19th, Frankfurt region customers experienced unavailability of the Logz.io user interface sporadically over 45 minutes. Other region customers experienced unavailability for up to 10 minutes. During this incident, log ingestion, alerting and APIs operate without issues.

As part of an ongoing engineering project that decentralizes our API management across services, we have refactored our session management component to a different service in order to support future refactoring efforts.
On Wednesday, December 18th at 18:52 UTC the new change was deployed to production.

Unfortunately, a bug was deployed there as well - the function that calculates if a session should be cached or not was passed parameters in the wrong order, resulting in negative cache time. Effectively, requires the session management component to access the database on every request - resulting in increased load on the database, to the verge of unavailability.

Since the deployment occurred when the system was in low usage, this issue went unnoticed till the EU morning.
After correlating the change to the deployment, we have immediately reverted the new code.
Unfortunately again, revealing a new bug. When we make changes like this one, we always make sure to have backward compatibility in place - so we can revert in case of an issue. While indeed code was added to support revert, it contained a logical bug, resulted in denying all requests until a full revert was done.

Corrective actions:

Add integration test that checks the validity of the most critical caches
Rehearse rollback of sensitive deployments as such
Educate all Engineers about this outage, and the programmatic lesson learned

Chain of events, all times in UTC:

Dec 19th, 7:39 am - The on-call production engineer gets alerts on high response time in the application load balancer
Dec 19th, 7:45 am - The production engineer recognizes a steep increase started Dec 18th, 17:00.
Dec 19th, 7:49 am - The on-call application engineer is called to assist.
Dec 19th, 8:52 am - Status-page is up to reflect slow user interface performance for some EU customers.
Dec 19th, 8:58 am - The production engineer tries to restart related components.
Dec 19th, 9:10 am - The on-call backend engineer is called to assist. The DB and relevant micro-services are tested and found valid.
Dec 19th, 9:15 am - We realize the problem also occurs in the US on a smaller scale.
Dec 19th, 10:00 am - The user interface becomes almost completely unavailable in EU.
Dec 19th, 10:24 am - We detect a suspicious deployment from when the decrease started and reverted it.
Dec 19th, 10:33 am - Due to unhandled dependencies in the revert process, the application becomes unavailable in the US.
Dec 19th, 10:45 am - The revert of the suspicious deployment ends and the issue is solved. Status-page is removed.

At Logz.io we remain highly committed to the SLA of our service, and this incident is no different - we will take all measures to make sure this will not happen again, and that everyone learned, and will learn from the event and its corrective actions.

Roi Rav-Hon, Core Team Leader, Logz.io

Posted Dec 22, 2019 - 15:02 IST

Resolved

This incident has been resolved.

Posted Dec 19, 2019 - 12:47 IST

Investigating

We are currently investigating a degradation in the presentation layer (Kibana) that may affect customers with accounts in multiple regions. Logs are being processed correctly. We will update as soon as possible.

Posted Dec 19, 2019 - 12:00 IST