Logz.io web application is experiencing elevated error rates

Incident Report for logz.io

Postmortem

App Unavailability - Root Cause Analysis

As an Observability system, Logz.io considers login availability a key performance indicator. A community of users rely on its availability to monitor and troubleshoot their systems and ensure its performance and uptime. We share this commitment with our users to deliver stable and reliable services.[

In order to provide a highly available login service, we utilize a globally distributed database cluster, which is based on synchronous replication. This architecture helps guard against local regional failures so that users in the failed region can login to the system without issues on the globally distributed network.We are constantly working to ensure our login availability is as resilient as possible by testing it under different setups and distribution strategies and applying upgrades as needed.

Event Description:
A deadlock in a run-of-the-mill database migration triggered a chain of events which resulted in app unavailability for 45 minutes for some of our customers.

Factors Leading to Event:
Logz.io uses flywaydb as its mechanism for performing database migrations.
A standard procedure which has been run hundreds of times over the course of logz.io history, hit a rare end-case bug, which resulted in a deadlock and a crash of one of the distributed authentication database nodes.
The crashed node’s automatic attempt to rejoin the cluster failed and led to the loss of the entire cluster.

Impact:
Logz.io app was unavailable for 45’.
There were no delays in ingestion and alerts continued to function uninterruptedly.

Chronology of Events (UTC):
Dec 9, 04:36 pm, Team is alerted that login to the App is not available.

Dec 9, 04:45pm, Notifying on status page

Dec 9, 04:45 pm, Manual cluster restart procedure initiated

Dec 9, 05:00 pm, Database recovered in all regions but one, statuspage updated

Dec 9, 05:10 pm, The restart procedure had to be executed again to include the missing region.

Dec 9, 05:25 pm, Incident over.

Corrective Actions:
Being a critical component of our service logz.io is considering handing user login as a top priority, The actions taken are a combination of short term actions to enable our current galera-based solution to hold, while at the same time looking for a top-tier managed-service alternative.

The immediate-term corrective measures are:

Enforce that deployments which includes a database migration will be done on a single region at a time
Review our manual authentication database restart procedure so as not to require the second restart round

The short-term corrective measures:

Upgrade to the latest version of flywaydb to address lock timeout issues
Disable the auto-rejoin of a crashed node in the authentication cluster

Long term:

Offload the management of globally distributed in-sync database cluster to a cloud based solution such as AWS DynamoDB, that will provide a fully managed, multi-region, and multi-master database on low-latency with a much tighter SLA

‌

Conclusions:
Logz.io is committed to providing stable service to its customers.

In this case, our login service database failed, due to an extreme end-case caused by inconsistent database migration locks. This caused our globally distributed database cluster to get out-of-sync and eventually fail. We are taking all necessary measures to tackle this end-case, and in the process reduce the reliance of the distributed elements of the cluster setup on each other.

We deeply apologize for any inconvenience we have caused.

Posted Dec 13, 2020 - 14:04 IST

Resolved

This incident has been resolved.

Posted Dec 09, 2020 - 19:11 IST

Update

We are continuing to investigate this issue.

Posted Dec 09, 2020 - 19:10 IST

Investigating

We are currently investigating this issue.

Posted Dec 09, 2020 - 18:51 IST