As an Observability system, Logz.io considers login availability a key performance indicator. A community of users rely on its availability to monitor and troubleshoot their systems and ensure its performance and uptime. We share this commitment with our users to deliver stable and reliable services.[
In order to provide a highly available login service, we utilize a globally distributed database cluster, which is based on synchronous replication. This architecture helps guard against local regional failures so that users in the failed region can login to the system without issues on the globally distributed network.We are constantly working to ensure our login availability is as resilient as possible by testing it under different setups and distribution strategies and applying upgrades as needed.
Event Description:
A deadlock in a run-of-the-mill database migration triggered a chain of events which resulted in app unavailability for 45 minutes for some of our customers.
Factors Leading to Event:
Logz.io uses flywaydb as its mechanism for performing database migrations.
A standard procedure which has been run hundreds of times over the course of logz.io history, hit a rare end-case bug, which resulted in a deadlock and a crash of one of the distributed authentication database nodes.
The crashed node’s automatic attempt to rejoin the cluster failed and led to the loss of the entire cluster.
Impact:
Logz.io app was unavailable for 45’.
There were no delays in ingestion and alerts continued to function uninterruptedly.
Chronology of Events (UTC):
Dec 9, 04:36 pm, Team is alerted that login to the App is not available.
Dec 9, 04:45pm, Notifying on status page
Dec 9, 04:45 pm, Manual cluster restart procedure initiated
Dec 9, 05:00 pm, Database recovered in all regions but one, statuspage updated
Dec 9, 05:10 pm, The restart procedure had to be executed again to include the missing region.
Dec 9, 05:25 pm, Incident over.
Corrective Actions:
Being a critical component of our service logz.io is considering handing user login as a top priority, The actions taken are a combination of short term actions to enable our current galera-based solution to hold, while at the same time looking for a top-tier managed-service alternative.
The immediate-term corrective measures are:
The short-term corrective measures:
Long term:
Offload the management of globally distributed in-sync database cluster to a cloud based solution such as AWS DynamoDB, that will provide a fully managed, multi-region, and multi-master database on low-latency with a much tighter SLA
Conclusions:
Logz.io is committed to providing stable service to its customers.
In this case, our login service database failed, due to an extreme end-case caused by inconsistent database migration locks. This caused our globally distributed database cluster to get out-of-sync and eventually fail. We are taking all necessary measures to tackle this end-case, and in the process reduce the reliance of the distributed elements of the cluster setup on each other.
We deeply apologize for any inconvenience we have caused.