As an Observability system, Logz.io considers login availability a key performance indicator. A community of users rely on its availability to monitor and troubleshoot their systems and ensure its performance and uptime. We share this commitment with our users to deliver stable and reliable services.
In order to provide a highly available login service, we utilize a globally distributed database cluster, which is based on synchronous replication. This architecture helps guard against local regional failures so that users in the failed region can login to the system without issues on the globally distributed network.
We are constantly working to ensure our login availability is as resilient as possible by testing it under different setups and distribution strategies and applying upgrades as needed. Despite our best efforts, on Jul 29th, network issues prevented Logz.io users from logging into their accounts. Below is a record of what happened:
Chronology of Events:
At 05:23 UTC, A severed connection due to networking issues caused our cluster to get out of sync between the Sydney nodes and the rest of the globally distributed database cluster.
Due to a bug in that version of the software, the cluster got stuck in a loop of trying to recover its normal operating state, and couldn’t manage to perform this properly. The Sydney nodes kept timing out causing the entire cluster nodes to do the same thing
At 05:24 UTC, The production team is getting alerted on login unavailability. We see that all of the cluster nodes kept disconnecting and reconnecting to each other, without recovering completely, never achieving the required state for the cluster to operate.
At 05:35 UTC, Due to the load from the cluster sync crash loop and disconnections, all the nodes received OOM and entered a failed state, requiring manual intervention from the team to restart the cluster as a new one by brute force.
At 06:00 UTC, The cluster was recovered and login was restored, starting with the Virginia region, followed by Frankfort and Sydney.
During all this time, users who were logged in already could resume their normal operation, all alerts and data ingestion were operating as always, and no data was lost.
Corrective Actions:
Being a critical component of our service logz.io is considering handing user login as a top priority, The actions taken are a combination of short term actions to enable our current galera-based solution to hold, while at the same time looking for a top-tier managed-service alternative.
The short-term corrective measures:
Long term:
Offload the management of globally distributed in-sync database cluster to a cloud based solution such as AWS DynamoDB, that will provide a fully managed, multi-region, and multi-master database on low-latency with a much tighter SLA
Conclusions:
Logz.io is committed to providing stable service to its customers.
In this case, our login service database failed, due to an extreme end-case caused by networking issues. This caused our globally distributed database cluster to get out-of-sync and eventually fail. We are taking all necessary measures to tackle this end-case, and in the process reduce the reliance of the distributed elements of the cluster setup on each other.
We deeply apologize for any inconvenience we have caused.