Log-in failures to the Logz.io application

Incident Report for logz.io

Postmortem

Log-in Unavailability - Root Cause Analysis

As an Observability system, Logz.io considers login availability a key performance indicator. A community of users rely on its availability to monitor and troubleshoot their systems and ensure its performance and uptime. We share this commitment with our users to deliver stable and reliable services.

In order to provide a highly available login service, we utilize a globally distributed database cluster, which is based on synchronous replication. This architecture helps guard against local regional failures so that users in the failed region can login to the system without issues on the globally distributed network.

We are constantly working to ensure our login availability is as resilient as possible by testing it under different setups and distribution strategies and applying upgrades as needed. Despite our best efforts, on Jul 29th, network issues prevented Logz.io users from logging into their accounts. Below is a record of what happened:

‌

Chronology of Events:

At 05:23 UTC, A severed connection due to networking issues caused our cluster to get out of sync between the Sydney nodes and the rest of the globally distributed database cluster.

Due to a bug in that version of the software, the cluster got stuck in a loop of trying to recover its normal operating state, and couldn’t manage to perform this properly. The Sydney nodes kept timing out causing the entire cluster nodes to do the same thing

At 05:24 UTC, The production team is getting alerted on login unavailability. We see that all of the cluster nodes kept disconnecting and reconnecting to each other, without recovering completely, never achieving the required state for the cluster to operate.

At 05:35 UTC, Due to the load from the cluster sync crash loop and disconnections, all the nodes received OOM and entered a failed state, requiring manual intervention from the team to restart the cluster as a new one by brute force.

At 06:00 UTC, The cluster was recovered and login was restored, starting with the Virginia region, followed by Frankfort and Sydney.

During all this time, users who were logged in already could resume their normal operation, all alerts and data ingestion were operating as always, and no data was lost.

‌

Corrective Actions:

Being a critical component of our service logz.io is considering handing user login as a top priority, The actions taken are a combination of short term actions to enable our current galera-based solution to hold, while at the same time looking for a top-tier managed-service alternative.

The short-term corrective measures:

Implement a segmented topology on our globally distributed cluster. This will reduce the dependency between the different regions and will enable regions not affected by a specific incident to continue to function.
Increase inactive timeout between cluster members - timeouts govern how the cluster evicts failed nodes. Inactive timeout defines a hard limit of how long a node can stay in the cluster if it’s not responding. On globally distributed database increasing those result in more stable performance and better cluster resiliency to such network failures
Optimise our cluster performance, by increasing OS and database caches and buffers size, tuning network settings and timeouts to make it easier to accommodate for stalls and unstable WAN link

Long term:

Offload the management of globally distributed in-sync database cluster to a cloud based solution such as AWS DynamoDB, that will provide a fully managed, multi-region, and multi-master database on low-latency with a much tighter SLA

Conclusions:

Logz.io is committed to providing stable service to its customers.

In this case, our login service database failed, due to an extreme end-case caused by networking issues. This caused our globally distributed database cluster to get out-of-sync and eventually fail. We are taking all necessary measures to tackle this end-case, and in the process reduce the reliance of the distributed elements of the cluster setup on each other.

We deeply apologize for any inconvenience we have caused.

Posted Aug 05, 2020 - 12:55 IDT

Resolved

This incident has been resolved.

Posted Jul 29, 2020 - 09:27 IDT

Monitoring

We have applied a fix, most customers should be able to log-in now.
We keep on monitoring and validating.

Posted Jul 29, 2020 - 09:13 IDT

Identified

Some users might experience log-in issues to the Logz.io web app and API.
We found the source of the issue and promptly working on a fix.

Log indexing and Alerting are not affected.

Posted Jul 29, 2020 - 08:41 IDT