Root Cause Analysis: Log Data and Kibana Unavailability in EU Region

Impact

Customers in Frankfurt the region could not use the service for a significant amount of time.

Executive Summary

Event Description: An entire cluster became non-operational, and when it was recovered - Kibana was unavailable due to a human error

Factors Leading to Event: Due to external network layer issues, one of our production Elasticsearch clusters stopped working entirely.
A bit later, a human error (see details below) further exacerbated the problem by causing data corruption in our database.
The data corruption prevented Kibana from restarting once the cluster was recovered.

Chain of events

All times are in UTC

Dec 3rd, 3:10 pm: One of our clusters experienced network issues, causing node-quorum to be lost and data to not be available. This affected a small portion of our customers in that region which were running on that environment.

Dec 3rd, 3:15 pm: The Production team worked to recover the cluster, client nodes were restarted, network errors were still very frequent.

Dec 3rd, 3:23 pm: Network layer errors became less frequent.

Dec 3rd, 3:52 pm: Immediately after that, a totally unrelated track, a developer mistakenly ran the platform’s integration tests that tests against production data. These tests included a failed migration in the database which affected production. This would come into effect when trying to recover the damaged cluster.

Dec 3rd, 4:28 pm: Node-quorum regained. Affected cluster started recovery.

Dec 3rd, 4:35 pm: At this point, it was already evident that the kibana application is not available and the Production team directed its efforts towards this problem.

Dec 3rd, 4:40 pm: Logs ingestion back to normal. Lag decreasing gradually. Kibana is unavailable.\

Dec 3rd, 5:10 pm: The root cause has been identified as a failed migration in the database schema, which prevented the services which were restarted due to the cluster failure, from starting up.

Dec 3rd, 5:17 pm: The data corruption in the database was fixed.

Dec 3rd, 5:29 pm: The Kibana-related services managed to start and Kibana became available again.

Dec 3rd, 5:30 pm: The cluster became healthy and the latency was gone.

Corrective Actions

Immediate Term:

Recover the cluster and handle the latencies
Fix the data corruption and recover the failed service

Short Term:

Work with AWS to invest these networking issues
Automate a solution for fast recovery after quorum lost
Provide better visibility into services failing to restart
Disconnect all ties between staging and Production databases

Long Term:

Improve the isolation between the various service layers of the Production platform

Corrective Actions

Logz.io is committed to provide a stable service to its customers and aims to automate solutions to any issue.
The network layer issues exposed a new weakness in our system, and caused unfortunate damage to our customers, and a human error caused another incident at the same time and made additional damage. We’re going to invest significant resources to make sure these do not occur again.
We are committed to create a solution that will minimize the damage in future network incidents, and secure our database from human errors.

Posted Dec 05, 2019 - 15:01 IST

Resolved

This incident has been resolved.

Posted Dec 03, 2019 - 19:36 IST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 03, 2019 - 19:02 IST

Identified

The issue has been identified and a fix is being implemented.

Posted Dec 03, 2019 - 18:29 IST

Investigating

We are currently investigating this issue.

Posted Dec 03, 2019 - 17:19 IST