December 3rd, 2019
Customers in Frankfurt the region could not use the service for a significant amount of time.
Event Description: An entire cluster became non-operational, and when it was recovered - Kibana was unavailable due to a human error
Factors Leading to Event: Due to external network layer issues, one of our production Elasticsearch clusters stopped working entirely.
A bit later, a human error (see details below) further exacerbated the problem by causing data corruption in our database.
The data corruption prevented Kibana from restarting once the cluster was recovered.
All times are in UTC
Dec 3rd, 3:10 pm: One of our clusters experienced network issues, causing node-quorum to be lost and data to not be available. This affected a small portion of our customers in that region which were running on that environment.
Dec 3rd, 3:15 pm: The Production team worked to recover the cluster, client nodes were restarted, network errors were still very frequent.
Dec 3rd, 3:23 pm: Network layer errors became less frequent.
Dec 3rd, 3:52 pm: Immediately after that, a totally unrelated track, a developer mistakenly ran the platform’s integration tests that tests against production data. These tests included a failed migration in the database which affected production. This would come into effect when trying to recover the damaged cluster.
Dec 3rd, 4:28 pm: Node-quorum regained. Affected cluster started recovery.
Dec 3rd, 4:35 pm: At this point, it was already evident that the kibana application is not available and the Production team directed its efforts towards this problem.
Dec 3rd, 4:40 pm: Logs ingestion back to normal. Lag decreasing gradually. Kibana is unavailable.\
Dec 3rd, 5:10 pm: The root cause has been identified as a failed migration in the database schema, which prevented the services which were restarted due to the cluster failure, from starting up.
Dec 3rd, 5:17 pm: The data corruption in the database was fixed.
Dec 3rd, 5:29 pm: The Kibana-related services managed to start and Kibana became available again.
Dec 3rd, 5:30 pm: The cluster became healthy and the latency was gone.
Logz.io is committed to provide a stable service to its customers and aims to automate solutions to any issue.
The network layer issues exposed a new weakness in our system, and caused unfortunate damage to our customers, and a human error caused another incident at the same time and made additional damage. We’re going to invest significant resources to make sure these do not occur again.
We are committed to create a solution that will minimize the damage in future network incidents, and secure our database from human errors.