Customers in Frankfurt who share a specific cluster experienced issues accessing their data and indexing new data. A very small subset of accounts had also experienced login issues.
The outage impacted search, API, alerts, and ingestion of new data into the system (not collection).
In the process of adding more resources to give customers a better service, a production engineer had accidentally terminated all master nodes of a single cluster in the Frankfurt region. The termination immediately rendered the entire cluster non-operational for both search and indexing. It also caused other resources in the same cluster (non-master nodes) to accumulate extremely high load.
This high load, which manifested itself as the well-rehearsed emergency procedure of launching replacement instances, was not as smooth as intended. Restoring the master nodes while the other nodes in the cluster were in a resource-shortage state succeeded only sporadically, and the engineers had to go through several iterations of the procedure in order to return all master nodes to an operational state.
As a result of the above inconsistency, some customer data became corrupted and had to be restored from backup.
As part of our restoration procedure our priorities were:
1. Restore the cluster to operational mode
2. Restore login for all accounts
3. Restore fresh logs
4. Restore all gaps in the data.
All times are in UTC
13:03 - A Production engineer accidentally terminated all master nodes of a single cluster
13:04 - An alert triggered, issues were identified, and a global emergency production team was working on a remediation plan
13:09 - The emergency procedure to re-provision the terminated nodes started; according to this procedure, two effort tracks were launched in parallel:
1. Restore the cluster based on existing resources
2. Rebuild the cluster from scratch and restore all data to be prepared for the case where Option 1 fails
13:09 - Replacement machines were in a provisioning process
13:17 - 2 new nodes completed provisioning (which is the minimum required to form a cluster)
13:31 - Indexing was suspended so as not to overload the newly formed cluster
13:36 - After repeated failures, we decided to quadruple all master nodes due to OOM
13:47 - Cluster formed again
13:56 - After cluster crashed due to OOM again, we discovered that Puppet (configuration management system we use) had a configuration issue which prevented it from running after the instance sizes increases; this resulted in the JVM options not utilizing the machine increase
14:05 - Option 1 succeeded
14:09 - New cluster stability verified; we tried to resume indexing
14:14 - We identified issues in Elasticsearch logs and worked to resolve all
14:19 - We noticed a significant load on the cluster and decided to stop indexing again
14:20 - We started deploying cluster settings to support easier indexing
14:32 - Cluster crashed again due to OOM
14:47 - Cluster formed again; we continued with settings to ease the restore
15:06 - We noticed a very low number of accounts were missing their kibana index (where accounts visualizations and dashboards are stored); this would prevent those accounts from logging in
15:22 - Deployed configuration to accelerate Elasticsearch backup recovery
15:24 - New cluster stability verified. The cluster finished accepting our commands (started at 14:20) to ease the restore and was stable
15:25 - Started to apply the first stage in the data-recovery procedure for all accounts
15:37 - Backup configuration (started at 15:22) deployed; all account Kibana indices started to restore
15:40 - Slowly started to resume indexing on the cluster while closely watching the metrics
15:43 - Cluster responded well, increasing the indexing rate
15:46 - Application available Kibana index restore completed, allowing all customers to log-in 15:49 - Prepared the missing data restoration for all accounts, while keeping all resources on indexing recent data
17:18 - All data gaps until 00:00 UTC of the same day were restored
17:50 - We have identified a small subset of accounts with a configuration issue preventing them from seeing logs; we fixed immediately
18:39 - All customers then had recent logs since 13:09 UTC (the beginning of the incident)
18:40 - Started restore of all missing data for a subset of accounts, 00:00 UTC - 13:09 UTC.
At Logz.io, we are fully committed to getting your data in time, and we know you trust us for having your data ready for you when you need it the most.
We know that, in this case, even though no single log went missing from the data, we didn’t live up to the high standards you expect from us, and we deeply apologize for that. We had many takeaways from this incident, with many in the process of being applied, and the rest will follow. Our engineers are working feverishly around the clock to improve the tools, the automation, the procedures, and the infrastructure so these kinds of incidents won't happen again. We are continuously working, as stated in the Remediation Section, to remove all manual processes and eliminate the incidence of human error. At the same, we are also working on improving our disaster recovery processes to further reduce the downtime in extreme situations.
Roi Rav-Hon, Core Team Leader, Logz.io