Some customers in the Frankfurt region might experience elevated error rates when searching
Incident Report for logz.io
Postmortem

Postmortem: Logz.io Partial Outage for a Subset of the Frankfurt Region

Impact

Customers in Frankfurt who share a specific cluster experienced issues accessing their data and indexing new data. A very small subset of accounts had also experienced login issues.
The outage impacted search, API, alerts, and ingestion of new data into the system (not collection).

‌Executive Summary

In the process of adding more resources to give customers a better service, a production engineer had accidentally terminated all master nodes of a single cluster in the Frankfurt region. The termination immediately rendered the entire cluster non-operational for both search and indexing. It also caused other resources in the same cluster (non-master nodes) to accumulate extremely high load.

This high load, which manifested itself as the well-rehearsed emergency procedure of launching replacement instances, was not as smooth as intended. Restoring the master nodes while the other nodes in the cluster were in a resource-shortage state succeeded only sporadically, and the engineers had to go through several iterations of the procedure in order to return all master nodes to an operational state.

As a result of the above inconsistency, some customer data became corrupted and had to be restored from backup.

As part of our restoration procedure our priorities were:
1. Restore the cluster to operational mode
2. Restore login for all accounts
3. Restore fresh logs
4. Restore all gaps in the data.

Chain of events

All times are in UTC

13:03 - A Production engineer accidentally terminated all master nodes of a single cluster
13:04 - An alert triggered, issues were identified, and a global emergency production team was working on a remediation plan
13:09 - The emergency procedure to re-provision the terminated nodes started; according to this procedure, two effort tracks were launched in parallel:
1. Restore the cluster based on existing resources
2. Rebuild the cluster from scratch and restore all data to be prepared for the case where Option 1 fails
13:09 - Replacement machines were in a provisioning process
13:17 - 2 new nodes completed provisioning (which is the minimum required to form a cluster)
13:31 - Indexing was suspended so as not to overload the newly formed cluster
13:36 - After repeated failures, we decided to quadruple all master nodes due to OOM
13:47 - Cluster formed again
13:56 - After cluster crashed due to OOM again, we discovered that Puppet (configuration management system we use) had a configuration issue which prevented it from running after the instance sizes increases; this resulted in the JVM options not utilizing the machine increase
14:05 - Option 1 succeeded
14:09 - New cluster stability verified; we tried to resume indexing
14:14 - We identified issues in Elasticsearch logs and worked to resolve all
14:19 - We noticed a significant load on the cluster and decided to stop indexing again
14:20 - We started deploying cluster settings to support easier indexing
14:32 - Cluster crashed again due to OOM
14:47 - Cluster formed again; we continued with settings to ease the restore
15:06 - We noticed a very low number of accounts were missing their kibana index (where accounts visualizations and dashboards are stored); this would prevent those accounts from logging in
15:22 - Deployed configuration to accelerate Elasticsearch backup recovery
15:24 - New cluster stability verified. The cluster finished accepting our commands (started at 14:20) to ease the restore and was stable
15:25 - Started to apply the first stage in the data-recovery procedure for all accounts
15:37 - Backup configuration (started at 15:22) deployed; all account Kibana indices started to restore
15:40 - Slowly started to resume indexing on the cluster while closely watching the metrics
15:43 - Cluster responded well, increasing the indexing rate
15:46 - Application available Kibana index restore completed, allowing all customers to log-in 15:49 - Prepared the missing data restoration for all accounts, while keeping all resources on indexing recent data
17:18 - All data gaps until 00:00 UTC of the same day were restored
17:50 - We have identified a small subset of accounts with a configuration issue preventing them from seeing logs; we fixed immediately
18:39 - All customers then had recent logs since 13:09 UTC (the beginning of the incident)
18:40 - Started restore of all missing data for a subset of accounts, 00:00 UTC - 13:09 UTC.

Remediation

Immediate Term

  • Verify backups configuration integrity and increase the snapshot backup frequency
  • Fix launch issue which prevented our configuration management system from running properly to configure JVM options
  • Automatic configuration fix for the issue that prevented customers from seeing logs even after they were present
  • Automation around bulk cluster configuration changes

Short Term

  • Finish migrating to fully Terraform managed environments, to eliminate all human errors out of the equation
  • Create automated 1-click restore cluster mechanism for master nodes failure (existing today only for data nodes failure)
  • Recreate this outage in lab conditions, and improve the automation and procedure around those failures
  • Automatically produce a list of accounts affected by cluster integrity issues, so we can communicate to you better and faster
  • Conduct “War Days” to simulate edge case issues for all components used in Logz.io

Conclusion

At Logz.io, we are fully committed to getting your data in time, and we know you trust us for having your data ready for you when you need it the most.
We know that, in this case, even though no single log went missing from the data, we didn’t live up to the high standards you expect from us, and we deeply apologize for that. We had many takeaways from this incident, with many in the process of being applied, and the rest will follow. Our engineers are working feverishly around the clock to improve the tools, the automation, the procedures, and the infrastructure so these kinds of incidents won't happen again. We are continuously working, as stated in the Remediation Section, to remove all manual processes and eliminate the incidence of human error. At the same, we are also working on improving our disaster recovery processes to further reduce the downtime in extreme situations.

Roi Rav-Hon, Core Team Leader, Logz.io

Posted 4 months ago. Jun 11, 2019 - 22:58 IDT

Resolved
Logs ingestion for all customers are caught up
Posted 4 months ago. Jun 10, 2019 - 21:41 IDT
Update
We are continuing to catch up on recent logs for all customers, we will keep updating here for any further news.
Posted 4 months ago. Jun 10, 2019 - 20:13 IDT
Update
All accounts should have stable login and search.
We are restoring recent logs as a top priority.
Posted 4 months ago. Jun 10, 2019 - 19:11 IDT
Update
Some accounts might experience a degradation with the UI (including search error rates and intermitted login and logout issues). We are working hard on restoring regular service as fast as we can.
Posted 4 months ago. Jun 10, 2019 - 17:38 IDT
Monitoring
Search availability on both Kibana and the API should work as usual for all customers.
Some customers might still have delays in log latency, we are catching up as fast as we can.
Posted 4 months ago. Jun 10, 2019 - 17:11 IDT
Identified
A subset of our Frankfurt region might experience elevated error rates when searching in Kibana or the API. This also might affect altering accuracy and log ingestion delays.
Our engineers have identified the issue and working fiercely to resolve it.
Posted 4 months ago. Jun 10, 2019 - 16:08 IDT
This incident affected: Indexing, User Interface, Alerting, and API.