Delays in Ingestion
Incident Report for logz.io
Postmortem

Executive Summary
Event Description Accounts located in the US East and EU Central regions experienced ingestion delays with some of their data, leading to fresh data not being visible on Logz.io.
Factors Leading to Event As a result of overallocation of AWS EC2 capacity, required resources were hitting EC2 limits and failed to launch. This also caused a networking issue which caused limited data ingestion capacity.
Impact There was a 3.5 hour period where some fresh data was not ingested. For all customers in the US and for a vast majority of the customers in the EU this latency was recovered 3 hours later. For a small subset of customers in EU, this was recovered 24 hours later.  Customers who used non-resilient data shipping mechanisms may have experienced data loss during that time.   Some data was recovered and appended to the index of the next day Customers who are missing any should reach out to Logz.io support team 
Time (Aug 28, IDT) Event
12:50 Production on-call engineer started to get alerts regarding failure of the data ingestion pipeline
13:03 Issue was escalated to a Senior engineer who started investigating the issue. A large group of Operations and Software Development engineering joined the research
14:00 Networking issue detected and the work on resolution has started
14:40 All defective nodes were removed, and networking issues were resolved
15:12 First “fresh” resources are launched. Fresh datcoming in
16:12 Issue resolved. Started adding the required resources to handle the accumulated lag.
19:15 Lag has been consumed for 99% of the accounts
21:00 Lag consumed for all the accounts
Corrective Actions
Immediate Term Work with AWS to make sure that there is a 24x7 open communication channel to handle such incidents along with the ability to manually “override” any capacity constraints Review EC2 limits across our cloud operations
Medium Term Improve our automation capabilities to launch needed resources faster in a 100% automated manner Further deepen our monitoring to detect these capacity issues earlier with more precision

Conclusions

Logz.io is committed to providing a stable service to its customers and we are fully committed to getting your data in time with minimal delays. 

Logz.io views this incident very seriously due to its impact in terms of breadth and time. 

Thanks to our robust disaster recovery procedures, even in this magnitude of an event we were able to start ingesting fresh data in a matter of a couple of hours

We have since been working with AWS around the clock to ensure that such a failure will not happen again.

Posted Aug 30, 2021 - 21:01 IDT

Resolved
This incident has been resolved.
Posted Aug 28, 2021 - 21:27 IDT
Update
Metrics indexing has been fully resumed in all regions. A small number of customers experiencing minor latency in eu-central
Posted Aug 28, 2021 - 20:48 IDT
Update
99% of our customers are back to normal operation
Posted Aug 28, 2021 - 19:12 IDT
Update
We are continuing to monitor for any further issues.
Posted Aug 28, 2021 - 19:03 IDT
Update
We are continuing to monitor for any further issues.
Posted Aug 28, 2021 - 17:20 IDT
Update
We are continuing to monitor for any further issues.
Posted Aug 28, 2021 - 17:15 IDT
Update
We are continuing to monitor for any further issues.
Posted Aug 28, 2021 - 17:13 IDT
Update
Latency recovery is continuing improve. Adding resources as required.
Posted Aug 28, 2021 - 17:03 IDT
Update
Latencies are starting to improve. Fresh logs are coming in. We are continuing to closely monitor
Posted Aug 28, 2021 - 16:17 IDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 28, 2021 - 15:50 IDT
Update
We are continuing to investigate this issue.
Posted Aug 28, 2021 - 15:18 IDT
Update
We have identified the source of the problem and are working to resolve it
Posted Aug 28, 2021 - 14:29 IDT
Investigating
We are currently investigating an elevated level of errors and alert notifications
Posted Aug 28, 2021 - 13:27 IDT
This incident affected: AWS N. Virginia (us-east-1) (Log indexing, Metric indexing) and AWS Frankfurt (eu-central-1) (Log indexing, Metric indexing).