Delays in Ingestion

Incident Report for logz.io

Postmortem

‌

Executive Summary
Event Description	Accounts located in the US East and EU Central regions experienced ingestion delays with some of their data, leading to fresh data not being visible on Logz.io.
Factors Leading to Event	As a result of overallocation of AWS EC2 capacity, required resources were hitting EC2 limits and failed to launch. This also caused a networking issue which caused limited data ingestion capacity.
Impact	There was a 3.5 hour period where some fresh data was not ingested. For all customers in the US and for a vast majority of the customers in the EU this latency was recovered 3 hours later. For a small subset of customers in EU, this was recovered 24 hours later. Customers who used non-resilient data shipping mechanisms may have experienced data loss during that time. Some data was recovered and appended to the index of the next day Customers who are missing any should reach out to Logz.io support team

Time (Aug 28, IDT)	Event
12:50	Production on-call engineer started to get alerts regarding failure of the data ingestion pipeline
13:03	Issue was escalated to a Senior engineer who started investigating the issue. A large group of Operations and Software Development engineering joined the research
14:00	Networking issue detected and the work on resolution has started
14:40	All defective nodes were removed, and networking issues were resolved
15:12	First “fresh” resources are launched. Fresh datcoming in
16:12	Issue resolved. Started adding the required resources to handle the accumulated lag.
19:15	Lag has been consumed for 99% of the accounts
21:00	Lag consumed for all the accounts

Corrective Actions
Immediate Term	Work with AWS to make sure that there is a 24x7 open communication channel to handle such incidents along with the ability to manually “override” any capacity constraints Review EC2 limits across our cloud operations

Medium Term	Improve our automation capabilities to launch needed resources faster in a 100% automated manner Further deepen our monitoring to detect these capacity issues earlier with more precision

‌

Conclusions

Logz.io is committed to providing a stable service to its customers and we are fully committed to getting your data in time with minimal delays.

Logz.io views this incident very seriously due to its impact in terms of breadth and time.

Thanks to our robust disaster recovery procedures, even in this magnitude of an event we were able to start ingesting fresh data in a matter of a couple of hours

We have since been working with AWS around the clock to ensure that such a failure will not happen again.

Posted Aug 30, 2021 - 21:01 IDT

Resolved

This incident has been resolved.

Posted Aug 28, 2021 - 21:27 IDT

Update

Metrics indexing has been fully resumed in all regions. A small number of customers experiencing minor latency in eu-central

Posted Aug 28, 2021 - 20:48 IDT

Update

99% of our customers are back to normal operation

Posted Aug 28, 2021 - 19:12 IDT

Update

We are continuing to monitor for any further issues.

Posted Aug 28, 2021 - 19:03 IDT

Update

We are continuing to monitor for any further issues.

Posted Aug 28, 2021 - 17:20 IDT

Update

We are continuing to monitor for any further issues.

Posted Aug 28, 2021 - 17:15 IDT

Update

We are continuing to monitor for any further issues.

Posted Aug 28, 2021 - 17:13 IDT

Update

Latency recovery is continuing improve. Adding resources as required.

Posted Aug 28, 2021 - 17:03 IDT

Update

Latencies are starting to improve. Fresh logs are coming in. We are continuing to closely monitor

Posted Aug 28, 2021 - 16:17 IDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 28, 2021 - 15:50 IDT

Update

We are continuing to investigate this issue.

Posted Aug 28, 2021 - 15:18 IDT

Update

We have identified the source of the problem and are working to resolve it

Posted Aug 28, 2021 - 14:29 IDT

Investigating

We are currently investigating an elevated level of errors and alert notifications

Posted Aug 28, 2021 - 13:27 IDT

This incident affected: AWS Frankfurt (eu-central-1) (Logs Ingestion).