Executive Summary | |
---|---|
Event Description | Accounts located in the US East and EU Central regions experienced ingestion delays with some of their data, leading to fresh data not being visible on Logz.io. |
Factors Leading to Event | As a result of overallocation of AWS EC2 capacity, required resources were hitting EC2 limits and failed to launch. This also caused a networking issue which caused limited data ingestion capacity. |
Impact | There was a 3.5 hour period where some fresh data was not ingested. For all customers in the US and for a vast majority of the customers in the EU this latency was recovered 3 hours later. For a small subset of customers in EU, this was recovered 24 hours later. Customers who used non-resilient data shipping mechanisms may have experienced data loss during that time. Some data was recovered and appended to the index of the next day Customers who are missing any should reach out to Logz.io support team |
Time (Aug 28, IDT) | Event |
---|---|
12:50 | Production on-call engineer started to get alerts regarding failure of the data ingestion pipeline |
13:03 | Issue was escalated to a Senior engineer who started investigating the issue. A large group of Operations and Software Development engineering joined the research |
14:00 | Networking issue detected and the work on resolution has started |
14:40 | All defective nodes were removed, and networking issues were resolved |
15:12 | First “fresh” resources are launched. Fresh datcoming in |
16:12 | Issue resolved. Started adding the required resources to handle the accumulated lag. |
19:15 | Lag has been consumed for 99% of the accounts |
21:00 | Lag consumed for all the accounts |
Corrective Actions | |
---|---|
Immediate Term | Work with AWS to make sure that there is a 24x7 open communication channel to handle such incidents along with the ability to manually “override” any capacity constraints Review EC2 limits across our cloud operations |
Medium Term | Improve our automation capabilities to launch needed resources faster in a 100% automated manner Further deepen our monitoring to detect these capacity issues earlier with more precision |
Logz.io is committed to providing a stable service to its customers and we are fully committed to getting your data in time with minimal delays.
Logz.io views this incident very seriously due to its impact in terms of breadth and time.
Thanks to our robust disaster recovery procedures, even in this magnitude of an event we were able to start ingesting fresh data in a matter of a couple of hours
We have since been working with AWS around the clock to ensure that such a failure will not happen again.