A subset of customers in the eu-central-1 region are experiencing UI issues
Incident Report for logz.io
Postmortem

App Unavailability - Root Cause Analysis

As an Observability system, [Logz.io](http://logz.io) considers login availability a key performance indicator. A community of users rely on its availability to monitor and troubleshoot their systems and ensure its performance and uptime. We share this commitment with our users to deliver stable and reliable services.

Event Description:
The Kibana App lost its connectivity to a microservice that acts as a gateway to our backend, leading to some of our accounts in the Frankfurt region not being able to access Kibana

Factors Leading to Event:
Logz.io microservices architecture uses a Gateway to provide extensive route customization, and allows features such as monitoring, tracing and connection timeout management.
All healthy discovered services are registered in the Gateway to produce a valid route to a service version.

Due to a bug in the services registration process that prevented the Gateway to produce valid routes to our backend services, the connectivity between our App to its backend services got lost.
As a result our kibana app was not available to the accounts that were routed to the affected infrastructure.

Impact:
Customers could not use the user interface for 75 minutes. No logs were lost, and alerts continued to trigger as required.

Chronology of Events (UTC):
Jan 4th, 2:45 pm, A logzio service was deployed in the Frankfurt region.

Jan 4th, 2:47 pm, A production engineer was paged due to the Kibana App not available.

Jan 4th, 2:50 pm, Status Page is updated

Jan 4th, 2:55 pm, The issue is being escalated. Bringing all hands on deck

Jan 4th, 3:00 pm, As part of our recovery process the recently-deployed service was reverted

Jan 4th, 3:55 pm, Team continues its investigation, and once the bug in the registration was identified, we reached the conclusion that we needed to revert to an earlier version.

Jan 4th, 4:10 pm, Kibana App availability is restored.

Jan 4th, 4:15 pm, Status Page is removed.

Corrective Actions:
As Logz.io is a critical component of our service we are considering App availability as a top priority, the actions taken are a combination of immediate term actions we already took to make sure such cases won't repeat along with some short-term ones.

The immediate-term corrective measures are:

  • The bug in the services registration has been resolved.

The short-term corrective measures:

  • Ensure that a service deployment process will not only require that the service is up and discoverable in order to consider it healthy, but that it is also accessible through the gateway.
  • Force canary deployments on all gateway upstream services
  • Increase visibility and precise our alerts to make sure we detect such connections failures and address them at the right urgency.

Conclusions:
Logz.io is committed to providing a stable service to its customers and we are fully committed to getting your data in time with minimal delays.

Unfortunately our services encountered an edge case bug which led to degradation of performance.  We keep working on appling the required  measures so that such cases should not occur again.

We deeply apologize for any inconvenience we have caused.

Posted Jan 18, 2021 - 12:11 IST

Resolved
This incident has been resolved.
Posted Jan 04, 2021 - 18:09 IST
Investigating
We are currently investigating this issue.
Posted Jan 04, 2021 - 17:10 IST
This incident affected: AWS Frankfurt (eu-central-1) (Web Application).