As an Observability system, [Logz.io](http://logz.io) considers login availability a key performance indicator. A community of users rely on its availability to monitor and troubleshoot their systems and ensure its performance and uptime. We share this commitment with our users to deliver stable and reliable services.
The Kibana App lost its connectivity to a microservice that acts as a gateway to our backend, leading to some of our accounts in the Frankfurt region not being able to access Kibana
Factors Leading to Event:
Logz.io microservices architecture uses a Gateway to provide extensive route customization, and allows features such as monitoring, tracing and connection timeout management.
All healthy discovered services are registered in the Gateway to produce a valid route to a service version.
Due to a bug in the services registration process that prevented the Gateway to produce valid routes to our backend services, the connectivity between our App to its backend services got lost.
As a result our kibana app was not available to the accounts that were routed to the affected infrastructure.
Customers could not use the user interface for 75 minutes. No logs were lost, and alerts continued to trigger as required.
Chronology of Events (UTC):
Jan 4th, 2:45 pm, A logzio service was deployed in the Frankfurt region.
Jan 4th, 2:47 pm, A production engineer was paged due to the Kibana App not available.
Jan 4th, 2:50 pm, Status Page is updated
Jan 4th, 2:55 pm, The issue is being escalated. Bringing all hands on deck
Jan 4th, 3:00 pm, As part of our recovery process the recently-deployed service was reverted
Jan 4th, 3:55 pm, Team continues its investigation, and once the bug in the registration was identified, we reached the conclusion that we needed to revert to an earlier version.
Jan 4th, 4:10 pm, Kibana App availability is restored.
Jan 4th, 4:15 pm, Status Page is removed.
As Logz.io is a critical component of our service we are considering App availability as a top priority, the actions taken are a combination of immediate term actions we already took to make sure such cases won't repeat along with some short-term ones.
The immediate-term corrective measures are:
The short-term corrective measures:
Logz.io is committed to providing a stable service to its customers and we are fully committed to getting your data in time with minimal delays.
Unfortunately our services encountered an edge case bug which led to degradation of performance. We keep working on appling the required measures so that such cases should not occur again.
We deeply apologize for any inconvenience we have caused.