Logz.io User interface is not accessible

Incident Report for logz.io

Postmortem

At 17:25 UTC, during a puppet module upgrade to one of the puppet modules, a wrongful configuration was applied to the new module. As a result, some critical kubernetes supporting services received the new configuration, restarted and failed to start. This caused some of the kubernetes pods to fail which caused a downtime of the Logz.io front-end application as well as a few other supporting mico-services.

The oncall team immediately escalated the issue across SRE and R&D teams which applied a new configuration and restarted the relevant pods.

At 17:54 UTC, the system was back up.

The issue was caused by a wrongful misconfiguration done by one of the engineers. The wrong configuration was performed in one of the puppet modules which led to a series of events which resulted in losing several Kubernetes pods

Immediate - A stricter pull request is applied immediately to all puppet configurations - Increase the availability level of several critical services such as etcd and others - Roll out docker changes slower in a small environment

Longer term

Improve monitoring across the different components to discover health issues earlier
Make changes to the relevant staging environment to make sure it completely aligned with production environment so these issues can be discovered earlier in the process.

Posted Dec 09, 2017 - 20:15 IST

Resolved

All components are fully functional. No logs have been lost

Posted Dec 06, 2017 - 20:05 IST

Monitoring

We've applied a fix and the UI is now accessible.
We're still monitoring the system to make sure everything is functioning as expected

Posted Dec 06, 2017 - 19:54 IST

Identified

The issue has been identified and we're applying a fix.

Posted Dec 06, 2017 - 19:41 IST

Update

We're still investigating an issue with the Logz.io user interface which is not accessible. We expect this problem to be resolved shortly.

Posted Dec 06, 2017 - 19:34 IST

Investigating

We are currently investigating an issue in the presentation layer (Kibana). Logs are being processed correctly. We will update as soon as possible.

Posted Dec 06, 2017 - 19:29 IST