Logz.io User interface is not accessible
Incident Report for logz.io
Postmortem

At 17:25 UTC, during a puppet module upgrade to one of the puppet modules, a wrongful configuration was applied to the new module. As a result, some critical kubernetes supporting services received the new configuration, restarted and failed to start. This caused some of the kubernetes pods to fail which caused a downtime of the Logz.io front-end application as well as a few other supporting mico-services.

The oncall team immediately escalated the issue across SRE and R&D teams which applied a new configuration and restarted the relevant pods.

At 17:54 UTC, the system was back up.

The issue was caused by a wrongful misconfiguration done by one of the engineers. The wrong configuration was performed in one of the puppet modules which led to a series of events which resulted in losing several Kubernetes pods

Immediate - A stricter pull request is applied immediately to all puppet configurations - Increase the availability level of several critical services such as etcd and others - Roll out docker changes slower in a small environment

Longer term

  • Improve monitoring across the different components to discover health issues earlier
  • Make changes to the relevant staging environment to make sure it completely aligned with production environment so these issues can be discovered earlier in the process.
Posted Dec 09, 2017 - 20:15 IST

Resolved
All components are fully functional. No logs have been lost
Posted Dec 06, 2017 - 20:05 IST
Monitoring
We've applied a fix and the UI is now accessible.
We're still monitoring the system to make sure everything is functioning as expected
Posted Dec 06, 2017 - 19:54 IST
Identified
The issue has been identified and we're applying a fix.
Posted Dec 06, 2017 - 19:41 IST
Update
We're still investigating an issue with the Logz.io user interface which is not accessible. We expect this problem to be resolved shortly.
Posted Dec 06, 2017 - 19:34 IST
Investigating
We are currently investigating an issue in the presentation layer (Kibana). Logs are being processed correctly. We will update as soon as possible.
Posted Dec 06, 2017 - 19:29 IST