At 17:25 UTC, during a puppet module upgrade to one of the puppet modules, a wrongful configuration was applied to the new module. As a result, some critical kubernetes supporting services received the new configuration, restarted and failed to start. This caused some of the kubernetes pods to fail which caused a downtime of the Logz.io front-end application as well as a few other supporting mico-services.
The oncall team immediately escalated the issue across SRE and R&D teams which applied a new configuration and restarted the relevant pods.
At 17:54 UTC, the system was back up.
The issue was caused by a wrongful misconfiguration done by one of the engineers. The wrong configuration was performed in one of the puppet modules which led to a series of events which resulted in losing several Kubernetes pods
Immediate - A stricter pull request is applied immediately to all puppet configurations - Increase the availability level of several critical services such as etcd and others - Roll out docker changes slower in a small environment
Longer term