We're seeing higher error rate on login attempts to the the logz.io Webapp
Incident Report for logz.io
Postmortem

At 10:21 AM GMT, 11/1/17, our monitoring systems reported a delay in user-login. User-login took as much as 15 seconds and timeout frequency increased beyond the normal level. The on-call team investigated the issue and identified a load on one of the databases. The on-call team worked with the dev team to determine that code deployed the previous day combined with specific data ingestion triggered the excessive load on the database and the downstream login latency.

At 11:10 AM, development was able to apply a code change which reduced the load on the database and reduced the timeouts close to the normal level, however, the load on the database and slow logins were still happening. At 12:08 PM, once development completed additional configuration changes our monitoring reported normal login times.

We've fixed this bug and added additional unit test and system tests to make sure this event will not occur again. We're also separating the user database to a separate database instance to make sure other services cannot slow down the login process. We expect this change to be completed by the end of this month (November).

Posted Nov 03, 2017 - 14:52 IST

Resolved
This incident has been resolved.
Posted Nov 01, 2017 - 14:54 IST
Update
We're still monitoring the system to make sure error rate remains slow. We currently see a slight delay in the login time.
Posted Nov 01, 2017 - 14:08 IST
Update
We're seeing error rate drop down close to the normal rate. We're monitoring the system and we will update on the progress.
Posted Nov 01, 2017 - 13:03 IST
Identified
We seem to have identified the root cause of this issue and are working to apply a fix.
Posted Nov 01, 2017 - 12:44 IST
Investigating
We're aware of the cause of this issue and are working to resolve it.
Posted Nov 01, 2017 - 12:41 IST