At 10:21 AM GMT, 11/1/17, our monitoring systems reported a delay in user-login. User-login took as much as 15 seconds and timeout frequency increased beyond the normal level. The on-call team investigated the issue and identified a load on one of the databases. The on-call team worked with the dev team to determine that code deployed the previous day combined with specific data ingestion triggered the excessive load on the database and the downstream login latency.
At 11:10 AM, development was able to apply a code change which reduced the load on the database and reduced the timeouts close to the normal level, however, the load on the database and slow logins were still happening. At 12:08 PM, once development completed additional configuration changes our monitoring reported normal login times.
We've fixed this bug and added additional unit test and system tests to make sure this event will not occur again. We're also separating the user database to a separate database instance to make sure other services cannot slow down the login process. We expect this change to be completed by the end of this month (November).