Hi — we had a number of heavy load errors reported this morning.
As soon as we saw the notifications about them we manually increased the capacity of the affected servers by 50% to ease the burden on the system.
Shortly thereafter we were able to reset the running servers to flush any issues out of those.
We believe that these actions should have fixed most ongoing instances of the heavy load errors so then we turned our attention to finding the cause of them.
We spent hours digging into server logs, load balancer logs, DNS settings and a host of other technical minutiae to investigate what caused the issue in the first place.
Our in-depth investigations took quite a long time but we believe we’ve identified and fixed the cause of the problem, now.
What caused the issue?
Basically, we made some back-end changes to help us deal with increased traffic that introduced a subtle bug.
Recent traffic load has increased even over our seasonal expectations, and we had a few, isolated incidents of a heavy load error spotted towards the end of this last week. In response to that, we set up a system to automatically add extra servers in to help at times of heavy load.
This system went live yesterday afternoon and I spent yesterday evening testing it to make sure it was behaving as it should be. The system passed the tests with apparent flying colours and I went to bed.
However, there was a flaw in one of the tests that meant that I missed that newly added servers were not handling web traffic. For those who want the gory details, the new servers I’d tested all started working *because* I’d logged on to test them. This meant that any new server I checked seemed to be working just fine, but only because I’d logged on to check in the first place. Talk about Schrödinger’s bug!
So when I went to bed, there were lots of (manually checked) servers all working happily away. Overnight, however, some of those were shut down and replaced with new servers that were not working. This led to the situation late this morning where a reduced number of servers in the whole group was able to handle traffic and, not surprisingly, they started to complain occasionally about the workload.
What have we done to fix the issue?
Once we’d identified the problem we were able to fix things so that the servers were launching in a state able to handle requests. This means that the site will be handling the increased traffic properly, as per the intention of the auto-scaling work yesterday. I’ve been monitoring new instances as they are created and have confirmed (by multiple methods!) that they are working as they should be.
What else are we doing in response to the issue?
Our investigations today have highlighted a few things we can do to improve the performance of the servers in question so we’ll be looking at those for the next few days.
We also need to have a think about how best to be notified about this sort of intermittent error. We were alerted to this morning’s issue by a few members of our brilliant community — thank you so much for that! It would be good to have been automatically warned about this before they had to see it, though. We have a series of automated alerts that tell us when parts of the site are completely down or consistently unresponsive but those aren’t so great for sporadic errors. When we’ve had other intermittent issues in the past, we’ve written custom monitors for them that fix the problem or notify us or both. We need to look into doing the same for this error, should it rear its head again.
And of course we’ll keep monitoring the servers and the community to make sure we’ve completely fixed the problem. I’m pretty confident we have but of course I was pretty confident with the auto-scaling system when I’d finished testing it last night, so that bit of extra vigilance is always a good idea.
Thank you so much to those members of our community who reported seeing the errors, your help really is appreciated. Also, I’d like to offer my sincere apologies for not catching the bug last night that led to the issue this morning. And my apologies also to anyone who saw and was inconvenienced by the error. I hope this post makes the cause of this unusual issue clear and I further hope that you can be reassured by the immediacy of our response and our commitment to fixing it as soon as we were aware of it.
And finally, thank you so much for bearing with us when we have to fix things.