Images back up, still waiting for full explanation

Hello — I’m really sorry about the images being down — they’re back up, now.

Our image service is hosted with Amazon and I’ve been on the phone with one of their engineers who was trying to fix it.

They’re currently having some issues with their infrastructure and that was the cause of the image service outage. He couldn’t tell me more than that for the time being as apparently all of their top-tier engineers were working to fix whatever issue they are having.

He was, however, able to get things working in our case so the images are back up, now.

Whilst I was on the phone to the engineer I’ve made a start on creating an alternative architecture for the images service that I could deploy in case something like this happens again. I’m going to carry on with that work, now, so that we have something we can get in place relatively quickly in case there was another issue with their internal systems that we couldn’t do anything about. Not that the engineer said there would be but it’s better to be safe than sorry.

I’ll let you know here when Amazon get back to me with more information about the issues they were having.

In the meantime, we’re up again — and again, my apologies for the outage.

Doug.

Image issues

Hi — just a quick note to let you know that we’re trying to fix the images issues as we speak.

I’ll post more fully once we’re sorted.

I’m really sorry about the inconvenience.

Doug.

Outage due to database issues this evening

We had a pretty serious outage this evening.
We were working constantly to identify and fix the error as soon as it reared it’s head. Once we’d located the root cause in the database I was on the phone with the engineer from our database service provider and he helped me fix it.
Basically, the database was running low on available memory. It isn’t something we’ve seen before so it took a while and some help from a support database engineer to identify and come up with a solution. We now know to go in and clear out some memory if this should happen again and also what to keep an eye on to stop it becoming a serious issue in the future. We’re also looking at upgrading the database so that we’ll have more leeway with memory, generally.
We hit a tipping point with the low memory tonight, which is why we saw a sudden rash of issues all across the site. However, going back through the database stats, it looks like it’s been struggling with memory periodically for a while. This would definitely have been a contributing factor to the occasional heavy load errors that have been reported, so hopefully the work we’ve done tonight on the database (along with the other optimisation work we’ve been doing over the last few weeks) will eliminate those.
Obviously, it’s something we’ll be keeping a very close eye on.
I hope that explains the issues we’ve seen this evening and our efforts to resolve them as quickly as possible. I’m really sorry to everyone who was affected by the outage and I hope this post offers some assurance that we’ve identified and fixed the cause of it.
Thank you so much for your patience whilst we fixed the issue,
Doug.

Ongoing outage

Hi — just to let you know that we’re doing everything we can to fix the outage we’re having at the moment.

We’ll let you know as soon as there’s any more news.

We’re really sorry for the outage and we’re doing our level best to fix it.

Doug.

Performance issues around 13:00

Hello — we’ve seen some issues early this afternoon.

I’m really sorry, they were my fault. I pushed some updates to one of our key servers that had tested fine locally and seemed innocuous enough.

However, once they were live they created a steadily escalating set of issues that, furthermore, rather obscured their immediate cause.

I’ve rolled those updates back, now, and everything has been behaving perfectly for the last 30-40 mins, so I think I can get on with investigating exactly why the updates created the issues they did.

Again, I’m really sorry if you were one of the people affected by the problems.

Thanks,

Doug.

A few ‘heavy load’ errors in the last 10 minutes or so

Hi — we’ve seen a few heavy load notifications in the last 10 minutes.

We’re currently investigating the precise chain of events leading to them but we’re pretty sure that it was a cascade of server failures caused by an over-sensitive health check in one of our load balancers. This is  a new load balancer put in place to handle the increased traffic we’re seeing, so we’re still tuning its configuration.

We’ve reduced the sensitivity of the health check and think that that should resolve the issue as we investigate more in-depth.

My apologies if you’re one of the people who saw the heavy load message and my thanks for your patience as we fine-tune the system to handle the greater demand we’re seeing across the site.

Thank you,

Doug.

Possible disruptions this afternoon

Hello — just a headsup that we’re making some backend changes this afternoon that *might* affect the service.

The plan we have laid out should mean that no disruption occur but there’s always a chance of misadventure.

We’ve got a quick rollback plan ready, too, in case of any issues, so hopefully they’d only be for a minute or two.

Thanks,

Doug.

 

15:46, UPDATE:

So far so good. The updates have been running for a few hours with no issues spotted or reported by our monitoring systems or users.

Heavy load notifications this morning

Hi — we had a number of heavy load errors reported this morning.
As soon as we saw the notifications about them we manually increased the capacity of the affected servers by 50% to ease the burden on the system.
Shortly thereafter we were able to reset the running servers to flush any issues out of those.
We believe that these actions should have fixed most ongoing instances of the heavy load errors so then we turned our attention to finding the cause of them.
We spent hours digging into server logs, load balancer logs, DNS settings and a host of other technical minutiae to investigate what caused the issue in the first place.
Our in-depth investigations took quite a long time but we believe we’ve identified and fixed the cause of the problem, now.

What caused the issue?

Basically, we made some back-end changes to help us deal with increased traffic that introduced a subtle bug.
Recent traffic load has increased even over our seasonal expectations, and we had a few, isolated incidents of a heavy load error spotted towards the end of this last week. In response to that, we set up a system to automatically add extra servers in to help at times of heavy load.
This system went live yesterday afternoon and I spent yesterday evening testing it to make sure it was behaving as it should be. The system passed the tests with apparent flying colours and I went to bed.
However, there was a flaw in one of the tests that meant that I missed that newly added servers were not handling web traffic. For those who want the gory details, the new servers I’d tested all started working *because* I’d logged on to test them. This meant that any new server I checked seemed to be working just fine, but only because I’d logged on to check in the first place. Talk about Schrödinger’s bug!
So when I went to bed, there were lots of (manually checked) servers all working happily away. Overnight, however, some of those were shut down and replaced with new servers that were not working. This led to the situation late this morning where a reduced number of servers in the whole group was able to handle traffic and, not surprisingly, they started to complain occasionally about the workload.

What have we done to fix the issue?

Once we’d identified the problem we were able to fix things so that the servers were launching in a state able to handle requests. This means that the site will be handling the increased traffic properly, as per the intention of the auto-scaling work yesterday. I’ve been monitoring new instances as they are created and have confirmed (by multiple methods!) that they are working as they should be.

What else are we doing in response to the issue?

Our investigations today have highlighted a few things we can do to improve the performance of the servers in question so we’ll be looking at those for the next few days.
We also need to have a think about how best to be notified about this sort of intermittent error. We were alerted to this morning’s issue by a few members of our brilliant community — thank you so much for that! It would be good to have been automatically warned about this before they had to see it, though. We have a series of automated alerts that tell us when parts of the site are completely down or consistently unresponsive but those aren’t so great for sporadic errors. When we’ve had other intermittent issues in the past, we’ve written custom monitors for them that fix the problem or notify us or both. We need to look into doing the same for this error, should it rear its head again.
And of course we’ll keep monitoring the servers and the community to make sure we’ve completely fixed the problem. I’m pretty confident we have but of course I was pretty confident with the auto-scaling system when I’d finished testing it last night, so that bit of extra vigilance is always a good idea.

Thank you so much to those members of our community who reported seeing the errors, your help really is appreciated. Also, I’d like to offer my sincere apologies for not catching the bug last night that led to the issue this morning. And my apologies also to anyone who saw and was inconvenienced by the error. I hope this post makes the cause of this unusual issue clear and I further hope that you can be reassured by the immediacy of our response and our commitment to fixing it as soon as we were aware of it.

And finally, thank you so much for bearing with us when we have to fix things.
Doug.

Outage at 04:50 this morning

Hello — we had an outage this morning for around 10-15 minutes.

This was caused by one of our routers running out of memory and was fixed pretty much as soon as the alerts came in.

We’re really sorry for anyone who was inconvenienced and would like to thank you for bearing with us whilst we sorted the issue.

Thanks again,

Doug.

Report on outage from 9pm today

Hello — just to let you know that the issues we saw earlier were related to the image service.

Initially just the upload didn’t work but images subsequently began to struggle and we decided to put the site into maintenance mode whilst we fixed the issue.

We fixed the issues with the image services within around 20 minutes and are now going through the various servers and services to make sure everything is behaving as it should be.

Thanks once again for your patience whilst we dealt with this and our sincere apologies for any inconvenience caused by the outage.

Thanks,

Doug.