Maintenance mode for around 10 minutes between 1 and 2 am

Hi — just letting you know that we’re running some updates to the stack that will require us to go into maintenance mode briefly for hopefully no more than 10 minutes.

These updates will be taking place sometime in the next hour.

Thanks so much for your patience whilst we undertake this essential work!

Cheers,

Doug.

UPDATE: We’re good for now. We were in maintenance mode for around the 10 minutes predicted. Thanks again for your patience whilst we ran the updates!

Mea culpa, outage at around 13:45

Hello — I’m really sorry, that was entirely my fault.

We were just down for between 30 and 40 minutes, there.

I was working on background improvements to the image service and accidentally, and *really* stupidly, deleted one of our main load balancers: the one that routes requests to item pages, amongst other key areas. I rebuilt it as quickly as I could and we’re back online now, of course.

I’m so sorry to everyone who was inconvenienced by that.

I am now looking at ways that I can make operating in the depths of our infrastructure safer in the future. I think I’ve come up with a way to ring-fence the areas I want to look at, first, so that nothing I do can have repercussions for other areas of the overall site.

Once again, you have my sincere apologies and my assurance that I’m looking very hard into making sure something like this doesn’t happen again,

Doug.

5 minute outage at 16:10

Hello — sorry, we just had a 5 minute outage due to an issue with our routing service.

It was noticed pretty much immediately and fixed pretty much immediately after that, however it did take 5 or so minutes to take effect.

We’re really sorry if you were caught by the outage. Thank you very much for bearing with us whilst we fixed it.

Thanks again,

Doug.

Images back up, still waiting for full explanation

Hello — I’m really sorry about the images being down — they’re back up, now.

Our image service is hosted with Amazon and I’ve been on the phone with one of their engineers who was trying to fix it.

They’re currently having some issues with their infrastructure and that was the cause of the image service outage. He couldn’t tell me more than that for the time being as apparently all of their top-tier engineers were working to fix whatever issue they are having.

He was, however, able to get things working in our case so the images are back up, now.

Whilst I was on the phone to the engineer I’ve made a start on creating an alternative architecture for the images service that I could deploy in case something like this happens again. I’m going to carry on with that work, now, so that we have something we can get in place relatively quickly in case there was another issue with their internal systems that we couldn’t do anything about. Not that the engineer said there would be but it’s better to be safe than sorry.

I’ll let you know here when Amazon get back to me with more information about the issues they were having.

In the meantime, we’re up again — and again, my apologies for the outage.

Doug.

Image issues

Hi — just a quick note to let you know that we’re trying to fix the images issues as we speak.

I’ll post more fully once we’re sorted.

I’m really sorry about the inconvenience.

Doug.

Outage due to database issues this evening

We had a pretty serious outage this evening.
We were working constantly to identify and fix the error as soon as it reared it’s head. Once we’d located the root cause in the database I was on the phone with the engineer from our database service provider and he helped me fix it.
Basically, the database was running low on available memory. It isn’t something we’ve seen before so it took a while and some help from a support database engineer to identify and come up with a solution. We now know to go in and clear out some memory if this should happen again and also what to keep an eye on to stop it becoming a serious issue in the future. We’re also looking at upgrading the database so that we’ll have more leeway with memory, generally.
We hit a tipping point with the low memory tonight, which is why we saw a sudden rash of issues all across the site. However, going back through the database stats, it looks like it’s been struggling with memory periodically for a while. This would definitely have been a contributing factor to the occasional heavy load errors that have been reported, so hopefully the work we’ve done tonight on the database (along with the other optimisation work we’ve been doing over the last few weeks) will eliminate those.
Obviously, it’s something we’ll be keeping a very close eye on.
I hope that explains the issues we’ve seen this evening and our efforts to resolve them as quickly as possible. I’m really sorry to everyone who was affected by the outage and I hope this post offers some assurance that we’ve identified and fixed the cause of it.
Thank you so much for your patience whilst we fixed the issue,
Doug.

Ongoing outage

Hi — just to let you know that we’re doing everything we can to fix the outage we’re having at the moment.

We’ll let you know as soon as there’s any more news.

We’re really sorry for the outage and we’re doing our level best to fix it.

Doug.

Performance issues around 13:00

Hello — we’ve seen some issues early this afternoon.

I’m really sorry, they were my fault. I pushed some updates to one of our key servers that had tested fine locally and seemed innocuous enough.

However, once they were live they created a steadily escalating set of issues that, furthermore, rather obscured their immediate cause.

I’ve rolled those updates back, now, and everything has been behaving perfectly for the last 30-40 mins, so I think I can get on with investigating exactly why the updates created the issues they did.

Again, I’m really sorry if you were one of the people affected by the problems.

Thanks,

Doug.

A few ‘heavy load’ errors in the last 10 minutes or so

Hi — we’ve seen a few heavy load notifications in the last 10 minutes.

We’re currently investigating the precise chain of events leading to them but we’re pretty sure that it was a cascade of server failures caused by an over-sensitive health check in one of our load balancers. This is  a new load balancer put in place to handle the increased traffic we’re seeing, so we’re still tuning its configuration.

We’ve reduced the sensitivity of the health check and think that that should resolve the issue as we investigate more in-depth.

My apologies if you’re one of the people who saw the heavy load message and my thanks for your patience as we fine-tune the system to handle the greater demand we’re seeing across the site.

Thank you,

Doug.

Possible disruptions this afternoon

Hello — just a headsup that we’re making some backend changes this afternoon that *might* affect the service.

The plan we have laid out should mean that no disruption occur but there’s always a chance of misadventure.

We’ve got a quick rollback plan ready, too, in case of any issues, so hopefully they’d only be for a minute or two.

Thanks,

Doug.

 

15:46, UPDATE:

So far so good. The updates have been running for a few hours with no issues spotted or reported by our monitoring systems or users.