Category Archives: maintenance

Maintenance mode this morning around 02:00

Hi — we have some significant backend updates to push this morning which will require us to go briefly into maintenance mode.

This shouldn’t be for a period of more than 10 minutes or so. We’ll let you know here if there are any changes to that estimate.

Thanks so much for bearing with us whilst we run these.

Doug.

15 minute outage just now

Hello — we just had an outage for around 15 minutes.

The cause was me running an intensive background system update that, with the site being busy, tipped one of our key servers over.

I’m really sorry about that.

I got the server back up in about 10 minutes and things were back to normal 5 minutes after that. Naturally, I’ll be keeping an eye on things to make sure we’re all good.

Once again, I’m sorry about that, I didn’t realise just how intensive the update was. I’ll reschedule it for running in the wee hours when we’re quieter.

Thanks,

Doug.

Image issues this morning

Hi — just letting you know that we’re on with fixing the images issue we’re experiencing. I’ll update here when we know more.

Thanks so much for bearing with us.

Doug.

UPDATE: The server that runs our image cache fell over. I’ve rebuilt and replaced it and the system seems to be working again, now, although obviously I’ll keep monitoring it. We’re now investigating what happened to cause the old server to fall over.

UPDATE 2: We’ve had notification that the original server had hardware degradation issues which explains things. Whilst very rare, these sorts of things will happen from time to time in complex systems. The silver lining in this incident is that we were able to get a whole new server in place and running within 30 minutes of the first alert. My sincere apologies to anyone who was inconvenienced by the outage and my thanks for everybody’s patience whilst we resolved this.

Doug.

5 minute outage just now

Hi — sorry about the brief outage just now. It was due to a bad code merge making it into a deploy. Ordinarily this can’t happen but I mistakenly thought the changes I’d made didn’t affect any production code and took a shortcut.

I’ve rolled the changes back and am fixing the issue, now.

Again, my apologies for that.

Doug.

Outages this afternoon

Hello — we had a few outages of lengths varying from 5 minutes to 40 this afternoon.

As you can imagine, we were madly scrambling to find and fix the errors and we’re happy that we have now done this.

It turns out that we had an intermittent issue with our search index: once we fixed this, everything started working again.

I’m going to keep monitoring the site this evening to make sure all is and stays well.

My first job for tomorrow morning is to write a script that will monitor for this sort of error and fix it as soon as it occurs in the future. I don’t anticipate that it will very often because it’s not one we’ve seen before but I want to be able to tell you all with confidence that this particular issue won’t take the site down again.

I apologise to everyone who was inconvenienced by this afternoon’s outages.

I’m also sincerely grateful to you all for your patience with us whilst we investigated and fixed it.

Thank you,

Doug.

Maintenance mode at 5am today

Hi — we’ve got some work to deploy that requires we rebuild our search indices this morning.

This will require us to go into maintenance mode for a while. We’re going to try to keep it to under half an hour.

The indices will be rebuilding for around an hour or so after that, which means that, for a short while, not all items will appear in all pages.

Thanks for bearing with us whilst we undertake this work and I hope you’re all doing well.

Doug.

Maintenance mode for around 10 minutes between 1 and 2 am

Hi — just letting you know that we’re running some updates to the stack that will require us to go into maintenance mode briefly for hopefully no more than 10 minutes.

These updates will be taking place sometime in the next hour.

Thanks so much for your patience whilst we undertake this essential work!

Cheers,

Doug.

UPDATE: We’re good for now. We were in maintenance mode for around the 10 minutes predicted. Thanks again for your patience whilst we ran the updates!

Mea culpa, outage at around 13:45

Hello — I’m really sorry, that was entirely my fault.

We were just down for between 30 and 40 minutes, there.

I was working on background improvements to the image service and accidentally, and *really* stupidly, deleted one of our main load balancers: the one that routes requests to item pages, amongst other key areas. I rebuilt it as quickly as I could and we’re back online now, of course.

I’m so sorry to everyone who was inconvenienced by that.

I am now looking at ways that I can make operating in the depths of our infrastructure safer in the future. I think I’ve come up with a way to ring-fence the areas I want to look at, first, so that nothing I do can have repercussions for other areas of the overall site.

Once again, you have my sincere apologies and my assurance that I’m looking very hard into making sure something like this doesn’t happen again,

Doug.

5 minute outage at 16:10

Hello — sorry, we just had a 5 minute outage due to an issue with our routing service.

It was noticed pretty much immediately and fixed pretty much immediately after that, however it did take 5 or so minutes to take effect.

We’re really sorry if you were caught by the outage. Thank you very much for bearing with us whilst we fixed it.

Thanks again,

Doug.

Images back up, still waiting for full explanation

Hello — I’m really sorry about the images being down — they’re back up, now.

Our image service is hosted with Amazon and I’ve been on the phone with one of their engineers who was trying to fix it.

They’re currently having some issues with their infrastructure and that was the cause of the image service outage. He couldn’t tell me more than that for the time being as apparently all of their top-tier engineers were working to fix whatever issue they are having.

He was, however, able to get things working in our case so the images are back up, now.

Whilst I was on the phone to the engineer I’ve made a start on creating an alternative architecture for the images service that I could deploy in case something like this happens again. I’m going to carry on with that work, now, so that we have something we can get in place relatively quickly in case there was another issue with their internal systems that we couldn’t do anything about. Not that the engineer said there would be but it’s better to be safe than sorry.

I’ll let you know here when Amazon get back to me with more information about the issues they were having.

In the meantime, we’re up again — and again, my apologies for the outage.

Doug.