Category Archives: maintenance

Performance issues around 13:00

Hello — we’ve seen some issues early this afternoon.

I’m really sorry, they were my fault. I pushed some updates to one of our key servers that had tested fine locally and seemed innocuous enough.

However, once they were live they created a steadily escalating set of issues that, furthermore, rather obscured their immediate cause.

I’ve rolled those updates back, now, and everything has been behaving perfectly for the last 30-40 mins, so I think I can get on with investigating exactly why the updates created the issues they did.

Again, I’m really sorry if you were one of the people affected by the problems.

Thanks,

Doug.

Advertisements

A few ‘heavy load’ errors in the last 10 minutes or so

Hi — we’ve seen a few heavy load notifications in the last 10 minutes.

We’re currently investigating the precise chain of events leading to them but we’re pretty sure that it was a cascade of server failures caused by an over-sensitive health check in one of our load balancers. This is  a new load balancer put in place to handle the increased traffic we’re seeing, so we’re still tuning its configuration.

We’ve reduced the sensitivity of the health check and think that that should resolve the issue as we investigate more in-depth.

My apologies if you’re one of the people who saw the heavy load message and my thanks for your patience as we fine-tune the system to handle the greater demand we’re seeing across the site.

Thank you,

Doug.

Possible disruptions this afternoon

Hello — just a headsup that we’re making some backend changes this afternoon that *might* affect the service.

The plan we have laid out should mean that no disruption occur but there’s always a chance of misadventure.

We’ve got a quick rollback plan ready, too, in case of any issues, so hopefully they’d only be for a minute or two.

Thanks,

Doug.

 

15:46, UPDATE:

So far so good. The updates have been running for a few hours with no issues spotted or reported by our monitoring systems or users.

Heavy load notifications this morning

Hi — we had a number of heavy load errors reported this morning.
As soon as we saw the notifications about them we manually increased the capacity of the affected servers by 50% to ease the burden on the system.
Shortly thereafter we were able to reset the running servers to flush any issues out of those.
We believe that these actions should have fixed most ongoing instances of the heavy load errors so then we turned our attention to finding the cause of them.
We spent hours digging into server logs, load balancer logs, DNS settings and a host of other technical minutiae to investigate what caused the issue in the first place.
Our in-depth investigations took quite a long time but we believe we’ve identified and fixed the cause of the problem, now.

What caused the issue?

Basically, we made some back-end changes to help us deal with increased traffic that introduced a subtle bug.
Recent traffic load has increased even over our seasonal expectations, and we had a few, isolated incidents of a heavy load error spotted towards the end of this last week. In response to that, we set up a system to automatically add extra servers in to help at times of heavy load.
This system went live yesterday afternoon and I spent yesterday evening testing it to make sure it was behaving as it should be. The system passed the tests with apparent flying colours and I went to bed.
However, there was a flaw in one of the tests that meant that I missed that newly added servers were not handling web traffic. For those who want the gory details, the new servers I’d tested all started working *because* I’d logged on to test them. This meant that any new server I checked seemed to be working just fine, but only because I’d logged on to check in the first place. Talk about Schrödinger’s bug!
So when I went to bed, there were lots of (manually checked) servers all working happily away. Overnight, however, some of those were shut down and replaced with new servers that were not working. This led to the situation late this morning where a reduced number of servers in the whole group was able to handle traffic and, not surprisingly, they started to complain occasionally about the workload.

What have we done to fix the issue?

Once we’d identified the problem we were able to fix things so that the servers were launching in a state able to handle requests. This means that the site will be handling the increased traffic properly, as per the intention of the auto-scaling work yesterday. I’ve been monitoring new instances as they are created and have confirmed (by multiple methods!) that they are working as they should be.

What else are we doing in response to the issue?

Our investigations today have highlighted a few things we can do to improve the performance of the servers in question so we’ll be looking at those for the next few days.
We also need to have a think about how best to be notified about this sort of intermittent error. We were alerted to this morning’s issue by a few members of our brilliant community — thank you so much for that! It would be good to have been automatically warned about this before they had to see it, though. We have a series of automated alerts that tell us when parts of the site are completely down or consistently unresponsive but those aren’t so great for sporadic errors. When we’ve had other intermittent issues in the past, we’ve written custom monitors for them that fix the problem or notify us or both. We need to look into doing the same for this error, should it rear its head again.
And of course we’ll keep monitoring the servers and the community to make sure we’ve completely fixed the problem. I’m pretty confident we have but of course I was pretty confident with the auto-scaling system when I’d finished testing it last night, so that bit of extra vigilance is always a good idea.

Thank you so much to those members of our community who reported seeing the errors, your help really is appreciated. Also, I’d like to offer my sincere apologies for not catching the bug last night that led to the issue this morning. And my apologies also to anyone who saw and was inconvenienced by the error. I hope this post makes the cause of this unusual issue clear and I further hope that you can be reassured by the immediacy of our response and our commitment to fixing it as soon as we were aware of it.

And finally, thank you so much for bearing with us when we have to fix things.
Doug.

Outage at 04:50 this morning

Hello — we had an outage this morning for around 10-15 minutes.

This was caused by one of our routers running out of memory and was fixed pretty much as soon as the alerts came in.

We’re really sorry for anyone who was inconvenienced and would like to thank you for bearing with us whilst we sorted the issue.

Thanks again,

Doug.

Report on outage from 9pm today

Hello — just to let you know that the issues we saw earlier were related to the image service.

Initially just the upload didn’t work but images subsequently began to struggle and we decided to put the site into maintenance mode whilst we fixed the issue.

We fixed the issues with the image services within around 20 minutes and are now going through the various servers and services to make sure everything is behaving as it should be.

Thanks once again for your patience whilst we dealt with this and our sincere apologies for any inconvenience caused by the outage.

Thanks,

Doug.

Folksy having some issues around 9pm

Hi — just to let you know that we’re investigating the outage we’re currently experiencing.

We hope to have the site back up as soon as possible.

In the meantime thanks for your patience and for bearing with us.

Doug.

Facebook sign in being updated

Hi — we’re currently updating our Facebook sign-in component. Hopefully this will not take very long at all.

In the meantime, we’ve posted a note on the sign-in page explaining that we’re currently working on the Facebook sign-in component and telling users how they can still log in by resetting their password (even if they’ve only ever logged in through Facebook before).

We’re working to get the component up and running and back in place as quickly as possible.

Thanks,

Doug.

Maintenance mode at 01:30 am

Hi — we’ve a few updates to push tonight/this morning at 1:30.

They will require us to go into maintenance mode, hopefully only for around 10 minutes.

Our apologies for any inconvenience caused, we shall be seeking to limit the downtime as much as possible.

Thanks ever so much for bearing with us whilst we push this vital work,

Doug.

UPDATE: 01:41

It looks like it’s taking a little while longer to run some of the database updates, sorry, so it could be up to half an hour or so until we’re back up. Thanks for your patience!

Some Folksy monthly bills generated a week late

Hello — my apologies to those Folksy users who received their bill for last month 7 days late.

The process that generates the bills and notifications got into trouble and fell over.

Normally I would have spotted this but I was away for a few days. I still *should* have spotted it, though, and am really sorry I didn’t.

Thank you so much for those sellers who brought the issue to our attention! We really do have an amazing community, here, and things like this remind us that we’re not just *providing* a service but that we’re collaborating with you all in it. Teamwork like that really is appreciated — thank you!

And my apologies again to those users who had to wait a week for their bill, I really hope it hasn’t inconvenienced anybody too much.

Thanks,

Doug.

Advertisements