Heavy load notifications this morning

Hi — we had a number of heavy load errors reported this morning.
As soon as we saw the notifications about them we manually increased the capacity of the affected servers by 50% to ease the burden on the system.
Shortly thereafter we were able to reset the running servers to flush any issues out of those.
We believe that these actions should have fixed most ongoing instances of the heavy load errors so then we turned our attention to finding the cause of them.
We spent hours digging into server logs, load balancer logs, DNS settings and a host of other technical minutiae to investigate what caused the issue in the first place.
Our in-depth investigations took quite a long time but we believe we’ve identified and fixed the cause of the problem, now.

What caused the issue?

Basically, we made some back-end changes to help us deal with increased traffic that introduced a subtle bug.
Recent traffic load has increased even over our seasonal expectations, and we had a few, isolated incidents of a heavy load error spotted towards the end of this last week. In response to that, we set up a system to automatically add extra servers in to help at times of heavy load.
This system went live yesterday afternoon and I spent yesterday evening testing it to make sure it was behaving as it should be. The system passed the tests with apparent flying colours and I went to bed.
However, there was a flaw in one of the tests that meant that I missed that newly added servers were not handling web traffic. For those who want the gory details, the new servers I’d tested all started working *because* I’d logged on to test them. This meant that any new server I checked seemed to be working just fine, but only because I’d logged on to check in the first place. Talk about Schrödinger’s bug!
So when I went to bed, there were lots of (manually checked) servers all working happily away. Overnight, however, some of those were shut down and replaced with new servers that were not working. This led to the situation late this morning where a reduced number of servers in the whole group was able to handle traffic and, not surprisingly, they started to complain occasionally about the workload.

What have we done to fix the issue?

Once we’d identified the problem we were able to fix things so that the servers were launching in a state able to handle requests. This means that the site will be handling the increased traffic properly, as per the intention of the auto-scaling work yesterday. I’ve been monitoring new instances as they are created and have confirmed (by multiple methods!) that they are working as they should be.

What else are we doing in response to the issue?

Our investigations today have highlighted a few things we can do to improve the performance of the servers in question so we’ll be looking at those for the next few days.
We also need to have a think about how best to be notified about this sort of intermittent error. We were alerted to this morning’s issue by a few members of our brilliant community — thank you so much for that! It would be good to have been automatically warned about this before they had to see it, though. We have a series of automated alerts that tell us when parts of the site are completely down or consistently unresponsive but those aren’t so great for sporadic errors. When we’ve had other intermittent issues in the past, we’ve written custom monitors for them that fix the problem or notify us or both. We need to look into doing the same for this error, should it rear its head again.
And of course we’ll keep monitoring the servers and the community to make sure we’ve completely fixed the problem. I’m pretty confident we have but of course I was pretty confident with the auto-scaling system when I’d finished testing it last night, so that bit of extra vigilance is always a good idea.

Thank you so much to those members of our community who reported seeing the errors, your help really is appreciated. Also, I’d like to offer my sincere apologies for not catching the bug last night that led to the issue this morning. And my apologies also to anyone who saw and was inconvenienced by the error. I hope this post makes the cause of this unusual issue clear and I further hope that you can be reassured by the immediacy of our response and our commitment to fixing it as soon as we were aware of it.

And finally, thank you so much for bearing with us when we have to fix things.

Outage at 04:50 this morning

Hello — we had an outage this morning for around 10-15 minutes.

This was caused by one of our routers running out of memory and was fixed pretty much as soon as the alerts came in.

We’re really sorry for anyone who was inconvenienced and would like to thank you for bearing with us whilst we sorted the issue.

Thanks again,


Report on outage from 9pm today

Hello — just to let you know that the issues we saw earlier were related to the image service.

Initially just the upload didn’t work but images subsequently began to struggle and we decided to put the site into maintenance mode whilst we fixed the issue.

We fixed the issues with the image services within around 20 minutes and are now going through the various servers and services to make sure everything is behaving as it should be.

Thanks once again for your patience whilst we dealt with this and our sincere apologies for any inconvenience caused by the outage.



Folksy having some issues around 9pm

Hi — just to let you know that we’re investigating the outage we’re currently experiencing.

We hope to have the site back up as soon as possible.

In the meantime thanks for your patience and for bearing with us.


Facebook sign in being updated

Hi — we’re currently updating our Facebook sign-in component. Hopefully this will not take very long at all.

In the meantime, we’ve posted a note on the sign-in page explaining that we’re currently working on the Facebook sign-in component and telling users how they can still log in by resetting their password (even if they’ve only ever logged in through Facebook before).

We’re working to get the component up and running and back in place as quickly as possible.



Maintenance mode at 01:30 am

Hi — we’ve a few updates to push tonight/this morning at 1:30.

They will require us to go into maintenance mode, hopefully only for around 10 minutes.

Our apologies for any inconvenience caused, we shall be seeking to limit the downtime as much as possible.

Thanks ever so much for bearing with us whilst we push this vital work,


UPDATE: 01:41

It looks like it’s taking a little while longer to run some of the database updates, sorry, so it could be up to half an hour or so until we’re back up. Thanks for your patience!

Some Folksy monthly bills generated a week late

Hello — my apologies to those Folksy users who received their bill for last month 7 days late.

The process that generates the bills and notifications got into trouble and fell over.

Normally I would have spotted this but I was away for a few days. I still *should* have spotted it, though, and am really sorry I didn’t.

Thank you so much for those sellers who brought the issue to our attention! We really do have an amazing community, here, and things like this remind us that we’re not just *providing* a service but that we’re collaborating with you all in it. Teamwork like that really is appreciated — thank you!

And my apologies again to those users who had to wait a week for their bill, I really hope it hasn’t inconvenienced anybody too much.



Typeform data breach, Folksy data not affected

We have been notified that Typeform – a company we use to send out surveys, as well as some competitions and offers – had a data breach on Friday 29 June.

Folksy.com has not been affected and any personal details stored on Folksy are safe.

Four of our Typeform documents were affected. Typeform has informed us that an external attacker managed to get unauthorised access to respondent data on those four forms and downloaded it. The good news is that Typeform responded immediately and fixed the source of the breach to prevent any further intrusion.

This affected a small number of people who had filled data into Typeform documents. We have contacted each of those people by email.

If you are one of the people we’ve emailed about this issue:

You do not need to do anything. But we do recommend that you watch out for potential phishing scams and spam emails.

What we’re doing.

Typeform have assured us they have identified the source of the breach, addressed that security vulnerability, initiated a comprehensive review of their security, and are taking significant measures to prevent this type of situation from happening in the future, including a full-scale review of their security. In future, to reduce the chance of similar incidents, we will remove all survey data within two months of any survey.

If this affected you at all we are sorry. Please do contact our support team if you have any questions.

Monthly fee notifications going out twice

Hi — our monthly fees generation task ran twice concurrently this month, sorry. This is because it was moved to a new, dedicated machine but still managed to run from the old location, too.

This means that some people will have received two notification emails, with at least one person reporting that the second email had no fees data in it.

It also meant that some people who paid their fees were seeing them as still unpaid in their dashboard for a brief period early this morning.

Nobody paid twice — that’s not possible.

We’ve fixed the issue, now, and everything is back to normal: those who paid can see that they have paid successfully and those that haven’t yet paid can do so normally.

We’re really sorry for any confusion the extra email will have caused anybody affected, and for any alarm caused when paid bills were showing as unpaid in the seller dashboard.

I hope the fact that we were able to fix things quickly helps.

Thanks so much to those sellers who let us know about the issue so quickly, and thank you everyone for your patience whilst we sorted it,


Issues with shop statistics

Hi — we’re having an issue where shop statistics are not updating until later in the day, so people checking yesterday’s stats in the morning are seeing zeroes.

This is due to an issue we’re currently investigating where some tasks that collate the data needed to see the shop stats pages is falling over.

The reason for the delay is that we are having to manually launch the task (quite a long-running task) in office hours once we’ve confirmed that it hasn’t run the night before.

As I said, investigation is ongoing — it’s proving to be a subtle bug!

In the meantime, we’re sorry about the delay in your stats updating. Please do be assured that we’re doing our level best to resolve the issue as expeditiously as we can.