Fail whale solution for Twitter #failwhale @twitter
Many years ago while working for a large ISP we had occasional outages due to bugging software or being over capacity. Mail was the biggest problem, with many people – even on the dial-up network – POP-ing their mail every 5 minutes. We had rolled our own mailserver and we were constantly fixing our infrastructure to give our customers the highest quality. Yet we still had the occasional outage which caused our helpdesk to be flooded with calls from people whose mail client gave a pop-up message with an error message.
To solve this we implemented 2 systems, the first being dynamic rate limiting: depending on capacity if you pop messages more than ever X minutes we will simple tell you you have no mail every X numbers of the time. Previously we would send an error message telling the user that their mail client was configured to POP mail more often than is recommended, obviously the helpdesk was less than happy about this.
Secondly we created what we called: “Lying POP3”. It simply tells the user they have no mail when we are experiencing outages. This means that in the case of outages, which might last from 5 to 30 minutes – temporary outages – we didn’t need to have a status page telling people that we were temporarily down, their experience was not that we were down. Obviously we had the advantage of that the majority of our users weren’t tech savvy and that mail wasn’t experienced as an instant technology. And in the case of real outages or upgrades, we obviously updated our users and helpdesk within this buffer or ahead of time.
So what am I trying to tell Twitter?
Experience is the key to service! You are down when I perceive you as down, you are up when I perceive you as up.
So how would I do it?
I’d do exactly the same as I did with mail for the API, dynamically rate limit and lie for a short period of time about tweets from the twips a user follows. For tweets a user sends I would have a FIFO queue which takes all the tweets send and processes them when it is again possible to process them, this would naturally have the same rate limiting as normal. For the website I would use a cache copy of 1 page of tweets, preferably with a failwhale notice at the top so people know there is an outage not to expect updates, but the experience of the people is that they can see tweets.
In the terms of the book “Authenticity” my experience of Twitter
“Fake-fake: is not true to itself; is not what it says it is“
Subscribe to comments with RSS.