Did you feel it? Amazon web services were down yesterday
— and for something like four hours, from 1 PM ET to 5PM ET or so. I wasn’t using any cloud solutions as my primary tools at that time but did feel occasional hiccups in email or web things that were peripherally affected. Bottom line: I was able to work, though not as smoothly as usual. People relying on an Amazon web services-based app were dead in the water.
And that’s a crucial point to consider as we move more and more of what we do onto someone else’s hardware. I don’t know what caused Amazon’s outage but fixing it was clearly out of my/your control. All we can do in this circumstance is wait. There are all sorts of existential arguments one can make about slowing down, tidying one’s office, talking to other people and not being heads-down online … but the problem is that deadlines loom, stress levels rise and we’re just …. frustrated.
Autodesk was one of many vendors that tried to put the best possible spin on the situation:
Earlier today, many of Autodesk’s cloud services were unavailable for several hours because of a major outage with one of our web service providers. While we are happy to report that all our cloud services as well as those from many other impacted companies are once again fully functional, we sincerely regret the significant inconvenience and frustration that resulted from this incident. We recognize the trust that our customers place in Autodesk to deliver reliable and dependable products, and are working with our web service providers to prevent similar incidents in the future. We thank you for your patience and encourage any customers with concerns to contact Autodesk Customer Support.
I’m inferring that Autodesk was also frustrated, since customers will blame it for the outage even though it, too, couldn’t do much about it.
My point: a four-hour outage isn’t, in the greater scheme of things, a terrible delay on an engineering project. It is, though, if you’re simulating urgent heart surgery or trying to figure out if a jetliner is safe to put back in the air for a 6PM flight. And it’s a problem if you lose what you were working on just before the outage started.
As you think through your computing strategy, make sure you factor in outages like this, rare though they are. Can you work locally if you have to? Do you have to — or is getting coffee and waiting it out a reasonable alternative for you? Does outsourcing the cost of that hardware outweigh the uncertainty? yes, they don’t happen often but no one can guarantee that they won’t ever happen.
Bottom line: Do you trust Amazon’s IT wizards to get things back up as quickly as possible? I do, since this affects our perception of their service and their paychecks/stock valuation far more than it does mine. But if I were doing something truly time sensitive, I’d want to babysit it all the way, on hardware under my control.
*Update: A TechCrunch article says that only one Amazon S3 data center was affected while 13 others remained up. It also indicates that the last similar problem occurred in August, 2015. It sounds like the writers really searched to find more examples, reinforcing the point that this is truly rare and that uptime is actually quite solid. It also has suggestions about redundancy … Worth a read.