The recent Amazon outage has caused a great deal of outrage among their clients who argue that enough is enough with the downtime and data losses that seem to occur every six to eight months in the us-east region. While the loss of data is never a good thing, I don't think this is Amazon's fault. Blame lays squarely on the clients who are unable or unwilling to take the necessary precautions to ensure their systems can remain operational when an entire geographic region is under siege from Mother Nature herself. Amazon's cloud offerings will only allow for minimal downtime if they are configured correctly and, from what I've seen, most companies are using it wrong.
From the middle of last year I started managing my employer's Amazon Web Services servers. At first there was quite a learning curve as a lot of what I was reading in books and online went from being just theory to a practical application. But within a few weeks I was comfortable enough with the ins and outs of Amazon's "cloud" systems to begin analyzing the system's performance and eking out every bit of value from the virtual infrastructure. Later in the year, Amazon rolled out a "Free Tier" for new customers. I signed up and moved a lot of my websites over to AWS with the hopes of lowering my annual operational cost. Unfortunately, I actually spent more money and had far greater downtime on AWS than anywhere else over the last five years. But why is this?
It's no secret that I've been sorely disappointed in the I/O performance found on Amazon's Elastic Cloud Computing (EC2) platform but, as this is the key technology that we use at work, it's my responsibility to eke every last ounce of performance from the systems. One of the biggest issues that I've had has been the relatively poor performance of EC2 instances running as dedicated MySQL Servers. The reason for using dedicated EC2 instances for MySQL rather than RDS instances is a topic for another day, so let's examine a usage scenario that I see becoming a reality in the next six months.
Over the last few days this site has been seeing a much higher frequency of comments posted, but not much difference in the number of people coming to the site. After examining some of the key words people are using to come here, I'm starting to see why. However, one particular search made my day...
Over the last twelve months how often have we seen cloud services go down? Amazon's AWS problems in their US East data center managed to take tens of thousands of services and websites offline. Salesforce.com has seen issues at least once a quarter. Microsoft's Office 365 has recently had problems. As the Internet moves from IPv4 to IPv6 we'll likely see a number of other cloud providers go down, even if only for a few hours. Anyone who's worked in IT is likely painfully aware of how often our servers have gone down, be they local or otherwise, and how difficult it might be to keep them running for years at a time. If websites and services are supposed to be more failure resistant by being hosted in the ephemeral place that is the cloud, how can organizations look reliable when their offerings completely fall off the planet when a cloud provider fails?