The recent Amazon outage has caused a great deal of outrage among their clients who argue that enough is enough with the downtime and data losses that seem to occur every six to eight months in the us-east region. While the loss of data is never a good thing, I don't think this is Amazon's fault. Blame lays squarely on the clients who are unable or unwilling to take the necessary precautions to ensure their systems can remain operational when an entire geographic region is under siege from Mother Nature herself. Amazon's cloud offerings will only allow for minimal downtime if they are configured correctly and, from what I've seen, most companies are using it wrong.
Amazon Is a Cloud Provider, Not a VPS
This is what I feel most people1 misunderstand about Amazon. A cloud provider is supposed to provide services where the data is "somewhere" and if one segment of the system fails, another segment picks up the slack and takes over. This ensures a minimal amount of downtime. It's very, very difficult to set up. It's also very costly. Companies that want 0% downtime and 0% data loss will be forced to replicate their systems across geographic regions and implement very complicated distributed DNS entries and proxies. For most organizations 15 to 20 minutes of downtime, while not great, is acceptable2 if an entire geographic region has failed and systems somewhere else on the planet need to be brought online.
That said, it seems that many people use Amazon only as a Virtual Private Server provider ... which is something that Amazon may not be the most cost-effective answer. When servers and databases go down, popular web services disappear. Data is lost. Backups are old. In some cases, backups were only stored on ephemeral storage and lost when the virtual servers lost power. It's an absolute embarrassment, but we see reports of stories like this with every Amazon outage. Which brings me to my last gripe ...
Why the hell are big companies relying solely on Amazon's US-East region? This region has been the most unstable, least reliable region for over five years. Every six months we hear about outages, data loss, network issues, or something else that gets in the way. People who insist on putting their entire virtual infrastructure in this region without considering having backup AMIs of live servers in other availability zones deserve the downtime. That's just the way it is.
Nothing humans make is 100% reliable. Expecting to offload the most difficult aspects of our network architecture to a 3rd party with the expectation that they'll take care of everything is not only unrealistic, it's immature. It's irresponsible.
Being upset and frustrated with the failures of US-East is understandable but, when no 'Plan B' has been organized, it's not really Amazon's fault when a huge lightning storm rolls through a region and knocks out a number of their services.