What IT Decision-Makers Should Learn from the AWS Outage

An AWS outage shut down thousands of businesses on the east coast of the United States. We show how companies can arm themselves against such failures. […]

The problems of Amazon Web Services (AWS) with APIs in the US-East-1 region at the beginning of December dramatically demonstrated to many Americans and Canadians how much everyone depends on AWS. Even consumers who had never heard of AWS were suddenly affected because Disney+ and Netflix no longer worked, the Roomba robot vacuum cleaner quit the service, or the intelligent lamp simply remained dark.

It hit many companies that rely on AWS for their IT operations even harder. Or who had to realize that although they themselves have no business relationship with AWS, many of their services – such as Trello, Smartsheet, Slack, etc. – that they use no longer worked because they are based on AWS.

But what lessons should we learn from the incident? In the private environment, this is still simple: we should stop relying on so many IoT devices. Do dishwashers, Christmas lights, refrigerators and toothbrushes really have to depend on the cloud? The situation is different in the business environment. The thought that the IT department will again run all its servers itself will remain a pious wish. A simple comparison between then and now shows how absurd this idea is.

And no matter what C-level management wants, IT can’t make something work that’s out of its control. Especially since there is a reason why a lot or everything is now running in the cloud: as a rule, it costs less to operate corresponding services on-premises. Especially since the downtime in the cloud has certainly been lower than that of your own IT so far.

But what to do to prevent problems such as the ASW failure? Could switching to a multi-cloud configuration be the solution? In theory, maybe, but it would require at least two public cloud providers and possibly a dedicated data center. It’s going to be very, very expensive. In addition, multi-clouds will not work as a safety net against failures such as those of AWS, Lydia Leong, Gartner Distinguished VP Analyst, is convinced of this.

Or as she puts it: “Multi-cloud failover requires full portability between two providers. This is a huge burden for the developers. The basic compute runtime (whether VMs or containers) is not the problem, so ‘I can move my containers’ solutions from OpenShift, Anthos or others do not really help.” The problem is all the distinguishing features, according to Leong, the different network architectures and functions, the different storage capacities, the proprietary PaaS functions, the completely different security functions, etc.

But enough lamenting. Gartner analyst Leong is convinced that it is quite possible to keep a company running, even if the primary cloud fails. She has two tips in stock for this:

Companies should operate their active applications in at least two, preferably three availability zones (AZ) in each region they use. Sure: Three is much harder to achieve than two AZs, but this is still easier than trying to build a multi-cloud failover solution.
Furthermore, the active applications should be operated in at least two, preferably three regions. Again, two is much easier than three, but if a mission-critical application is truly mission-critical, it may be worth the effort. If this is not feasible, a fast and fully automated regional failover could be an option – provided that a company is willing to pay for such a service.

*Steven writes for our US sister publication Computerworld. He was already dealing with business and technology when 300bps was still high speed.