Catch all(…) For high-availability in the cloud

Prerequisitehttp://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html

In this post I would like to expose some of the pain points that you may be experiencing (certainly would like to hear more! If I didn’t cover them all) as you transition your Microsoft SQL based applications into the cloud.

You may already know that SIOS Technology Corp. recently announced DataKeeper Cloud Edition that enables native Microsoft SQL Failover Clustering in EC2 environment within a since Virtual Private Cloud (VPC) and between the Availability Zones (AZs).

Is it enough?

Well…it all depends. Let’s go through the logical chain. Certainly if you deploy your cluster in a single availability zone it will provide some (I would say false) sense of security, since the only thing you may gain (or save and even that becomes questionable in case of the disaster) is $30 per month on the transfer fees between AZs. Think of AZ as a datacenter, but why? Amazon AWS provides all the instruments to deploy your solution between geographies. So, next is cluster across AZs. I would say that’s a pretty strong solution considering that the likelihood of both locations going down is much less, however we know occurrences (now almost once a year) when disasters affects the entire side of the country.

What is the next step in high-availability in the cloud?

Well… it becomes inevitable that Regions look very attractive. But why?

At times HA is not the most critical component of the deployed solutions which depends on the company education in HA space, solution requirements, size and therefore budget associated with the solution. However mother nature and overall our surrounding environment gives us clues that things are not going to be as stable as they used to. While certainly stability of technology (hardware and software) improves the environment conditions do go through significant changes that introduce disasters, and from what I can tell things will only get worth before (if) they get better.

Hurricane Sandy was a powerful tropical cyclone that devastated portions of the Caribbean and the Mid-Atlantic and Northeastern US, and was definitely an eye opener for many organizations that lost services for quite some time until power as well as other conditions came to its norm. Surprisingly even very well established organization do not have a true HA employed and their applications as well as services (even trading) were interrupted for 2-3 days. As a result many while realizing that such things will happen again decided to turn into the cloud providers like Amazon to take advantage of their services, and while some may be exploring whether to go private to public many transitioning full application services to EC2. Certainly it seems like a good idea and can potentially not only provide HA, but also save the cost but how do you achieve the true HA in EC2?

The answer is to deploy the solution with HA across the geographic regions. Cross Region HA protects the service offerings deployed in a cloud environment from being affected by the disasters that typically occur in a particular continent or a portion of such. With that HA between AZs is a good start, however EC2 as well as other cloud provider made the deployment across geographic regions very easy to achieve while certainly the cost of such solution will be higher comparing to HA across AZs. This is where I believe most are still a bit stunned and confused while trying to take it a step at the time starting with AZs and expanding from there.

So what is fundamental difference between AZs and Regions?

I would say that’s network bandwidth when it comes to HA. When the solution is deployed within a single VPC across two AZs, it never hits the internet and therefore no VPN required and connection between AZs is typically sufficient (referred to as “low latency” by Amazon, and you go figure out what it is exactly ;) ). Once you go across regions the bandwidth itself is questionable (its there, but what is it?) and for security purpose VPN would be recommended as well (overhead on top of overhead). Gladly there are solution like DataKeeper Cloud Edition that support asynchronous replication, but my question is even higher level.

What to protect in traditionally decoupled application and back-end (and more) stacks when go across regions? (Infrastructure, application, back-end  all, and how?) The answer is ALL, and synchronously.

Cross Regional Application Resource chaos

  1. HA solution between AZs is ok. Why? Because IF the software or hardware failure in AZ will cause only partial relocation of the resources associated with the application to fail-over then low latency connection between the AZs will insure that even if resources are now located in different AZs the connection is sufficient to provide smooth application surfing. However if such split of resources will occur in case of the cross Regional HA user experience likely to degrade significantly.
  2. As more components (VPN, AD, etc.) are added to the configuration there is a higher potential for things to get out of sync. VPN instance (once added) becomes a single point of failure and to address this gap it is recommended to deploy VPN instance in two (or more) AZs with the VPN “observer” monitoring the VPN instance and fail-over the resource to a different AZ in case of failure (observers cannot cross Regions and therefore the observer cannot “watch” the instances between the Regions). In summary the solution will have 2 AZs in a single VPC for each Region, which once again brings up to earlier mentioned point (1), that coordination of the resource fail-over is very important.

What else is there?

Besides the Cross Regional Application Resource Chaos (mentioned above) the configuration of instances, infrastructure component resources, and connectivity is not a trivial task. Automation of the deployment would be nice!

In my opinion it calls for a solution that provides an automated deployment AND HA of the entire deployed environment, i.e. infrastructure (like VPN), application, and backend.

How do you solve it today? Drop me a line.

Leave a Reply