Switching from one cloud provider to another is like opening Pandora’s box of infrastructure challenges, egress costs, and pricing commitments.
If you’re using multiple cloud providers, sooner or later you’ll realize the value of building an open-source, portable infrastructure you can smoothly move to another vendor.
I wrote a guide to help you navigate this part of the cloud landscape:
- Should I switch to another cloud provider at all?
- Protect your business against occasional SLA breaches
- And if it’s still not working, switch to another cloud service provider
Should I switch to another cloud provider at all?
If you don’t have a good reason to switch – you probably shouldn’t. Here’s how to tell if your cloud provider is living up to the promised service level.
Pay attention to downtimes
Even the most prominent vendors like AWS or Azure experience outages once in a while. But if you spot a few significant ones on the public status pages, it’s clear that the provider is overlooking some aspect of their infrastructure.
Get concerned if you see major incidents that lead to downtimes for large portions of regional or global services, causing customer application downtimes. They’re usually hard to miss thanks to getting reported all over the internet.
Note all the API errors
Some providers might be struggling to keep up with their scale. One side effect of that is more IaaS control plane API errors (5xx). Another warning sign is an increase in IaaS control plane API latency. What does this mean? APIs return fine, but they take longer to execute each operation. Reliable cloud providers should actually provide you with historical latency numbers.
Check vendor capacity
If you see “out of capacity” errors for compute and other resources, it’s a sign that specific regions are under-resourced. Now, this isn’t such a major problem under regular circumstances. But in busier seasons (like Q4 for e-commerce businesses), it may lead to deeper resource crunches.
Protect your business against occasional SLA breaches
First things first, you need solid monitoring in place. SLA issues often fly under the radar, especially if teams don’t actively monitor status feeds or record cloud API error rates independently.
So the first step to understanding whether the vendor is doing a good job is having the means for discovering SLA problems.
You need to have a fine-tuned monitoring and alerting process for your own components, as well as upstream dependencies of your cloud provider.
Here’s an example of what you can do
Use multiple “availability zone” capabilities whenever they’re available and take advantage of cross-region backup, restore, and replication for cloud services if they offer it.
Most cloud providers allow KMS keys to be replicated across multiple regions. The same goes for database backups. Google Cloud Platform even offers a globally replicated database called Cloud Spanner to achieve high availability across many regions.
Once you understand that there is a persistent issue happening in your cloud provider’s infrastructure, it’s time to make a high-availability, Disaster Recovery decision.
Should you be going active-active across regions or other cloud providers? This is a fairly complex topic, and we find most customers aren’t thinking down this path yet.
Still, when a vendor breaches an SLA, it’s good practice on their part to offer customers billing credits automatically – without customers having to ask for them on their own.
And if it’s still not working, switch to another cloud service provider
Here are a few best practices to make the process less painful:
1. Go open source
Moving to true open-source standards sets you up for a smoother transition. If your team is tied into a proprietary database like DynamoDB, transitioning to another vendor will be extremely difficult.
Sure, a few vendors provide alternative solutions without requiring any application rewrite or change. But using open source is just smarter.
2. Use containers
Take advantage of containers for application deployment. They’re lightweight and easily transportable across cloud environments. Once it’s time to say goodbye, you won’t be forced to waste time applying countless changes.
Let’s say that your team modernized their stack for containers. Now you need to choose a container orchestration technology. The clear winner here is Kubernetes, and each cloud provider has service offerings for Kubernetes and cloud native. If you start by adopting a proprietary solution like AWS ECS, moving to GCP or Azure will be much harder.
3. Automate DevOps
An automated DevOps process that takes advantage of industry best practices for CI/CD (Continuous Integration and Continuous Delivery) is a must-have. Once you set it up, releasing code – and even infrastructure – becomes very straightforward. Even if you’re using Kubernetes – there are automation solutions on the market made specifically for this container system.
Embrace Infrastructure as Code (IaC) as well. Your team will be able to write their infrastructure requirements and implement them using standard software techniques. Several vendors provide IaC options, the most popular one being Terraform by Hashicorp.
4. Find feature parity
Make sure that your new cloud provider has feature parity for your use cases. There are many functional differences between cloud providers, and some vendors have well-defined specializations. But “table stakes” are all there across the 3-4 major hyperscalers (AWS, Azure, Google Cloud, and Oracle OCI).
5. Get your data out the smart way
Once you select your new vendor, prepare to face the next challenge: getting your data out of the original cloud provider’s servers. Moving large volumes of data can be very costly, so the operation needs some planning.
You don’t want to end up like NASA, which uploaded 247 petabytes of data into AWS but forgot about the mounting egress costs. Vendors typically charge around 0.09 per GB to extract data, while important data is free.
6. Use a load balancing solution
Using a global load balancing solution is a smart move. It helps to make the transition between cloud providers practically seamless. Your end-users won’t even realize that a transition is taking place since the flow of traffic will switch over in real time.
7. Wait until your commitments expire
Make sure that you’re not leaving money on the table because you committed to resources via Reserved Instances or Savings Plans. Plan your transition around the expiry of such plans. Remember always to leave some time in case things go wrong and you need a quick rollback.
8. Have a rollback plan
Practice the migration rollback in pre-production first. That way, things will go much smoother during the actual transition.
9. Choose your new provider wisely
Evaluate the vendor based on their historical track record. You can also use sources like Gartner’s IaaS Magic Quadrant that evaluates major cloud providers – and reliability is part of their evaluation.
Check if the provider is investing in the infrastructure. Are they opening more regions? Are they handling capacity crunches by adding more capacity per region?
And always ask about billing credit policies. Can your vendor give examples of outages where they have actively refunded customers for the associated losses?
10. Build a Disaster Recovery/downtime avoidance strategy
No cloud provider is flawless, but many have a strong operational foundation and proven track records at scale. Still, it’s a good idea to develop a solid Disaster Recovery and downtime avoidance strategy.
At CAST AI, we think of multiple cloud usage as an “A” AND “B” proposition. By having active-active fault tolerance across two or more cloud providers, we make sure that our customers aren’t impacted by a single cloud going down. All of this needs specialized tooling that we provide.