How Branch saved millions of dollars on its cloud bill while maintaining reliability

“Partnering with CAST AI has been a big success for Branch, saving us several millions of dollars per year in AWS Cloud compute costs for our Kubernetes clusters while maintaining our reliability SLAs.”

Company size

~500 employees

Industry

Mobile Marketing

Headquarters

Palo Alto, CA

Cloud services used

Amazon EKS, Kops, EC2

About Branch

Branch provides the industry’s leading mobile linking and measurement platforms, offering solutions that unify user experience and attribution across devices and channels. By integrating Branch technology into core marketing channels including apps, web, email, social media, search, and paid ads, leading brands are delivering better user experiences and driving more value from their marketing campaigns than ever before.

Branch powers mobile links, attribution, and measurement for more than 3 billion monthly users and is working on exciting new ways to improve discovery in the mobile ecosystem. Since 2014, Branch has been selected by over 100,000 apps, including Adobe, BuzzFeed, Yelp, and many more. Branch is headquartered in Palo Alto, California with offices around the globe and found on the web at https://branch.io/

Branch is growing rapidly and its need for cloud computing resources increases double digit % every year based on the client growth, which includes both new clients and the increased traffic Branch’s solution drives for existing clients.

Mark Weiler, the Senior VP of Engineering at Branch kindly shared with us his experience working with CAST AI.

Spot instances were a challenge to deploy with reliability

Branch is a late stage startup company with a very high volume event processing system (10s of billions of daily events) coupled with overarching business goals to continually improve profitability. For the Branch Engineering team this translates into a need for cost efficient implementation of the production environment in the Amazon Web Services, Branch’s preferred cloud provider.

One of the main tools for deploying a highly cost efficient solution into AWS Cloud is to optimize the reservation type (see chart below) of EC2 compute instances running in Kubernetes clusters. Several years ago, Branch encountered significant challenges deploying Spot instances due to capacity shortages and the inflexible and cumbersome failover mechanisms, which resulted in an outage and constant DevOps tuning.

Spots were very attractive to use as they provided the highest discount on price per compute hour (almost 3x the discount of Savings Plans), but the risk of downtime and the time investment to manage spots made it impractical to continue to use Spot instances. Instead Branch reverted to using Savings Plans, which still provided some discount, but with the guarantee of ownership, but this required better capacity planning and up front payment.

This motivated Branch to look for a solution that would automate spot instance usage, offer real-time cost visibility, and provide for the ability to provision the most cost-effective cloud resources to free the company from upfront commitment of Reserved Instances and Savings Plans and reduce overall cost of EC2 compute resources.

Cloud Provider Compute (EC2) Reservation Type Comparison

Metric	On Demand	Savings Plan	Reserved	Spot Instances
What are you buying	Compute instance (EC2) hourly for only what you use	Compute savings plan, hourly commit for 1 or 3 years for any instance family (e.g. C4) or “general compute” hours	Compute Reserved Instance, hourly commit for 1 or 3 years for a specific instance type(C4.xlarge)	Compute instance (EC2) hourly for only what you use
Cloud provider commitment	Guaranteed the buyer owns it 100% and won’t be taken away	Guaranteed the buyer owns it 100% for the term	Guaranteed the buyer owns it 100% for the term	None, instances can be revoked by Cloud provider with 2 minute notification
Buyer commitment	None	1 or 3 year term Longer term	1 or 3 year term Longer term	None
Payment terms	Pay as you go, no upfront fee	All up front, partial upfront and pay as you go. More up front gives larger % discount	All up front, partial upfront and pay as you go. More up front gives larger % discount	Pay as you go, no upfront fee
Cost savings	None, paying full price	20-35% discount (depending on terms & availability)	35-65% discount (depending on terms & availability)	65-90% discount (depending on terms & availability)
Use cases	Maximum price for maximum flexibility. This is the lazy option and you should avoid it if you can	Flexibility to buy “General EC2 compute” hours, where you don’t have a good forecast of exact types you want to use. Common for stateful compute operations/jobs	Inflexible (tied to specific instance type) but best discount for buying up front. Requires 1+ year good forecast Common for stateful compute operations/jobs	Stateless compute workloads, or jobs where interruption and restart is ok.

CAST AI was able to solve the complexity of spot reliability and Kubernetes instance cost efficiency

Branch partnered with CAST AI to unlock the ability to safely utilize Spot instances within their Kubernetes compute clusters and transition away from upfront reservations via Savings Plans.

Prices vary by AWS Data Center and over time, but generally speaking Savings Plans give you ~25% discount off OnDemand and Spots give 70% discount off OnDemand, meaning Spot instances on average will net 45% more net discount over Savings Plans and without having to pay up front for a full 1 year’s worth of EC2 compute.

CAST AI’s ability to do automated fallback when spot instances are reclaimed (by automatically spinning up new equivalent compute instances of the most cost efficient and available instance types dynamically, and then migrate the workloads prior to the old Spots were reclaimed) allowed Branch to deploy Spot instances to all stateless compute workloads in our Kubernetes clusters safely, with zero incidents incurred due to Spot reclamation by Amazon since deploying CAST AI over 10 months ago.

Additionally, CAST AI provided the most cost efficient instance type selection when moving from OnDemand to Spot, and seamlessly auto scaled the cluster based on resource load. CAST AI has also helped to optimize the choice of spot instance types based on the current market price for each EC2 instance type, allowing Branch to utilize the most cost efficient spot instances for their workload.

The result for Branch was to eliminate the upfront spend of several million dollars per year on Savings Plans for their stateless EC2 compute workloads, while saving millions of dollars of Cloud OpEx spend (over 25% of EC2 compute costs) by leveraging Spot instances safely via the CAST AI solution.

Cost analysis showed immediate savings potential

Branch was searching for a cloud cost optimization platform capable of addressing multiple cost-related goals including the ability to safely deploy Spot instances, reduce upfront reservations payments, improve real-time cost visibility, and automated provisioning of the best-suited and cost efficient EC2 compute resources.

Branch assessed multiple cloud automation solutions available on the market and concluded that none of them matched all of its essential requirements. Most tools in the market focus on reporting and at best provide recommendations for what your DevOps team could do manually to optimize your EC2 workloads. When Branch discovered CAST AI, it was clearly differentiated by focusing on cost optimization of runtime Kubernetes cluster workloads, while providing the reporting and visibility capabilities necessary to understand the automated decisions and actions the platform was doing to optimize which provided the necessary trust.

During the initial POC, Branch ran the CAST AI available savings report, which estimated within minutes that Branch could save 59.2% on the first Kubernetes cluster tested.

This level of savings made the CAST AI project the top infrastructure project for Branch to test and deploy in production in a controlled rollout starting with smaller clusters to prove out the technology, and after establishing confidence in both operational management of the solution by the Branch Infrastructure team and confidence in the reliability of the Spot failover, they later deployed across all Kubernetes clusters.

One early identified blocker to full deployment across all clusters was that Branch was using earlier versions of Kops Kubernetes clusters, as Branch was an early adopter of Kubernetes and had not yet migrated to Amazon EKS in all clusters. The team at CAST AI was extremely nimble and back ported their solution from EKS to support earlier versions of Kubernetes Kops within 30 days. This was another very positive sign that the engineering team at CAST AI had a similar culture and mindset as Branch of operating with urgency and being very responsive to customers.

Onboarding was easy, and we saw immediate cost savings as we rolled out across each of our clusters

Prior to deploying CAST AI, Branch used a combination of 1 year Reserved Instances and 1 year Savings Plans for “General compute” to reduce its AWS cloud costs, while keeping OnDemand to under 5% of EC2 compute costs.

The downside of Reserved Instances and Savings Plans are that you have to guarantee a fixed amount of EC2 compute spend up front for a 1 or 3 year period, and you don’t always know if you are going to have the need for it throughout that time period. Additionally, to get the biggest discount, you have to buy this amount of compute for 1 year up front, which requires both accurate forecasting by Engineering and cash flow management for multi million dollar investments by the Finance team. Our goal was to minimize the use of OnDemand, the most expensive type of EC2 Compute.

When a Savings Plan or a Reserved Instance expires after 1 year, all of the previously covered EC2 instances immediately transition to OnDemand, the most expensive reservation type of EC2. To ensure we didn’t have a long period of transition in OnDemand, we had to develop a plan on the calendar day the Savings Plan expired to convert a large volume of OnDemand EC2 over to Spot quickly.

The team at CAST AI helped us plan the transition to cutover quickly as the Savings Plan expired by pre-configuring the clusters and tagging our stateless workloads that were “Spot Eligible” in advance. The result was that the CAST AI solution was able to manage the transition in an automated manner from OnDemand over to Spot within hours and as each Savings Plan expired.

We were amazed how we were able to make an automated transition to more cost efficient Kubernetes nodes so quickly, at scale and without incident.

Built trust with real-time visibility into costs and cluster health

As Branch rolled out CAST AI throughout each of our clusters, and we gained more trust in the CAST AI solution, we wanted to make sure we had full visibility into the solution. The solution was to utilize the CAST AI endpoint where we could scape all metrics from CAST AI and put them into our Prometheus monitoring solution and build real time Grafana dashboards to monitor system operational health and have full visibility on the automated optimizations done by CAST AI on our clusters.

This real time observability feature was critical to building both understanding and confidence in the solution across the Engineering and Infrastructure teams at Branch, and allowed us to roll it out much quicker across the organization. Since we can alert on these metrics, our operations team is able to act quickly on any anomaly to correct it quickly.

Since AWS billing data is delayed by up to 24 hours, catching expensive usage spikes is challenging. CAST AI helps our teams flag issues in real time, figure out the root cause, and bring everything back under control before the situation snowballs into a significant financial problem.
Herman Ng, VP Finance & Operations at Branch

Automating instance selection across wide range of types

One unexpected bonus of deploying CAST AI was the variety of Spot Instance types we would be able to take advantage of, and the selection of the most cost efficient Spot Instance types. It’s challenging for anyone to keep up to date on all the myriad of AWS Instance types available, and to monitor the fluctuating market price of all these types on the Spot Market.

In the past, when we were manually configuring our Spot instances, we would select nodes that appeared optimal based on the knowledge and experience of our Infrastructure team doing the configuration. With CAST AI, the automated system constantly monitors the entire AWS Spot Marketplace for the most appropriate Spot Instance type based on the cost adjusted resource needs of the workload, and selects those instances for you.

Engineers may miss out on these offers because they tend to order from families they know. CAST AI, on the other hand, approaches the search pragmatically – always focusing on finding the best match based on price and relative performance.

That’s why the platform may suggest an unlikely candidate – for example, the extra-fast m5zn.large that comes with extra network bandwidth, SSD disks, and a powerful CPU. If an engineer decided to look beyond the instance families they know, it’s likely that they’d consider this machine as too expensive. However CAST AI’s programmatic approach to instance type selection based on price performance showed that this was a bargain at the current Spot Market price, and selected this instance for the workload.

The entire cloud optimization effort is completely automated, while also providing a high degree of configuration via setting requests and limits, blocking specific instance types, tagging deployments as spot eligible, and setting optimization frequency intervals. The end result is our team sets the constraints and tuning parameters, and then CAST AI takes over instance selection, provisioning, killing instances or node groups, and fallback in the case of node reclamation by AWS.

Graceful failover and optimization during high demand

Looking back at this past holiday season, our AWS Data centers experienced higher than normal demand, creating a shortage of spot instances, which resulted in the need to transition to OnDemand where the spot market dried up. Thankfully CAST AI has a fallback feature that gracefully falls back to alternative Spot Instance types (if available) and ultimately if there are no equivalent Spot instance types, fallback to OnDemand to ensure we experienced no downtime.

While this came at higher expenses during such “low spot inventory” time period, this ensured our primary goal of no downtime was met. In the past, when Branch would experience such Spot drought scenarios, it would require an enormous amount of time for our SysOps and DevOps teams to gracefully failover and then switch back to Spots when they later became available, or worse, just shy away from Spots altogether.

Now with CAST AI, not only would we gracefully fall back to other equivalent Spot instance types or to OnDemand, but in the event the instance was a Fall Back to OnDemand, that instance would periodically retry to acquire an equivalent spot in an automated way. This meant with no effort from Branch, we were able to navigate the Spot drought safely and once the drought was past, were able to return back to optimal cost efficiency with Spot instance deployments.

Continuous high level of support makes this a true partnership

Anytime you’re working with an Infrastructure partner that you are enabling the ability to optimize your cloud deployments, you need to have a lot of trust not only in their solution but especially with their team. While the CAST AI solution doesn’t require a lot of support, I can confidently say their team was extremely responsive anytime we raised an issue or question, and they promptly either fixed the issue, or communicated expectations on timeline.

Our Branch Engineering team operates in Slack primarily and especially our Infrastructure team so we can ensure we’re addressing issues in the production environment in real time. The team at CAST AI has the same mindset, that production data center issues require 24/7 immediate response, which is reflected by offering Slack based support with our engineering and infrastructure team, which has proven to have the same responsiveness we have internally, making us view the team at CAST AI as an extension of our team.

The level of support we receive from the team at CAST AI gives our team very high confidence that we will be able to continue to invest in our partnership with them for a long time.

A great solution that saved us millions of dollars and time

The CAST AI Kubernetes Cost Optimization solution has been a big success for Branch, saving us several millions of dollars per year in AWS Cloud compute costs for our Kubernetes clusters while maintaining our reliability SLAs. The modest amount of effort by our team makes this one of the highest ROI cloud cost savings initiatives we’ve done at Branch.