Bonus content: Cloud bill optimization checklist
This is part 1 of our series on cloud cost optimization.
Before we go, let’s see where you stand. Can you win the bin(go)? Note how many of these issues you’ve had to deal with.
If you haven’t crossed a single tile – please share your secret with us. Otherwise, read on and see how you can lose this bingo and win in DevOps.
Any cost optimization you’ll do begins with understanding your cloud bill. But how do you get started when a typical cloud bill looks like this?
The cloud may come with a high price tag, here’s why
You’re not a terrible DevOps engineer if your cloud bill usually exceeds your planned budget. Flexera’s 2020 State of the Cloud Report revealed that a typical public cloud spend goes over budget by 23% on average.
This isn’t surprising at all. After all, you don’t know how much you’ll need to pay until you get your cloud bill at the end of the month. And forecasting these expenses is no walk in the park.
But how exactly does this happen? Why is it so hard to control cloud bills?
In the early days of IT, people were divided into those who controlled costs and those who wrote code. The latter often had no direct impact on the costs of running that code. Of course, software efficiency matters, but let’s put that aside for a moment. The traditional software delivery setup was clear on roles and responsibilities – it was the system administration group (later called DevOps or SRE) who took care of infrastructure deployment and ongoing operational costs.
In this setup, it was typical to release software once a month or every couple of months. The traditional way of delivering software to production wasn’t driving innovation. Enter the service team – a group of engineers who could write code, test it automatically, deploy it, and be held responsible for 24/7 operations.
Since software engineers were in the business of software innovation, they needed access to resources. Thanks to the cloud, they could order the infrastructure they needed through Infrastructure-as-a-Service platforms and then automate deployments through Infrastructure-as-Code. This led to massively faster innovation cycles. In times of economic boom, this type of rapid innovation, infrastructure deployment, and potential associated waste were acceptable to many companies.
As the pendulum perpetually swings back, times have changed recently. We are entering an era where CFOs, controllers, and financial departments need to know more about the costs incurred by cloud resources used by various departments. It’s a hard nut to crack. Is the team ordering just enough, too much, or too little? Are cloud costs optimized? How can I forecast cloud costs based on projected business metrics and growth?
Finance departments need to answer these questions for every single team/group/department that uses cloud services. And in an ideal scenario, the engineering service team should be able to answer some of them as well.
Why optimize cloud costs? or, The dangers of over-provisioning
We all want to pay less for cloud services. But cost optimization is about more than that. Here are three problems your team might face if you leave your cloud bill to fate.
If you don’t have a firm grasp over your cloud expenses, you might end up wasting resources. This can happen to teams that move an application to the cloud and make assumptions about the resources needed to support the software based on how it ran on-premises, leading to potential over-provisioning.
Planning cloud expenses without any real utilization visibility will likely have you drowning in the sea of unused CPU in the best case. Even worse, your application could be crashing due to insufficient resources if under-provisioned. There’s no real safety net to catch you when you make an error on either side.
It’s so easy to spin up an instance for a project and then forget to shut it down. Many teams deal with orphaned instances that have no ownership but still continue to generate costs. Shadow IT projects similarly produce poorly accounted for resources in the cloud.
Some teams have a tendency to overprovision resources just to add some extra capacity and create a “just in case” buffer. The limited visibility or control of these resources might snowball into a huge problem at the end of the month.
Deciding on a path with limited information
In recent trends, we have seen many companies that are migrating from the cloud back to on-premises, citing better cost control as their primary reason. Migration generates costs as well – from planning and data egress to re-aligning your service continuity and disaster recovery plan.
A migration from cloud to on-premises affects your ability to innovate and quickly respond to change, coming at a steep price. Consider this scenario:
You’re paying way too much for the cloud. Managing and estimating these costs is hard. So, you move back to data centers and can finally make sense of your infrastructure expenses. You’re killing it! But you’re also killing innovation in the process.
This is a pain many engineers have felt first-hand. Once on-premises, teams can no longer be flexible and experiment because the infrastructure or DevOps team plans the capacity months ahead and blocks certain expansion scenarios.
5 cloud bill issues you’re bound to encounter, sooner or later
Cloud bills are hard to understand
They’re long, complicated, hard to unpack and comprehend. The example we shared above speaks volumes.
Every service in your bill has a defined billing metric for it. For example, some services in the AWS Simple Storage Service charge by the number of requests, while others use GB. That’s why making sense of the cloud bill is such an overwhelming task.
To make sense of your usage and costs, you need to look into various areas in your CSP console. Just take a look at AWS Billing and Cost Management Dashboard. To get a more granular view, you need to check the Cost Explorer and then group and report on costs by certain attributes – for example, group resources by region or service. This is time-consuming and heavily relies on human intervention.
Now imagine doing that for more than one team or department using the same cloud service.
Billing for multiple clouds is even harder
Multiply the bill problems above by the number of clouds you use. Let’s say that you’re using AWS. You’re now used to it; its costs are manageable. But what if another department unexpectedly starts using Microsoft Azure in a shadow IT project?
Just compare the cloud bill from AWS and Azure; they’re worlds apart.
Multiple teams working in one financial account
Step into the shoes of a CFO and their finance team. You’re dealing with several departments that contribute to the cloud bill. How can one make sense of it? Who is using which resources?
Cloud service providers offer mechanisms that allow categorizing spending by accounts, organizations, or projects to make sure that a team or department keeps within its spending parameters:
- Organizations – This feature helps to centrally manage and govern your environment when scaling AWS resources. You can create new AWS accounts and allocate resources, organize your workflows by grouping accounts, apply budgets and policies to accounts or groups, and use a single payment method for simpler billing.
- Tagging resources – You can tag resources directly, but don’t expect them to show up in your bills. It’s your job to break down data by tags. You can do that by writing reports in the Cost Explorer or downloading the data from S3 and using it directly, which is definitely not a trivial task. There are entire companies busy developing tools for expressing and representing bills, like CloudHealth by VMWare.
- Resource groups – A resource group is a container that consists of resources that you want to manage as a group. Azure recommends bringing together resources that share the same lifecycle to deploy, update, and delete them as a group.
- Tagging is also available as an option for Azure customers.
Google Cloud Platform
- Projects – In GCP, a project includes a set of users, enabled APIs, billing settings, and authentication, and monitoring settings for those APIs. You can create multiple projects and use them to organize your cloud resources into logical groups to help with understanding their cost.
- Google Cloud supports tags for billing as well – they are called labels. Some GCP resources haven’t implemented the ability to label yet, but that gap will likely close soon.
Each CSP approaches costs and billing differently. This only adds to the already complex task of manually keeping track of your resources, usage, and associated costs.
Budgeting for the cloud
Each cloud provider offers budgeting tools that help CFOs and their financial teams to restrict resources that can be used in a project in line with its budget.
But cloud budgets tend to overrun. Case in point? A Silicon Valley startup Milky Way burned through $72k on testing Firebase + Cloud Run and almost went bankrupt. The trouble, in that case, was Google evaluates your budget at the end of the day, and they blew their budget within hours.
In a similar incident, a team of software developers at Adobe incurred $80k a day in unplanned cloud costs, with a final bill that surpassed half a million dollars. For an enterprise, this could easily become a multi-million dollar cloud bill.
What causes cloud budgets to overrun?
- Discovering costly requirements after formal discovery is finished.
- Not knowing your system requirements upfront and having wrong assumptions about how the system features will work and scale.
- Lack of autoscaling design for applications.
Faulty provisioning logic in IaC that spins out of control.
- The use of serverless (functions) without thought to parallel scale.
- Lack of budget attention, aka nobody watching.
- Poorly configured notifications and alerts.
Approach to consider: Correlating costs with business value at Netflix
Driving costs down isn’t something you do at the expense of supporting your key business goals. Netflix is a good example here. The company uses the total number of active streams (how many people are currently watching content) to measure its business value. By correlating this KPI to cloud costs, Netflix can ensure that spending growth doesn’t outpace one of those active streams.
Cloud cost forecasting
Cloud bills are hard to forecast because they fluctuate depending on usage.
Forecasting should still be on top of your mind. Having a good understanding of your future resource requirements helps to keep a rein on costs.
You could possibly get a lower price for services too. That is, if you’re willing to risk committing to a certain level of spending, this could mean a serious case vendor lock-in lasting even a few years.
To help in forecasting, CSPs offer various tools:
- AWS Budgets – The tool allows setting custom budgets to track your cost and usage for various use cases. It includes alerts when the actual or forecasted cost exceeds your budget threshold and when your actual RI and Savings Plan utilization or coverage drops below the threshold you set for it.
- AWS Cost Explorer – The AWS Cost Explorer helps to visualize, understand, and manage costs and usage over time. You can create custom reports, analyze your data at a high level, and detect cost drivers or anomalies.
- AWS Cost & Usage Report – This report includes a comprehensive set of cost and usage data with additional metadata about AWS pricing, services, reserved instances, and savings plans at different levels of granularity. It itemizes usage at the account or organization level, and you can organize the costs further organized using Cost Allocation tags and Cost Categories.
- Cloud cost management – This dashboard helps to track and manage costs across both Azure and AWS cloud services. It includes options for cost analysis and optimization. You can expand it with Microsoft Power BI connectors and Azure Cost Management and Billing APIs.
- Pricing calculator – A handy tool for checking the costs of different Azure configurations. You can estimate costs by products or using ready-made scenarios like Advanced analytics on Big Data or CI/CD for Containers.
Google Cloud Platform
- Google Cloud Billing – The cost forecast feature enables users to check how their costs are trending and how much they’re projected to spend in a given month. You can use it to forecast the end-of-month costs for a specific spending group, from the entire billing account down to one SKU in a single project. You can also export your entire billing data set to Google Big Query and use tools like Google Data Studio for custom analysis.
Once you have the right data about your cloud spend, you can try different techniques of cloud cost forecasting:
- Analyze usage reports – Forecasting is impossible if you don’t have clear visibility of your spend. Monitor your resource usage reports on a regular basis and set up email and other alerts. Some CSPs have tools that allow forecasting of how much you’re likely to spend during the next few months and get recommendations on reserved instances or savings plans.
- Model your cloud costs – To calculate the total cost of ownership, analyze the pricing models of CSPs, and accurately plan capacity requirements over time to project costs. Measure application or workload-specific costs and create an application-level cost plan. Aggregate all of this data in one location to understand your costs and trends better.
- Identify peak resource usage scenarios – You can do this by using periodic analytics and then running reports over your usage data. Consider other sources of data like seasonal customer demand patterns. Do their patterns correlate with your peak resource usages? If so, you can identify these in advance.
Is automation possible?
You’re probably asking yourself these questions right now:
What about third-party solutions? Aren’t cost management and optimization tools here to help make sense of my cloud bill faster?
Third-party cost reporting solutions provide better visibility, but they won’t help you take automated action.
Some third-party tools can be helpful in getting a big picture view of your cloud costs. At worst, they report on your usage so that you know where the costs are coming from. At best, they give you static recommendations that need to be executed by human beings.
A simple checklist will help you do most of what a dedicated tool can accomplish at no additional cost. You can get one delivered to your email here:
To save and optimize costs in real time, you should use an automated AIOps platform that does the heavy lifting for you. CAST AI includes this feature - take a look at our use cases to check if it’s the right fit for you.
Stay tuned for the next part of this series, where we’re going to talk about what to consider when choosing the right cloud compute resources for your workloads.