Software engineers typically don’t care about infrastructure costs because they never had to.
But when you leave the team to its own devices, costs may easily get out of control, given that provisioning cloud resources is so easy.
I know what you’re thinking: convincing the team that cloud costs are important is so hard. Even if you give people direct recommendations – this is the top challenge for 40% of FinOps people.
So, what can you do? Speaking from the experience of leading a dev team at CAST AI, I decided to share a few tips on how we made it work.
Explain the WHY behind cost optimization
Here’s what I hear often happens: the cloud operations team implements an analysis/optimization tool that goes through the infrastructure and produces a set of recommendations that ultimately land on the desk of engineering to be implemented.
When you ask an engineer to do something extra (and care about it), you need to explain the value of that action to the team or even the entire company.
You need to have a narrative about cloud costs in place to make devs start thinking about FinOps matters. We’ve seen Spotify successfully build a gamification-driven culture of achieving cloud cost savings.
Here’s the narrative that worked for my team:
Optimizing cloud costs isn’t about just reducing costs – it’s about removing waste via smart money management and good hygiene practices.
We’re not looking to prevent our company from reaping the benefits of agility and speed the cloud brings.
But we need to show that we’re smart about using our resources. We need to achieve the exact ratio of performance vs. cost. We need to commit to reducing and ultimately eliminating wasteful spending to do that. And that’s why we monitor, track, and optimize our infra costs.
Equip your team with processes for identifying costs
Build visibility into cost drivers with tagging
There’s no point in asking your team to cut their spending if you have no idea how much money goes into which service is used by which team.
That’s why the first step is building visibility into your cost drivers – both for you and for your entire team.
Develop a tagging system for your team and ask people to tag resources within the next month. By the end of the month, you’ll be able to allocate costs to teams easily.
Which resources should you be tagging? This guide from AWS is a great starting point.
Airbnb created a great approach to attribution. The company gave teams all the key information to make tradeoffs between cost and other business drivers, all to maintain the overall spend within the set growth threshold.
By adding this level of visibility into cost drivers, Airbnb incentivized engineers to come up with architectural design changes that would be more cost-effective.
Invest in monitoring for better accountability
Adding alerts and monitoring to your toolkit isn’t just about accountability – it helps you to avoid disasters.
A team at Adobe once generated an unplanned cloud bill of over $500k because someone left a computing job running on Azure. One alert would be enough to prevent this.
Humans make mistakes – and in the cloud, the implications of one simple error might translate into thousands of dollars. One company employee made a keystroke error that caused spinning up an AWS instance much larger than required. Then the job that was supposed to end on Friday wasn’t turned off and ran all weekend, resulting in $300,000 in charges that the company could have easily prevented.
Typically, billing data arrives with some delay to a cost tracking tool like AWS Cost Explorer. If you check your cloud bill at the end of the month, you might get a nasty surprise. Troubleshooting cost issues after they happen is hard (and costly).
It makes sense to invest in real-time monitoring and alert mechanisms that notify you as soon as your cloud spending for a specific service passes the threshold you set for it. You can either do that or periodically look into your infrastructure billing to catch it yourself.
But in the cloud world, things move quickly. Once team racked up a $72k bill in just a few hours just testing a service.
You can’t afford people to dedicate their time to monitoring the cloud manually. And you likely don’t have the time to do it yourself. This is where real-time monitoring and alerting tools can help. But automation is what takes it all to the next level.
Implement automation to make everyone’s life easier
Asking your people to pay more attention to cloud expenses is a big ask if you don’t follow up with the right tooling. Tagging, monitoring, and alerting help a lot but still leave room for a lot of work. And you can bet that your engineers aren’t going to be happy about that – this isn’t the job they signed up for.
Implement an automation solution to take all of these tasks off their plate.
An automation tool takes them over and constantly looks for cost optimization opportunities. Instead of producing a long list of recommendations, it takes care of these tasks on its own (within the limits you set for it).
While tagging or monitoring can be done manually, the same can’t be said about picking the most cost-effective instances or replacing spot instances when they get interrupted to save your workload.
If you use Kubernetes, here’s an example of automation flow for optimizing cluster costs: