How PlayPlay Optimizes Compute Resources During Spikes And Saves 40% On Its Cloud Bill
→ Automated selection of the most cost-efficient VMs and full Spot VM automation
→ 40% of cloud cost savings on average
→ Boosted DevOps team productivity and increased innovation
Company size
260+ employees
Industry
SaaS
Headquarters
Paris, France
Cloud services used
Google Kubernetes Engine (GKE)
Company
PlayPlay is the video creation platform that empowers Marketing and Communication teams to transform any message into engaging video stories. With powerful and intuitive products, the best AI technologies, and a focus on enterprise storytelling, they’ve enabled over 3,000 companies to make video their main form of communication.
Challenge
PlayPlay was looking for a solution to scale its infrastructure up and down to match usage spikes, select the most optimal compute instances, and automate the entire Spot VM lifecycle. The ideal solution would avoid overprovisioning while still delivering optimal service to customers.
Solution
PlayPlay implemented CAST AI to gain greater visibility into its cloud spend, optimize costs using the platform’s autoscaler and automated instance selection capabilities, and automate Spot VMs with fallback management.
Results
PlayPlay overcame the key challenges using several CAST AI features: the platform’s autoscaler, full automation of Spot Virtual Machines, and resource rebalancing to achieve an optimal configuration in a matter of minutes. Here’s how each of these features works.
Smooth autoscaling that handles usage spikes
The screenshots below show how CAST AI handles the spiky usage pattern of the production cluster by providing a rapid scaling capability. The bin packing capability is built into the CAST AI autoscaler to ensure that every scaling event is handled cost-efficiently. This is how the platform scales down the nodes, lowering costs.
CPU usage over the period of 30 days
Memory usage over the period of 30 days
CPU usage over the period of 7 days
The same bin packing benefits are evident in a 7-day usage pattern. CAST AI provides the capacity needed for the workloads without wasting resources.
Average current vs. optimal cost
Thanks to the CAST AI autoscaler, PlayPlay’s production cluster cost reached an optimal state. This result was achieved by picking the cheapest instance type, maximizing the Spot VM adoption for the stateless workloads, and bin packing the cluster.
Rebalancing
Rebalancing allows clusters to reach the most optimal and up-to-date state with a single click. During this process, CAST AI automatically replaces suboptimal nodes with new ones that are more cost-efficient and run the most up-to-date.
PlayPlay uses the rebalancing features in its dev cluster every night. The following table shows the achieved savings from the nightly rebalancing.
The savings were achieved by eliminating the wasted CPU capacity and compacting the cluster to fewer nodes. The Scheduled Rebalancing feature allowed PlayPlay to keep the cluster cost continuously at an optimal level.
Rebalancing in action
The screenshot below illustrates one rebalancing operation in which CAST AI replaced 13 nodes without impacting service availability, saving PlayPlay $1,430 per month.
Spot Virtual Machines automation
To optimize costs, PlayPlay runs its production cluster on Spot VMs. CAST AI handles interruptions automatically, with zero impact on service availability.
The CAST AI prediction model can proactively identify the VM family that may soon be interrupted and react immediately by provisioning another Spot VM from a different family. If the Spot VM is unavailable, CAST AI will automatically fall back to the on-demand instances without requiring any manual intervention.
CAST AI is an all-in-one solution that delivers all the optimization and reporting features we need:
- Cost visibility and monitoring via detailed dashboards,
- Ability to optimize costs using the autoscaler and automated instance selection,
- Spot VM automation with fallback management.
On top of that, it was simple to implement and easy to integrate into our infrastructure.
Clément Hémidy
Senior DevOps Engineer at PlayPlay
Managing peak usage periods in Kubernetes
How large is your cloud infrastructure and what goals does it need to meet to support your business operations?
We run our applications on the Google Cloud Platform and use 1600 CPUs every day.
Our infrastructure supports several aspects of our business operation:
- A web application for our SaaS product,
- Compute workloads required for media transformation and video rendering,
- Storage of media required for video creation (300 TB).
We need an infrastructure that delivers high availability (99.8 targeted), resilience, and scalability, as we need to provision a significant amount of resources in a very short time frame when customers request video rendering.
What challenges did you face in managing your spiky workloads?
Our biggest challenge is managing the scalability of our infrastructure in accordance with changing usage and workloads. Our workloads require a lot of resources in short time periods, so we need to be very reactive when providing new resources. This means we need to use and understand many metrics in order to provision quickly and efficiently.
If we fail to manage these spikes, they will impact end users by extending the time to create a video. Since we provide our product to large B2B companies, our application needs excellent performance and availability.
Each time a customer wants to render a video, if the video is one minute or 10 minutes long, we need to provision enough resources in the cloud to be able to do it quickly. So, we need to have enough CPUs to provide this. The challenge is anticipating all the needs because our customers are in Europe and the US.
The objective is to strike a balance between performance and cost to avoid overprovisioning while simultaneously delivering optimal service to customers.
Jérémy Fridman
Head Of Information Security at PlayPlay
What problem did you hope to solve around cost optimization when searching for solutions?
Our main goal was to choose compute instances at the best price and appropriate configuration to support our application and its real-time demands while optimizing the cost.
At first, we relied on Google Cloud Platform’s native autoscaler, followed by some manual adjustments. However, this approach simply wasn’t sustainable, as it was impossible to manage our infrastructure in real time, and the costs left room for optimization.
Clément Hémidy
Senior DevOps Engineer at PlayPlay
Which cost optimization measures have you implemented before within your Kubernetes environment?
We were using Spot VMs, but manual management was time-consuming and difficult to track and report. Since our optimizations were based on manual work, they were not in line with real application demand and traffic.
Automation was the solution
How did CAST AI help you solve your key challenges?
We were looking for a solution to optimize costs in a cloud environment and started a POC with CAST AI.
What attracted us to the platform were the following key features:
- Cost visibility and monitoring via detailed dashboards,
- Ability to optimize costs using the autoscaler and automated instance selection,
- Spot VM automation with fallback management.
It quickly turned out that CAST AI is an all-in-one solution that delivers all the optimization and reporting features we need. It was also simple to implement and integrate into our infrastructure.
How has CAST AI automation improved your ability to scale resources in response to workload spikes?
Positive impacts were directly visible once we integrated CAST AI. The platform optimizes our costs by selecting the most cost-efficient compute instances. Using the CAST AI autoscalar didn’t impact our availability.
Get results like PlayPlay – book a demo with CAST AI now
A boost to engineer productivity
How does CAST AI impact your team’s engineering workload?
Since everything is managed automatically by CAST AI without any manual intervention by engineers, our team has saved a lot of time.
Given that we’re a startup in a really challenging market, the ability to reallocate resources to other, more innovative topics is essential to us. Our DevOps team is small, so their time is precious.
Jérémy Fridman
Head Of Information Security at PlayPlay
How was the support throughout the POC and the implementation of CAST AI?
CAST AI is really easy to implement, and we have good internal infrastructure/orchestration skills, so the POC was a quick success. The CAST AI support team provides really good follow-up during this POC.
How has CAST AI impacted your security?
Security is a big pillar of our strategy. And the first thing and the first added value CAST AI brings is compliance, as we’re targeting the ISO 27100 certification. So, compliance with standards and some best practices at the infrastructural level is really important for us.
Secondly, we’re using vulnerability management to easily identify the vulnerabilities existing on our infrastructure, qualify them, and plan the correction of a given vulnerability.
What next steps are you planning to take to extend your use of CAST AI?
We’re investigating a few new features to see how we could benefit from them. But overall, we’re really excited about the tool’s future and the possibility of collaborating with you, and we feel empowered to impact the roadmap.
CAST AI is a good “companion” for using Kubernetes in a cloud environment. Its reporting and monitoring interface is great and easy to use. It provides a good solution to optimize costs, is easy to integrate, and delivers quick results. CAST AI is the orchestrator of our orchestrators.
Clément Hémidy
Senior DevOps Engineer at PlayPlay