How ShareChat Saves Millions While Reducing Engineering Toil

Company

ShareChat is India’s largest homegrown social media company and the country’s largest Google Cloud Platform customer. ShareChat counts 325+ million monthly active users across its brands, ShareChat and Moj.

Challenge

ShareChat runs over 90% of its infrastructure on Kubernetes and is India’s largest Google Cloud Platform customer. The company automatically scales to almost 7 billion web requests daily to deliver a great user experience.

ShareChat was looking for a third-party solution that would help optimize its massive Kubernetes deployment, streamline autoscaling capabilities, and improve resource usage.

Solution

ShareChat leveraged Cast AI’s rebalancing feature to optimize its cloud infrastructure, replacing inefficient nodes and automating workload distribution across clusters – leading to immediate cloud cost savings.

The flexibility offered by Cast was critical in addressing one of ShareChat’s biggest challenges: optimizing Kubernetes node pools. By allowing dynamic provisioning and the use of various machine types, Cast enabled ShareChat to meet fluctuating demands efficiently.

Additionally, Cast’s automation dramatically improved ShareChat’s utilization of Committed Use Discounts (CUDs) to reach almost 98%, reducing capacity planning efforts from twice weekly to once every two months.

Results

By implementing Cast, ShareChat runs a well-optimized infrastructure with an extremely low level of cloud waste. This comes at no added effort to the SRE team as automation dramatically reduces engineering toil and helps the team to be more efficient.

In terms of the Infrastructure or DevOps team, the management effort such as manual rightsizing, creation of node pools, and upgrade efforts are drastically reduced with Cast AI, so the DevOps team can focus on building products to improve developer productivity.
Any modern company that is running on Kubernetes should try Cast AI, no matter the scale of their infrastructure. The platform combines many functionalities for which one would need separate products.
Jenson C S
Senior Engineering Manager at ShareChat

Rebalancing

ShareChat uses a feature leading to instant cloud cost savings: rebalancing, where Cast replaces suboptimal nodes with new ones and moves the workloads automatically to help clusters quickly reach an optimal state.

Rebalancing history showing the level of savings achieved without any manual intervention

We use rebalancing extensively. That way our infrastructure stays optimized, thanks to the right configurations and choice of the right machine types, saving us considerable resources. We are already witnessing significant savings and expect to achieve more than a million in annual savings.
Jenson C S
Senior Engineering Manager at ShareChat

Bringing flexibility to Kubernetes node pools

A key challenge ShareChat’s DevOps team faced was manually optimizing node pools. Kubernetes provides functionalities like autoscaling, but dynamic and flexible node pool provisioning are missing. In the context of dynamic application changes and variety in developer requests, a predefined node pool size would result in availability issues if not properly configured.

“We don’t want to stick with one machine type, size, or family, we need different types to run and scale our platform. For example, we wanted to be able to choose the percentage of nodes that would run on-demand and spot instances for different types of workloads. Cast offers us this flexibility,” said Jenson C S.

Optimizing Committed Use Discounts utilization

ShareChat used resource-based Committed Use Discounts to optimize the costs for steady workloads. To benefit from the discounts, team members spent a lot of time figuring out which percentage of a given cluster could run on discounted resources. This is where Cast’s automation made a difference.

We had to tweak the setup a lot to get to the sweet spot. This became inefficient because you cannot drive synergies between different clusters of various sizes if they leverage different families at the same time. Another question was running it cost-efficiently at night when we have much lower traffic. In some of our commitments, we wasted up to 35% of resources that were already paid for.
Obviously, I could just set a static percentage across all clusters and configure a cron job to change that percentage as per our traffic pattern through the day. To be honest, that would never scale well because, at the end of the day, it’s still a static number calculated manually. This is why we decided to work with Cast to develop a solution that would dynamically set those percentages in real time because Cast has a bird’s eye view of our inventory at a given point.
Now, I don’t have to do anything manually and we’re close to 98% commitment utilization. I used to do capacity planning twice a week for CUD management – now I do that once every two months.
Abhiroop Soni
Staff Engineer – DevOps at ShareChat

The image below shows the period before optimization where resource utilization would often fail to reach the line marking committed capacity:

After implementing Cast’s solution, capacity utilization increased to around 98%:

Kubernetes cost optimization

Monitor organization-wide and cluster-level resource spending. Automate resource allocation and scale instantly with zero downtime.

Learn more

Reducing the engineering workload

The ShareChat team no longer needs to carry out quota management internally since Cast has the flexibility to choose the next available quota. Before implementing Cast, ShareChat found this part of its cloud management hectic and time-consuming. “There’s no babysitting with the quotas anymore. This task is completely gone from our team schedule,” added Jenson C S.

Another benefit of implementing Cast was the increase in the speed of internal support. “We have an internal support tool. When I compare it before and after Cast, at least 10 tickets per week for node pool management and customization are gone. The main advantage is now developers no longer need to wait for infrastructure provisioning – we don’t want to be a blocker for shipping features, so that’s great,” said Jenson C S.

In terms of the Infrastructure or DevOps team, the Kubernetes management effort such as manual rightsizing, creation of node pools, and upgrade efforts are drastically reduced, so the DevOps team can focus on building products to improve developer productivity.
Jenson C S
Senior Engineering Manager at ShareChat

The image below shows how Cast’s autoscaler reduces the gap between the requested and provisioned CPUs a large cluster that experiences traffic spikes:

Outstanding support and partnership with a long-term outlook

While developing its relationship with Cast, ShareChat has become the platform’s design partner for several features, leading to many improvements and the delivery of a stable solution.

“The improvements of Cast over the last couple of months were amazing. The support team from Cast are really helpful and friendly. They’re always available and ready to improve their product,” said Jenson C S.

Overall, it was a wonderful experience with Cast AI. Personally, I’d recommend Cast AI to anyone – it’s one of the good vendors with which we interacted in terms of the relationship or support. The platform was able to show its value in under one month.
Jenson C S
Senior Engineering Manager at ShareChat