Many modern applications have a batch processing aspect to them and regularly run high-volume, repetitive data jobs. If you use the cloud and cloud-native technologies like Kubernetes, this area offers a great opportunity for cost optimization through approaches like event-driven autoscaling and spot instance automation.
4 tactics for cost-efficient and resilient batch processing
1. Event-driven autoscaling for cost reduction
The Kubernetes ecosystem has an open-source solution for event-driven autoscaling called KEDA (Kubernetes Event-Driven Autoscaling), which helps provide a more responsive and dynamic scaling solution for batch jobs.
KEDA can scale the number of job replicas based on the messages in the Kafka queue, enabling real-time adjustments to match workload demand.
With KEDA, you can deploy a function (or a microservice) to your Kubernetes cluster and specify how you want it to scale in response to events. This allows your applications to be more reactive and scalable. It also slashes costs by allowing your applications to scale down to zero when they are not in use.
What makes event-driven autoscaling worth it?
Using an event-driven model offers significant advantages over traditional CI/CD scaling approaches.
First of all, it reduces the time it takes to respond to changes in demand by shortening the feedback loop. Rather than waiting for the entire CI/CD cycle to adjust the scale, KEDA can react almost instantly to workload changes. This opens the door to more efficient resource usage and improved responsiveness to fluctuating demand.
But that’s not everything. KEDA also allows the number of pods for a given batch process to scale to zero, which translates into significant cost savings.
How does this work?
KEDA enables Kubernetes deployments to scale based on events like the length of a Kafka queue, a RabbitMQ queue, Azure Service Bus queue length, and many more. It extends Kubernetes to support a wide range of event-driven scaling sources without requiring changes to your application.
KEDA works by defining custom metrics for a Kubernetes deployment and then automatically scaling the deployment up and down based on those metrics. The scaling is event-driven, which means that it responds to external events, such as messages being added to a queue.
Example of event-driven autoscaling with KEDA
Here’s a simple example of how a KEDA-scaled job might work with a Kafka queue:
- Suppose you have a batch job that processes messages from a Kafka queue. The job is designed to process one message at a time and runs inside a Kubernetes pod. Check out the documentation.
- You configure KEDA to monitor the length of the Kafka queue. KEDA is set up to create one pod for every 10 messages in the queue.
- When the queue has 50 messages, KEDA scales up the number of pods to 5. Each pod processes one message at a time, so now your system can process five messages simultaneously.
- As messages are processed and removed from the queue, KEDA scales down the number of pods. If the queue goes down to 20 messages, it scales back to two pods.
- If the queue is empty, KEDA scales down to zero pods, meaning no resources are used when there’s no work to do.
The above is a simplified example of a real-world scenario.
When designing an event-driven scaling solution, consider factors such as the time it takes to start up a new pod, the processing time of a message, and the cost of running pods.
2. Spot instances for cost efficiency and better performance
By running batch processing jobs on spot instances, you can save up to 90% off the on-demand prices.
Spot instances are unutilized cloud instances available at significantly lower prices, presenting a substantial opportunity for cost savings.
Choosing higher-capability instances like compute optimized, IO optimized, or network optimized ones may result in swifter batch job execution at a fraction of the usual costs.
We have successfully put this strategy into practice for our own operations and those of our clients. So far, we’ve seen superior performance and cost-effectiveness without compromising the quality of batch processing jobs.
A good spot instance automation solution should deliver the following features to help reduce the costs of running batch jobs in the cloud:
- Automated provisioning and termination – it rapidly assesses a workload’s needs, locates the best match among available spot instances, and provisions these resources. When there are no more jobs to be done, the tool automatically shuts down instances. You don’t want to spend money on resources that don’t provide value to your organization, even if they’re as cheap as spot instances.
- Falling back to on-demand during spot drought – there may be times during the year when there is a scarcity of spot instances. To reduce the risk of downtime, automation solutions can move workloads from spot instances to on-demand instances as needed. This reduces the risk of interruption and ensures that all workloads have a place to execute even when no spot resources are available.
- Partial use of spot instances – it is critical that you select an automation tool that is not a black box and allows you to configure it. CAST AI offers all the features listed here, allowing you to run only a fraction of workloads on spot instances without modifying manifest files.
3. Reentrant batch processing and automated testing
Reentrant jobs can be safely paused, resumed, or restarted without unintended side effects or errors. This is crucial for robust and reliable processing, particularly in a distributed, cloud-native environment riddled with interruptions.
To achieve reentrancy, leverage the built-in checkpointing functionality of your chosen system.
For example, Kafka provides “at least once” delivery semantics, meaning it guarantees that a message will be delivered at least once, but it may be delivered more than once. This is a key feature that can help ensure the robustness of your batch processing jobs.
But relying solely on Kafka’s built-in features isn’t enough. Consider implementing comprehensive automated testing to verify that all batch processing tasks are reentrant.
This includes creating tests that deliberately interrupt and resume tasks and validate that they can be successfully completed even when interruptions occur. Automated testing will help you discover and resolve issues that could affect reentrancy, leading to more reliable batch job processing and overall system resilience.
4. “Thin” messaging for efficient message delivery
In the context of messaging systems like Kafka, “thin” messaging refers to a strategy where the messages being sent to consumers from the Kafka queue are small and lightweight.
Rather than including the entire data payload, these thin messages typically include a reference or a pointer to the actual data. The actual data, in turn, is usually stored in a scalable and durable storage service, often referred to as object storage.
From a transactional perspective, this data is consistent, which means that it offers a trustworthy, single version of the truth that all consumers can access regardless of when or where they access it.
This strategy can be particularly beneficial in systems where the data payloads are large or where there are many consumers that need to access the data. By keeping the messages thin, the system can ensure fast, efficient delivery of messages to consumers. Then, each consumer can retrieve the full data payload as needed, spreading out the load on the object storage system.
This also reduces network traffic and can increase the overall throughput of the messaging system. Adopting a “thin” messaging strategy in the design of batch jobs is a smart move, letting your batch jobs scale in phases.
One set of consumers can be tasked with preparing the input for batch processes based on relational data pointers. Further stages in the batch process can then produce results that are written to reference systems.
Batch processing gets easier with automation
From a cost optimization perspective, the adoption of KEDA and spot instances are bound to result in substantial savings, with potential cost reductions of up to 90% for complex batch job processes.
This approach allows the system to scale down to zero when workload demand is low, maximizing cost efficiency, while maintaining the ability to scale the system seamlessly as workload demands increase.
Both approaches require time-consuming configuration and constant monitoring – unless you implement a solution that does these jobs for you. Automation platforms designed with Kubernetes in mind easily handle autoscaling and spot instances to help you run batch jobs efficiently.
CAST AI clients save an average of 63%
on their Kubernetes bills
Book a call to see if you too can get low & predictable cloud bills.