The Hackathon Fix That Cut Our Storage Costs by 93%

For the second year running, Cast AI hosted an internal Hackathon during our Vilnius team gathering, with the goal of innovating on bold new ideas and potential moonshots, or improving our existing product and developer experience. This blog post showcases a smart storage solution developed at the latest hackathon. 

Mike Norgate Avatar

For the second year running, Cast AI hosted an internal Hackathon during our Vilnius team gathering, with the goal of innovating on bold new ideas and potential moonshots, or improving our existing product and developer experience. This blog post showcases a smart storage solution developed at the latest hackathon. 

Every 15 seconds, Cast AI collects a complete snapshot of thousands of Kubernetes clusters around the world. That’s millions of snapshots per day, each containing detailed information about pods, nodes, volumes, and dozens of other resources.

The problem? This results in over a petabyte of data stored every month, costing upwards of $25,000, and to cap that off, most of the data we store is duplicated every snapshot. Most clusters don’t change that much in a 15-second window – maybe a couple of new pods or a tweak to the deployment resources – but even a small change means storing an entire snapshot.

In this article, I show you how we changed the way we store, access, and use snapshots, achieving a 93% reduction in storage and 82% faster processing. 

Why Snapshots Matter

Cluster snapshots are the foundation of everything Cast AI does. Think of them as a time machine for your Kubernetes infrastructure – capturing every detail of every cluster at any point in time.

This data powers critical features across our platform:

  • Cost Optimization – Our node and workload autoscaling teams use snapshots to monitor workloads and their usage patterns, identifying opportunities to optimize compute within clusters. Without this historical data, we’d be making decisions blind.
  • Customer Reporting – When customers ask “how much did I spend last month?” or “which namespace is consuming the most resources?”, snapshots provide the answers. Every usage report, cost breakdown, and efficiency metric in our UI starts here.
  • Machine Learning – Our ML team trains models on historical snapshot data to predict Spot Instance interruptions and optimize spot reliability. More data means smarter, more accurate predictions.
  • Customer Support – When something goes wrong, our CS team uses historical snapshots to troubleshoot issues and help customers debug problems. Being able to see exactly what a cluster looked like at the time of an incident is invaluable.

The bottom line: snapshots aren’t just nice to have –they’re mission-critical infrastructure that powers everything we do.

The Cost of Keeping it Simple

Our original snapshot system worked like this:

  1. The agent installed in the customer’s cluster collects changes every 15 seconds
  2. Changes are folded into the previous snapshot to generate a new one
  3. New snapshot uploaded to cloud storage

Seems simple enough, and it is. This system has worked reliably for years, but as the number of clusters we manage grew, cracks started to appear.

The Storage Problem

We’re collecting millions of snapshots per day. As the number of clusters managed by Cast AI keeps growing, so does the size of each file as we take on larger and larger clusters. Snapshots can be anywhere from a few hundred kilobytes to 100+ megabytes.

Over time, we took steps to reduce the storage requirements – limiting the data we collected, adjusting retention periods – but we were still spending hundreds of thousands per year on storage alone.

The Access Problem

When a snapshot is generated, it gets processed by over 15 individual services, each interested in different parts. This meant each service had to download the entire snapshot, deserialize its content, and only then could it start looking for the information it cared about. As the size of snapshots grew, this increased both the time and the compute resources required.

A service that only needed to check pod status would still download 50 MB of data, parse the entire JSON structure, then extract the one field it needed. Multiply that by 15 services processing every snapshot, and you can see the problem.

The Hackathon That Changed Everything

During our 2025 hackathon, a team of engineers gathered around a whiteboard with a simple question: “How can we do this better?”

For years, we’d been optimizing our snapshot system –tweaking compression, pruning unnecessary data, adjusting retention periods. But we were still burning through storage costs, and query times kept getting worse as clusters grew.

The breakthrough came from a different question: “Why are we storing so much duplicate data?”

Between snapshots taken 15 seconds apart, most clusters are nearly identical. Maybe a pod scaled up. Maybe a deployment’s resource request changed. But 99% of the cluster? Exactly the same as it was 15 seconds ago.

This led to a complete redesign based on three core ideas:

  1. Store complete “base” snapshots only periodically (hourly instead of every 15 seconds)
  2. Between bases, store only the actual differences
  3. Use smart compression and file structures to optimize both storage and access

Internally, this became known as Snapshots V2.

Six months later, it went into production.

Three Innovations, One Solution

Snapshots V2 is built around three key innovations:

1. Custom Binary Format with Selective Loading

The biggest change is the creation of a custom binary format that takes advantage of cloud storage features. At the heart of this format is an index that describes where each section lives within the file.

Think of it like a book with a detailed table of contents. Instead of reading the entire book to find one chapter, you can jump directly to the page you need.

This index allows services to use HTTP range requests to download only the parts they actually care about. Need pod information? Download just that section. Need nodes and services? Download only those. Need everything? Download it all.

The choice now lives with each service team. Services that only need pods no longer waste time and bandwidth downloading node data, volumes, and everything else they don’t care about.

2. Dictionary-Based Differential Compression

Here’s the clever part: when a pod’s CPU request changes from 100m to 200m, the JSON representation is almost identical – maybe 500 bytes total, with only three characters different.

Traditional compression would treat this as a completely new 500-byte resource. But what if we could say, “it’s the same as before, except these three characters changed”?

That’s where Zstandard (ZSTD) dictionary compression comes in. We use the old version of a resource as a “dictionary” to compress the new version. The algorithm can then reference patterns from the dictionary rather than re-encode them. The result? A 500-byte resource change becomes a 20-byte patch – a 96% reduction.

This technique, similar to how Git stores code changes, lets us represent entire snapshot deltas in kilobytes instead of megabytes. And because ZSTD is hardware-accelerated, decompression is blazingly fast.

3. Lazy Loading and Smart Memory Management

To reduce compute usage in service snapshot processing, we needed to prevent them from doing unnecessary work. We’d already made it possible to download only the sections you care about, but if you’re only looking for one specific pod, why parse all 10,000?

By using lazy loading techniques and better memory management, we only deserialize what you actually access. Looking for one pod? We’ll process exactly one pod. Want to iterate through all pods? We’ll process them one at a time as you access them, not all upfront.

We also introduced arena allocation – a technique that stores all the raw JSON data for a section in a single contiguous memory block rather than thousands of individual allocations. This cuts memory allocations by 50% and significantly improves cache locality, making iteration faster.

The Numbers Don’t Lie

After several months in production, the results exceeded our expectations:

Storage Efficiency

93% reduction in storage required

What used to consume over a petabyte per month now fits in a fraction of that space.

Over $300,000 saved in cloud storage costs per year

This is real money that can be invested back into product development instead of paying cloud providers for duplicate data.

Service Efficiency

82% decrease in p90 snapshot processing time

What took 30 seconds now takes five. This means faster insights for customers and more responsive autoscaling decisions. When every second counts in responding to traffic spikes, this matters.

88% decrease in CPU usage in snapshot processing services

Less CPU means lower infrastructure costs and more headroom for growth. We can now handle 5× more clusters with the same infrastructure that was struggling before.

91% reduction in network ingress from cloud storage

Downloading only what you need instead of entire snapshots eliminated millions of unnecessary API calls and removed a major bottleneck for services running under load.

The infrastructure efficiency gains alone will save us significant additional compute costs annually – on top of the $300K in storage savings.

Wrap up

We started this post talking about snapshots every 15 seconds. That hasn’t changed – we’re still capturing cluster state with the same frequency and fidelity our customers depend on.

What changed is everything underneath: how we store it, how we access it, how efficiently we use it, and what we can build on top of it.

93% less storage. 82% faster processing. Over $300,000 saved annually. Zero breaking changes for existing services.

Not bad for a hackathon project.

Cast AIBlogThe Hackathon Fix That Cut Our Storage Costs by 93%