As generative AI applications scale in production, many teams are relying on powerful models like GPT-4o-mini to deliver fast, high-quality outputs across chat, summarization, and retrieval-augmented generation (RAG) workloads. While GPT-4o-mini offers impressive performance and ease of integration, the cost of frequent inference, especially at scale, can quickly add up.
For teams running Kubernetes-based LLM in cloud environments, there is increasing interest in evaluating high-performing open-source alternatives, such as Alibaba’s Qwen2.5-14B. If users host it themselves, it could provide comparable results at a significantly lower cost.
This article explores how a simple switch in your model enables your infrastructure to support flexible LLM choices. It also introduces AI Enabler, an automated solution for deploying and testing LLMs and dynamically routing requests for cost and performance optimization.
We ran some benchmark tests to identify cost and performance results and found some intriguing data.
Setup
For an accurate cost comparison, we benchmarked Qwen2.5:14B (no quantization) using vLLM—served via Cast AI’s AI Enabler, on 2× L4 Spot GPUs. We batch-processed 1000 requests under a realistic load.
Benchmark summary
- 1000 requests served successfully
- Throughput: 800 tokens/sec
- Median Time to First Token: 34s (batch load; much lower in real-time traffic)
- Total throughput: ~2.88 million tokens/hour
- Infra cost: $0.46/hour (GCP Spot VMs, asia-northeast3 region)
Cost comparison
To ensure fairness, we used hourly throughput. Since Qwen2.5:14B runs at ~800 tokens/sec, that gives us ~2.88M tokens/hour.
Here’s how the costs stack up for 1 million and 2.8 million tokens:

At full capacity, Qwen2.5:14B is 2.3x cheaper than GPT-4o-mini for the same workload, while at lower usage, GPT-4o-mini is slightly more cost-effective.
If you’re building GenAI apps and expect volume, hosted OSS models are significantly cheaper,plus you get privacy, control, and no rate limits.
Want to try it yourself?
Setting up Qwen2.5:14B in your cloud with AI Enabler takes just a few minutes. Here’s how to get started.
Prerequisites
- A Kubernetes cluster (in GKE, EKS, or AKS with GPU support)
- An account with Cast AI (if you don’t have one yet, create it here:Cast AI – Console)
Step by step
- Navigate to the Cast AI console > AI Enabler > Model Deployments and click on “Install AI Enabler”
- Follow the installation instructions. You will be given a script to run in your cluster. This will install the Cast AI proxy component in your cluster, the necessary GPU drivers, and all resources to deploy a model.
- Select the Qwen2.5:14B model from the model catalog and click on Deploy.
Your model is now being deployed in your cluster. To access it, use the hosted proxy at http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1.
Other cost optimization steps you can take
- Allow your model to scale to 0 automatically when it is not used to reduce idle time and save on cost.
- Configure a fallback model for the proxy to use when Qwen is unavailable.
- Allow AI Enabler to handle the model 0 to N autoscaling so that your team can focus on building great solutions instead of managing infrastructure.
👉 Spin up Qwen2.5:14B in your VPC
Conclusion
Our benchmark showed that at full capacity, Qwen2.5:14B is 2.3 times less expensive than GPT-4o-mini for the same workload. At the same time, GPT-4o-mini is marginally more cost-effective at lower utilization.
If you want to run LLMs in a Kubernetes cluster without breaking the bank, use Cast AI. Our platform includes a module where you can test and deploy the most optimal LLM model for performance, cost, and security.
LLM optimization for AIOps
Test and deploy the most optimal LLM model for performance, cost and security.



