Qwen2.5:14B vs. GPT-4o-Mini: Which One is Cheaper at Scale?

This article explores how switching from GPT-4o-mini to Qwen2.5:14B can reduce GenAI costs at scale. It also introduces AI Enabler—a tool for deploying, testing, and routing LLMs automatically to optimize performance and cost.

Ioana Apetrei Avatar
Qwen2.5:14B vs GPT-4o-mini – Cost & Performance at Scale

As generative AI applications scale in production, many teams are relying on powerful models like GPT-4o-mini to deliver fast, high-quality outputs across chat, summarization, and retrieval-augmented generation (RAG) workloads. While GPT-4o-mini offers impressive performance and ease of integration, the cost of frequent inference, especially at scale, can quickly add up. 

For teams running Kubernetes-based LLM in cloud environments, there is increasing interest in evaluating high-performing open-source alternatives, such as Alibaba’s Qwen2.5-14B. If users host it themselves, it could provide comparable results at a significantly lower cost. 

This article explores how a simple switch in your model enables your infrastructure to support flexible LLM choices. It also introduces AI Enabler, an automated solution for deploying and testing LLMs and dynamically routing requests for cost and performance optimization.

We ran some benchmark tests to identify cost and performance results and found some intriguing data.

Setup

For an accurate cost comparison, we benchmarked Qwen2.5:14B (no quantization) using vLLM—served via Cast AI’s AI Enabler, on 2× L4 Spot GPUs. We batch-processed 1000 requests under a realistic load.

Benchmark summary

  • 1000 requests served successfully
  • Throughput: 800 tokens/sec
  • Median Time to First Token: 34s (batch load; much lower in real-time traffic)
  • Total throughput: ~2.88 million tokens/hour
  • Infra cost: $0.46/hour (GCP Spot VMs, asia-northeast3 region)

Cost comparison

To ensure fairness, we used hourly throughput. Since Qwen2.5:14B runs at ~800 tokens/sec, that gives us ~2.88M tokens/hour.

Here’s how the costs stack up for 1 million and 2.8 million tokens:

Want to try it yourself?

Setting up Qwen2.5:14B in your cloud with AI Enabler takes just a few minutes. Here’s how to get started.

Prerequisites

  • A Kubernetes cluster (in GKE, EKS, or AKS with GPU support)
  • An account with Cast AI (if you don’t have one yet, create it here:Cast AI – Console)

Step by step

  1. Navigate to the Cast AI console > AI Enabler > Model Deployments and click on “Install AI Enabler”
  2. Follow the installation instructions. You will be given a script to run in your cluster. This will install the Cast AI proxy component in your cluster, the necessary GPU drivers, and all resources to deploy a model.
  3. Select the Qwen2.5:14B model from the model catalog and click on Deploy.

Your model is now being deployed in your cluster. To access it, use the hosted proxy at http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1.

Other cost optimization steps you can take

  1. Allow your model to scale to 0 automatically when it is not used to reduce idle time and save on cost.
  2. Configure a fallback model for the proxy to use when Qwen is unavailable.
  3. Allow AI Enabler to handle the model 0 to N autoscaling so that your team can focus on building great solutions instead of managing infrastructure.
👉 Spin up Qwen2.5:14B in your VPC

Conclusion

Our benchmark showed that at full capacity, Qwen2.5:14B is 2.3 times less expensive than GPT-4o-mini for the same workload. At the same time, GPT-4o-mini is marginally more cost-effective at lower utilization.

If you want to run LLMs in a Kubernetes cluster without breaking the bank, use Cast AI. Our platform includes a module where you can test and deploy the most optimal LLM model for performance, cost, and security.

Cast AIBlogQwen2.5:14B vs. GPT-4o-Mini: Which One is Cheaper at Scale?