Meet OpsPilot: Your AI SRE Agent Built Into Cast AI

When a cost spike hits on a Tuesday afternoon, and your manager wants to know why, you are not answering one question. You switch between three tools, run kubectl describe commands in two terminal windows, and cross-reference timestamps against workload event logs elsewhere entirely. By the time you have assembled the answer, the investigation has taken longer than the incident itself.

That context-assembly overhead: three tabs, two CLI windows, one cost dashboard. It runs before every incident investigation, every capacity question, and every developer escalation. You have already answered six times this month.

OpsPilot is Cast AI’s AI assistant for Kubernetes operations, available directly in the Cast AI console, connected to the same real-time data pipeline the platform uses for autoscaling and cost optimization: cluster state, workload events, cost data, and audit logs. You type the question. OpsPilot assembles the answer.

Kubernetes Troubleshooting Without the Context Tax

OpsPilot returns structured answers in seconds, where manual investigation typically takes 15 to 30 minutes. That gap matters most during active incidents, when every minute of context-assembly is a minute the incident continues to run.

Incident investigation is where fragmented tooling hurts the most. When something breaks, you are simultaneously running kubectl get events, checking your observability dashboard, and comparing timestamps across systems, all while someone is waiting for an answer. The longer that process takes, the longer the incident runs.

OpsPilot pulls from Cast AI’s data pipeline: cluster state, workload events, cost attribution, and audit logs, everything the Cast AI agent observes in your cluster. Ask what is wrong with a deployment and get a structured response that crosses event logs, resource utilization, and recent configuration changes in a single view.

Example: type “Why is payments-api crashing?”

Deployment: payments-api | Namespace: production Condition: CrashLoopBackOff, 14 restarts in the last 2 hours Root cause signal: OOMKilled, container memory limit 256Mi, peak RSS 312Mi at v2.3.9 rollout (03:14 UTC) Suggested next: Raise memory limit or roll back to v2.3.8 (starting points for investigation; not automated actions)

OpsPilot suggests a next step you can validate and apply. For example:

kubectl set resources deployment payments-api --limits=memory=512Mi -n production

(Update your deployment manifest instead if you manage resources via GitOps.)

OpsPilot works across all clusters connected to your Cast AI account. Ask cluster-specific questions by prefixing with the cluster name, or ask across all clusters at once.

Instead of running kubectl get events and correlating timestamps manually, the Kubernetes incident investigation starts with context already assembled.

Kubernetes Cost Optimization, Answered on the Same Day

Cost questions are as disruptive as incidents. When spending in the payments namespace jumps on a Tuesday, you need the same context-assembly work: which workloads changed, which pods scaled, which namespace-level cost attribution matches the spike. Without a unified view, that means pulling from a cost dashboard, cross-referencing with kubectl output, and hoping the timestamps align.

OpsPilot draws on Cast AI’s cost attribution data, the same data the platform uses for rightsizing recommendations and autoscaler decisions. Ask which workloads drove the most spend increase this week, or why costs spiked on Tuesday, and get a workload-level breakdown correlated with workload events and resource utilization.

The 2025 Kubernetes Cost Benchmark found that clusters use, on average, only 10% of CPU and 23% of memory allocated to them. That waste is distributed unevenly. A handful of over-provisioned workloads typically drives most of it, but finding them requires per-namespace investigation that only happens when someone specifically looks. OpsPilot surfaces those patterns when you ask, not after a weekly FinOps cycle.

The same principle applies when you hit a knowledge gap mid-incident. Instead of switching tabs for docs, you ask OpsPilot directly. Ask about PodDisruptionBudgets and Cast AI autoscaling, and get the recommended configuration returned inline: not a link to a doc page, but the actual policy settings with defaults and recommended thresholds for your workload type.

Example: ask “What’s the recommended Cast AI policy configuration for workloads with unpredictable memory usage?”

Cast AI recommends: memory.target_utilization: 0.6 | fallback_to_requests: true For workloads with high memory variance (>20% peak-to-median), a lower target helps avoid thrashing. Your current config uses 0.85. Consider reducing to 0.65 for this workload class.

Whether you are debugging a spike, tracking down waste, or confirming a policy setting mid-incident, OpsPilot handles operational, cost, and knowledge questions from a single interface.

Available Now, for All Cast AI Customers

OpsPilot is available now in the Cast AI console, included free for all Cast AI customers. Open the OpsPilot tab in your Cast AI console. It is available immediately, at no additional cost, with no configuration needed.

Access to OpsPilot respects your existing Cast AI RBAC settings. Members can only query data from clusters and namespaces they already have access to.

Not sure where to start? Try one of these:

“Which workloads in my cluster are OOM-killing this week?”
“What’s driving the cost increase in my payments namespace?”
“What’s the recommended rightsizing policy for a stateful deployment?”

If you are not yet a Cast AI customer, OpsPilot is part of Cast AI’s platform that automatically handles autoscaling, workload rightsizing, spot instance management, and cost monitoring. OpsPilot answers the investigation questions that your team still needs to ask by hand, so that when Cast AI’s automation acts, you understand why, and when something new breaks, you have context before the first kubectl command.

Improve cloud efficiency:

How to Compare the Cost of AWS, Azure, Google and Oracle, Once and For All

How To Migrate Stateful Workloads On Kubernetes With Zero Downtime

Environmental Impact of the Cloud: 5 Data-Based Insights and One Good Fix

Solutions

Resources

Company

Book a demo