// product news

Release notes

See what’s new in the product.


  • September 2024: Pod Pinner Release, Enhanced Runtime Security, and New Workload Autoscaler Features

    Autoscaler
    Cost monitoring
    GPU
    Metrics
    Node configuration
    Pod Pinner
    Security
    Terraform
    Workload rightsizing

    Major Features and Improvements

    Pod Pinner: General Availability Release

    We’re excited to announce that Pod Pinner has graduated from beta and is now generally available. This CAST AI in-cluster component enhances the alignment between CAST AI Autoscaler decisions and actual pod placement, leading to improved resource utilization and potential cost savings.

    Pod Pinner is enabled by default for eligible clusters. Some customers may not see this feature immediately; if you’re interested in using Pod Pinner, please contact our support team for activation.

    For detailed installation instructions, configuration options, and best practices, refer to our Pod Pinner documentation.

    Custom Deny Lists for Runtime Security

    We’ve introduced custom deny lists to expand our Runtime Security capabilities. This feature allows you to create and manage lists of potentially harmful elements such as IP addresses, process names, or file hashes.

    Key features:

    • Create and manage custom deny lists via API
    • Integrate deny lists into security rules using CEL expressions
    • Support for various entry types, including IPv4, SHA256 hashes, and process names

    New API endpoints:

    • Create a deny list: POST /v1/security/runtime/list
    • Retrieve all deny lists: GET /v1/security/runtime/list
    • Get a specific deny list: GET /v1/security/runtime/list/{id}
    • Add items to a deny list: POST /v1/security/runtime/list/{id}/add
    • Remove items from a deny list: POST /v1/security/runtime/list/{id}/remove
    • Get entries of a deny list: GET /v1/security/runtime/list/{id}/entries
    • Delete a deny list: POST /v1/security/runtime/list/delete

    This addition provides greater flexibility in defining and enforcing security policies in your Kubernetes environment. For usage details and examples, refer to our Runtime Security documentation.

    Cloud Provider Integrations

    GPU Support for AKS Clusters (Preview)

    Our GPU support expanded to include Azure Kubernetes Service (AKS) clusters, bringing feature parity across major cloud providers. This addition allows for autoscaling of GPU-attached nodes in AKS environments.

    Key features:

    • Support for NVIDIA GPUs in AKS clusters
    • Autoscaling capabilities for workloads requesting GPU resources
    • Integration with default and custom node templates
    • Updated Terraform modules to support GPU configurations in AKS

    This feature is currently in preview and available upon request. If you’re interested in using GPU support for your AKS clusters, please contact our support team for enablement.

    For more information on GPU support across cloud providers, see our GPU documentation.

    Optimization and Cost Management

    Cost Comparison Report Overhaul

    We’ve significantly improved our cost comparison reports, providing deeper visibility into your Kubernetes spending over time. This overhaul includes several new features:

    • Memory dimension added alongside CPU metrics for a more comprehensive resource analysis
    • Workload analysis showing Workload Autoscaler impact, helping you understand cost optimizations
    • Growth rate insights for cluster costs vs. size, allowing you to track cost efficiency as your cluster grows

    These enhancements will help you gain more detailed insights into your Kubernetes spending patterns and optimization opportunities.

    For more information on using these new features, please refer to our Cost Comparison Report documentation.

    Overhauled Cost comparison report
    Workload Startup Ignore Period: Full Feature Availability

    The feature to ignore initial resource usage during workload startup is now fully available across our platform. This enhancement is particularly beneficial for applications with high initial resource demands, such as Java applications.

    Users can now configure scaling policies to disregard resource metrics for a specified period after startup, ensuring more accurate autoscaling decisions. This feature is accessible via:

    • CAST AI Console
    • Terraform
    • API

    For details on implementation and best practices, please refer to our Workload Autoscaler documentation.

    Custom Look-back Period for Workload Autoscaler

    We’ve enhanced our Workload Autoscaler with a custom look-back period feature. This allows you to specify the historical timeframe the autoscaler uses to analyze resource usage and generate scaling recommendations.

    Key points:

    • Set custom look-back periods for CPU and memory separately
    • Available through the Annotations, API, and Terraform, with UI coming soon

    This feature provides greater flexibility in optimizing your autoscaling policies to match your specific application needs and usage patterns. For more details on configuring custom lookback periods, see our Workload Autoscaler documentation.

    Enhanced Cluster Efficiency Reporting

    We’ve improved the granularity of our cluster efficiency reports by reducing the minimum time step from 1 day to 1 hour. This change affects the /v1/cost-reports/clusters/{clusterId}/efficiency API endpoint.

    Key benefits:

    • More accurate representation of cluster efficiency over time
    • Better alignment between efficiency reports and dashboard data
    • Improved visibility for clusters with frequent resource changes

    For more details on using the efficiency report API, see our Cost Reports API documentation.

    Node Configuration

    Improved EKS Node Configuration with Instance Profile ARN Suggestions

    We’ve enhanced the EKS node configuration experience by adding suggestions for Instance Profile ARNs. This feature simplifies the setup of CAST AI-provisioned nodes in your EKS clusters.

    Key benefits:

    • Automated suggestions for Instance Profile ARNs
    • Reduced need to switch between CAST AI and AWS consoles
    Added Ability to Define MaxPods per Node Using Custom Formula

    We’ve enhanced the EKS node configuration to allow users to select a formula for calculating the maximum number of pods per specific AWS EC2 node.

    Key benefits:

    • Supports customer configurations with various Container Network Interfaces (CNIs)
    • Allows for non-default max pods per node, providing greater flexibility in cluster management
    • Enables more precise control over pod density on nodes

    This feature enhances CAST AI’s ability to adapt to diverse customer environments and networking setups. For more information on using this feature, please refer to our EKS Node Configuration documentation.

    Security and Compliance

    SSH Protocol Detection in Runtime Security

    We’ve enhanced our Runtime Security capabilities by implementing SSH protocol detection. This feature helps identify potentially unauthorized or unusual SSH connections to pods, which are generally discouraged in Kubernetes environments.

    Key benefits:

    • Improved visibility into SSH usage within your clusters
    • Enhanced security signaling for potentially risky connections

    This addition strengthens your ability to monitor and secure your Kubernetes workloads. For more information on Runtime Security features, see our documentation.

    Expanded Detection of Hacking Tools

    We’ve enhanced our KSPM capabilities by adding detection rules for a wider range of hacking and penetration testing tools. This update improves our ability to identify potential security threats in your Kubernetes environments.
    For more information on our security capabilities, refer to our Runtime Security documentation.

    Security Improvement for GPU Metrics Exporter

    We’ve improved the security of the GPU Metrics Exporter by moving sensitive information (API Key, Cluster ID) from ConfigMap to Secrets. This change enhances the protection of your credentials. When updating to the latest version, a job will automatically migrate existing data to the new secure format. For details on updating, see our GPU Metrics Exporter documentation.

    API and Metrics Improvements

    New Cost Comparison API Endpoint

    We’ve introduced a new API endpoint for retrieving cluster-level cost comparison data between two periods of time:GET v1/cost-reports/organization/cost-comparison.

    This endpoint allows you to compare costs and savings between two time periods, providing resource cost breakdowns and savings calculations.

    For more details on parameters and response format, see our API documentation.

    Enhanced Node Pricing API

    We’ve updated our Node Pricing API to provide a more detailed breakdown of node costs. This improvement offers greater transparency and flexibility in understanding your cluster’s pricing structure.

    Key updates:

    • Detailed base price breakdown for nodes, including CPU, RAM, and GPU prices
    • All price components are now exposed directly in the API response
    • Available for new pricing data; historical data breakdowns not included at this time

    This update allows for more accurate cost analysis and simplifies integration with external tools. To access this enhanced pricing data, use the endpoint: /v1/pricing/clusters/{clusterId}/nodes/{nodeId}.

    For more details on using the updated Node Pricing API, refer to our API documentation.

    Expanded Node Cost Metrics

    We’ve expanded our node metrics endpoint (/v1/metrics/nodes) to include detailed cost information. New metrics include hourly costs for CPU and RAM, as well as overprovisioning percentages for these resources.
    For more information on using these metrics, refer to our metrics documentation or check out our API Reference.

    Updated Image Scanning API

    We’ve updated the /v1/security/insights/images API endpoint to include an image scan status filter. This improvement allows for more efficient querying of scanned images and supports multiple status selections in a single query. For details, see our API documentation.

    User Interface Improvements

    Enhanced Grouping Options in Runtime Security

    We’ve improved the user experience in the Runtime Security section of the CAST AI console by introducing flexible grouping options. This allows users to organize and analyze security data more effectively.

    Key features:

    • New Group by dropdown menu in the Runtime Security interface
    • Additional grouping parameters including Anomaly, Cluster, Namespace, Resource, and Workload

    Terraform and Agent Updates

    We’ve released an updated version of our Terraform provider. As always, the latest changes are detailed in the changelog. The updated provider and modules are ready for use in your infrastructure as code projects in Terraform’s registry.

    We have released a new version of the CAST AI agent. The complete list of changes is here. To update the agent in your cluster, please follow these steps.

  • August 2024: Storage Cost Reporting, StatefulSet Optimization, and Enhanced Cloud Provider Support

    Autoscaler
    Cost monitoring
    Metrics
    Node configuration
    Notifications
    Organization
    Rebalancer
    Security
    Terraform
    Workload rightsizing

    Major Features and Improvements

    Storage Cost Reporting

    CAST AI now offers comprehensive storage cost and utilization reporting. This enhancement delivers crucial insights into your Kubernetes cluster’s storage expenses and usage patterns, enabling more effective resource management and cost optimization.

    Key features:

    • View provisioned and claimed storage metrics in the cluster dashboard
    • Track storage costs alongside compute costs in cluster and workload reports
    • Identify storage over-provisioning in the Efficiency tab

    Currently supported for GCP persistent block storage. We’re actively working on expanding support to other cloud providers and storage types.

    StatefulSet Support in Workload Optimizer

    We’re excited to introduce support for StatefulSets in our Workload Optimizer. This feature is currently available to select customers.

    Key aspects:

    • StatefulSets are now supported for both Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA)
    • New default scaling policy for StatefulSets with deferred apply type to minimize disruptions
    • Compatible with existing autoscaling modes and policies

    This enhancement expands our optimization capabilities to a broader range of workloads, allowing for more comprehensive cluster management.

    Azure Preview Instances and Customer-Specific Instance Constraints

    We’ve expanded our support for Azure instances and added new customization options:

    • Azure Preview Instances: CAST AI now supports Azure Preview Instances, allowing customers to access early hardware at reduced prices and optimize their clusters for additional savings.
    • Customer-Specific Instance Constraint: A new “Customer-specific” constraint has been added to Node Templates for Azure Kubernetes Service (AKS) clusters. This opt-in feature allows users to include instances exclusively available to their account, such as preview instances or other specially arranged hardware.

    These additions provide more flexibility in instance selection and potential cost savings for Azure users. The Customer-Specific constraint is disabled by default and must be explicitly enabled to use these exclusive instances.

    Optimization and Cost Management

    Enhanced Node Selection for Scheduled Cluster Rebalancing

    We’ve introduced new options for node selection in cluster rebalancing schedules, giving you more control over how your clusters optimize resource usage and costs. You can now choose from three node selection algorithms:

    This flexibility allows you to tailor your rebalancing strategy to your specific needs, whether you’re prioritizing cost efficiency, resource utilization, or a balance of both. The new options are beneficial for addressing scenarios with uneven resource utilization or when dealing with nodes that have similar pricing but different usage patterns. The feature is available via API and in the CAST AI console.

    Remember that the ‘Least utilized’ option may not be optimal for spot instances, where price optimization is typically prioritized over utilization.

    Cost Anomaly Indicators in Cluster Reports

    We’ve added indicators for cost anomalies directly in the cluster cost reports. When you receive an anomaly notification, you’ll see a clear marker on the cost graph, pinpointing exactly when the anomaly occurred. This allows for quicker identification and investigation of unusual cost patterns, improving your ability to manage and optimize cluster expenses.

    This new addition provides visibility into cost anomalies at a glance, helping you quickly identify and investigate unusual spending patterns in your clusters.

    Extended Time Range for Cluster Reports

    We’ve expanded the time range for cluster and organization-level reports fetched via API from 33 days to 3 months. This enhancement allows for more comprehensive long-term analysis and trend identification across various report types, including network costs and efficiency reports. Data granularity adjusts automatically for optimal performance with longer time ranges.

    Improved AWS Spot Instance Interruption Handling

    We’ve optimized our response to AWS spot instance interruptions by removing the 90-second delay before initiating node replacement. This change aligns our AWS spot handling with other cloud providers, allowing for faster node replacement and improving the chances of graceful workload shutdown during spot interruptions.

    New Rebalancing Mode: Aggressive Mode

    We’re excited to introduce Aggressive Mode in our rebalancer, allowing you to forcefully roll problematic nodes that would normally be skipped. This feature is perfect for customers who need to ensure strict compliance, even if it means accepting some operational risk. Aggressive Mode makes sure that all nodes in the cluster are updated, maintaining alignment with compliance requirements.

    Node Configuration

    Prioritized Instance Families in Node Templates

    We’ve introduced support for prioritizing instance families in node templates, catering to customers prioritizing performance over cost. This new feature allows you to define a specific order of instance families based on your performance testing results. The system will attempt to schedule workloads on the highest-priority instance family first. If capacity is full or unavailable, it will move to the next in the defined order, ensuring your workloads run on the most suitable instances for optimal performance, regardless of price.

    Windows Server 2022 Support for AKS Clusters

    We’ve expanded our support for Windows workloads in Azure Kubernetes Service (AKS) clusters to include Windows Server 2022. This enhancement allows for greater flexibility in running Windows containers within your Kubernetes environments.

    Workload Optimization

    Container-Level Constraints for Workload Optimization

    Our Workload Optimizer now supports setting constraints at the individual container level within a pod. This enhancement allows for more precise resource allocation for multi-container workloads, especially beneficial for applications with diverse container resource needs. By tailoring constraints to each container, you can achieve better resource utilization and potentially reduce costs while maintaining optimal performance across your applications.

    Workload-Level Scaling Policy Editing

    We’ve added the ability to edit scaling policies directly from the workload view. An edit button now appears next to each workload’s scaling policy, allowing for quick adjustments without navigating away from the workload context.

    This streamlines the process of fine-tuning your autoscaling settings, making it easier to optimize individual workloads. Note that read-only policies do not have this edit option, maintaining proper access control.

    API Update: Optional Startup Period Setting

    The updateworkloadscalingpolicy API endpoint now includes an optional periodSeconds field, which defines the duration of increased resource usage during startup. When set, recommendations will adjust to ignore startup spikes, ensuring more accurate resource recommendations. If not specified, recommendations will be based on regular execution patterns.

    Infrastructure as Code

    Terraform Support for Workload Autoscaler

    We’re excited to introduce full Terraform support for Workload Autoscaler, making managing and automating your scaling configurations easier than ever. With this update, you can use Terraform to create, update, and delete scaling policies directly. This includes setting up critical parameters like thresholds, enabling auto-enable settings, and managing the entire policy lifecycle through Terraform.

    Additionally, we’ve updated our AKS, GKE, and EKS Terraform modules to include these Workload Autoscaler resources and an option to install the workload autoscaler. These enhancements streamline the deployment and management of scaling policies across your clusters, allowing you to streamline and move workload autoscaling management to an organizational level.

    Security and Compliance

    Enhanced Performance for Image Vulnerability Scanning

    We’ve improved the performance of our image vulnerability scanning and reporting processes. This optimization especially benefits organizations with large numbers of container images, providing faster and more efficient vulnerability assessments. The enhancement includes pre-calculation of vulnerability counters, resulting in quicker image list loading and more responsive image overviews. Users will experience noticeably faster access to critical security information, enabling more timely decision-making around container security.

    Expanded CA Certificate Support Across CAST AI Agents

    We’ve strengthened our security capabilities by expanding private CA certificate support across the CAST AI platform. Previously available only for the CAST AI agent, this crucial feature is now extended to our workload autoscaler, cluster-controller, spot-handler, and evictor components. This advancement ensures that SSL connections within these critical components can seamlessly integrate with private CA certificates, providing enhanced security for customers with stringent compliance requirements.

    New Event Tab in Runtime Security

    We’ve added an Event tab to the Runtime Security section, mirroring the functionality found in the anomalies view. This new tab allows users to easily review security events and helps streamline the process of creating custom security rules.

    Improved RuntimeSecurityAPI Endpoint

    We’ve updated the RuntimeSecurityAPI to help you retrieve and analyze security events. You can now filter events by date and group results by multiple fields. Paging and sorting are also supported for efficient data handling. For more details, see our API documentation.

    API and Metrics Improvements

    New Storage Metric: cluster_claimed_disk_bytes

    We’ve added a new storage metric, cluster_claimed_disk_bytes, to our API. This metric is now available in the GetClusterResourceUsage and GetClustersSummary endpoints, allowing you to monitor claimed disk storage across your clusters more effectively.

    Improved Namespace Network Cost Reporting

    We’ve updated our NamespaceReport API to provide data for network costs. This update introduces a new endpoint specifically designed for retrieving such data: getClusterNamespaceDataTransferCost

    Improved ReportMetricsAPI

    We’ve updated our ReportMetricsAPI to now allow for filtering by cluster. See our API documentation for details.

    New Discounts API

    We’ve introduced a new Discounts API to expose discount-related functionalities. This API allows you to access and manage discount features directly through our platform.

    New GET Endpoint for Individual Commitments

    We’ve added a new GET endpoint for retrieving individual commitments by ID. This update allows users to view the details of a specific commitment, improving the ability to verify and manage commitments more conveniently.

    New Workload Network Metrics for Prometheus Endpoints

    We’ve added workload network metrics to Prometheus endpoints, providing organizational-level metrics, including:

    • workload_network_bytes
    • workload_to_workload_traffic_bytes

    This means that more network-related metrics from egressd are now available, improving visibility and monitoring for network traffic within workloads. For more information, refer to our updated API documentation.

    User Interface Improvements

    Notifications for New Instance Types

    We’ve introduced a new notification system to inform you about the latest instance types available across AWS, Azure, and Google Cloud Platform. This feature ensures you’re always aware of new opportunities to optimize your cloud resources:

    • In-product notifications when new instance families become available in your active regions
    • Webhook support for easy integration with your existing notification systems
    • Tailored notifications based on your organization’s cloud usage and regions

    This feature helps you stay at the forefront of cloud technology, enabling better performance and cost optimization for your Kubernetes clusters.

    Improved Side Menu Navigation

    We’ve enhanced the side menu navigation to make it more intuitive and user-friendly. Now, when you hover over any menu icon in the collapsed state, a dropdown card will appear, showing all available options for that section.

    Enhanced Billing Report Date Selection

    We’ve improved the date selection process in the Billing Report to make it easier for you to access commonly needed information. A new “Last Month” option has been added to the date picker, allowing you to quickly view your billing data for the previous calendar month.

    Self-Service SSO Setup for OpenID Providers

    We’ve expanded our Single Sign-On (SSO) capabilities to include a self-service setup for OpenID providers. This new feature allows users to configure SSO for OpenID in the CAST AI console, similar to the existing process for Azure and Okta integrations. See our SSO documentation for instructions.

    Component Updates

    Autoscaler Update: Instance Type Selection Based on PV Storage Type

    The autoscaler now selects instance types based on the persistent volume (PV) storage type, ensuring that only compatible instance types are chosen. This update also improves grouping to handle conflicting storage types required by different pods. To ensure proper scheduling, the use of pod pinning or taints may be required to match pods with nodes that support the necessary storage type.

    Autoscaler Support for Native Sidecar Containers

    We’ve added support for native sidecar containers in our autoscaler, aligning with Kubernetes v1.28’s introduction of this feature. This update ensures that autoscaling can now handle pods with native sidecar containers.

    Support for topology.kubernetes.io/zone in Pod Anti-Affinity Rules

    Introduced support for Pod Anti-Affinity on the topology.kubernetes.io/zone key, addressing a critical need for customers requiring zonal distribution of their workloads. This feature ensures that your pods are correctly scheduled across different availability zones, adhering to your anti-affinity rules and preventing unnecessary node creation in the wrong zones.

    Enhanced Pod Topology Spread with matchLabelKeys Support

    We’ve expanded our support for Kubernetes pod topology spread constraints to include the matchLabelKeys feature introduced in Kubernetes 1.27. This addition allows for more flexible and precise control over pod distribution across your cluster. By specifying label keys without values, you can now maintain optimal pod spread during redeployments and updates, ensuring better application resilience and resource utilization. This feature is particularly useful for complex applications that require careful balancing of pods across nodes or zones.

    Terraform and Agent Updates

    We’ve released an updated version of our Terraform provider. As always, the latest changes are detailed in the changelog. The updated provider and modules are ready for use in your infrastructure as code projects in Terraform’s registry.
    We have released a new version of the CAST AI agent. The complete list of changes is here. To update the agent in your cluster, please follow these steps.

  • July 2024: Horizontal Pod Autoscaling, Enhanced GPU Monitoring, and Improved Kubernetes Security Features

    Commitments
    Cost monitoring
    GPU
    Metrics
    Node configuration
    Note templates
    Organization
    Security
    Terraform
    Workload rightsizing

    Major Features and Improvements

    Horizontal Pod Autoscaling (HPA)

    Workload autoscaling got a significant upgrade with our new Horizontal Pod Autoscaler (HPA). Working independently or in tandem with our Vertical Pod Autoscaler, HPA dynamically adjusts pod counts based on CPU usage, ensuring your applications stay responsive and cost-effective. Configure it easily through our UI, API, or Kubernetes annotations to keep your workloads perfectly balanced. Just ensure you’re running the latest versions of our autoscaler and agent to take full advantage of this powerful new feature. Get started in our docs!

    Billing Report Enhancements

    We’ve significantly improved our billing system to ensure accuracy and consistency across all reporting channels. The Billing Report now uses 10-minute aggregation intervals, aligning with other reporting services as of July 1st. These changes streamline our billing processes, eliminating manual reconciliations and providing you with more accurate, consistent financial data across all platforms.

    Streamlined Support Access with CAST AI Observer

    We have streamlined our support process by automatically adding the newly renamed CAST AI Observer account (formerly the POC account) to all newly created organizations with view-only permissions. This account allows our support and engineering teams to assist you more efficiently, analyze workloads proactively, and resolve incidents faster – all without requiring manual intervention from your side. Don’t worry; you retain complete control: the account’s purpose is transparently displayed in the UI, and you can remove it at any time if you prefer.

    Cloud Provider-Specific Updates

    EKS: Multiple Load Balancer Support

    We’ve enhanced our EKS node configuration to support multiple load balancers per node. The UI, API, and Terraform providers now accept an array of target groups, replacing the previous single target group field. This update allows for more flexible and complex load balancing setups, catering to scenarios where nodes must simultaneously be associated with multiple load balancers. The old single target group field has been deprecated but remains functional for backward compatibility.

    EKS: Flexible AMI Selection

    We’ve enhanced the flexibility of node configurations for Amazon EKS clusters. In both our API and Terraform provider, you can now specify an AMI family instead of an exact AMI when setting up your node templates. This improvement allows you to choose from supported image families like Amazon Linux 2, Amazon Linux 2023, or Bottlerocket, giving you more control over your node operating systems while benefiting from automatic updates within your chosen family.

    GKE: Local SSD Enhancements

    We’ve expanded our local SSD support for GKE clusters across our platform. The new useEphemeralStorageLocalSsd boolean parameter is now available in our Node configuration API, allowing you to easily enable or disable local SSD-backed ephemeral storage for your GKE nodes. We’ve also integrated this option into our node configuration UI, making it even more accessible.

    Kubernetes Security and Compliance

    New Workloads View for Enhanced Visibility

    Our team has expanded our Kubernetes Security section with a new Workloads view, providing an overview of all workloads across your clusters from a security perspective. This new view lets you easily search, filter, and sort your workloads by cluster, namespace, and labels. This addition complements our existing security features, offering you a more holistic approach to workload security management.

    Enhanced Malware Detection

    We’ve improved our runtime security capabilities with a new built-in rule to detect executions of known hacking tools. The Hacking tool executed rule complements our existing detections, providing broader coverage against potential security threats in your clusters. This enhancement is part of our Kubernetes Runtime Security feature, which is currently in closed preview. If you’re interested in gaining early access to these advanced security capabilities, please contact our support team for availability and enrollment details.

    Reporting and Cost Optimization

    Persistent Volume (PV) Storage Dimension Added

    You’ll now find PV usage (in GiB) and associated costs integrated throughout our API reports and metrics, from cluster-level to organization-wide. This new dimension provides a more comprehensive view of your resource utilization, enabling better-informed decisions for storage optimization alongside compute resources.

    New GPU Utilization Tab

    We’ve introduced a dedicated GPU utilization tab in the Workloads view, offering deeper insights into your GPU-enabled workloads. This new tab provides a comprehensive overview of GPU usage, including average utilization rates and associated costs. These insights are also available per workload and as a summary in the Dashboard. You can quickly identify top GPU consumers and potential optimization opportunities, enhancing your ability to manage and optimize GPU resources effectively.

    Expanded GPU Metrics Collection

    We’ve enhanced our GPU monitoring capabilities by adding four new metrics to our data collection: FP64, FP32, and FP16 pipeline activity, as well as integer pipeline activity. These additional data points, collected via our gpu-metrics-exporter, provide a more comprehensive view of GPU utilization across different computation types. Read about all of the metrics the exporter collects here.

    Cluster Rebalancer: Enhanced Flexibility in Savings Thresholds

    We improved the Cluster Rebalancer to give you more control over when rebalancing operations occur. While you can still set a target savings threshold to trigger rebalancing, we’ve now added an option to bypass this requirement. By unchecking the Execute only if there are guaranteed minimum savings box, you can ensure rebalancing proceeds even if the projected savings don’t meet the specified threshold.

    UI/UX Improvements

    Billing Report Re-re-location

    Following user feedback on the Billing Report’s recent relocation to the Organization Profile, we’ve enhanced accessibility by restoring access from the Optimization section. Now, you can conveniently access the Billing Report from both the Organization Profile and the Optimization section.

    Cluster Dashboard Enhancements

    We’ve upgraded the Cluster Dashboard to provide you with more crucial information. The last reconciliation timestamp is now in the top-right corner, giving you immediate insight into the freshness of your cluster data. Additionally, we’ve added the Kubernetes version to the cluster details section, allowing you to verify your cluster’s compatibility with the latest features quickly.

    Improved Image Scan Error Reporting

    We’ve enhanced the visibility of our image-scanning process in the Kubernetes Security section. When an image scan fails, you’ll now see a detailed error message explaining the reason for the failure. This improvement allows you to quickly identify and address issues that may be preventing successful scans, such as invalid characters in image names or network connectivity problems.

    Enhanced Resource Filtering in Kubernetes Security

    We’ve added a labels filter to the Kubernetes Security Best Practices drawer in the resources tab, allowing for more granular resource filtering.

    Improved Sorting in Commitments UI

    We’ve enhanced the commitments view to prioritize active commitments. While maintaining the existing sort by start and end dates, active commitments now appear at the top of the list. This change provides a clearer overview of your current resource allocations, allowing for quicker access to the most relevant commitment information.

    API and Metrics Improvements

    Workload Optimizer Metrics Tracking

    We’ve introduced a new Prometheus metric to track the Workload Optimization status across your clusters. The workload_optimization_enabled label is now available through our metrics API, allowing you to monitor the number of workloads with Optimizer enabled versus those without.

    Component Updates

    Autoscaler: Legacy Agent Version Limitations

    Users running castai-workload-autoscaler versions below 0.5.0 will notice restricted functionality in the autoscaler configuration UI. The mode selection option between deferred and immediate is now hidden for these older versions. We strongly recommend upgrading your agent to the latest version to access the full range of autoscaling features and options.

    Terraform and Agent Updates

    We’ve released an updated version of our Terraform provider. As always, the latest changes are detailed in the changelog. The updated provider and modules are ready for use in your infrastructure as code projects in Terraform’s registry.

    We have released a new version of the CAST AI agent. The complete list of changes is here. To update the agent in your cluster, please follow these steps.

  • Burstable instance support, enhanced security features, and expanded metrics scraping

    Autoscaler
    Commitments
    Evictor
    GPU
    Metrics
    Node configuration
    Organization
    Security
    Spot Instance
    Terraform
    Optimization and Cost Management

    Verbose Evictor config error message

    • We’ve improved the error messaging when updating Advanced Evictor configurations. Now, if multiple eviction settings are incorrectly specified, the system will return a 400 error with a clear explanation of what needs to be changed. This update aims to improve the user experience when configuring advanced eviction rules.

    Improved Commitment Update Flexibility

    • The team has refined the Commitments update process to allow partial updates. Now, when modifying a commitment, only the fields you specify will be changed, while all other settings remain intact. This update prevents unintended changes to critical parameters, providing a more user-friendly and safer update experience.
    Infrastructure as Code

    Improved Terraform Support for Autoscaler Policies

    • We’ve upgraded our Terraform module to replace the autoscaler_policies_json with a new autoscaler_settings object. This change allows for more granular control and state management of individual policy settings. Users using the previous JSON method will receive deprecation warnings.
    Cloud Provider Integrations

    AKS: Migration from Reserved Instances to Commitments

    • A seamless migration path for AKS users from Reserved Instances (RIs) to our new Commitments model has been implemented. This transition enhances our abstraction layer for managing reserved capacity across cloud providers. Existing RIs will be automatically converted to Commitments and, by default, assigned to all clusters. We’ve ensured backward compatibility, so any changes made using the old API will be reflected in the new Commitments logic, maintaining consistency for users of our Terraform provider.

    Automatic ingestion of new regions and instance families for GCP and AWS

    • New regions and instance families for GCP and AWS will be ingested automatically from now on. This ensures our platform stays current with the latest cloud offerings, giving you immediate access to new instance types and regions as they’re released by the providers.

    GCP Node Config: Local SSD Support

    • Support for the ephemeral-storage-local-ssd parameter in GCP node configurations has been added. This allows the automatic configuration of RAID 0 partitions from NVME local disks on GKE node pools, which is particularly beneficial for high-performance instances. This feature simplifies the process of utilizing local SSDs in your GCP clusters, eliminating the need for timely manual configuration.
    gcloud container node-pools create POOL_NAME \
        --cluster=CLUSTER_NAME \
        --ephemeral-storage-local-ssd count=NUMBER_OF_DISKS \
        --machine-type=MACHINE_TYPE

    Support for Amazon Linux 2023 AMIs in EKS

    • We’ve added support for Amazon Linux 2023 AMIs in EKS clusters. This update enables customers to use the latest Amazon Linux distribution for their EKS worker nodes, providing access to recent security patches and performance improvements.

    GP2 Volume Type Support for EKS Node Configurations

    • EKS node configuration options have been expanded to include support for gp2 volume types. This addition gives customers more flexibility in choosing cost-effective storage solutions for their EKS worker nodes.

    Enhanced Spot Instance Interruption Prediction for AWS

    • Our machine learning-based prediction model is now the default option for spot instance interruption handling in new AWS node templates. This CAST AI ML model offers improved accuracy over AWS rebalance recommendations, helping to minimize unexpected downtime and optimize cost savings from spot instances.

    Support for GCP Z3 Instance Family

    • Added support for GCP’s new Z3 instance family, designed for storage-optimized workloads.
    Security and Compliance

    Kubernetes Security Posture Management (KSPM) with CAST AI

    • We’ve expanded our security offerings with major new KSPM (Kubernetes Security Posture Management) features. Building on our existing core capabilities, we’ve now added attack path detection and runtime anomaly detection via a powerful rule-fed engine. To help users navigate these new features, we’ve implemented a guided tour of our security capabilities in the UI for those enabling security for the first time. Be sure to check out our Security documentation, try out our new Security APIs, and take the guided tour in the console to discover how CAST AI can strengthen your cluster’s security posture!

    Multiple Email Domain Support for SSO

    • Our SSO configuration now supports multiple email domains. Users can now add additional email domains, separated by commas, in both the CAST AI console and via our API. This feature has also been integrated into our Terraform provider, allowing for more flexible and comprehensive SSO setups that accommodate diverse organizational structures.
    API and Metrics Improvements

    New API Endpoint for Individual Runtime Security Rules

    • Added a new API endpoint /v1/security/runtime/rules/{id} that allows fetching of individual runtime security rules. Put the new endpoint into action here.

    New API Endpoint: Workloads GPU Summary Report

    • Introduced a new API endpoint /v1/cost-reports/workloads/gpu-summary that provides a detailed GPU utilization summary for workloads. This endpoint accepts a filter in the request body and returns detailed GPU usage data, including idle time, utilization rates, and memory allocation across clusters. Try out the endpoint here.

    Enhanced Organization Report API with Storage Metrics

    • Updated the /api/v1/cost-reports/organization/clusters/report endpoint to include storage metrics. The API now provides average storage cost, requested storage, and provisioned storage data.

    Enhanced Prometheus Metrics with Cluster Storage Data

    • Expanded our Prometheus-compatible API /v1/metrics/prom to include cluster-level storage metrics. New metrics include provisioned storage bytes, requested storage bytes, and hourly storage cost. This addition enables more comprehensive monitoring and analysis of cluster storage utilization and costs through your preferred Prometheus-compatible tools. Take a look at the updated endpoint here or learn about scraping metrics in our dedicated guide.

    Expanded Public Metrics for Workloads and Allocation Groups

    • Introduced new public metrics endpoints providing granular data on workloads and allocation groups. These include workload pod counts, average CPU/memory requests, and hourly costs. For allocation groups, we’ve added metrics on resource requests and associated costs. This expansion allows for a more detailed analysis of resource utilization and cost trends.

    Enhanced Sorting for Allocation Group Workload Efficiency API

    • Updated the allocation group workload efficiency endpoint /v1/cost-reports/allocation-groups/{groupId}/workload-efficiency with new sorting capabilities. Users can now sort results by specifying a sort field and order in their API requests.
    User Interface Improvements

    CPU Usage report relocation

    • We’ve moved the CPU Usage Report from the Optimization section of the CAST AI console to the Organization Profile for easier access and improved consistency. The report is now called the Billing report. This change aligns it with other organizational-level insights. Users logging in for the first time after this update will receive a UI notification guiding them to the report’s new location.

    Burstable instance support

    Customizable Table Columns in the Console

    • Introduced flexible table columns across our console, enhancing user experience and data visibility. Users can now resize columns, add or remove them, reorder their placement, and freeze important columns. This update allows for personalized, more efficient data viewing, which is especially beneficial for tables with numerous fields or on smaller screens.

    Cluster List Page Enhancements

    • We’ve revamped the cluster list page for improved at-a-glance information and easier management. A new feature column now displays icons for active services like cost monitoring, autoscaling, and security. We’ve also streamlined the actions column, making it dynamic based on which CAST AI features are enabled.
    Terraform and Agent Updates
    • We’ve released an updated version of our Terraform provider. As always, the latest changes are detailed in the changelog. The updated provider and modules are ready for use in your infrastructure as code projects in Terraform’s registry.
    • We have released a new version of the CAST AI agent. The complete list of changes is here. To update the agent in your cluster, please follow these steps.
  • Cost anomaly detection, network bandwidth-specific instance selection, and expanded commitments support

    Autoscaler
    Commitments
    Cost monitoring
    Metrics
    Node configuration
    Note templates
    Notifications
    Spot Instance
    Terraform

    Cost Optimization and Anomaly Detection

    • Allocation Groups now generate reports dramatically faster by pre-recording metrics during snapshot processing. Read more in the documentation. We’ve also added node filtering by label to the AllocationGroupAPI while we were at it.
    • We’ve introduced a default warning notification for clusters with detected cost anomalies. Using machine learning, we analyze the last 5 days of cluster costs, considering seasonality, and trigger a notification if the cost threshold exceeds or falls below 1.5x the expected amount.

    Node Configuration and Management

    • API and Terraform now support AZ(s) constraint per Node template. You can find the update in the Terraform registry or on our GitHub.
    • Support has been added for GCP’s N4 instances. N4 VMs support only the NVMe disk interface and can use the Hyperdisk Balanced block storage.
    • You can now set a custom drain timeout value in the UI for each Node configuration during rebalancing. If not set, it defaults to 20 minutes for on-demand nodes and 1 minute for spot nodes. This addition helps align with your application requirements.
    • When onboarding a new cluster, you can now apply recommended node constraints to any Node template with just a click of a button. These recommendations are based on your cluster size and ensure optimal performance from the very start.
    • You can now specify a Load Balancer target group for your EKS nodes directly from the Node configuration UI. Under the “Advanced configuration” section, you’ll find a new section where you can input the ARN and port of the desired target group.

    Autoscaling

    • Autoscaling using the customer’s purchased capacity (Reserved Instances and Committed Use Discounts) is now supported and replaces the previously supported Reserved Instance management flow, which was Azure-specific. The Commitments feature set is CAST AI’s generic approach to utilizing Reserved Instances (AWS, Azure) and Committed Use Discounts (GCP) capacity in autoscaling. It is now available for Azure and GCP customers. Please check the documentation for more details on how to set up and manage your commitments with CAST AI to optimize costs and performance while autoscaling your workloads.
    • The CAST AI Autoscaler now automatically selects instance types based on pod network bandwidth requests from the workload configuration. By specifying the required bandwidth using the scheduling.cast.ai/network-bandwidth annotation, you can ensure pods are placed on instances with sufficient throughput, streamlining the process for network-intensive applications. Read more in our documentation.

    Spot Instance Management

    • CAST AI’s Spot Instance Interruption Prediction feature has been re-enabled after being temporarily disabled due to excessive interruptions caused by AWS rebalance recommendations. The CAST AI ML team has improved the prediction model’s accuracy for identifying instances at risk of interruption, further reducing the likelihood of unexpected terminations while still leveraging the cost savings of spot instances.

    Reporting and Insights

    • Workload Network and Efficiency Reports now support pagination, which improves performance and usability when accessing workload-level insights, particularly for customers with extensive deployments.
    • We have introduced the ability to export the cluster security best practice compliance report in CSV format with the click of a button.

    User Interface Enhancements

    • A new UI element now notifies users when their connected EKS cluster runs an outdated Kubernetes version. If the version is older than the current cutoff, users are notified about a 6x increase in control plane management costs resulting from Amazon’s pricing changes for legacy EKS control plane versions.
    • You can now customize console table views with dynamic column controls. Resize name column widths to tailor the experience to your needs by focusing on what matters most to you.
    • The CPU Usage Report has been redesigned for clarity and moved to the Organization Profile for easier access. A UI notification will guide users to the new location on their first login post-update. These changes ensure better consistency and improve user experience when analyzing CPU usage and costs.
    • The security settings page now streamlines enabling CAST AI security features for your clusters. Instantly enable security with a toggle or script, and manage security features through the updated table and feature management drawer.

    Terraform and Agent Updates

    • A new Terraform resource has been introduced to manage RBAC patches for CASTware components automatically. This allows customers to handle RBAC updates independently while also eliminating the need for manual Helm updates across clusters.
    • Our Terraform provider now supports the management of Azure Reserved Instances and GCP committed use discounts. Automate the entire commitment lifecycle, from creation to deletion, while the Autoscaler optimizes utilization across your clusters based on region, allowed percentage, and prioritization. Read more about how CAST AI handles commitments.
    • We’ve released an updated version of our Terraform provider and updated some Terraform examples. As always, the latest changes are detailed in the changelog. The updated provider and modules are ready for use in your infrastructure as code projects in Terraform’s registry.
    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.

    Monitoring and Observability

    • We’ve updated our Grafana dashboard example to provide more comprehensive insights into cluster performance and resource utilization. Based on customer feedback, we added new widgets using existing and newly collected metrics.

    API Updates

    • New organization-level scraping metrics endpoints have been introduced for nodes and workloads, simplifying metrics collection across your clusters. The new endpoints, v1/metrics/workloads, and v1/metrics/nodes support cluster filtering and replace the deprecated cluster-level endpoints.
    • Our WorkloadReportAPI now includes a real-time GPU utilization endpoint for workloads, including workload name, type, and average GPU utilization. The updated endpoint is available here.
    • The RuntimeSecurityAPI’s /v1/security/runtime/anomalies GET request now supports filtering anomalies by their type instead of ID. This change allows you to retrieve anomalies based on more intuitive and meaningful criteria, making it easier to investigate and analyze specific types of anomalies in your system.

    Bugfixes

    • The Cluster Proportional Vertical Pod Autoscaler (CPVPA) no longer prevents node eviction because it is marked as disposable.
  • AI Optimizer launch

    AI Optimizer
    Autoscaler
    Commitments
    Evictor
    Node configuration
    Workload rightsizing

    AI Optimizer

    • We have launched AI Optimizer Beta! Our new AI Optimizer service automatically identifies the LLM that offers the best performance and lowest inference costs, and deploys it on CAST AI-optimized Kubernetes clusters. Read more in our press release and join the beta!

    Workload Autoscaling

    • The Workload Autoscaler now also supports scaling policies. Scaling policies allow you to manage all your workloads centrally, with the ability to apply the same settings to multiple workloads simultaneously. Read more in the documentation.
    • The Workload Autoscaler now also supports two distinct scaling modes: immediate and deferred. When immediate mode is selected, recommendations are applied as soon as they meet the thresholds, forcing a pod restart. Meanwhile, in deferred mode, recommendations will be applied when external factors trigger a pod restart. Read more here.
    • Quality of life improvements: Date range selector, ‘Controlled by’ filter, and more.

    Cluster autoscaling

    • In the Autoscaler settings, a master switch has been implemented to allow users to enable or disable all Autoscaler policies in one go.
    • GCP customers who have resource-based Committed Use discounts can now upload them to CAST AI to fully utilize them during cluster scaling events. Read more about how this new feature works in the documentation.
    • Also, GCP customers who use Sole Tenant node groups can now instruct the Autoscaler to use them first before requesting VMs on multi-tenant hosts. This feature is currently available only in API and Terraform. Read more here.
    • Added Minimum node disk size as a configurable property to the Node configuration.
    • From now on, when onboarding an AKS cluster, users must manually control the autoscaling of legacy node pools, preferably turning it off as they onboard the cluster to CAST AI. This was done to achieve similar behavior with other supported cloud providers.
    • Drain timeout during Rebalancing is now a configurable property that is set in the Node template.
    • Enhanced Evictor by adding support for Stateful Set eviction when exclusively specified by the user. Check the documentation to learn how to use this feature without causing downtime.
    • Improved API key management: now, any change to a user role (or user removal) will trigger the deletion of API keys created by that user.

    Cloud Native Security

    • Added Security dashboard – a new page that provides an overview of an organization’s security posture based on the data from CAST AI-managed clusters where kvisor is installed.
    • Improved performance of the Vulnerability and Best Practices pages in the Security product.
    • We have released a new version of our Terraform provider and updated some Terraform examples. You can find the list of changes here. The provider and modules are already in Terraform’s registry.
    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.
  • GCP single-tier discount support, partial node template matching and new Security features

    Autoscaler
    Cost monitoring
    Evictor
    Node configuration
    Rebalancer
    Security
    • The security product has been enhanced with two major features: attack paths and base image recommendations. Both features become automatically available once the CAST AI security agent is installed:
      • Attack Path: CAST AI can now illustrate the potential path an intruder might take from the internet to exploit misconfigurations and vulnerabilities in the connected Kubernetes cluster.
      • Base Image Recommendation: CAST AI now not only reports vulnerabilities found in scanned images but also recommends less vulnerable base image versions that customers can adopt to address flagged issues.
    • Partial node template matching is now released. This feature allows workloads to specify at least one of all terms of the Node template for it to be considered when provisioning a node for the workload. Read more in the documentation.
    • GCP customers leveraging single-tier discounts now have the capability to integrate their pricing data with CAST AI. This integration will be applied across the product. If you are interested in utilizing this feature, please contact our support team for assistance.
    • A single namespace compute cost report is now available.
    • Improvements have been made to the one-click migration from Karpenter setups. Now, instead of failing when unsupported objects are encountered, the process skips them and informs the user about partially created Node templates and Node configurations.
    • The frequency at which CAST AI updates the image used to build AKS nodes has been increased. Previously, images were updated every 30 days (or upon a Kubernetes version update). Now, CAST AI checks for new image versions in the cluster’s node pools and updates the images used to build AKS nodes accordingly.
    • The Rebalancer has been enhanced to exclude pods marked for deletion, which are still running in the cluster, from capacity considerations.
    • Various quality-of-life improvements have been released:
      • Allocation group filters now support and/or conditions.
      • Rebalance and Audit log screens have additional visual cues to help with readability. Rebalancer error messages will now also show additional information about the root cause.
      • Improved Node deletion policy page, to better indicate the status of the Evictor component.
      • Improved performance of Security best practices pages.
      • Workload type and ID filters were added to the Workload optimization screen.
      • If subnets and security groups were deleted when a cluster was offboarded from CAST AI, they will be indicated in the Node configuration upon re-onboarding.
    • We have released a new version of our Terraform provider and updated some Terraform examples. You can find the list of changes here. The provider and modules are already in Terraform’s registry.
    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.
  • Private Cluster Optimization & Workload Optimization Improvements

    Evictor
    Node configuration
    Rebalancer
    Workload rightsizing
    • Now, private EKS clusters—those without direct access to the internet—can be fully optimized by CAST AI via private link setup.
    • Support for Kubelet config and Init script has been added to the AKS cluster’s Node configuration.
    • We’ve improved the user experience by collapsing Rebalancing into 2 phases instead of 3. This was done because the deletion of the node now happens as soon as the drain is completed, instead of waiting for all nodes to drain.
    • Evictor and Rebalancer now support well-known safe-to-evict annotations. Check the documentation for more details.
    • One-click migration from Karpenter now supports v0.32+ objects.
    • Self-Service SSO now supports OIDC connections.
    • CAST AI’s node template name is now added as a tag on AWS nodes, a label on GCP VMs, and a tag on Azure’s VM Scaling set.
    • Workload optimization settings now support constraints—the minimum and maximum values for resources, dictating that the workload autoscaler cannot scale CPU/Memory above the max or below the min limits.
    • Introduced an event log for Workload Optimization, providing users with increased visibility into actions taken. Additionally, Workload Optimization now supports Argo rollouts.
    • Added the ability to filter by label to the Security Best Practice report.
    • The Image Security page now displays the image scan status.
    • Minor quality-of-life improvements have been made to make cluster and node list tables easier to use.
  • Pod Pinner beta, Node OS updates, Organization-level allocation reports, GPU dimension in Cost Monitoring

    Cost monitoring
    Evictor
    GPU
    Node configuration
    Organization
    Pod Pinner
    Reservations
    Security
    Workload rightsizing
    • Pod Pinner has been released in beta and is now available in a limited availability capacity. It addresses the misalignment between the actions of the CAST AI Autoscaler and the Kubernetes cluster scheduler. For example, while the CAST AI Autoscaler efficiently binpacks pods and creates nodes in the cluster in a cost-optimized manner, the Kubernetes cluster scheduler determines the actual placement of pods on nodes. Read more in the documentation. Customers willing to test this functionality should engage the support team.
    • Ubuntu SNAP-based AMIs for EKS are now supported.
    • We have updated the Advanced Evictor targeting logic so pods in the namespace can be targeted without the need to match labels. Check the documentation.
    • AKS users can now choose compatible OS disk type in the node configuration.
    • For GKE clusters we have added the support for Init script and Kubelet configuration. Custom instances with attached GPUs are now also supported.
    • We have released Organization-level Allocation group reporting, where users can create Allocation groups that span multiple clusters in the organization. Read more here.
    • A GPU cost dimension was introduced to all reports in the Cost monitoring suite.
    • Data about allocated and requested GPUs are now presented in the Cluster list and dashboard.
    • The Cluster list, Cluster dashboard, and Cluster efficiency screens will now display the amount of used computing resources in addition to allocated and requested ones.
    • Workload screens in the Workload Optimization menu now show labels and annotations. If a workload is managed via annotations set in the manifest file, it can’t be managed via the Workload optimization interface.
    • We updated the Azure Reserved Instance upload flow to support only the native Azure RI export document format. Support for a custom CAST AI format was removed.
    • A new Security product feature – Node OS updates, has been released. According to best security practices, the node OS should be patched frequently. From now on, users can identify nodes that are out of compliance and set a Rebalancer schedule that will target nodes for replacement based on their age. All details are available here.
    • We introduced a new image scanning status, “Pending,” which indicates that an image is to be scanned, or it might be that kVisor encountered an error and a user needs to take action.
    • To unify the user journey, cluster-level Security best practice and Image security screens were removed. These reports are available from the organizational level menu.
    • We have released a new version of our Terraform provider and updated some Terraform examples. You can find the list of changes here. The provider and modules are already in Terraform’s registry.
    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.
  • Windows node support on AKS, Self-service SSO, GCP resource-based CUDs usable in autoscaling, the launch of CAST AI EU instance

    Autoscaler
    Rebalancer
    Reservations
    Workload rightsizing
    • We have launched support for Windows nodes in AKS clusters. Clusters running Windows 2019, as well as Linux machines, can now be fully managed by CAST AI.

    • Support for GCP custom instances with extended memory settings has been implemented.

    • We have launched a self-service SSO functionality for enterprises using Azure AD and Okta OIDC. More details and a setup guide are available in the documentation.

    • GCP resource-based CUDs can now be uploaded and assigned to clusters via APIs. This is an early release. Customers interested in using this feature should contact the support team.

    • From now on, when CAST AI initiates a drain as part of a spot interruption management event, it will label the node with the key autoscaling.cast.ai/draining, and the value will represent the reason. An additional taint autoscaling.cast.ai/draining=true will also be added. All details can be found in the documentation.

    • The CAST AI EU instance is now live. Customers who require all data to be local in the EU can onboard clusters to the CAST AI EU instance.

    • Topology spread constraints that use the kubernetes.io/hostname key are now supported. A full list of supported labels can be found here.

    • We’ve added an export functionality to the CPU usage report.

    • The cluster dashboard now shows used resources in addition to allocatable and requested resources.

    • Updated the audit log event that is triggered when an instance type or family gets blacklisted. It now includes the reason for blacklisting and the expiration date.

    • We’ve updated the Security Best Practice Report to use CIS Benchmarks for GKE 1.4.

    • We’ve introduced workload autoscaler management by workload annotations. Check the documentation for more details.

    • The workload autoscaler user interface has been updated.

    • We have released a new version of our Terraform provider and updated some Terraform examples. You can find the list of changes here. The provider and modules are already in Terraform’s registry.

    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.

  • One click migration from Karpenter, Workload optimization settings, Security exceptions and more

    Autoscaler
    Organization
    Security
    Workload rightsizing
    • We have implemented a functionality that enables customers to migrate from Karpenter to CAST AI with just a few clicks. CAST AI now recognizes whether Karpenter is installed, its version, and configuration objects (provisioners and AWS node templates). Users can migrate these objects to CAST AI Node templates and Node configurations with a single click. Check the documentation.
    • We have introduced configurable settings to the Workload Optimization feature, allowing users to adjust overhead, recommendation percentile, and set a threshold trigger for applying recommendations to better fit their use case. Refer to the documentation for more details.
    • The available savings report has been enhanced to identify potential savings achievable by right-sizing workloads’ CPU and MEM requests.
    • Improved visibility into pending user invitations: Once a user is invited to an organization, they will appear in the organization’s member list with the status ‘Invite pending.’
    • The Autoscaler now supports bare metal instances on EKS clusters.
    • The Cluster dashboard has been enhanced to differentiate between CAST AI and non-CAST AI provisioned nodes, OS, and overprovisioning levels.
    • A new role, ‘Analyst’, has been introduced to the platform. Users who only need to work with the cost monitoring featureset will be able to view all reports and create Allocation groups without the need for edit Member level access rights.
    • In Cost Monitoring, the Workload Network cost tab now provides granular information about the amount of network traffic associated with the workload, destination workload, and associated costs. Refer to the documentation for more details.
    • In the Security domain, users can now exclude image repositories and resources from scanning for vulnerabilities or against the best practice framework.
    • Starting now, an outdated CAST AI agent version (earlier than v0.49.1) will be detected, and users will be informed before attempting to use advanced evictor configuration.
    • Workloads that have at least one pod on a node are now visible in the NodeList’s detailed node view.
    • We have released a new version of our Terraform provider and updated some Terraform examples. You can find the list of changes here. The provider and modules are already in Terraform’s registry.
    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.
  • Workload Autoscaler Enters Beta

    Workload rightsizing
    • We are excited to announce that Workload Autoscaler is now available in Beta mode. This innovative feature automatically scales your workload requests up or down based on the demand, ensuring optimal performance and cost-effectiveness. Join our early adopters today and experience the benefits of intelligent autoscaling firsthand. For more details of Workload Autoscaler please refer to the documentation.
  • New Image Security Featureset

    Cost monitoring
    Metrics
    Note templates
    Security
    • We are thrilled to introduce our new Image Security featureset. This feature empowers you to monitor all images running within your clusters, identify problematic ones, obtain detailed information about vulnerabilities, and prioritize your tasks. You can assess vulnerabilities within your organization or limit the assessment to your team’s scope. To experience Image Security in action, simply install CAST AI kVisor (Security Agent) on your cluster. For more details, please refer to the documentation.
    • When handling AWS rebalancing recommendations for spot nodes, CAST AI will now greylist only the impacted zone, rather than affecting all zones in the cluster.
    • We have added new scrapable node and workload level Prometheus metrics. Please consult the documentation for detailed guidance on how to use them.
    • In cases where instances are unavailable in the cloud, the Rebalancer will dynamically replace planned instances with the best alternatives. As a result, the rebalancing completion screen will now display actually provisioned instances, planned instances, and achieved savings. Previously, only planned instances were shown.
    • The Network Cost feature is now available for AKS clusters. Additionally, you can use Cost Allocation groups to track network costs in addition to compute costs.
    • Daily vulnerability report is now available in the Notifications view.
    • The Notifications webhook can now be utilized to send security-related notifications as well.
    • Improved the Node Template’s inventory table to provide information about the availability of GPUs and their respective counts for specific instance types. This enhancement enables users to easily identify and select the instance type that best suits their GPU requirements.
    • Released a new API designed for CAST AI partners. This API is designed to streamline the onboarding experience for both partners and their customers.
    • We have released a new version of our Terraform provider and updated some Terraform examples. You can find the list of changes here. The provider and modules are already in Terraform’s registry.
    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.
  • Autoscaling using ARM nodes in AKS, Graceful node eviction during rebalancing & Enhanced Workload efficiency report

    ARM
    Autoscaler
    Cost monitoring
    Evictor
    Rebalancer
    • Users can now select in the rebalancing setup if they want CAST AI to evict pods gracefully. Graceful eviction means that CAST AI will not forcefully drain nodes that fail to drain in time. Instead, they will get cordoned and annotated so the user can take corrective action and adjust pod disruption budgets. All rebalancing settings can be found here.
    • During rebalancing, CAST AI will delete nodes as soon as they have been drained. Previously, all nodes had to be drained before the node deletion phase would start.
    • Autoscaling using ARM nodes is now supported for AKS customers. The user has to have quotas for ARM nodes and be in the supported Azure region. That’s why we ask users to engage with our support team before enabling this feature.
    • The workload efficiency report has been uplifted and now provides information about funds wasted due to poorly set requests. On top of that, we have added the ability to take patching commands from the console and apply them to your workload, so resources are adjusted based on CAST AI recommendations.
    • GCP node configuration now supports the boot disk type selection. Check the documentation for more details.
    • Node Templates now support the “NoExecute” taint effect.
    • The advanced Evictor configuration now enables more granular protection and targeting of pods. Read more in our documentation.
    • We have further reduced the permissions levels required in the customer’s cloud account to run CAST AI. For more details, refer to our documentation.
    • The cluster dashboard now displays CAST AI autoscaler-generated events to highlight why pods are pending instead of relying on standard Kubernetes events. This change helps pinpoint the exact reason a pod is not scheduled.
    • Bottlerocket images are now supported in EKS clusters. This improvement is available in API and Terraform only for now.
    • We have released a new version of our Terraform provider and updated some Terraform examples. You can find the list of changes here. The provider and modules are already in Terraform’s registry.
    • We have released a new version of the CAST AI agent. You can find the complete list of changes here. To update the agent in your cluster, please follow these steps.
  • GPU Support on GKE & Launch of Network Cost Monitoring

    Cost monitoring
    Evictor
    Node configuration
    Note templates
    Rebalancer
    • CAST AI now supports autoscaling with GPU-attached nodes on GKE. This feature can be enabled and managed through the node template menu. The Autoscaler responds to pending pods that require GPU resources, upscaling the cluster as necessary. By using a manifest file, a workload can request specific GPU models and more. For detailed guidance, check our documentation.
    • Added support for Advanced Evictor configuration. Using node or pod selectors users can control what workloads should be targeted by or protected from Evictor. Read more in our documentation.
    • For EKS clusters Node configuration now supports the use of a customer-provided KMS key for the encryption of EBS volume.
    • The rollout of the Default node template has been completed, feature is now generally available Further details can be found in our documentation.
    • We’ve enhanced our Cost monitoring suite to report on network costs. This feature is accessible to EKS and GKE customers but requires the egressd to be installed on a cluster. Upon completion of the installation, users can view network costs and traffic quantities, aggregated by cluster, namespace, or individual workload. Check the documentation for more details
    • We’ve updated the Rebalancer screens to showcase both projected and actual savings.
    • When setting up a schedule for Scheduled rebalancing, users can now set a value for Guaranteed minimum savings. This will ensure that rebalancing terminates after the node creation phase if minimum savings can’t be achieved. This setting safeguards the cluster from unnecessary rebalancing if planned nodes are not available from the cloud provider. All configuration options are listed in the documentation.
    • Introduced a new Notification status labeled ‘Obsolete’ and enhanced our filtering capabilities.
    • Terraform support has been added for the Reserved instance management feature for AKS clusters.
    • Clusters managed through Terraform can now be distinguished in the cluster dashboard via the ‘managed by’ field. Disconnection of such clusters through the UI is no longer permissible; it must be carried out via Terraform or API.
    • The audit log events for node addition or removal now display the node ID from the cloud provider and an exhaustive list of labels.
    • Deprecated Features:
      • CAST AI no longer offers optimization for kOps clusters. While this change won’t affect current users, this feature will be unavailable to newcomers.
    • We have released a new version of our Terraform provider and updated some Terraform examples; the list of changes can be found here. The provider, together with modules, can be found in the registry.
    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.
  • Default node template, Scheduled rebalancing, Organizational level security reporting, Cluster efficiency report and much more

    Autoscaler
    Cost monitoring
    Rebalancer
    Reservations
    • We have launched an Organizational level view of Security reporting, enabling customers to see their organization’s compliance posture through the Best practices report, which is now generated across all clusters. Additionally, the organizational level Image security report flags vulnerable images and affected clusters.
    • We have created a new CAST AI Audit Logs receiver using Opentelemetry. This component can read audit logs and seamlessly send them to the customer’s central logging system. The best part is that it’s open-sourced and available on GitHub.
    • A beta version of the Reservations feature is now available for Azure customers. With this update, customers can upload their Reservations of Virtual Machine Instances, allowing CAST AI to prioritize them during upscaling decisions. For more details, please refer to the documentation.
    • We are replacing Default Autoscaler settings with the Default Node template capability. This update offers greater flexibility in setting up the behavior of the Autoscaler when custom node templates are not used. Now users can create ‘spot-only’ clusters without using tainted nodes, set inventory limits using various constraints, apply custom labels and taints, and more. The full configuration list can be found here. This change is being gradually rolled out to limited set of customers first.
    • Deprecated Features:
      • The Cluster Headroom feature has been deprecated as it had very low adoption. The speed of CAST AI Autoscaler made extra headroom capacity wasteful in the vast majority of uses cases.
      • We have also deprecated the AWS reliability score feature from spot instance configuration options. This feature was reliant on old AWS behavior that did not translate into favorable performance. After introducing the interruption prediction model, this feature is no longer necessary.
    • Launched user interface for Scheduled rebalancing. Customers can now set up schedules and run rebalancing automatically using the UI. The Rebalancer’s log also indicates when a scheduled rebalancing was executed and the achieved savings.
    • We have added an updated Cluster efficiency tab to the Cost Monitoring suite, providing data on the current and historical state of CPU/MEM overprovisioning in the cluster, as well as the cost of each provisioned/requested CPU or GiB of RAM.
    • The Audit log improvements:
      • The Audit log now offers improved usability with added filtering options and an hour picker.
      • The ‘Unscheduled pods policy applied’ event now includes details about the node template utilized by the pending pods.
      • ‘Node was added’, ‘Node was deleted’ – now has cloud provider ID as well label details
    • For GKE clusters, we have added support for regional GCP volumes. Previously, we only supported single availability zone volumes.
    • We have implemented multiple improvements to the Nodelist, including changes in the presentation of CPU/RAM data, the ability to see not only Labels applied to the nodes but also annotations, taints and IP.
    • We have released a new version of our Terraform provider and updated some Terraform examples; the list of changes can be found here. The provider, together with modules, can be found in the registry.
    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.
  • New Autoscaler Engine, Predictive Rebalancing, and ARM Node Support for GKE Clusters

    ARM
    Autoscaler
    Evictor
    Rebalancer
    Spot Instance
    • We’ve released a new version of the CAST AI Autoscaler engine, which introduces a host of improvements. Now, CAST AI can consider an even more diverse set of instance types. This update not only improves GPU support but also enhances node distribution across zones and subnets for our EKS and AKS customers. The new engine significantly bolsters the Rebalancer, enabling customers to achieve greater savings and overcome previous limitations when rebalancing multi-zone clusters. We’re now offering comprehensive support for both NodeAffinity and NodeSelectors, which includes the integration of Affinities like NotIn. This update ensures that customers can easily utilize their preferred labelling method, or even both, with minimal friction. There’s no need for customers to take any action to benefit from the updated engine—it works straight out of the box.
    • CAST AI now fully supports optimization of GKE clusters running ARM nodes. For more information, please refer to our documentation.
    • On the cluster dashboard, users can now view both the count of unscheduled pods and the reasons for pods remaining in the pending state.
    • The AWS Node configuration now cross-references the cloud provider for configured subnets and security groups, allowing users to easily select them.
    • We’ve launched a beta version of our predictive rebalancing feature set for AWS customers. To proactively manage spot interruptions, users can now select one of two interruption prediction models. CAST AI will handle notifications about upcoming interruptions, identifying impacted nodes and rebalancing them in advance. More information is available in our documentation.
    • We’ve improved our Notifications functionality. Now, notifications older than 24 hours will automatically expire, and those followed by a successful operation will automatically resolve.
    • We’ve enhanced Evictor logging. The logs now record the pods present on the node before the drain operation as well as those that remained on the node if draining failed.
    • The user interface for Node templates now supports the setup of custom taints. Previously, this functionality was only accessible via the API / Terraform.
    • In addition to the Azure container offering launched last month, CAST AI is now also available as a SaaS offering on the Azure Marketplace.
    • We have released a new version of our Terraform provider and updated some Terraform examples; the list of changes can be found here. The provider, together with modules, can be found in the registry.
    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.
  • Redesigned CAST AI Console, Azure Kubernetes Marketplace Offering, and More Updates

    ARM
    Evictor
    Rebalancer
    Spot Instance
    Workload rightsizing
    • We’ve launched a new design for the CAST AI console, making it more modern, sharp, and user-friendly.
    • The available savings report for EKS clusters now provides recommendations for optimal configurations using Graviton (ARM) nodes, even if the current configuration doesn’t include such nodes. This change allows users to simulate how their cluster would appear if migrated to nodes with ARM processors.
    • The available savings report for clusters in read-only mode will no longer display the full recommended optimal configuration. This feature is now exclusively available for clusters managed by CAST AI.
    • Modified the workload efficiency calculation to account for deliberately under-provisioned workloads. Efficiency is now capped at 100%. For instance, heavily memory under-provisioned workloads will show 100% efficiency, but they risk OOM kills.
    • Updated the workload efficiency calculation for multi-container pods – depending on the request size, pods will contribute proportionally to the overall efficiency score.
    • Fixed a bug that prevented the evictor from continuing optimization of a cluster when it encountered a faulty pod.
    • We’ve enhanced the detection of the Metrics server on a cluster to prevent instances where the workload efficiency page isn’t displayed even though the Metrics server is installed.
    • Created APIs for the upcoming AWS Spot rebalance recommendation handling and preventive rebalancing features. Users interested in testing these features should contact our support team.
    • We’ve created APIs and implemented Terraform changes for the scheduled rebalancing feature that will allow partial or full rebalance of a cluster based on a cron type schedule. Users interested in testing this feature should contact our support team.
    • For AKS users, we’ve created APIs for the upcoming Reserved Instance management feature. Users interested in testing this feature should contact our support team.
    • Added support for Google’s newly introduced G2 VMs.
    • AKS Govcloud and Azure CNI Overlay networking are now supported.
    • We’ve modified the partial rebalancing savings calculation logic. Previously, during partial rebalancing, savings were calculated based on total cluster cost. Now, it will be calculated based solely on the selected nodes.
    • Enhanced the spot diversity feature. It now includes a user-defined safeguard: a permitted percentage price increase that the Autoscaler should adhere to but not surpass when making autoscaling decisions that enhance spot instance family diversity on the cluster.
    • Now, in non-aggressive mode, the Evictor will no longer target nodes that have running jobs.
    • We’ve released the CAST AI agent as an Azure container offering on the Azure Kubernetes Marketplace.
    • We’ve included the ability to override the problematic pod check during the rebalancing process, even if normally such a pod would prevent a node from being rebalanced. For more details, please refer to our documentation.
    • Fixed several bugs in the Security report and kVisor agent.
    • We have released a new version of our Terraform provider and updated some Terraform examples; the list of changes can be found here. The provider, together with modules, can be found in the registry.
    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.
  • CAST AI is now listed on AWS marketplace

    ARM
    Node configuration
    Note templates
    Rebalancer
    Terraform
    • CAST AI can now be purchased via the AWS marketplace, offering seamless integration and an easy procurement process.
    • We have released Scheduled rebalancing APIs and updated the Terraform provider allowing users to run rebalancing on a schedule and limit the scope, e.g., rebalance a full cluster or only spot nodes.
    • GCP Custom instances are supported in the Node templates UI.
    • AKS autoscaling with Storage Optimized Nodes for AKS users – we have added support for autoscaling using storage-optimized nodes when the workload requests ephemeral storage in the pod definition (i.e., via nodeSelector and toleration). Read more in our docs.
    • Node template UI now supports multiple custom labels.
    • We have added support for preferred affinity, enabling pods to be scheduled on templated nodes and if they are not available, nodes added by the default autoscaler will be used.
    • Node configuration UI for GKE clusters now has a setting to pass Network tag values as well as to set a value for the MaxPods per node.
    • In the EKS Cluster Node Template UI, users can now select between ARM and/or x86_64 architectures.
    • For EKS clusters Node configuration UI now allows the selection of EBS disk type and specification of required IOPS values
    • Workload Efficiency Improvements. Previously, under-provisioned workloads were shown as equally inefficient as over-provisioned ones. As under-provisioning might be a deliberate customer strategy to save resources while accepting risk, such workloads are now capped at 100% efficiency.
    • We have released a new version of our Terraform provider and updated some Terraform examples; the list of changes can be found here. The provider, together with modules, can be found in the registry.
  • Release of Spot diversity feature and pod affinity support

    ARM
    Cost monitoring
    Node configuration
    Note templates
    Spot Instance
    • Implemented pod affinity support for well-known Kubernetes (and some cloud provider-specific) labels, check the documentation.
    • We have released the Spot diversity feature for community feedback (available across all supported cloud providers). When turned on, the Autoscaler will try to balance between the most diverse and cheapest instance types. By using a wider array of instance types, the overall node interruption rate in a cluster will be lowered, increasing the uptime of your workloads. API and Terraform support is already available, UI will follow. More in our documentation.
    • In Cost monitoring, Workload level reporting, we have optimized the workload grouping algorithm, for large clusters with a significant number of workloads that use repetitive naming patterns. Handling of short-lived workloads was also improved.
    • Updated Viewer role permissions, so viewers in the organizations now can generate rebalancing plans.
    • The Available savings report now recommends ARM nodes if the cluster’s current config has at least one ARM node.
    • AWS customers can now use nodes with ARM or x86_64 processors when creating Node templates. More in the documentation. This change is currently implemented in API and Terraform, UI changes will follow.
    • Multiple arbitrary labels are now supported in the Node templates, check our docs. This change is currently implemented in API and Terraform, UI changes will follow.
    • The unscheduled pods policy audit log event expanded to contain more data about the trigger, to provide more insights to a customer.
    • We have updated Terraform examples. Released a new version of our Terraform provider; the list of changes can be found here. The provider, together with modules, can be found in the registry.
    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.
  • Beta launch of Workload rightsizing recommendations, ARM processor support in EKS

    ARM
    Evictor
    Node configuration
    Note templates
    RedHat Openshift
    Terraform
    • Not sure what CPU and MEM requests to set on your application containers? We are here to help! In our Cost monitoring suite, we have launched the Workload rightsizing feature. Users can now access the Efficiency tab where the list view displays the workloads and their respective requested and used resources (calculated in resource hours). Every workload can be accessed for more resource requests and usage data. On top of that, CAST AI now provides CPU and RAM rightsizing recommendations for every container. The entire feature set is currently available in beta access, we are actively gathering feedback and making incremental improvements.

    • CAST AI agent image and Helm chart are now RedHat certified and available in the partner software catalog.

    • The Available savings report for RedHat Openshift clusters was adjusted to identify master and infra-worker nodes as not optimizable.

    • AWS customers who are using ARM nodes (Graviton processors) can now connect and autoscale such clusters using CAST AI. To initiate the autoscaling workload has to have a nodeSelector of affinity with the label kubernetes.io/arch: "arm64". Check the documentation for an example.

    • In GKE clusters MaxPods value for CAST AI added nodes can now be passed via Node configuration API. GCP Network tags can now be set via the same API as well. UI updates to follow.

    • We added node affinity support to Node templates, previously only nodeSelector was supported. More details can be found in the documentation. On top of that Node templates API now supports multiple arbitrary taints.

    • Removed permissions to create Security Groups from CAST AI ‘user’ in AWS. These permissions are no longer required as the creation of the Security Group was transferred to the onboarding script. Currently, required permissions can be checked here.

    • Added Node template support to the Terraform, it is available from v.4.5.0 of eks-module
      and 2.1.1 for CAST AI Terraform provider. 

    • Reworked how Evictor policy is managed via terraform. Until the user sets Evictor policies, Evictor installation is not modified in the cluster; users can still proceed by managing evictor via Helm if they choose so. Policy defaults will always be the same and no longer synced to evictor helm changes. When the Evictor policy is modified in the UI, the changes will be synced to Evictor as usual.

    • Updated metrics collection so widgets in the cluster dashboard take into account pods that are scheduled on the node but can’t start due to containers failing to initialize properly.

    • Node configuration API for EKS clusters now supports additional storage volume parameters: volume type, IOPS, and throughput. In addition, the API now supports the specification of the IMDS version to be used on CAST AI provisioned nodes. Documentation.

  • Deeper security insights, improved AKS image management, and OpenShift support

    Autoscaler
    Node configuration
    Note templates
    Security
    • CAST AI kVisor security agent has been released. Customers can now get even deeper security insights into their Kubernetes clusters by enabling the CAST AI kVisor agent. There is no need to wait until vulnerabilities and best practices reports are refreshed as kVisor assesses public and private container images for vulnerabilities once they appear in your cluster. It also provides a more thorough analysis of your cluster configuration based on CIS Kubernetes benchmarks and DevOps best practices.

    • Improved the way CAST AI manages AKS images used when creating new nodes. When onboarding an AKS cluster to CAST AI-managed mode, we will create and store in the customer’s gallery the image to be used when creating CAST AI-managed nodes. This new solution drastically reduces the time required to create and join new nodes to a cluster. Rebalancing execution times are also reduced.

    • Added ‘Read only’ support for Red Hat OpenShift Service on AWS (ROSA) clusters. Now customers running ROSA clusters can experience all the CAST AI reporting features for free.

    • CPU usage report was completely reworked. It now provides running CPU usage data as well as billable CPU counts on a daily basis. Billable CPU count is the foundation of CAST AI billing and now customers will be able to see the current as well as the forecasted end-of-month numbers.

    • Improved Autoscaling. Previously, if a required instance type was unavailable due to an ‘Insufficient Capacity Error’ received from the Cloud provider, it took CAST AI a considerable amount of time to find an alternative or initiate the creation of Spot Fallback. Now, CAST AI will choose the next best option straight away from the ‘candidate list’ without waiting for the next snapshot.

    • Reworked workload grouping algorithm, so Workload cost reports for clusters with a huge amount of workloads load faster.

    • Adjusted Autoscaler settings, Node template, and Node configuration UX by making minor changes to the user interface elements.

    • Set the default disk-to-CPU ratio to 0 (instead of 5). Now, by default CAST AI added nodes would have 1 CPU to 0 GiB root volume size ratio, so the default added disk would be of 100 GiB. Users can change this setting in Node configuration properties.

    • Users can now trigger Cluster reconcile, for any cluster that was onboarded to CAST AI-managed mode. This functionality aligns the actual cluster state with the state in the CAST AI central system, so any issues related to credentials or inconsistencies in cluster state can be resolved (or flagged).

    • Users can also now retrieve our credentials onboarding script when the previously onboarded cluster is in the ‘Failed’ or other states. Re-running this script would update CAST AI components and solve any IAM issues (e.g., missing permissions).

    • We have updated Terraform examples. Released a new version of our Terraform provider; the list of changes can be found here. The provider, together with modules, can be found in the registry.

    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.

    • Fixed various bugs and introduced minor UI/UX improvements.

  • Node template and Configuration support for AKS and GKE clusters

    Node configuration
    Note templates
    • Released Node template and Node configuration functionality for GKE and AKS clusters. Customers can now create required node pools and apply specific configuration parameters on CAST AI-created nodes. Over upcoming sprints, support for advanced configuration parameters will be added.

    • Enriched AKS cluster onboarding, so now the script returns detailed information about encountered errors, even if they are retriable. 

    • AWS Node configuration – added ability to pass kubelet configuration parameters in the JSON format.

    • Exposed the Kubernetes version in the cluster dashboard.

    • Removed the ability to set disk size in the Autoscaler policies page and transferred it to Node configuration. The default ratio for calculating disk size based on CPU is 1CPU:5GiB.

    • Added the ability for customers to modify helm values in our Terraform modules. Released a new version of our Terraform provider; the list of changes can be found here. The provider, together with modules, can be found in the registry.

    • Expanded our inventory to support all AKS regions. 

    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.

    • Fixed various bugs and introduced minor UI/UX improvements.

  • New user onboarding experience and many other improvements

    ARM
    Evictor
    Note templates
    Rebalancer
    • We have reworked the user onboarding flow to improve the experience. Now, users can explore a demo cluster available immediately after registration, providing a guided tour through CAST AI features.

    • When users connect an EKS cluster with GPU-attached nodes Savings report now displays the GPU count as a separate dimension.

    • Users now have the ability to swap between Nodes and Workloads view when preparing to rebalance the cluster. It makes it easier to identify problematic workloads in one go.

    • The minimum node count figure is now exposed in the Rebalancing plan screen so that users can configure the minimum desired number of nodes in the post-rebalanced cluster state. That way, customers have more control over the rebalancing outcome to align with their goal for high availability / compute resource distribution.

    • For EKS users, we have added the support for autoscaling using Storage optimized nodes when the workload requests ephemeral storage in the pod definition (i.e., via nodeSelector and toleration, read more in our docs).

    • Node templates now support the Fallback feature, so users who create node pools using templates consisting of spot nodes can benefit from CAST AI’s ability to guarantee capacity even when spot nodes are temporarily not available.

    • Fixed a bug that caused Evictor not to shut down when the Empty node policy is turned off.

    • Added ARM node support into CAST AI provisioner for EKS and GKE clusters. Now ARM nodes can be added to the cluster via API, autoscaling support is coming up next.

    • Made Evictor more cluster context-aware, so when it is used in the ‘aggressive mode,’ it will not remove single replica pods in big batches, to avoid downtime.

    • Improved Autoscaler logic: when the autoscaler has a choice of AZ (pods don’t require specific zone via selector or attached volumes), it will choose the zone where there’s less provisioned capacity (CPU) and fewer blacklisted instances. Both factors are taken into account – heavy underprovisioning will win against slightly higher blacklist count, and vice versa.

    • Fixed bugs and released several user experience improvements in the Security report.

    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.

    • Released a new version of our Terraform provider; the list of changes can be found here. The provider, together with modules, can be found in the registry.

  • Notifications are here!

    Autoscaler
    Evictor
    Node configuration
    Notifications
    • We have launched the Notifications feature to inform customers via UI or webhook about various issues, upgrades, new releases etc., affecting their clusters. Currently, the feature supports a single scenario: customers will be informed if CAST AI credentials are invalidated. In the upcoming weeks, more scenarios will be added. A detailed guide about how to set up a notification webhook can be found here.

    • Improved the performance of the binpacking algorithm (Evictor) to ensure that it is capable of quickly downscaling large / volatile clusters. Instead of targeting and draining nodes one node at a time, Evictor will validate that affected pods are reschedulable and then target multiple nodes in parallel in the same cycle.

    • Workload cost report now supports filtering by labels so customers can easily find cost information of specific workloads based on the labels applied. Furthermore, users can now see the cost over time information for every workload.

    • Made taint an optional setting when creating Node template, users can also now specify custom nodeSelector. These improvements enable more flexible use of Node templates based on the customer’s use case.

    • Improved CAST AI autoscaler algorithm by adding multiple optimizations. CAST AI now considers additional cost efficiency scenarios before satisfying pending pods, for example: in the past, CAST AI used to prefer bigger nodes, but this, in turn, was not always the cheapest option. Now our algorithm also considers a combination of smaller nodes as well, this new approach contributes to more cost savings and additional stability when handling spot instances.

    • Updates to Node configuration feature:

      • Added ability to specify containerd or Docker as a default container runtime engine to be installed in CAST AI provisioned nodes.

      • Created functionality to provide a set of values that will be overwritten in the Docker daemon configuration (available values).

      • Added support for Node configuration functionality to CAST AI Terraform modules.

    • The CAST AI cluster controller is now more resilient and now will restart on failure instead of failing silently.

    • Re-arranged UI elements in the console and made various changes to the onboarding flow for a better user experience.

    • Added support for AWS ap-southeast-3 (Jakarta) and me-central-1(UAE) regions.

    • Introduced multiple stability and performance improvements to the platform.

    • Released a new version of the CAST AI agent; the list of changes can be found here. To update the agent in your cluster, please follow these steps.

    • Released a new version of our Terraform provider; the list of changes can be found here. The provider, together with modules, can be found in the registry.

  • Launch of the Free Kubernetes Security report

    Node configuration
    Security
    • We have released a Free Kubernetes Security Report that contains:

      • Overview page – gives an overview of the historical trends in vulnerabilities and best practices configuration within the cluster and how those vulnerabilities are distributed across cluster resources.

      • Best practices page – gives insights into the cluster’s alignment with security and DevOps best practices. The insights are provided in the form of checks with a short description of the issue and remediation guidelines. This release only covers insights based on the read-only data collected by the CAST AI agent.

      • Vulnerabilities page – provides a list of vulnerable objects with detailed information about vulnerabilities found and information about available fixes. This release covers only vulnerability assessment of the images downloaded from public repositories.

    • Cost over time graphs in the Available savings report now supports reacting to the chosen configuration preference (i.e., Spot only, Spot-friendly, or only on-demand configuration settings).

    • Our recently released Node configuration functionality for EKS now also supports kubelet configuration parameters and the ability to pass user data in the form of a bash script.

    • GCP custom instances can now be provisioned with a lower CPU to RAM ratio of 1 : 0.5 , instead of 1 : 1 as it was previously.

    • The latest version of the CAST AI agent is v0.32.1; the list of changes can be found here. To update the agent in your cluster, please follow these steps.

    • Released a new version of our Terraform provider (v0.26.0); the list of changes can be found here. The provider, together with modules, can be found in the registry.

  • Launch of GPU support, Node configurations, and templates for EKS clusters

    Autoscaler
    GPU
    Node configuration
    Note templates
    • CAST AI can now autoscale workloads that require GPU-attached nodes. Currently, this feature supports Nvidia GPU-attached EKS nodes. To use this functionality, the workload needs to have defined GPU limits and toleration nvidia.com/gpu. For more information, please refer to documentation.

    • We released a new feature called Node configuration! It allows users to define configuration settings for each CAST AI provisioned node. This feature is currently enabled for EKS clusters only.

    • Also for EKS clusters, we have released a new feature called Node template, which allows the creation of node pools. Node pools can be used to run specific workloads on pre-defined list of nodes only. Cost-wise, this behaviour leads to sub-optimal state of the cluster but it gives users more control.

    • Another new feature! The Cost comparison report captures the state of the cluster (i.e. number of requested CPUs and cost) prior to the enablement of CAST AI optimization and extrapolates the savings by comparing the cluster’s historical versus current state. The report clearly shows the value of the CAST AI node selector and bin packer algorithms.

    • We launched a CAST AI offering in the GCP marketplace, so Google customers can purchase CAST AI directly from the well-known GCP platform.

    • For GKE clusters, the CAST AI autoscaler can now be instructed to scale the cluster with instances that have locally attached SSD disks. To do it, a workload has to have a node selector and toleration for label scheduling.cast.ai/storage-optimized defined in the spec. For more details, please refer to the documentation.

    • We have introduced temporary taints to prevent pods from being scheduled until the node creation phase is completely finished during rebalancing.

    • Revamped the design of the cluster list, added more details about CPU and memory resources used by individual clusters, as well as the organization as a whole.

    • We uplifted the design of the Autoscaler policies page.

  • Cost Report launch

    Cost monitoring
    Evictor
    • The Cost Reporting solution is now live. The report displays the compute costs of the cluster and cost allocation per workload or namespace. Customers can quickly assess the compute costs associated with an application, service, or team. Additional reporting dimensions like Cost per CPU were introduced to help customers analyze the cost. Further enhancements are coming in Q3.

    • Changes in user interface settings in the Available savings or Cost reports now persist when switching to another cluster.

    • The cluster list can now be filtered based on the cluster name or status.

    • The CAST AI agent adjusts the resources it needs to operate based on the size of the cluster. We have improved the memory size scaling logic to consider the CPU count and the node count.

    • Users can now set a custom replica count for the CAST AI agent, by simply scaling the deployment. At a point in time, only one replica will be running, and others will be in a passive mode. However, if an active replica crashes, another replica will become active, thus ensuring service availability.

    • Evictor now respects Pod Disruption Budgets (PDB) and won’t try to evict the pod if it would violate the PDB.

    • Exposed the blacklist information via API. Instances that can’t be used for autoscaling are now visible via the API. Instances affected by an insufficient capacity error in the cloud service provider are not visible yet; this improvement is in the works.

    • During the installation of our agent, a customer-managed secret can now be specified in CAST AI Helm charts; check the documentation for guidance.

    • Added support for kOps versions 1.21 and 1.22.

    • The latest version of the CAST AI agent is v0.31.0; the list of changes can be found here. To update the agent in your cluster, follow these steps.

    • Released a new version of our Terraform provider (v0.24.2); the list of changes can be found here. The provider, together with modules, can be found in the registry.

  • Partial rebalancing, pod topology spread constraint support, and Terraform module for AKS clusters

    Autoscaler
    Evictor
    Rebalancer
    • Introduced the partial rebalancing capability. Instead of rebalancing the whole cluster, customers can now select specific nodes to rebalance (replace with more optimal configuration). The user experience was reworked to focus on the nodes instead of the workloads.

    • Implemented temporary tainting on the new nodes created during rebalancing, so no workloads can land on them before the node creation phase is finished.

    • When autoscaling GKE clusters, CAST AI can now assess the workloads and pick custom instances if it is more beneficial than using standard instances. This is an optional policy setting customers can enable.

    • Added a search bar to the cluster list.

    • Improved the calculation for system memory overhead on GKE and EKS nodes, to ensure that pods always fit in the provisioned nodes.

    • Added support for scheduling.cast.ai/compute-optimized: "true" label. If this node selector is used, CAST AI will provision compute optimized nodes. The list of all supported labels can be found in the documentation.

    • Made the CAST AI agent more robust by ensuring that health check fails if it can’t deliver snapshots for a period of time.

    • Updated Evictor to not evict static pods.

    • Added automatic handling of CPU quota errors during autoscaling for GCP and Azure customers.

    • Added support for workloads on AKS that use persistent volumes and topology label topology.disk.csi.azure.com/zone

    • Improved the logic of recognizing not ready/unreachable nodes and removing them from the cluster.

    • The autoscaler now supports pod topology spread constraints on the topology.kubernetes.io/zone label. More information can be found in the documentation.

    • Removed ability to delete control plane nodes on kOps clusters directly from the node list.

    • If a node is cordoned, its status in the node list will change to “ShedulingDisabled”.

    • The latest version of the CAST AI agent is v0.28, the list of changes can be found here. To update the agent in your cluster, follow these steps.

    • Released a new version of our Terraform provider (v0.24.0) and module to support AKS cluster onboarding. Provider and the module can be found in the registry.

  • Evictor as a policy and many more improvements

    Autoscaler
    Evictor
    • Evictor – you can now install our algorithm that continuously bin packs pods into the smallest number of nodes via Autoscaler policy (in UI and Terraform). Previously, users had to follow a documented guide and install it via the command line. Please note: if you have Evictor already installed and configured it will continue to run, even though in the Autoscaler page it might indicate that it can’t be enabled. In order to correct this you would need to remove current Evictor installation manually and enable it from the Autoscaler page.

    • Cost per CPU’ reporting dimension was added to Available Savings report, as well as Node list and Rebalancing screens. This cost is calculated by dividing the compute cost by the number of provisioned CPUs. It is also exposed as a scrapeable metric for the whole cluster or per-instance life cycle: spot, on-demand, fallback. A full list of currently available metrics and the setup guide are available here.

    • We reacted to community feedback and improved the user interface and experience of our Cost report. The user interface of the Autoscaler policies page was also uplifted.

    • Added functionality that allows users to remove a team member from the organization.

    • Added the concept of a “Project” to the CAST AI console. Previously, if users had clusters with the same name across different GCP projects (or Azure Resource groups, AWS accounts) in the CAST AI console, there was no way to differentiate between these clusters. Now each cluster record also indicates the name of the GCP project / Azure Resource group / AWS account ID.

    • Node list view now displays the total node CPU and Memory capacity instead of allocatable values. We have also fixed an issue preventing the status of the cordoned node to be accurately presented in the node list.

    • We have updated the calculation formula for the root volume that is added to each CAST AI provisioned node. Before the change nodes could have had 100 GiB disks as a minimum or a larger disk based on CPU to Storage (GiB) ratio. This ratio couldn’t have been less than 1 CPU: 25 GiB. Now, a 100 GiB disk is a base and we add additional storage based on the CPU to Storage ratio, which can be as low as 1 CPU : 0 GiB.

    • For GKE clusters, the logic of the ‘Node constraints’ setting in the Autoscaler policy is now more flexible. We have removed CPU to RAM ratio validations so users can choose more flexible configurations.

    • For AWS EKS users who are using AWS CNI, we have improved the autoscaler logic, so when considering what node to add, autoscaler would react to an error if the target subnet is full and choose another available subnet (if any).

    • For AWS EKS users who are using CSI driver for Amazon EBS we have added the support for topology.ebs.csi.aws.com/zone label in the autoscaler so the new node is created in the correct zone, respecting the specification of the storage class.

    • We have optimized permissions required to onboard and run CAST AI in EKS clusters, more information can be found in the documentation.

    • If an EKS cluster is being onboarded using Terraform we now allow passing AWS credentials via environment variables to the agent helm chart.

    • Added Terraform examples for onboarding EKS cluster using custom IAM policies, creating EKS cluster with NAT Gateway and Application load balancer. All examples can be found here.

    • The latest version of the CAST AI agent is v0.27, the list of changes can be found here. To update the agent in your cluster, follow these steps.

    • The latest version of our Terraform provider is v0.23, the list of changes can be found here. The provider and the modules can be found in the registry.

  • Terraform support for GKE, Cost report, AWS Cross account role support, and more

    Cost monitoring
    Rebalancer
    Terraform
    • Implemented an additional way to onboard an EKS cluster. Now users can delegate access to CAST AI using the AWS cross-account IAM role.

    • Added scrapable metrics for Fallback nodes to display the requested and provisioned CPU and RAM resources. A full list of currently available metrics and the setup guide is available here.

    • Released the Cost report for a public preview. This report allows customers to track historical cost data of the cluster to understand how the cost fluctuated over the time period, what was the normalized cost per provisioned CPU, what is the forecasted cost at the end of the month, and more.

    • We have released a new version of our Terraform provider (v0.17.0) and modules to support GKE cluster onboarding. Provider and the modules can be found in the registry.

    • Updated our External clusters API so EKS customers can check and if necessary, update security groups.

    • Added an ability for a GKE cluster, which was paused using GCP console, to automatically get back to the ‘Ready’ state after being resumed.

    • Introduced new node status called “Detached,” so nodes that are no longer part of the K8s cluster but still running in the customer’s cloud account could be identified for the removal.

    • Cluster dashboard will now display spot, on-demand, and Fallback node counts separately.

    • If AWS custom tags were added into the cluster config, they will be replicated to the underlying volume attached to the EC2 instance.

    • Optimized performance of the Evictor in very large clusters and, as a consequence, users can bring costs down faster.

    • Exposed more details about the error when encountered during rebalancing operation.

    • Enhanced ‘Unscheduled pods policy applied’ audit log event to display the trigger (i.e. pods that caused autoscaling), as well as the filters that Autoscaler was working with to pick up a new node.

    • Also in the audit log, the event that indicates the Node addition failure is now followed by a rollback event.

    • Implemented enhanced JSON viewer, making it much easier to read JSON output when presented in the console.

    • If the cluster is in the ‘Warning’ state, it will now display a reason for it.

    • The latest version of the CAST AI agent is now v0.25.1. To update the agent in your cluster, follow these steps.

  • GCP network tags, ssh key support, and audit log improvements

    Autoscaler
    Evictor
    • Added support for GCP network tags (a concept used in the GCP world to manage network routing). Users can pass the tag 'network-tag.gcp.cast.ai/{network-tag-name}' as a label and newly created nodes will be tagged accordingly.
    • Now if CAST AI fails to add a node due to reasons outside of our control (e.g., the customer’s quota is too low) a specific event called “Adding node failed” will occur in the audit log to provide additional context about the failed operation.
    • Added support for ssh public keys. Using the updated cluster configuration API, users can set the public key (base64 encoded) or, in the case of AWS, also use the AWS key pair ID ("key-0123456789EXAMPLE") and connect to CAST AI provisioned nodes.

    • Spot nodes can now be interrupted directly from the node list. The interrupted node will change its status and eventually be removed from the cluster while the new spot node is provisioned instead. The whole process takes a few minutes to complete.

    • Added numerical values of requested and allocatable resources in CPU and GiB to the detailed node view in the Node list.

    • We have added an additional sheet that lists all workloads and their CPU & RAM requests in the Excel extract of the Available savings report.

    • If Evictor fails the leader election process, it will now restart automatically. Previously, users may have encountered a situation where the leader election process has failed, causing Evictor to fail silently.

    • “Unscheduled pods policy applied” audit log event JSON now has more context about what information was considered and led to the addition of a specific node, i.e. which nodes were skipped and why, which workload triggered the autoscaling event, what were the node constraints, etc. This feature greatly improves transparency into the decision-making process employed by CAST AI.

    • Introduced label selectors in the mutating webhook configuration, so customers can control the webhook in a much more flexible manner. Previously, users could set which pods should be scheduled on the on-demand nodes using regular expression values (namespaces). Now, they can use label selectors to force (or ignore) some pods to run on spot nodes (ability to force some pods to run on on-demand nodes based on a namespace remains).

    • Added another scrapable cluster metric – the hourly compute cost per pricing type (spot, on-demand, fallback). Check the documentation for more details.

    • Introduced a separate screen for the dashboard of the disconnected cluster.

    • Uplifted the design of console menu items.

  • Node list filtering, more Prometheus metrics and additional details about connected clusters

    Autoscaler
    Cost monitoring
    • Released a node list filtering and search capability that allows users to filter large node lists conveniently based on specific search criteria.

    • Cluster dashboard and a more detailed node list are now available as soon as a cluster is connected to CAST AI. It is no longer necessary to connect a cluster into the ‘managed’ mode in order to access these features.

    • In the Available Savings report, compute resources can now be viewed in a more detailed mode where they are broken into categories based on instance lifecycle type: spot, on-demand, fall-back (a temporary on-demand instance while spot is not available).

    • Introduced universal autoscaling.cast.ai/removal-disabled label and annotation that will be respected during Rebalancing or Evictor operations. Nodes or workloads marked this way will not be subject to migration. This label also replaces previously used beta.evictor.cast.ai/eviction-disabled which will be deprecated shortly. More information about Evictor overrides can be found in the documentation.

    • The Autoscaler now supports ‘topology.gke.io/zone label.

    • More Prometheus metrics. We have exposed for scraping all metrics visible in the cluster dashboard. A full list of currently available metrics and the setup guide can be found here.

    • Released a new version of our Terraform module for connecting EKS clusters to CAST AI. The module now supports cluster-wide tagging as well as the ability to configure Autoscaler policies.

    • To ensure that kOps nodes always have resources to run OS and kubelet we have implemented support for the system overhead settings.

    • Available savings report now has PDF export functionality.

  • Further enhancements of the Rebalancing feature

    Rebalancer
    • The Rebalancing feature received the following improvements:

      • Temporary on-demand nodes (aka Spot fallback nodes) will be considered during rebalancing plan generation if the ‘Spot fallback’ feature is turned on in the Autoscaler.

      • Applied various improvements to reduce the amount of time taken to create and execute Rebalancing plans. This performance enhancement is especially noticeable on large clusters.

      • Introduced a way to protect specific workloads and nodes from migration activity during rebalancing. Users can annotate pods or label nodes with 'autoscaling.cast.ai/removal-disabled' to ensure that they are not considered for migration.

      • Users can now generate new rebalancing plans even if the current plan is still relevant. Generating new plan would move previously active plan in to the obsolete state.

    • Latest version of the CAST AI agent is v0.22.8. To update the agent in your cluster, follow these steps.

    • Uplifted our signup and login pages for a better user experience.

    • Bug fixes and other performance improvements.

  • Scoped autoscaler, improved Available savings report, and more

    Autoscaler
    Rebalancer
    • Released Scoped autoscaler, a mode of operations where the CAST AI autoscaler co-exists with another autoscaler on the same cluster and manages only specific workloads. To restrict the scope of the autoscaler, workloads have to be modified as described in the documentation.

    • Improved the Available Savings report by adding additional interactive settings that enable customers to simulate further optimization of the cluster by using spot instances more aggressively or operating the cluster on a schedule. Automated capability to stop external clusters on schedule when they’re not in use is in development and coming soon.

    • The Available savings report can now be exported to the Excel format.

    • Released the following Rebalancer improvements:

      • Issues preventing workloads from being migrated to new nodes can now be seen in detail from the workloads screen.

      • Each rebalancing plan now has a visible generation date.

      • Rebalancing plans now become obsolete after 1 hour and move to the archive with the status ‘Obsolete’.

      • In case the rebalancing plan execution failed, a technical error message is now visible in the logs.

      • In case the Rebalancer fails during plan generation, an error will be displayed to the customer on a separate screen. Rebalancing operations will not progress further.

      • Added automatic handling for insufficient capacity error, i.e. when the originally planned node type is no longer available, CAST AI will choose the next best alternative and proceed.

      • Updated the Rebalancer documentation.

    • We have released a new version of our Terraform provider (v0.10.0). The provider now supports cluster-wide configuration changes (e.g., the addition of subnet, security group). Documentation on Terraform registry was updated as well.

    • Evictor now has an aggressive mode where it can evict pods even if they have just a single replica. Check the documentation for more details.

    • Nodes can now be manually deleted from the Node list using the ‘Delete node’ button. During this operation, nodes are drained and then deleted from the cluster.

    • Released the new version v0.22.6 of the CAST AI agent, where we have improved how spot instances are identified on GKE. To update the agent in your cluster, follow these steps.

  • Terraform provider update and improved nodelist

    Autoscaler
    Rebalancer
    Terraform
    • We have released an updated version of the Terraform provider (v0.8.1), it now supports EKS clusters. Release and example projects can be found on GitHub.

    • In our UI menu “Policies” page is now called “Autoscaler”. We have started the work on improving the experience of setting up and controlling the autoscaler, more changes will come.

    • Released cluster dashboard that displays key metrics related to each cluster.

    • Implemented the following Node list improvements:

      • The list is sorted in descending order by date and there is now a possibility to sort the list on most of the columns.

      • Ability to view labels attached to each node.

      • Spot fall-back nodes are now identified with an icon.

    • Improved error handling in Rebalancer, providing screens with more details about the encountered error and possible remediation.

    • New version of CAST AI agent v.0.22.5 is now available. To update the agent in your cluster follow these steps.

    • Fixed a bug in GCP custom spot instances pricing.

    • Fixed a bug in the Available savings report where sometimes workloads that are already running on spot instances would be suggested to be run on on-demand nodes.

    • Added records of spot fallback events to the audit log.

    • Evictor now has a setting to run in more “aggressive” mode, where it would also evict pods with a single replica. Check the documentation for more details.

    • Improved performance of our console UI and fixed various small bugs.

  • Spot fallback, enhanced cluster node list & private cluster support

    Autoscaler
    Spot Instance
    • Have you ever experienced Spot instance drought, when instances you need are temporarily not available and so your workloads become unschedulable? The Spot fallback feature guarantees capacity by temporarily moving impacted workloads onto on-demand nodes. After a period of time, CAST AI will check for Spot availability and move the workloads back to spot instances. This feature is available on the Policies page under the Spot instance section and supports EKS, Kops, and GKE clusters.

    • Added support for private kOps clusters that do not have K8s API IP exposed to the internet. CAST AI agent now supports “call-home mechanism” for private IP K8s clusters.

    • Node list went through a major upgrade and now contains much more detailed information about individual nodes in the cluster.

    • Autoscaler can now be instructed to scale the cluster with instances that have locally attached SSD disks, when the storage-optimized label is used in a workload spec. For details, please refer to the documentation.

    • Minor improvements to UI and bug fixes.

  • Release of Rebalancer & cluster cost graph

    Rebalancer
    • We have launched a new feature that we call Rebalancer. It allows users to automatically migrate clusters from the current state to the most optimal configuration. The migration is performed via three distinct phases: 1) during the preparation, the user can inspect all impacted workloads; 2) later the user gets a migration plan so they understand what nodes will be added & removed and what cost impact can be expected; 3) lastly – the migration plan is executed by adding new nodes, migrating workloads and deleting obsolete nodes.

    • The Available savings report is now enhanced with a graph that displays point in time actual and optimal cluster costs as well as other dimensions (i.e. CPU, Memory, node count).

    • For kOps clusters, we no longer consider master nodes in our available savings report recommendations.
    • Added support for kOps version 1.20.

    • For AWS/kOps clusters we previously deployed a Lambda function per cluster, its no longer the case. From now on a single Lambda function is deployed per account.

    • Implemented the handling of cases when customer has removed some permissions (or the cluster itself) in their cloud provider account. In such a scenario, the cluster would be displayed with status “Failed” in our console and user would have two options: remove the cluster from the console or fix the error in their cloud provider’s account.

    • Fixed various reported bugs and implemented other UI improvements.

  • AKS support is now available

    Autoscaler
    • Microsoft Azure users can now connect their AKS clusters to CAST AI and see how much they could save by using the CAST AI optimization engine. It’s completely free and safe as our agent operates in read-only mode. Try it out now.
    • Cluster onboarding flow is now fully automated and no longer requires manual entry of credentials.
    • Users can now generate read-only API access keys.
    • Cluster headroom policy based on instance lifecycle type. Until now, users could configure one set of headroom values for the cluster. Now they can set headroom values for on-demand and spot nodes separately.
    • Added support for the following AWS regions: ap-northeast-3 Asia Pacific (Osaka), ap-east-1 Asia Pacific (Hong Kong), af-south-1 Africa (Cape Town), and me-south-1 (Middle East (Bahrain)). We now support all AWS (and GCP) regions.
  • Introduction of roles and improved cluster onboarding flow

    Organization
    • Organizational roles have been released. Every organization now has Owner, Member, or Viewer (read-only) roles that can be managed in our console.

    • Cluster headroom and Node constraints policies are now independent and can be set separately.

    • Improved cluster onboarding flow. Customers are no longer required to enter the access key and secret details, the onboarding script takes care of these details now.

    • Customers can now set annotation on the pod level that would prevent Evictor from removing the node that hosts the pod. More details about annotations and labels used by Evictor can be found in our documentation.

    • The node deletion policy now removes nodes that are marked by Evictor immediately, ignoring the time delay set for empty nodes in the “Node deletion” policy. That way, customers can avoid paying for nodes that were marked as unschedulable.

    • Customers using AWS GovCloud regions (AWS GovCloud (US-East) and AWS GovCloud (US-West)) are now able to connect their clusters and check possible savings.

    • CPU hrs report is now available in the console. The report presents the total amount of CPU hours accumulated across all of the nodes in the organization.

    • GKE clusters running shielded nodes are now also fully supported in our platform.

    • Improved our inventory to support a wider range of instance types.

    • Delivered multiple Autoscaler improvements.

    • Minor UI improvements and bug fixes.

  • External GKE cluster optimization, Cluster metrics, and enhanced optimization policies

    Autoscaler
    • GKE cluster optimization. Customers running unshielded GKE clusters can now onboard them into CAST AI and benefit from all cost optimization policies.

    • Cluster metrics endpoint – we have released the first version of the metrics endpoint that provides visibility into the CAST AI-captured metrics. The initial description of metrics and setup guide can be found in Github. We will continue expanding the list of exposed metrics, so stay tuned.

    • Implemented Node Root Volume Policy policy that allows the configuration of root volume size based on the CPU count. This way nodes with a high CPU count can have a larger root disk allocated upon node creation.

    • We have enhanced the Spot policy for EKS and kOps, so customers can instruct CAST AI to provision the least interrupted spot instances, most cost-effective ones, or simply leave the default – balanced approach. We also support an ability to override this cluster-wide policy on the deployment level.

    • CAST AI agent v.0.20.0 was released – the agent now supports auto-discovery of GKE clusters, users are no longer required to enter any cluster details manually.

    • Cluster headroom and Node constraints policies are now separated and can be used simultaneously.

    • We made it easier for users to set correct node CPU and Memory constraints that adhere to supported ratios.

    • Bug fixes and small interface improvements.

  • Empty node time to live and new CAST agent version

    Autoscaler
    • Implemented a new feature that allows users to set the time for how long an empty node should be kept alive before deletion. This “empty node time-to-live” setting makes node deletion policy less aggressive in case users do not want to delete empty nodes right away. Read more about this feature in our docs.

    • CAST AI agent v0.19.2 was released – we removed managed fields and sensitive environment variables from objects as well as introduced compression of delta content sent by the agent. Ensure that you always update to the latest version of our agent. Check github for more details.

    • Quality of life improvements:

      • GKE connect cluster improved UX

      • Savings estimator now displays totals of nodes in current and optimized configurations

      • Savings estimator now displays the status of all Cost optimization policies

      • Spot instance recommendations for workloads from now on can be exported to .csv

      • Users can now investigate the content of yaml file in connect your cluster screen, before deploying it to the cluster

      • Improved UX for scenarios when Add-ons are not installed or can’t be found

    • Enhancement of our Audit log has continued, making it more detailed and useful.

    • Rolled out various bug fixes and small improvements.

  • Higher variety of SPOT instances, specification of CPU and RAM per node, Audit log improvements

    Autoscaler
    • Our Savings Estimator as well Autoscaler are now able to target higher variety of instance types when recommending SPOT instances. This improvement allows customers to unlock more savings from the use of instance families that previously would not be considered.

    • From now on users can rename the organization after the initial creation.

    • Audit log is now much more detailed and available for EKS and kOps clusters (previously this feature was available only on CAST AI created clusters).

    • We introduced annotation and label that protects a node from being considered for eviction and deletion, you can read more about it in our documentation.

    • During the migration in to CAST AI selected nodes, customers might want to specify minimum and maximum values of CPU and RAM for nodes to be added to a cluster. Now users can easily set these parameters in our Unscheduled pods policy and limit the possible pool of nodes that CAST AI considers. As before, other option is to use Cluster headroom settings.

    • We have added the support of kOps 1.11, 1.15 and 1.17.

    • Removed IAM permission to create new roles from our credentials script.

    • Implemented another quality of life improvement – clusters can now be sorted based on the name, region or status.

    • Fixed bugs and made minor improvements to UI.

  • Organizations, Cost analyzer for GKE clusters and Cost optimization functionality for kOps

    Autoscaler
    Organization
    • CAST AI now supports Organizations! Multiple team members from a company can now join CAST AI, create organization inside our console and collaboratively manage K8s clusters.

    • GCP customers can connect GKE clusters to CAST AI and see how much they could save by using CAST AI optimization engine. As always this is completely free and safe as our agent operates in read only mode. Try it out now. Functionality to optimize GKE cluster using CAST AI is currently in development.

    • Users running kOps clusters on AWS can now fully benefit from CAST AI cost analysis and optimization functionality. Connect your kOps cluster now, to see how much you can save and realize those savings by turning on AI driven optimization policies.

    • Connected AWS (EKS and kOps) clusters can now be paused and resumed as easily as CAST AI created clusters. Functionality to pause and resume on pre-set schedule is coming soon as well.

    • Node list is now accessible as soon as cluster is connected, customers no longer need to onboard cluster to access this functionality.

    • Additional Control plane nodes can now be added to CAST AI created clusters.

    • Clusters that were onboarded to CAST AI can now be disconnected via UI, customers have an option to delete or leave CAST AI created nodes.

    • We have reacted to user feedback and made minor adjustments in UI as well as fixed bugs.

  • Release of Add-ons and more agile CAST AI agent

    • We have released the Add-ons management functionality for CAST AI clusters. Now CAST AI clusters will be created faster without any add-ons pre-installed. Afterward, users will be able to choose the add-ons they wish to use. The Add-ons feature is available in the cluster dashboard, try it out! 

    • We increased the frequency of communication between the agent deployed on the client’s cluster and CAST AI and reduced the amount of data the agent sends via the network. Now CAST AI can react in as little as 15 seconds and scale the cluster as required.

    • We have applied minor improvements and fixes to increase the accuracy of our Available savings report.

    • Improved experience for selecting and managing your subscription.

    • Created a guide on how to disconnect your EKS cluster from CAST AI.

    • Last but not least, we fixed some bugs and made small improvements to the UI.

  • Release of Cost optimization functionality for EKS clusters

    Cost monitoring
  • Save a lot by pausing and resuming your clusters on schedule

    • Save costs by stopping your clusters when they’re idle! We have launched a “Cluster schedule” functionality to pause and resume clusters based on the user-defined schedule. Find this feature in your cluster dashboard or check the documentation.

    • The node autoscaler policy now supports GCP Preemptive Instances.

    • We introduced additional validations in GCP credentials onboarding.

    • As always our team took care of bug fixes, performance optimizations, and small UI improvements.

  • Release of CAST AI agent and “Savings” feature

    • Launched an agent to connect the EKS cluster (that was not created by CAST AI) to our console. Users can now connect clusters in read-only mode and use the “Savings” feature to analyze proposed optimizations and their impact on the cloud bill.

    • Revamped dashboard UI.

    • Node interruptions made visible in the logs data via Audit log UI.

    • Canada East (Montréal) is now a supported region in our cluster creation flow.

    • Fixed minor bugs.

  • Improved GCP credentials creation & Launch of CAST CLI

    • We have simplified the user credentials creation process for GCP.

    • You can now control your clusters using our own Command Line Interface (CLI)

    • Improved handling of Kubernetes nodes and load balancers, so the status of the nodes is tracked, and load balancers are removed when appropriate.

    • Improved Unschedulable Pods policy to peak in to the future and consider nodes that are being created.

    • Now users can process subscription payments without leaving our console.
    • Improved structure of our documentation; check it at docs.cast.ai.

    • Updated UI elements in our console and, as always, our team shipped some bug-fixes.

    • Launched the status page so our customers can check the health of our platform.

  • Master Node Configuration & General Improvements

    • Now you can Add/Remove additional master nodes on the live cluster. Convert a single non-high availability control plane to 1, 3, or 5 nodes and vice versa.
    • The newly updated and easier to understand policy is now included as part of our Unschedulable Pods policy configuration. Read more in our documentation.
    • Digital Ocean cluster deletion is improved by handling dependencies timing better.
    • Other small and various improvements.

     

  • New Upgrades & Visible Improvements

    • We upgraded Kubernetes to version 1.19 and bumped Cilium up to version 1.9. Take it for a spin here.
    • If you’re creating a new cluster with Azure as one of the providers, it will now use non-burstable Azure instance types.
    • Get more control if you see the need: interrupt and add a Spot Instance Node right from your Node list.
    • And, as always, we’ve shipped some bug-fixes and performance improvements.
  • CAST AI welcomes the beloved Developer Cloud!

    • You asked, we’ve delivered: DigitalOcean is now part of our ever-growing list of supported cloud service providers. Starting now, you can stretch your Kubernetes clusters across DO, AWS, GCP and Azure. Sign up here !
  • Support for Spot/Preemptive Instances added

    • Spot instances, if applied correctly, can yield up to 60-80% cloud savings and are really useful for stateless workloads (think, micro-services). So, starting now, if you want to, we can tell our optimization engine to start purchasing Spot (Preemptive on GCP) instances for you. And if these instances are interrupted by the cloud provider, we automatically replace them! GCP & Azure instances will follow very shortly. Read more in our documentation.
  • Support for Wireguard

    • If you want to use Wireguard as an alternative to Cloud VPN, you can now! Read more in our documentation.
  • CAST AI joined Cloud Native Computing Foundation

    • We’ve joined CNCF as full members. You’ll see more of us talking about true Multi-Cloud in CNCF events from now on!
  • Additional changes to CAST AI console

    • Create your API authentication tokens in the console
    • CAST AI API is moved to a more intuitive domain – api.cast.ai
  • New Terraform provider

  • New documentation hub

    • Access CAST AI documentation at docs.cast.ai. We’ve reworked it so you can find what you need more easily
  • Free AWS and Google Cloud credentials

    • You can claim your free credentials for AWS and Google Cloud in our Slack community. Try out our product for free for a limited time!
  • Improved Azure cloud credentials

    • Improvements in how Azure cloud credentials are created
  • A new cloud region in South America East

    • You’ve asked, we’ve delivered: choose Sao Paulo (South America East) to set up your clusters
  • Lots of additional changes in CAST AI console

    • You can now see Virtual Machine types and CPU/RAM usage in your Nodes dashboard
    • Easily copy your DNS records by accessing Global Server Load Balancer link from your cluster info widget
    • We’ve updated links to CAST AI documentation, API, and your cloud credentials
    • A new sign-up flow for easier setup
    • Initial costs are now visible when you are creating a cluster
    • Audit log tracks what actions are being performed on your cluster
  • We’ve made some changes in your cluster screen

    • CAST CSI (storage drivers) now support cloud native storage snapshots
    • We’ve increased security of your K8s clusters
    • You can now scale your apps easier with KEDA add-on installed with pod autoscaler policies
    • Now, when autoscaler scales down cluster nodes, RAM is considered more
    • CPU policy acts as a hand-rail, limiting minimum & maximum CPU cores per your cluster
    • Prometheus in your cluster was moved to Control-Plane (Master) node

// get started

Proof of Concept in 5 days

CAST AI starts saving the moment you onboard. Complete your PoC in days, not months and get an ROI report right after.

The 10 Coolest Cloud Computing Startups of 2023

40 Top Cloud Trends and Private Companies

Users love CAST AI on G2 CAST AI is a leader in Cloud Cost Management on G2 CAST AI is a leader in Cloud Cost Management on G2 CAST AI is a leader in Small-Business Cloud Cost Management on G2