A Quick Guide to Data and Security in CAST AI

When we say security is at the core of CAST AI, we mean it. Several of our founders have previously built a company focusing on application security and attack mitigation. Our CTO, Leon Kuperman, was the VP of Security Products at Oracle. Unsurprisingly, this in-depth knowledge of cybersecurity lead not only to an exceedingly safe…

CAST AI Avatar
data and security in CAST AI

When we say security is at the core of CAST AI, we mean it. Several of our founders have previously built a company focusing on application security and attack mitigation. Our CTO, Leon Kuperman, was the VP of Security Products at Oracle.

Unsurprisingly, this in-depth knowledge of cybersecurity lead not only to an exceedingly safe product but also to CAST AI getting ISO-certified and achieving the SOC 2 Type II certification.

To make automated cost optimization of your Kubernetes clusters possible, we require minimum access to your data and follow a clear list of permissions. Keep on reading to get a detailed overview of our security measures at every step of the way.

Our general security commitments at CAST AI

All our security commitments are captured in Service Level Agreements (SLAs), other agreements, and in the description of our service offers. All our security commitments are standardized and include, among others, the following:

  • We always encrypt customer data at rest and in transit,
  • Our platform is housed in state-of-the-art cloud environments that are SOC 2 Type II-compliant,
  • We continuously monitor and test our platform for any security vulnerabilities or unexpected changes,
  • The platform’s security allows the segregation of responsibilities and application functional access,
  • Our company only provides access to customer data on a need-to-know basis and audited employee access to ensure that access levels are never out of date,
  • We carried out a security bug bounty program to let security researchers around the world discover and report bugs – and at the same time, ensure the high level of security of CAST AI services,
  • Our services are available at least 99.9% of the time (measured over the course of each calendar month).

Part 1: Analysis in a read-only mode

To deliver meaningful results at the first stage, our platform requires minimal cluster access. We generally follow the principle of least privilege. Our read-only agent doesn’t have access to anything that would allow it to change your cluster configuration or access sensitive data. 

Here’s what the CAST AI agent can see:

  • Main resources like nodes, pods, deployments, etc., that are required for running the Savings report,
  • Environment Variables on pods, deployments, statefulsets, daemonsets. 

The CAST AI agent doesn’t have any access to elements like secrets or config maps. The Environment Variables considered as sensitive judging by their name (passwords, tokens, keys, secrets) are removed before the resources are sent for analysis. 

Read more about this here: How does the read-only CAST AI agent work and what data can it read?

Part 2: Automated optimization and security

Once the analysis is completed, users can either implement its recommendations manually or turn the automated optimization on.

CAST AI analyzes the setup and starts evicting pods, shrinking nodes, or adding Spot Instances (if you turn this option on) to achieve the required performance at the lowest possible cost.

Here’s how it works on an example from the perspective of Amazon AWS (all of these security principles are applied to use cases in Google Cloud Platform and Microsoft Azure).

By running the phase 2 on-boarding script, you create a dedicated AWS user that will be used by CAST AI to request and manage AWS resources on your behalf. 

To do that, the agent creates a user account with the following permissions: 

AmazonEC2ReadOnlyAccessAWS managed policyUsed to fetch details about Virtual Machines
IAMReadOnlyAccessAWS managed policyUsed to fetch required data from IAM
CastEKSPolicyManaged policyPolicy for creating and removing Virtual Machines when managing Cluster nodes
CastEKSRestrictedAccessInline policyCAST AI policy for Cluster Pause / Resume functionality

You can validate these policies by combining the results of specific commands.

All these permissions are scoped to a single cluster. This means that CAST AI doesn’t have access to the resources of any other clusters in your AWS account.

How CAST AI handles sensitive data

CAST AI doesn’t access any sensitive data of users, no matter which resources you’re using in your Kubernetes cluster. All the platform can know is how much storage, memory, and CPU units are required to run your cluster most efficiently. 

Note: You can remove the CAST AI agent and all of its resources any time you want.

Here’s how CAST AI handles user data:

  • In CAST AI services, no data is deleted – instead, an entity is marked as deleted,
  • Where possible, the platform uses foreign keys for internal data integrity,
  • If one service needs access to data from another service, this is done via service call (synchronous/asynchronous), or a service that owns a data copy (view), created by listening to the events that golden data source is publishing,
  • Each service runs its own isolated subset of platform data, so evolving its entity model and testing the changes is easier.

Data back-up and recovery

We carry out overnight native Google service database backups that store the databases encrypted to a regional GCP Object storage. Only authorized team members have access to such production backup storage locations.

We store database backups for 30 days. Backup files are regularly tested to assure their integrity, and backup processes alert in the event of failure and on success. The logs of backup notifications are kept for up to 6 months.

Failed backups are monitored through Stackdriver, and the on-call engineer is notified about them using OpsGenie and Slack (and an email is sent to our security team).

Disaster recovery and business continuity

We have a documented Disaster Recovery plan, which is reviewed and updated on at least an annual basis. We also have a GKE instance located in 3 availability zones in the Google data center in Ashburn (North Virginia).

The Cloud SQL has a fail-over instance, which is constantly in hot-stand-by. In the event of a catastrophic failure of the primary data center, the Cloud SQL fail-over instance is automated and managed by Google.

Have a question we didn’t cover? Send us a message and someone on our team will answer!

CAST AI Blog A Quick Guide to Data and Security in CAST AI