A quick guide to data and security in CAST AI

CAST AI
· 6 min read
data and security in CAST AI

When we say security is at the core of CAST AI, we mean it. Several of our founders have previously built a company focusing on application security and attack mitigation. Our CTO, Leon Kuperman, was the VP of Security Products at Oracle.

Unsurprisingly, this in-depth knowledge of cybersecurity lead not only to an exceedingly safe product but also to CAST AI getting ISO-certified and being well underway to obtaining a SOC 2 certification.

To make automated cost optimization of your Kubernetes clusters possible, we require minimum access to your data and follow a clear list of permissions. Keep on reading to get a detailed overview of our security measures at every step of the way.

Jump to the part that interests you the most:

Our general security commitments at CAST AI

All our security commitments are captured in Service Level Agreements (SLAs), other agreements, and in the description of our service offers. All our security commitments are standardized and include, among others, the following:

  • We always encrypt customer data at rest and in transit,
  • Our platform is housed in state-of-the-art cloud environmentsthat are SOC 2 Type II-compliant,
  • We continuously monitor and test our platform for any security vulnerabilities or unexpected changes,
  • The platform’s security allows the segregation of responsibilities and application functional access,
  • Our company only provides access to customer data on a need-to-know basis and audited employee access to ensure that access levels are never out of date,
  • We launched a security bug bounty program to let security researchers around the world discover and report bugs – and at the same time, ensure the high level of security of CAST AI services,
  • Our services are available at least 99.9% of the time (measured over the course of each calendar month).

Part 1: Analysis in a read-only mode

To deliver meaningful results at the first stage, our platform requires minimal cluster access. We generally follow the principle of least privilege. Our read-only agent doesn’t have access to anything that would allow it to change your cluster configuration or access sensitive data. 

Here’s what the CAST AI agent can see:

  • Main resources like nodes, pods, deployments, etc., that are required for running the Savings report,
  • Environment Variables on pods, deployments, statefulsets, daemonsets. 

The CAST AI agent doesn’t have any access to elements like secrets or config maps. The Environment Variables considered as sensitive judging by their name (passwords, tokens, keys, secrets) are removed before the resources are sent for analysis. 

Read more about this here: How does the read-only CAST AI agent work and what data can it read?

Part 2: Automated optimization and security

Once the analysis is completed, users can either implement its recommendations manually or turn the automated optimization on.

CAST AI analyzes the setup and starts evicting pods, shrinking nodes, or adding Spot Instances (if you turn this option on) to achieve the required performance at the lowest possible cost.

Here’s how it works on an example from the perspective of Amazon AWS (all of these security principles are applied to use cases in Google Cloud Platform and Microsoft Azure).

If you’d like to implement the cloud cost savings automatically, the agent will need your credentials (Secret Access Key and Access Key ID). They’re used to call the public AWS API so that CAST AI can create, orchestrate and optimize clusters for you. 

To do that, the agent creates a user account with the following permissions: 

  • AmazonEC2ReadOnlyAccess1
  • IAMReadOnlyAccess2
  • Manage EC2 instances (create or terminate instances on demand) within the specified cluster and restricted to its VPC (Subnets, Security Groups, NAT Gateways), 
  • Manage autoscaling groups in the specified cluster, 
  • Manage EKS3 Node Groups in the specified cluster. 

All these permissions are scoped to a single cluster. This means that CAST AI doesn’t have access to the resources of any other clusters in your AWS account. It also created a Lambda4 (1 per cluster) to handle Spot Instance interruptions in case you decide to use them and get even more savings. 

References:

1   arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess

  This policy provides read-only access to Amazon EC2 via the AWS Management Console.

2  arn:aws:iam::aws:policy/IAMReadOnlyAccess

  This provides read-only access to IAM via the AWS Management Console.

3 arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

  This policy provides Kubernetes the permissions it requires to manage resources on your behalf. Kubernetes requires Ec2:CreateTags permissions to place identifying information on EC2 resources, including but not limited to Instances, Security Groups, and Elastic Network Interfaces.

4 arn:aws:iam::aws:policy/service-role/AWSLambdaRole

  The default policy for AWS Lambda service role.

How CAST AI handles sensitive data

CAST AI doesn’t access any sensitive data of users, no matter which resources you’re using in your Kubernetes cluster. All the platform can know is how much storage, memory, and CPU units are required to run your cluster most efficiently. 

Note: You can remove the CAST AI agent and all of its resources any time you want.

Here’s how CAST AI handles user data:

  • In CAST AI services, no data is deleted – instead, an entity is marked as deleted,
  • Where possible, the platform uses foreign keys for internal data integrity,
  • If one service needs access to data from another service, this is done via service call (synchronous/asynchronous), or a service that owns a data copy (view), created by listening to the events that golden data source is publishing,
  • Each service runs its own isolated subset of platform data, so evolving its entity model and testing the changes is easier.

Data back-up and recovery

We carry out overnight native Google service database backups that store the databases encrypted to a regional GCP Object storage. Only authorized team members have access to such production backup storage locations.

We store database backups for 30 days. Backup files are regularly tested to assure their integrity, and backup processes alert in the event of failure and on success. The logs of backup notifications are kept for up to 6 months.

Failed backups are monitored through Stackdriver, and the on-call engineer is notified about them using OpsGenie and Slack (and an email is sent to our security team).

Disaster recovery and business continuity

We have a documented Disaster Recovery plan, which is reviewed and updated on at least an annual basis. We also have a GKE instance located in 3 availability zones in the Google data center in Ashburn (North Virginia).

The Cloud SQL has a fail-over instance, which is constantly in hot-stand-by. In the event of a catastrophic failure of the primary data center, the Cloud SQL fail-over instance is automated and managed by Google.

Have a question we didn’t cover? Ask it in the chat and someone on our team will answer!

Leave a reply

0 Comments
Inline Feedbacks
View all comments