Implementing Cloud Governance as a Code using Cloud Custodian
Alok
Posted on December 9, 2021
In today’s scaling cloud infrastructure it's hard to manage all resources compliance. Every organization has a set of policies to follow for detecting violations and taking remediation actions on their cloud resources. This is generally done by writing multiple custom scripts and using some 3rd party tool and integration. Many development teams know how hard it is to manage and write custom scripts and keep a track of those. This is where we can leverage Cloud Custodian DSL policies to manage our Cloud resources with ease.
What is cloud governance?
Cloud governance is a framework which defines how developers can create policies to control costs, minimize security risks, improve efficiency and accelerate deployment.
What are other tools that provide governance as code?
AWS Config
AWS config records and monitors all configuration data of AWS resources and We can build rules to help us enforce compliance. Setting up a Multi account and Multi Zone option is available. It also provides some predefined AWS managed rule that we can use or we can write our own custom rules. We can also take remediation action based on matches. For Custom policy we need to write our own lambda function for taking action.
However we can use Cloud Custodian to set up AWS Config rule and Custom rule which supports Multi account and Multi region using c7n-org. Also it can automatically provision aws lambda function.
Azure Policy
Azure policy enforces organization standards across Azure resources. It provides an aggregated view to evaluate the overall state of the environment, with the ability to drill down to the per-resource, per-policy granularity.(eg. Users are only allowed to create A and B series Virtual Machines). We can turn on in-built policies or create custom policies for all resources. It can also take auto remediation action on non-compliant resources.
Azure Policy is reliable and efficient for building a custom validation layer on deployments to prevent deviation from customer defined rules. Cloud Custodian and Azure Policy have significant overlap in scenarios they can accomplish with regard to compliance implementations. When reviewing your requirements, we recommend first identifying the requirements that can be implemented via Azure Policy. Custodian can then be used to implement the remaining requirements. Custodian is also frequently used to add a second layer of protection or mitigation actions to requirements covered by Azure Policy. This way we can ensure that policy is configured correctly.
Till now, we have seen What is cloud governance and what are other tools available in the market. Let's see now what Cloud Custodian can provide us in cloud governance.
What is Cloud Custodian?
Cloud Custodian is CNCF sandbox project for governing public cloud resources in real-time. It helps us write governance as code the same way we write infrastructure as code. It detects the non-complaints resource and takes action to remediate it. Custodian is a cloud native tool. It can be used with multiple cloud providers(AWS, AZURE, GCP, etc)
We can use Cloud Custodian as below,
- Compliance and Security as code - We can write Simple YAML DSL policy as a code.
- Cost savings - Removing unwanted resources and Implementing the on/off hours policy can save costs.
- Operational efficiency -By adding governance as code it reduces the friction for innovating securely in the cloud and also increases developer efficiency.
How does it work?
When we run Cloud Custodian command depending on the Cloud provider it takes resources, filters, action as input and translate into Cloud provider API Call(eg. AWS Boto3 API). No need to worry about custom script or aws cli commands. We get clean, readable policies and numerous common filters and actions that have been built into Cloud Custodian. If we need custom filters we can always use JMESPath to write our filter.
There can be situations where we may need to run our policy periodically or based on some events. For this Cloud Custodian automatically provision lambda function and CloudWatch event rule. CloudWatch event rules can be scheduled (every 10 minutes) or triggered in response to API calls by CloudTrail, EC2 instance state events, etc.
How to install and set up Cloud Custodian ?
We can simply install Cloud Custodian with python pip command
python3 -m venv custodian
source custodian/bin/activate
pip install c7n # This includes AWS support
pip install c7n_azure # Install Azure package
pip install c7n_gcp # Install GCP Package
Using Cloud Custodian docker image
docker run -it \
-v $(pwd)/output:/opt/custodian/output \
-v $(pwd)/policy.yml:/opt/custodian/policy.yml \
--env-file <(env | grep "^AWS\|^AZURE\|^GOOGLE|^KUBECONFIG") \
cloudcustodian/c7n run -v --cache-period 0 -s /opt/custodian/output /opt/custodian/policy.yml
Note: ACCESS and SECRET KEY, DEFAULT_REGION and KUBECONFIG are fetched from ENV variables and users should have access to required IAM Roles and Policies that we define in policy YAML file. Another option is to mount the file/directory inside the container.
Cloud Custodian policy.yaml explained
Cloud Custodian has simple yaml file which includes Resource, Filter and Action
Resources: Custodian is able to target several cloud providers (AWS, GCP, Azure) and each provider has its own resource type.(eg ec2, s3 bucket)
Filters: Filters are the way in Custodian to target a specific subset of resources. It could be based on some date, tag etc. We can write our custom filter using the JMESPath expression.
Actions: Actions is the actual decision you make on resources that match the filter. This action can be as simple as sending a report to the owner, stating that the resource does not match the Cloud governance rule or delete the resource.
Both actions and filters can combine as many rules as you want to express your needs perfectly.
- name: first-policy
resource: name-of-cloud-resource
description: Description of policy
filters:
- (some filter that will select a subset of resource)
- (more filters)
actions:
- (an action to trigger on filtered resource)
- (more actions)
Cloud Custodian sample policy
Although Official docs cover most of the aws policies examples, We have picked up some policies which can be used from day 1 for cost saving and Compliance.
ebs-snapshots-month-old.yml
One of the most common issues the organization faces is the complexity of removing old ami,snapshot and volume which lie there in our environment for more than 1 years and add more bills. Eventually we have to write multiple custom scripts to deal with the situation.
Below is a simple policy which removes snapshots which are older than 30 days.
policies:
- name: ebs-snapshots-month-old
resource: ebs-snapshot
filters:
- type: age
days: 30
op: ge
actions:
- delete
Here is an example of how we can run the Cloud Custodian policy.
custodian run -v -s /tmp/output /tmp/ebs-snapshots-month-old.yml
Every time we run the Custodian command it creates/appends files inside policies.name output directory passed with -s option (eg. /tmp/output/ebs-snapshot-month-old/custodian-run.log)
- custodian-run.log : All console logs are stored here
- resources.json : Filtered resources list
- metadata.json : Metadata about filtered resources
- action-* : resources list on which action was taken
- $HOME/.cache/cloud-custodian.cache : All cloud api call results are cached here. Default value is 15 minutes.
To get a filtered resource report we can run the below command. By default it provides reports in csv format but we can change it by passing --format json.
custodian report -s /tmp/output/ --format csv ebs-snapshots-month-old.yml
only-approved-ami.yml
Stop running ec2 which does not match with the trusted AMI list.
policies:
- name: only-approved-ami
resource: ec2
comment: |
Stop running EC2 instances that are using invalid AMIs
filters:
- "State.Name": running
- type: value
key: ImageId
op: not-in
value:
- ami-04db49c0fb2215364 # Amazon Linux 2 AMI (HVM)
- ami-06a0b4e3b7eb7a300 # Red Hat Enterprise Linux 8 (HVM)
- ami-0b3acf3edf2397475 # SUSE Linux Enterprise Server 15 SP2 (HVM)
- ami-0c1a7f89451184c8b # Ubuntu Server 20.04 LTS (HVM)
actions:
- stop
Security-group-check.yml
One of the more common issues that we see when Developers tend to allow all traffic on SSH while creating POC VM OR during testing we sometimes allow port 22 to ALL but forget to remove the rule. Below policy can take care of these issues by automatically removing SSH access from ALL and adding only VPN IP to the security group.
policies:
- name: sg-remove-permission
resource: security-group
filters:
- or:
- type: ingress
IpProtocol: "-1"
Ports: [22]
Cidr: "0.0.0.0/0"
- type: ingress
IpProtocol: "-1"
Ports: [22]
CidrV6: "::/0"
actions:
- type: set-permissions
remove-ingress: matched
add-ingress:
- IpPermissions:
- IpProtocol: TCP
FromPort: 22
ToPort: 22
IpRanges:
- Description: VPN1 Access
CidrIp: "10.10.0.0/16"
Support Kubernetes resources
We can now manage Kubernetes resources like deployment, pod, Daemonset, Volume. Below are some sample policies that we can write with Cloud Custodian.
- Delete POC and untagged resources
- Update labels and patch on k8 resources
- Call webhooks based on findings
kubernetes-delete-poc-resource.yml
policies:
- name: delete-poc-namespace
resource: k8s.namespace
filters:
- type: value
key: 'metadata.name'
op: regex
value: '^.*poc.*$'
actions:
- delete
- name: delete-poc-deployments
resource: k8s.deployment
filters:
- type: value
key: 'metadata.name'
op: regex
value: '^.*poc.*$'
actions:
- delete
Note: Cloud Custodian kubernetes resources still work in progress. We can check the status of the plugin here.
What are the types of modes that we can call Cloud Custodian?
- pull - Default method can be run manually. Preferred to add it in CICD tool cron.
- periodic - Provision cloud resource (eg. Aws lambda with CloudWatch cron) as per policy and executes as scheduled.
- Custom mode as per cloud provider - Executes when the event matches
Integrate Cloud Custodian with Jenkins CI
For simplicity we are using Cloud Custodian docker image and injecting the credentials as environment variables.
Note: secret file should have keys in upper case and default region. In case of kubernetes the KUBECONFIG file should be mounted inside the container.
export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>
export AWS_DEFAULT_REGION=<YOUR_DEFAULT_REGION>
pipeline{
agent{ label 'worker1'}
stages{
stage('cloudcustodian-non-prod'){
steps{
dir("non-prod"){
withCredentials([file(credentialsId: 'secretfile', variable: 'var_secretfile')])
{
sh '''
source $var_secretfile > /dev/null 2>&1
env | grep "^AWS\\|^AZURE\\|^GOOGLE\\|^KUBECONFIG" > envfile
for files in $(ls | egrep '.yml|.yaml')
do
docker run --rm -t \
-v $(pwd)/output:/opt/custodian/output \
-v $(pwd):/opt/custodian/ \
--env-file envfile \
cloudcustodian/c7n run -v -s /opt/custodian/output /opt/custodian/$files
done
'''
}
}
}
}
stage("cloudcustodian-prod"){
steps{
dir("prod"){
withCredentials([file(credentialsId: 'secretfile', variable: 'var_secretfile')])
{
sh '''
source $var_secretfile > /dev/null 2>&1
env | grep "^AWS\\|^AZURE\\|^GOOGLE\\|^KUBECONFIG" > envfile
for files in $(ls | egrep '.yml|.yaml')
do
docker run --rm -t \
-v $(pwd)/output:/opt/custodian/output \
-v $(pwd):/opt/custodian/ \
--env-file envfile \
cloudcustodian/c7n run -v -s /opt/custodian/output /opt/custodian/$files
done
'''
}
}
}
}
}
}
Tools and Features
Cloud Custodian has a number of add-on tools that have been developed by the community.
Multi Region and Multi Account support
We can use c7n-org plugging to configure multiple AWS, AZURE, GCP accounts and run them in parallel. Flag --region all can be used to run the same policy across all regions.
Notification
c7n-mailer plugin provides lots of flexibility for alert notifications. We can use webhook, email, queue service, Datadog, Slack and Splunk for alerts.
Auto-resource-tagging
c7n_trailcreator script will process cloudtrail records to create a sqlite db of resources and their creators, and then use that sqlitedb to tag the resources with their creator's name.
Logging and Reporting
It provides reporting in JSON and CSV format. We can also collect these metrics inside Cloud native logging and generate nice dashboards. We can store the logs locally, S3 or on Cloudwatch. A consistent logging format makes it easy to troubleshoot policies.
Custodian Dry run
In Dry run(--dryrun), the action part of policy is ignored. It shows what resources will be impacted by the policy. It is always best practice to do a dry run first before running the actual code.
Custodian Cache
When we execute any policy it fetches data from the cloud and stored locally for 15 min. Cache is used to minimize api calls. We can set the cache with --cache-period 0 option.
Editor integration
It can be integrated with Visual Studio Code for auto compilation and suggestion.
Custodian schema
We can use Custodian schema command to find out the type of resource, action and filters that are available inside Cloud Custodian.
custodian schema #Shows all resource available in custodian
custodian schema aws #Shows aws resource available in custodian
custodian schema aws.ec2 #Shows aws ec2 action and filters
custodian schema aws.ec2.actions #Shows aws ec2 actions only
custodian schema aws.ec2.actions.stop #Shows ec2 stop sample policy and schema
How is Cloud Custodian better than other tools?
- Simplicity and Consistency of writing policies across multiple cloud platforms and kubernetes.
- Multi account and Multi region support using c7n-org.
- Support a wide range of Notification channels using c7n-mailer
- Custodian's terraform provider enables writing and evaluating Custodian policies against Terraform IaaC modules.
- Custodian has deep integration with AWS config. It can deploy any config-rule that is supported by config. Also It can automatically provision aws lambda for AWS custom config policy.
- We can implement our custom policies in Python if you need to as it supports all rules as per Cloud providers SDK.
- Cloud Custodian is an opensource CNCF Sandbox project.
Cloud Custodian Limitations
- No Default Dashboard (Supports AWS native dashboard but We can also send metrics output to Elasticsearch/Grafana, etc. and create dashboard).
- Cloud Custodian can not prevent custom layer validation pre deployments. It can only run periodically or based on some events.
- Cloud Custodian does not have any in-built policies. We need to write all policies by ourselves. However it has a lot of good example policies(aws, azure, gcp) that we can use as reference.
Conclusion
Cloud Custodian enables us to define rules and remediation as one policy to facilitate a well-managed cloud infrastructure. We can also use it to write policies for managing Kubernetes resources like deployment, pod, etc. Compared to other cloud based governance tools It provides a very simple DSL to write policies and It’s Consistency across Cloud platforms. Custodian reduces the friction for innovating securely in the Cloud and also increases efficiency.
We can use Cloud Custodian to optimize our Cloud cost by implementing offhour and cleanup policies. It also includes lots of plugins like Multi account/region support, Wide range of Notification tools(Slack, SMTP, sqs, Datadog, Webhooks, etc), etc. We can find a list of Cloud Custodian plugins here.
That’s a wrap folks :) Hope the article was informative and you enjoyed reading it. I’d love to hear your thoughts and experience - let’s connect and start a conversation on LinkedIn.
References & Further Reading:
Posted on December 9, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.