Optimizing AWS Infrastructure Deployment: Terraform, Sentinel, and CI/CD Best Practices
Nikolai Main
Posted on October 6, 2024
This project follows on from a previous post where I built AWS infrastructure solely in the AWS console. In it i cover the following topics:
- Centralized Terraform state management
- Terraform code validation with Sentinel
- CI/CD Pipeline deployment
- AWS Infrastructure
Less focus is placed on actual application design but may be covered in a later post.
Project Overview
In my initial project, I spent about an hour building the infrastructure and quickly realized how easy it is to make even minor mistakes that can lead to system failures. This often resulted in spending an additional 10 minutes here and there, sifting through each component to identify the error.
Recognizing this challenge in a relatively small project made me acutely aware of the potential headaches that could arise when managing larger systems.
To address this issue, I turned to Terraform. I dedicated a similar amount of time — approximately 1-2 hours — to define my infrastructure. However, the benefits were substantial: instead of spending 1-2 hours each time I needed to deploy, I can now get my entire infrastructure up and running in about 10 minutes, with a comparable teardown time.
This improvement effectively reduced my deployment time by approximately 50 minutes. Additionally, I can confidently assert that my application and infrastructure are secure, thanks to the comprehensive scans conducted prior to deployment:
- Infrastructure Validation: My infrastructure is validated and checked with Sentinel in my cloud workspace. If any misconfigurations—such as poor naming and tagging conventions, overly permissive IAM policies, or insecure VPC designs—are present in my Infrastructure as Code (IaC), the run will fail, and I will be notified of the necessary changes.
- Application Security Scans: For my application, I utilize GitLab's built-in suite of security tools to scan for code and dependency vulnerabilities, as well as exposed secrets. If GitLab isn’t an option, there are several other security scanning tools available, such as CodeQL, SonarQube, and Trivy. Once the application image is built, it undergoes an additional scan with Trivy to ensure its security.
Infrastructure Overview
Frontend Infrastructure (Repo 1)
- ECR (Elastic Container Registry)
- ECS (Elastic Container Service)
- Application Load Balancer
Backend Infrastructure (Repo 2)
- VPC (Virtual Private Cloud)
- RDS (Relational Database Service)
- API Gateway
- AWS Lambda
- Secrets Manager
Security Checks and Scans
Pipeline Scans
- Secret Detection
- SAST (Static Application Security Testing) Scanning
- Dependency Scanning
- SCA (Software Composition Analysis) Scanning
Sentinel Scans
- Appropriate IAM Permissions
- General Configuration Checks
- VPC Traffic Flows
Deployment Workflow Overview
Workflow 1 (Backend Configuration)
- Backend code is pushed to GitLab.
- Terraform run triggered in cloud workspace.
- Sentinel policies check code for misconfigurations
- VPC: Naming conventions and private subnet config
- Security Groups: Only allowing traffic over necessary ports
- Lambda: IAM permissions and general config
- RDS: Check for encryption, Public Accessibility and Default credentials.
- Secrets Manager: Checks for secret rotation and read replicas
- Upon validation infrastructure can be applied. Note relevant outputs.
- RDS Endpoint + Secret Name are needed for Lamdba to work in this project. (I later came back to this and retrieved those outputs dynamically from within the Lambda function)
Example Sentinel Policy - VPC Checks
import "tfplan/v2" as tfplan
import "tfrun" as run
import "strings"
// Define variables
messages = \[\]
resource = "VPC"
// Define main function
checks = func() {
if run.is\_destroy == true {
return true
}
// Retrieve resource info
vpc = filter tfplan.resource\_changes as \_, rc {
rc.mode is "managed" and
rc.type is "aws\_vpc"
}
subnet = filter tfplan.resource\_changes as \_, rc {
rc.mode is "managed" and
rc.type is "aws\_subnet"
}
// Checking if resource exists.
if length(vpc) == 0 {
append(messages, "No vpc found.")
}
if length(subnet) == 0 {
append(messages, "No subnets found.")
}
// Iterate over subnets
for subnet as address, subnet {
// Check number of available addresses
if int(strings.split(subnet.change.after.cidr\_block, "/")\[1\]) < 24{
append(messages, (subnet.address + " CIDR prefix too large. Must be at least 24."))
}
if(strings.has\_prefix(subnet.address, "aws\_subnet.private")){
// Check subnet CIDR block
if subnet.change.after.cidr\_block == "0.0.0.0/0"{
append(messages, "Subnet not private. Edit CIDR block")
}
// Check if subnet has a public IP enabled.
if subnet.change.after.map\_public\_ip\_on\_launch == true{
append(messages, "Subnet not private. Public IP enabled")
}
}
}
// Run VPC checks
for vpc as address, vpc {
// Check if requires\_compatibilities is set and includes "FARGATE"
requires\_name = vpc.change.after.tags else \[\]
// Check VPC name/tags
if length(requires\_name) == 0 or requires\_name.Name == "main-vpc"{
append(messages, "VPC must follow proper naming conventions. Current name: " + requires\_name.Name)
}
}
// Checking if any error messages have been produced
// If messages is empty, the policy returns True and passes.
if length(messages) != 0 {
print(resource + " misconfigurations:")
counter = 1
for messages as message{
print(string(counter) + ". " + message)
counter += 1
}
return false
}
return true
}
// Main rule
main = rule {
checks()
}
Workflow 2 (Frontend Configuration)
- Application code is developed on local machine and pushed to Gitlab.
- Pipeline is trigged (More details below)
- Scan application code
- Build image, scan and push to ECR
- Retrieve relevant outputs from backend infrastructure
- Create TF_vars file and push back to GitLab
- 2nd Terraform workspace triggered by push to repo w/ tag
- Similar plan > sentinel scan > apply process takes place.
GitLab Pipeline
Stage 1: Test - SAST, Dependency, Secrets etc..
image: docker:latest
services:
- docker:dind
variables:
DOCKER\_HOST: tcp://docker:2375/
DOCKER\_DRIVER: overlay2
REPO\_NAME: gitlab-cicd
// Declaring the required GitLab scans.
include:
- template: Jobs/Dependency-Scanning.gitlab-ci.yml
- template: Jobs/SAST.gitlab-ci.yml
- template: Jobs/Secret-Detection.gitlab-ci.yml
// All included templates run during 'test' stage.
stages:
- test
- build-image
- fetch-terraform-outputs
- update-terraform
Stage 2: Build, Scan, Push
build:
stage: build-image
before\_script:
- apk add --no-cache aws-cli
- apk add --no-cache curl
script:
// Building Docker image
- echo "Building Docker image..."
- docker build -t $REPO\_NAME:latest .
// Scanning Docker image with Trivy
- echo "Running Trivy scan on Docker image"
- curl -sSL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh
| sh -
- export PATH=$PATH:$(pwd)/bin
- trivy image --exit-code 0 --severity HIGH,CRITICAL $REPO\_NAME:latest || true
- trivy image --format json --output trivy-results.json $REPO\_NAME:latest
# Retrieving ECR repo credentials
- echo "Logging in to Amazon ECR..."
- aws ecr get-login-password --region $AWS\_DEFAULT\_REGION | docker login --username
AWS --password-stdin $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com
# Pushing Docker image to ECR
- echo "Pushing Docker image to ECR..."
- TIMESTAMP=$(date +%Y%m%d%H%M%S)
- IMAGE\_TAG="$REPO\_NAME:$TIMESTAMP"
- docker tag $REPO\_NAME:latest $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG
- docker push $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG
- echo "TF\_VAR\_image\_uri=$AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG"
>> build.env
artifacts:
paths:
- build.env
Stage 3: Fetch TF outputs
fetch-terraform-outputs:
stage: fetch-terraform-outputs
image: alpine:latest
script:
- apk add --no-cache curl jq
- echo "Creating variables for specific outputs..."
// Retrieving outputs via Terraform Cloud API
- "curl -s -X GET \\\\\\n \\"https://app.terraform.io/api/v2/workspaces/${HCP\_WORKSPACE\_ID}/current-state-version-outputs\\"
\\\\\\n -H \\"Authorization: Bearer ${HCP\_TOKEN}\\" \\\\\\n -H
'Content-Type: application/vnd.api+json' | \\\\\\njq -r '.data\[\] | select(.attributes.name
// Saving outputs as environment variables to be passed to the next stage.
| test(\\"public\_subnet\_ids|alb-sg-id|container-sg-id|vpc\_id\\")) | \\n if .attributes.name
== \\"public\_subnet\_ids\\" then\\n \\"PUBLIC\_SUBNET\_IDS=\\\\(.attributes.value)\\"\\n
\\ elif .attributes.name == \\"alb-sg-id\\" then\\n \\"ALB\_SG\_ID=\\\\(.attributes.value)\\"\\n
\\ elif .attributes.name == \\"container-sg-id\\" then\\n \\"CONTAINER\_SG\_ID=\\\\(.attributes.value)\\"\\n
\\ elif .attributes.name == \\"vpc\_id\\" then\\n \\"VPC\_ID=\\\\(.attributes.value)\\"\\n
\\ else\\n empty\\n end' > terraform\_outputs.env\\n"
artifacts:
reports:
dotenv: terraform\_outputs.env
Stage 4: Update main.tf
update-terraform:
stage: update-terraform
image: alpine:latest
dependencies:
- build
- fetch-terraform-outputs
before\_script:
- apk add --no-cache git
- git config --global user.email "${USER\_EMAIL}"
- git config --global user.name "${USER\_NAME}"
script:
- echo "Contents of current directory:"
- ls -la
- echo "Contents of build.env:"
- cat build.env || echo "build.env not found"
- echo "Contents of terraform\_outputs.env:"
- cat terraform\_outputs.env || echo "terraform\_outputs.env not found"
- export $(cat build.env | xargs)
- export $(cat terraform\_outputs.env | xargs)
- echo "Cloning repository..."
- git clone https://<username>:${GITLAB\_PAT}@gitlab.com/<project_id>/<repo.git> || exit
1
- cd Test
// Create TF\_vars file
- echo "Creating/Updating TF\_vars file..."
- |
cat << EOF > terraform.tfvars
image\_uri = "${TF\_VAR\_image\_uri}"
public\_subnet\_ids = ${PUBLIC\_SUBNET\_IDS}
alb\_sg\_id = "${ALB\_SG\_ID}"
container\_sg\_id = "${CONTAINER\_SG\_ID}"
vpc\_id = "${VPC\_ID}"
EOF
// Commit and push TF\_vars to repo
- git add terraform.tfvars
- git commit -m "Update image URI and Terraform outputs in TF\_vars \[ci skip\]" ||
echo "No changes to commit"
- TAG\_NAME="$(date +%Y.%m.%d-%H%M%S)"
- echo "Creating a new tag $TAG\_NAME"
// Creating a tag to trigger TF cloud only from pushes from this pipeline.
// 'ci skip' tells the repo not to run the pipeline again on this push.
- git tag -a $TAG\_NAME -m "Release version $TAG\_NAME \[ci skip\]"
- git push origin HEAD:main --tags || exit 1
Final Notes
In conclusion, I now have an end-to-end deployment solution that ensures my application is both secure and robust. This streamlined process has significantly reduced my mean time to deployment, allowing me to reallocate time and resources to other areas.
By identifying potential issues much earlier in the deployment process, I can mitigate risks that previously led to delays and unnecessary costs. This proactive approach not only enhances the overall efficiency of our development cycle but also improves the quality of our releases.
Looking ahead, I plan to involve additional security testing, Incorporate testing and production environments and integrate a monitoring tool such as Grafana.
Posted on October 6, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.