Scaled Virus Scanner using AWS Fargate, ClamAV, S3, and SQS with Terraform
Joseph Sutton
Posted on May 27, 2022
Welcome back for more shenanigans!
Some time ago, team I was on ran into a problem of hitting the Lambda deployment (and runtime) size limits: our solo Lambda function + a ClamAV layer with pre-built binaries and virus definitions. If you had a smaller file size requirement to scan, I'm sure that a Lambda layer setup with ClamAV and its binaries and definitions would work great for you; however, it wasn't in our case. We needed to scale our solution to allow for files up to sizes of 512MB.
TL;DR: GitHub repo.
Since we were already using SQS and EC2 for other things, why not use it along with S3 and Fargate? We had a spike to pursue either EC2 or a Fargate consumer client, but Fargate had better maintainability in the long run.
Note that I won't be implementing a cluster policy (yet) in this article, I might save that for another time; however, things should be setup to translate relatively well in that matter.
NOTE: This assumes you have your AWS credentials setup already via
aws configure
. If you plan on using a different profile, make sure you reflect that in themain.tf
file below whereprofile = "default"
is set.
Anyhow, let's get to it. First, let's set up our main configuration in terraform/main.tf
:
terraform {
required_version = ">= 1.0.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 3.29"
}
}
backend "s3" {
encrypt = true
bucket = "tf-clamav-state"
dynamodb_table = "tf-dynamodb-lock"
region = "us-east-1"
key = "terraform.tfstate"
}
}
# TODO: Make note about aws credentials and different profiles
provider "aws" {
profile = "default"
region = "us-east-1"
}
This has a remote state (didn't I write about that once?), so we need a script to setup the S3 bucket and DynamoDB table for both our state and lock status respectively. We can set that up in a bash script in terraform/tf-setup.sh
:
#!/bin/bash
# Create S3 Bucket
MY_ARN=$(aws iam get-user --query User.Arn --output text 2>/dev/null)
aws s3 mb "s3://tf-clamav-state" --region "us-east-1"
sed -e "s/RESOURCE/arn:aws:s3:::tf-clamav-state/g" -e "s/KEY/terraform.tfstate/g" -e "s|ARN|${MY_ARN}|g" "$(dirname "$0")/templates/s3_policy.json" > new-policy.json
aws s3api put-bucket-policy --bucket "tf-clamav-state" --policy file://new-policy.json
aws s3api put-bucket-versioning --bucket "tf-clamav-state" --versioning-configuration Status=Enabled
rm new-policy.json
# Create DynamoDB Table
aws dynamodb create-table \
--table-name "tf-dynamodb-lock" \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1 \
--region "us-east-1"
This does require an S3 policy s3_policy
in terraform/templates/s3_policy.json
:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "RESOURCE",
"Principal": {
"AWS": "ARN"
}
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "RESOURCE/KEY",
"Principal": {
"AWS": "ARN"
}
}
]
}
Now we can run the tf-setup.sh
script (don't forget to chmod +x
it) via cd terraform && ./tf-setup.sh
.
Now that we have our remote state established, let's scaffold out our infrastructure in Terraform. First up, we need our buckets (one for quarantined files and one for clean files), our SQS queue, and an event notification configured on the quarantine bucket for when an object is created. We can set this up via terraform/logistics.tf
:
provider "aws" {
region = "us-east-1"
alias = "east"
}
data "aws_caller_identity" "current" {}
resource "aws_s3_bucket" "quarantine_bucket" {
provider = aws.east
bucket = "clamav-quarantine-bucket"
acl = "private"
cors_rule {
allowed_headers = ["Authorization"]
allowed_methods = ["GET", "POST"]
allowed_origins = ["*"]
max_age_seconds = 3000
}
lifecycle_rule {
enabled = true
# Anything in the bucket remaining is a virus, so
# we'll just delete it after a week.
expiration {
days = 7
}
}
}
resource "aws_s3_bucket" "clean_bucket" {
provider = aws.east
bucket = "clamav-clean-bucket"
acl = "private"
cors_rule {
allowed_headers = ["Authorization"]
allowed_methods = ["GET", "POST"]
allowed_origins = ["*"]
max_age_seconds = 3000
}
}
data "template_file" "event_queue_policy" {
template = file("templates/event_queue_policy.tpl.json")
vars = {
bucketArn = aws_s3_bucket.quarantine_bucket.arn
}
}
resource "aws_sqs_queue" "clamav_event_queue" {
name = "s3_clamav_event_queue"
policy = data.template_file.event_queue_policy.rendered
}
resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = aws_s3_bucket.quarantine_bucket.id
queue {
queue_arn = aws_sqs_queue.clamav_event_queue.arn
events = ["s3:ObjectCreated:*"]
}
depends_on = [
aws_sqs_queue.clamav_event_queue
]
}
resource "aws_cloudwatch_log_group" "clamav_fargate_log_group" {
name = "/aws/ecs/clamav_fargate"
}
If you read the clamav_event_queue
block above, there's an event queue policy - let's not forget that in terraform/templates/event_queue_policy.json
:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": [
"sqs:SendMessage",
"sqs:ReceiveMessage"
],
"Resource": "arn:aws:sqs:*:*:s3_clamav_event_queue",
"Condition": {
"ArnEquals": {
"aws:SourceArn": "${bucketArn}"
}
}
}
]
}
Since this is a "security" thing, we need to make sure it's isolated within its own VPC. I'm no networking guru, so most of the information I obtained was from this Stackoverflow answer. We'll do that in terraform/vpc.tf
:
# Networking for Fargate
# Note: 10.0.0.0 and 10.0.2.0 are private IPs
# Required via https://stackoverflow.com/a/66802973/1002357
# """
# > Launch tasks in a private subnet that has a VPC routing table configured to route outbound
# > traffic via a NAT gateway in a public subnet. This way the NAT gateway can open a connection
# > to ECR on behalf of the task.
# """
# If this networking configuration isn't here, this error happens in the ECS Task's "Stopped reason":
# ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed
resource "aws_vpc" "clamav_vpc" {
cidr_block = "10.0.0.0/16"
}
resource "aws_subnet" "private" {
vpc_id = aws_vpc.clamav_vpc.id
cidr_block = "10.0.2.0/24"
}
resource "aws_subnet" "public" {
vpc_id = aws_vpc.clamav_vpc.id
cidr_block = "10.0.1.0/24"
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.clamav_vpc.id
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.clamav_vpc.id
}
resource "aws_route_table_association" "public_subnet" {
subnet_id = aws_subnet.public.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private_subnet" {
subnet_id = aws_subnet.private.id
route_table_id = aws_route_table.private.id
}
resource "aws_eip" "nat" {
vpc = true
}
resource "aws_internet_gateway" "igw" {
vpc_id = aws_vpc.clamav_vpc.id
}
resource "aws_nat_gateway" "ngw" {
subnet_id = aws_subnet.public.id
allocation_id = aws_eip.nat.id
depends_on = [aws_internet_gateway.igw]
}
resource "aws_route" "public_igw" {
route_table_id = aws_route_table.public.id
destination_cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.igw.id
}
resource "aws_route" "private_ngw" {
route_table_id = aws_route_table.private.id
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.ngw.id
}
resource "aws_security_group" "egress-all" {
name = "egress_all"
description = "Allow all outbound traffic"
vpc_id = aws_vpc.clamav_vpc.id
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Now that we have the networking configuration done, we can go ahead and implement the ECS / Fargate configuration:
resource "aws_iam_role" "ecs_task_execution_role" {
name = "clamav_fargate_execution_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_role" "ecs_task_role" {
name = "clamav_fargate_task_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_role_policy_attachment" "ecs_task_execution_policy_attachment" {
role = aws_iam_role.ecs_task_execution_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
resource "aws_iam_role_policy_attachment" "s3_task" {
role = aws_iam_role.ecs_task_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}
resource "aws_iam_role_policy_attachment" "sqs_task" {
role = aws_iam_role.ecs_task_role.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSQSFullAccess"
}
resource "aws_ecs_cluster" "cluster" {
name = "clamav_fargate_cluster"
capacity_providers = ["FARGATE"]
}
data "template_file" "task_consumer_east" {
template = file("./templates/clamav_container_definition.json")
vars = {
aws_account_id = data.aws_caller_identity.current.account_id
}
}
resource "aws_ecs_task_definition" "definition" {
family = "clamav_fargate_task_definition"
task_role_arn = aws_iam_role.ecs_task_role.arn
execution_role_arn = aws_iam_role.ecs_task_execution_role.arn
network_mode = "awsvpc"
cpu = "512"
memory = "2048"
requires_compatibilities = ["FARGATE"]
container_definitions = data.template_file.task_consumer_east.rendered
depends_on = [
aws_iam_role.ecs_task_role,
aws_iam_role.ecs_task_execution_role
]
}
resource "aws_ecs_service" "clamav_service" {
name = "clamav_service"
cluster = aws_ecs_cluster.cluster.id
task_definition = aws_ecs_task_definition.definition.arn
desired_count = 1
launch_type = "FARGATE"
network_configuration {
assign_public_ip = false
subnets = [
aws_subnet.private.id
]
security_groups = [
aws_security_group.egress-all.id
]
}
}
The container_definitions
from the template_file
has the configuration for the log configuration and environment variables. That configuration is found in terraform/templates/clamav_container_definition.json
:
[
{
"image": "${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com/fargate-images:latest",
"name": "clamav",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-region": "us-east-1",
"awslogs-group": "/aws/ecs/clamav_fargate",
"awslogs-stream-prefix": "project"
}
},
"environment": [
{
"name": "VIRUS_SCAN_QUEUE_URL",
"value": "https://sqs.us-east-1.amazonaws.com/${aws_account_id}/s3_clamav_event_queue"
},
{
"name": "QUARANTINE_BUCKET",
"value": "clamav-quarantine-bucket"
},
{
"name": "CLEAN_BUCKET",
"value": "clamav-clean-bucket"
}
]
}
]
Since we're using Fargate, we'll need a Dockerfile configuration and an ECR repository (as depicted in the clamav_container_definition.json
file above). Let's get the ECR repository configured in terraform/ecr.tf
:
resource "aws_ecr_repository" "image_repository" {
name = "fargate-images"
}
data "template_file" "repo_policy_file" {
template = file("./templates/ecr_policy.tpl.json")
vars = {
numberOfImages = 5
}
}
# keep the last 5 images
resource "aws_ecr_lifecycle_policy" "repo_policy" {
repository = aws_ecr_repository.image_repository.name
policy = data.template_file.repo_policy_file.rendered
}
You can play around with the number of images, completely up to you. It's just for versioning your Docker image that contains the consumer. Now, for the actual Dockerfile
:
FROM ubuntu
WORKDIR /home/clamav
RUN echo "Prepping ClamAV"
RUN apt update -y
RUN apt install curl sudo procps -y
RUN curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash -
RUN apt install -y nodejs
RUN npm init -y
RUN npm i aws-sdk tmp sqs-consumer --save
RUN DEBIAN_FRONTEND=noninteractive sh -c 'apt install -y awscli'
RUN apt install -y clamav clamav-daemon
RUN mkdir /var/run/clamav && \
chown clamav:clamav /var/run/clamav && \
chmod 750 /var/run/clamav
RUN freshclam
COPY ./src/clamd.conf /etc/clamav/clamd.conf
COPY ./src/consumer.js ./consumer.js
RUN npm install
ADD ./src/run.sh ./run.sh
CMD ["bash", "./run.sh"]
This basically installs node, initializes a node project, and installs the bare essentials we need for the consumer: aws-sdk
and tmp
(for file handling to scan it). The first file, we can create in src/clamd.conf
which is the ClamAV configuration (for the daemon that will be listening):
LocalSocket /tmp/clamd.socket
LocalSocketMode 660
Now for the SQS consumer in src/consumer.js
:
const { SQS, S3 } = require('aws-sdk');
const { Consumer } = require('sqs-consumer');
const tmp = require('tmp');
const fs = require('fs');
const util = require('util');
const { exec } = require('child_process');
const execPromise = util.promisify(exec);
const s3 = new S3();
const app = Consumer.create({
queueUrl: process.env.VIRUS_SCAN_QUEUE_URL,
handleMessage: async (message) => {
console.log('message', message);
const parsedBody = JSON.parse(message.Body);
const documentKey = parsedBody.Records[0].s3.object.key;
const { Body: fileData } = await s3.getObject({
Bucket: process.env.QUARANTINE_BUCKET,
Key: documentKey
}).promise();
const inputFile = tmp.fileSync({
mode: 0o644,
tmpdir: process.env.TMP_PATH,
});
fs.writeSync(inputFile.fd, Buffer.from(fileData));
fs.closeSync(inputFile.fd);
try {
await execPromise(`clamdscan ${inputFile.name}`);
await s3.putObject({
Body: fileData,
Bucket: process.env.CLEAN_BUCKET,
Key: documentKey,
Tagging: 'virus-scan=clean',
}).promise();
await s3.deleteObject({
Bucket: process.env.QUARANTINE_BUCKET,
Key: documentKey,
}).promise();
} catch (e) {
if (e.code === 1) {
await s3.putObjectTagging({
Bucket: process.env.QUARANTINE_BUCKET,
Key: documentKey,
Tagging: {
TagSet: [
{
Key: 'virus-scan',
Value: 'dirty',
},
],
},
}).promise();
}
} finally {
await sqs.deleteMessage({
QueueUrl: process.env.VIRUS_SCAN_QUEUE_URL,
ReceiptHandle: message.ReceiptHandle
}).promise();
}
},
sqs: new SQS()
});
app.on('error', (err) => {
console.error('err', err.message);
});
app.on('processing_error', (err) => {
console.error('processing error', err.message);
});
app.on('timeout_error', (err) => {
console.error('timeout error', err.message);
});
app.start();
This does the following within a 10 second interval:
1) Pulls the file in through the quarantine bucket via the metadata in the SQS message (message in-flight)
2) Writes it to /tmp
3) Scans it with clamdscan
(via the ClamAV daemon, clamd
which already has the virus definitions loaded)
4) If it's clean, it puts the file in the clean bucket with a clean
tag with the key virus-scan
, removes it from the quarantine bucket, and deletes the message
5) If it's dirty, it tags the file as virus-scan
= dirty
and keeps the virus in the quarantine bucket, and deletes the SQS message
For now, this consumer handles only 1 message at a time; this can be easily configured to handle more messages since the ClamAV daemon scanner is much more efficient.
Now, the last file mentioned in the Dockerfile
is the bash script that runs the updater, daemon, and consumer in src/run.sh
:
echo "Starting Services"
service clamav-freshclam start
service clamav-daemon start
echo "Services started. Running worker."
node consumer.js
Cool, let's start it up by terraform plan
and terraform apply
. Once that's finished (you'll have to confirm by typing yes
), you should be good to go.
Now, to test it with a script, test-virus.sh
:
#!/bin/bash
aws s3 cp fixtures/test-virus.txt s3://clamav-quarantine-bucket
aws s3 cp fixtures/test-file.txt s3://clamav-quarantine-bucket
sleep 30
VIRUS_TEST=$(aws s3api get-object-tagging --key test-virus.txt --bucket clamav-quarantine-bucket --output text)
CLEAN_TEST=$(aws s3api get-object-tagging --key test-file.txt --bucket clamav-clean-bucket --output text)
echo "Dirty tag: ${VIRUS_TEST}"
echo "Clean tag: ${CLEAN_TEST}"
Running that, here's the output we get:
Dirty tag: TAGSET virus-scan dirty
Clean tag: TAGSET virus-scan clean
There we go. Hopefully y'all learned something! I had a lot of fun with this, and although I felt like I rushed it in a few areas, I look forward to your comments to see what all I missed (or to answer questions).
Y'all take care now.
Posted on May 27, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.