Scaled Virus Scanner using AWS Fargate, ClamAV, S3, and SQS with Terraform

Welcome back for more shenanigans!

Some time ago, team I was on ran into a problem of hitting the Lambda deployment (and runtime) size limits: our solo Lambda function + a ClamAV layer with pre-built binaries and virus definitions. If you had a smaller file size requirement to scan, I'm sure that a Lambda layer setup with ClamAV and its binaries and definitions would work great for you; however, it wasn't in our case. We needed to scale our solution to allow for files up to sizes of 512MB.

TL;DR: GitHub repo.

Since we were already using SQS and EC2 for other things, why not use it along with S3 and Fargate? We had a spike to pursue either EC2 or a Fargate consumer client, but Fargate had better maintainability in the long run.

Note that I won't be implementing a cluster policy (yet) in this article, I might save that for another time; however, things should be setup to translate relatively well in that matter.

NOTE: This assumes you have your AWS credentials setup already via aws configure. If you plan on using a different profile, make sure you reflect that in the main.tf file below where profile = "default" is set.

Anyhow, let's get to it. First, let's set up our main configuration in terraform/main.tf:



terraform {
  required_version = ">= 1.0.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.29"
    }
  }

  backend "s3" {
    encrypt        = true
    bucket         = "tf-clamav-state"
    dynamodb_table = "tf-dynamodb-lock"
    region         = "us-east-1"
    key            = "terraform.tfstate"
  }
}

# TODO: Make note about aws credentials and different profiles
provider "aws" {
  profile = "default"
  region  = "us-east-1"
}

This has a remote state (didn't I write about that once?), so we need a script to setup the S3 bucket and DynamoDB table for both our state and lock status respectively. We can set that up in a bash script in terraform/tf-setup.sh:



#!/bin/bash

# Create S3 Bucket
MY_ARN=$(aws iam get-user --query User.Arn --output text 2>/dev/null)
aws s3 mb "s3://tf-clamav-state" --region "us-east-1"
sed -e "s/RESOURCE/arn:aws:s3:::tf-clamav-state/g" -e "s/KEY/terraform.tfstate/g" -e "s|ARN|${MY_ARN}|g" "$(dirname "$0")/templates/s3_policy.json" > new-policy.json
aws s3api put-bucket-policy --bucket "tf-clamav-state" --policy file://new-policy.json
aws s3api put-bucket-versioning --bucket "tf-clamav-state" --versioning-configuration Status=Enabled
rm new-policy.json

# Create DynamoDB Table
aws dynamodb create-table \
  --table-name "tf-dynamodb-lock" \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1 \
  --region "us-east-1"

This does require an S3 policy s3_policy in terraform/templates/s3_policy.json:



{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "RESOURCE",
      "Principal": {
        "AWS": "ARN"
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "RESOURCE/KEY",
      "Principal": {
        "AWS": "ARN"
      }
    }
  ]
}

Now we can run the tf-setup.sh script (don't forget to chmod +x it) via cd terraform && ./tf-setup.sh.

Now that we have our remote state established, let's scaffold out our infrastructure in Terraform. First up, we need our buckets (one for quarantined files and one for clean files), our SQS queue, and an event notification configured on the quarantine bucket for when an object is created. We can set this up via terraform/logistics.tf:



provider "aws" {
  region = "us-east-1"
  alias  = "east"
}

data "aws_caller_identity" "current" {}

resource "aws_s3_bucket" "quarantine_bucket" {
  provider = aws.east
  bucket   = "clamav-quarantine-bucket"
  acl      = "private"

  cors_rule {
    allowed_headers = ["Authorization"]
    allowed_methods = ["GET", "POST"]
    allowed_origins = ["*"]
    max_age_seconds = 3000
  }

  lifecycle_rule {
    enabled = true

    # Anything in the bucket remaining is a virus, so
    # we'll just delete it after a week.
    expiration {
      days = 7
    }
  }
}


resource "aws_s3_bucket" "clean_bucket" {
  provider = aws.east
  bucket   = "clamav-clean-bucket"
  acl      = "private"

  cors_rule {
    allowed_headers = ["Authorization"]
    allowed_methods = ["GET", "POST"]
    allowed_origins = ["*"]
    max_age_seconds = 3000
  }
}


data "template_file" "event_queue_policy" {
  template = file("templates/event_queue_policy.tpl.json")

  vars = {
    bucketArn = aws_s3_bucket.quarantine_bucket.arn
  }
}

resource "aws_sqs_queue" "clamav_event_queue" {
  name = "s3_clamav_event_queue"

  policy = data.template_file.event_queue_policy.rendered
}

resource "aws_s3_bucket_notification" "bucket_notification" {
  bucket = aws_s3_bucket.quarantine_bucket.id

  queue {
    queue_arn = aws_sqs_queue.clamav_event_queue.arn
    events    = ["s3:ObjectCreated:*"]
  }

  depends_on = [
    aws_sqs_queue.clamav_event_queue
  ]
}

resource "aws_cloudwatch_log_group" "clamav_fargate_log_group" {
  name = "/aws/ecs/clamav_fargate"
}

If you read the clamav_event_queue block above, there's an event queue policy - let's not forget that in terraform/templates/event_queue_policy.json:



{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage"
      ],
      "Resource": "arn:aws:sqs:*:*:s3_clamav_event_queue",
      "Condition": {
        "ArnEquals": {
          "aws:SourceArn": "${bucketArn}"
        }
      }
    }
  ]
}

Since this is a "security" thing, we need to make sure it's isolated within its own VPC. I'm no networking guru, so most of the information I obtained was from this Stackoverflow answer. We'll do that in terraform/vpc.tf:



# Networking for Fargate
# Note: 10.0.0.0 and 10.0.2.0 are private IPs
# Required via https://stackoverflow.com/a/66802973/1002357
# """
# > Launch tasks in a private subnet that has a VPC routing table configured to route outbound 
# > traffic via a NAT gateway in a public subnet. This way the NAT gateway can open a connection 
# > to ECR on behalf of the task.
# """
# If this networking configuration isn't here, this error happens in the ECS Task's "Stopped reason":
# ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed
resource "aws_vpc" "clamav_vpc" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "private" {
  vpc_id     = aws_vpc.clamav_vpc.id
  cidr_block = "10.0.2.0/24"
}

resource "aws_subnet" "public" {
  vpc_id     = aws_vpc.clamav_vpc.id
  cidr_block = "10.0.1.0/24"
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.clamav_vpc.id
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.clamav_vpc.id
}

resource "aws_route_table_association" "public_subnet" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private_subnet" {
  subnet_id      = aws_subnet.private.id
  route_table_id = aws_route_table.private.id
}

resource "aws_eip" "nat" {
  vpc = true
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.clamav_vpc.id
}

resource "aws_nat_gateway" "ngw" {
  subnet_id     = aws_subnet.public.id
  allocation_id = aws_eip.nat.id

  depends_on = [aws_internet_gateway.igw]
}

resource "aws_route" "public_igw" {
  route_table_id         = aws_route_table.public.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.igw.id
}

resource "aws_route" "private_ngw" {
  route_table_id         = aws_route_table.private.id
  destination_cidr_block = "0.0.0.0/0"
  nat_gateway_id         = aws_nat_gateway.ngw.id
}


resource "aws_security_group" "egress-all" {
  name        = "egress_all"
  description = "Allow all outbound traffic"
  vpc_id      = aws_vpc.clamav_vpc.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Now that we have the networking configuration done, we can go ahead and implement the ECS / Fargate configuration:



resource "aws_iam_role" "ecs_task_execution_role" {
  name = "clamav_fargate_execution_role"

  assume_role_policy = <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Action": "sts:AssumeRole",
     "Principal": {
       "Service": "ecs-tasks.amazonaws.com"
     },
     "Effect": "Allow",
     "Sid": ""
   }
 ]
}
EOF
}

resource "aws_iam_role" "ecs_task_role" {
  name = "clamav_fargate_task_role"

  assume_role_policy = <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Action": "sts:AssumeRole",
     "Principal": {
       "Service": "ecs-tasks.amazonaws.com"
     },
     "Effect": "Allow",
     "Sid": ""
   }
 ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "ecs_task_execution_policy_attachment" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role_policy_attachment" "s3_task" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}

resource "aws_iam_role_policy_attachment" "sqs_task" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSQSFullAccess"
}

resource "aws_ecs_cluster" "cluster" {
  name = "clamav_fargate_cluster"

  capacity_providers = ["FARGATE"]
}

data "template_file" "task_consumer_east" {
  template = file("./templates/clamav_container_definition.json")

  vars = {
    aws_account_id = data.aws_caller_identity.current.account_id
  }
}

resource "aws_ecs_task_definition" "definition" {
  family                   = "clamav_fargate_task_definition"
  task_role_arn            = aws_iam_role.ecs_task_role.arn
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "2048"
  requires_compatibilities = ["FARGATE"]

  container_definitions = data.template_file.task_consumer_east.rendered

  depends_on = [
    aws_iam_role.ecs_task_role,
    aws_iam_role.ecs_task_execution_role
  ]
}

resource "aws_ecs_service" "clamav_service" {
  name            = "clamav_service"
  cluster         = aws_ecs_cluster.cluster.id
  task_definition = aws_ecs_task_definition.definition.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    assign_public_ip = false

    subnets = [
      aws_subnet.private.id
    ]

    security_groups = [
      aws_security_group.egress-all.id
    ]
  }
}

The container_definitions from the template_file has the configuration for the log configuration and environment variables. That configuration is found in terraform/templates/clamav_container_definition.json:



[
  {
    "image": "${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com/fargate-images:latest",
    "name": "clamav",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-region": "us-east-1",
        "awslogs-group": "/aws/ecs/clamav_fargate",
        "awslogs-stream-prefix": "project"
      }
    },
    "environment": [
      {
        "name": "VIRUS_SCAN_QUEUE_URL",
        "value": "https://sqs.us-east-1.amazonaws.com/${aws_account_id}/s3_clamav_event_queue"
      },
      {
        "name": "QUARANTINE_BUCKET",
        "value": "clamav-quarantine-bucket"
      },
      {
        "name": "CLEAN_BUCKET",
        "value": "clamav-clean-bucket"
      }
    ]
  }
]

Since we're using Fargate, we'll need a Dockerfile configuration and an ECR repository (as depicted in the clamav_container_definition.json file above). Let's get the ECR repository configured in terraform/ecr.tf:



resource "aws_ecr_repository" "image_repository" {
  name = "fargate-images"
}

data "template_file" "repo_policy_file" {
  template = file("./templates/ecr_policy.tpl.json")

  vars = {
    numberOfImages = 5
  }
}

# keep the last 5 images
resource "aws_ecr_lifecycle_policy" "repo_policy" {
  repository = aws_ecr_repository.image_repository.name
  policy     = data.template_file.repo_policy_file.rendered
}

You can play around with the number of images, completely up to you. It's just for versioning your Docker image that contains the consumer. Now, for the actual Dockerfile:



FROM ubuntu

WORKDIR /home/clamav

RUN echo "Prepping ClamAV"

RUN apt update -y
RUN apt install curl sudo procps -y

RUN curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash -
RUN apt install -y nodejs
RUN npm init -y

RUN npm i aws-sdk tmp sqs-consumer --save
RUN DEBIAN_FRONTEND=noninteractive sh -c 'apt install -y awscli'

RUN apt install -y clamav clamav-daemon

RUN mkdir /var/run/clamav && \
  chown clamav:clamav /var/run/clamav && \
  chmod 750 /var/run/clamav

RUN freshclam

COPY ./src/clamd.conf /etc/clamav/clamd.conf
COPY ./src/consumer.js ./consumer.js
RUN npm install
ADD ./src/run.sh ./run.sh

CMD ["bash", "./run.sh"]

This basically installs node, initializes a node project, and installs the bare essentials we need for the consumer: aws-sdk and tmp (for file handling to scan it). The first file, we can create in src/clamd.conf which is the ClamAV configuration (for the daemon that will be listening):



LocalSocket /tmp/clamd.socket
LocalSocketMode 660

Now for the SQS consumer in src/consumer.js:



const { SQS, S3 } = require('aws-sdk');
const { Consumer } = require('sqs-consumer');
const tmp = require('tmp');
const fs = require('fs');
const util = require('util');
const { exec } = require('child_process');

const execPromise = util.promisify(exec);

const s3 = new S3();

const app = Consumer.create({
  queueUrl: process.env.VIRUS_SCAN_QUEUE_URL,
  handleMessage: async (message) => {
    console.log('message', message);
    const parsedBody = JSON.parse(message.Body);
    const documentKey = parsedBody.Records[0].s3.object.key;

    const { Body: fileData } = await s3.getObject({
      Bucket: process.env.QUARANTINE_BUCKET,
      Key: documentKey
    }).promise();

    const inputFile = tmp.fileSync({
      mode: 0o644,
      tmpdir: process.env.TMP_PATH,
    });
    fs.writeSync(inputFile.fd, Buffer.from(fileData));
    fs.closeSync(inputFile.fd);

    try {
      await execPromise(`clamdscan ${inputFile.name}`);

      await s3.putObject({
        Body: fileData,
        Bucket: process.env.CLEAN_BUCKET,
        Key: documentKey,
        Tagging: 'virus-scan=clean',
      }).promise();

      await s3.deleteObject({
        Bucket: process.env.QUARANTINE_BUCKET,
        Key: documentKey,
      }).promise();

    } catch (e) {
      if (e.code === 1) {
        await s3.putObjectTagging({
          Bucket: process.env.QUARANTINE_BUCKET,
          Key: documentKey,
          Tagging: {
            TagSet: [
              {
                Key: 'virus-scan',
                Value: 'dirty',
              },
            ],
          },
        }).promise();
      }
    } finally {
      await sqs.deleteMessage({
        QueueUrl: process.env.VIRUS_SCAN_QUEUE_URL,
        ReceiptHandle: message.ReceiptHandle
      }).promise();
    }
  },
  sqs: new SQS()
});

app.on('error', (err) => {
  console.error('err', err.message);
});

app.on('processing_error', (err) => {
  console.error('processing error', err.message);
});

app.on('timeout_error', (err) => {
 console.error('timeout error', err.message);
});

app.start();

This does the following within a 10 second interval:

1) Pulls the file in through the quarantine bucket via the metadata in the SQS message (message in-flight)
2) Writes it to /tmp
3) Scans it with clamdscan (via the ClamAV daemon, clamd which already has the virus definitions loaded)
4) If it's clean, it puts the file in the clean bucket with a clean tag with the key virus-scan, removes it from the quarantine bucket, and deletes the message
5) If it's dirty, it tags the file as virus-scan = dirty and keeps the virus in the quarantine bucket, and deletes the SQS message

For now, this consumer handles only 1 message at a time; this can be easily configured to handle more messages since the ClamAV daemon scanner is much more efficient.

Now, the last file mentioned in the Dockerfile is the bash script that runs the updater, daemon, and consumer in src/run.sh:



echo "Starting Services"
service clamav-freshclam start
service clamav-daemon start

echo "Services started. Running worker."

node consumer.js

Cool, let's start it up by terraform plan and terraform apply. Once that's finished (you'll have to confirm by typing yes), you should be good to go.

Now, to test it with a script, test-virus.sh:



#!/bin/bash

aws s3 cp fixtures/test-virus.txt s3://clamav-quarantine-bucket
aws s3 cp fixtures/test-file.txt s3://clamav-quarantine-bucket

sleep 30

VIRUS_TEST=$(aws s3api get-object-tagging --key test-virus.txt --bucket clamav-quarantine-bucket --output text)
CLEAN_TEST=$(aws s3api get-object-tagging --key test-file.txt --bucket clamav-clean-bucket --output text)

echo "Dirty tag: ${VIRUS_TEST}"
echo "Clean tag: ${CLEAN_TEST}"

Running that, here's the output we get:



Dirty tag: TAGSET       virus-scan      dirty
Clean tag: TAGSET       virus-scan      clean

There we go. Hopefully y'all learned something! I had a lot of fun with this, and although I felt like I rushed it in a few areas, I look forward to your comments to see what all I missed (or to answer questions).

Y'all take care now.

Blog

Scaled Virus Scanner using AWS Fargate, ClamAV, S3, and SQS with Terraform

Joseph Sutton

Join Our Newsletter. No Spam, Only the good stuff.

Related