Get direct traffic to ECS Fargate Containers with LetsEncrypt, CDK and AWS Lambda (probably a bad idea)
Thomas Rooney
Posted on March 18, 2022
With a small amount of code, you can expose your running Fargate Tasks to the internet directly on an individual subdomain. For example, ${taskid}.eu-west-2.browser.reflow.io
.
This is almost always a bad idea. However, if you really need to do it, this guide should help.
Why you shouldn't do this
AWS provides multiple battle-tested patterns to operate containerized workloads. These patterns are much easier to configure, less likely to break, and will work for the vast majority of customer use-cases. For example:
-
Application Load Balanced Fargate Service: A Fargate service running on an ECS cluster fronted by an application load balancer. Benefits:
- Supports health-checks implicitly, diverting traffic away from unhealthy instances before re-creating them.
- On a deployment via CodePipeline, managed services monitor the stability of newly provisioned services before gradually moving traffic onto them.
- Traffic is automatically distributed across different Availability Zones to provide data-center resilience.
-
Queue Processing Fargate Service: A Fargate service auto-scaled to handle jobs in an SQS Queue. Benefits:
- Can automatically retry jobs upon a failure.
- Will scale up/down based on asynchronous workloads
- Can handle long-lived jobs.
Why we do this
At reflow, we run web browsers to record and execute end-to-end tests on. To record tests, we need to be able to create a websocket with a server to hold transient state.
ECS with Fargate is a great choice for this, as it removes the risk and operational overhead of running servers, but without the complexity introduced by Kubernetes.
Our first design used the Application Load Balanced Fargate Service pattern, but we ran into the following problems:
- We wanted to run untrusted customer code on our servers, which requires both isolating customer workloads from each other, and zero privilege servers. Unfortunately it is non-trivial to do any customer-level physical isolation with ALB-fronted ECS Services.
- We wanted a multi-region architecture, but the cost of keeping one instance and a NAT Gateway warm in every region is significant to a bootstrapped startup. There didn't appear any way to allow clusters to scale to zero when not in use.
- We have a feature whereby multiple customers in the same team could share a single browser instance to collaborate on recording tests. However, this didn't work in our cloud instances, because we couldn't guarantee that multiple users in the same team would reach the same server.
Using the following pattern, we have:
- Clusters which scale to zero when not in use. This means we don't need to pay a warm-instance fee for going multi-region.
- All customer workloads physically isolated from each other, and the ability to share transient server state between multiple users in the same team.
- The ability to bake expected customer ids into the server process's environment variables, simplifying authentication
- No need to run a NAT Gateway in each availability zone.
With the main negatives:
- DNS propagation delays mean that the first time a customer uses a recording instance, they have to wait approximately 1 minute before the server is DNS-available.
- More moving parts to monitor.
Logical Components
- LetsEncrypt SSL wildcard certificates, on a relevant domain. E.g.
*.eu-west-2.browser.reflow.io
- An AWS Lambda task to renew the above certificate automatically, and alert us if there are any problems. This is scheduled to run monthly.
- An AWS Lambda task to automatically create and destroy DNS records for every task registered in our cluster.
- An ECS Cluster and Task Configuration to run our service on demand, and automatically register a public IP address.
- Extra application logic to register the task hostnames and get them to the client. We use AppSync/DynamoDB to get these to web-clients; when a server boots it saves to a dynamodb record that a client can read. Finally we bake a
TeamId
into the server's environment variables, a cognito custom attribute that must be signed within a client JWT for all web requests.
Components [1], [2] and [3] are completely generic, so I'll describe them here. Once configured, any tasks created in the target ECS Cluster will be exposed via DNS. [4], [5] are domain-specific, so are unlikely to be useful or relevant for anyone else, but feel free to reach out if you have questions.
LetsEncrypt SSL Certificates
CDK
We use CDK to manage all our infrastructure. The following is a Construct we use to manage our LetsEncrypt Lambda function.
It creates:
- A S3 bucket to hold our certificates in
- A SNS topic to notify us when our certificates renew
- A Lambda function to actually renew everything. We have a wrapper
ReamplifyLambdaFunction
that allows us to pre-compile our code outside of CDK, but this can just as well be a NodeJSFunction.
It references:
- The hosted zone where we will place our task instance's DNS records, and the associated
domain
to suffix our tasks with. - A
workspace
parameter, e.g.dev
/prod
. This allows us to provision multiple instances of this construct within one AWS account. - An email address to send renewal notifications to.
- The
region/account
to store everything in
import { Construct } from 'constructs';
import { Duration, RemovalPolicy, StackProps, Tags } from 'aws-cdk-lib';
import { BlockPublicAccess, Bucket, BucketEncryption, ObjectOwnership } from 'aws-cdk-lib/aws-s3';
import { Topic } from 'aws-cdk-lib/aws-sns';
import { EmailSubscription } from 'aws-cdk-lib/aws-sns-subscriptions';
import { ReamplifyLambdaFunction } from './reamplifyLambdaFunction';
import { PolicyStatement } from 'aws-cdk-lib/aws-iam';
import { IHostedZone } from 'aws-cdk-lib/aws-route53';
import { Rule, Schedule } from 'aws-cdk-lib/aws-events';
import { LambdaFunction } from 'aws-cdk-lib/aws-events-targets';
interface CertbotProps {
adminNotificationEmail: string;
hostedZone: IHostedZone;
domain: string;
workspace: string;
env: {
region: string;
account: string;
};
}
export class Certbot extends Construct {
public readonly certBucket: Bucket;
constructor(scope: Construct, id: string, props: StackProps & CertbotProps) {
super(scope, id);
Tags.of(this).add('construct', 'Certbot');
const certBucket = new Bucket(this, 'bucket', {
bucketName: `certs.${props.env.region}.${props.workspace}.reflow.io`,
objectOwnership: ObjectOwnership.BUCKET_OWNER_PREFERRED,
removalPolicy: RemovalPolicy.DESTROY,
autoDeleteObjects: true,
versioned: true,
lifecycleRules: [
{
enabled: true,
abortIncompleteMultipartUploadAfter: Duration.days(1),
},
],
encryption: BucketEncryption.S3_MANAGED,
enforceSSL: true,
blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
});
this.certBucket = certBucket;
const topic = new Topic(this, 'CertAdminTopic');
topic.addSubscription(new EmailSubscription(props.adminNotificationEmail));
const fn = new ReamplifyLambdaFunction(this, 'LambdaFn', {
workspace: props.workspace,
lambdaConfig: 'deploy/browserCerts.ts',
timeout: Duration.minutes(15),
environment: {
NOTIFY_EMAIL: props.adminNotificationEmail,
CERTIFICATES: JSON.stringify([
{
domains: [`*.${props.domain}`],
zoneId: props.hostedZone.hostedZoneId,
certStorageBucketName: certBucket.bucketName,
certStoragePrefix: 'browser/',
successSnsTopicArn: topic.topicArn,
failureSnsTopicArn: topic.topicArn,
},
]),
},
});
fn.addToRolePolicy(
new PolicyStatement({
actions: ['route53:ListHostedZones'],
resources: ['*'],
})
);
fn.addToRolePolicy(
new PolicyStatement({
actions: ['route53:GetChange', 'route53:ChangeResourceRecordSets'],
resources: ['arn:aws:route53:::change/*'].concat(props.hostedZone.hostedZoneArn),
})
);
fn.addToRolePolicy(
new PolicyStatement({
actions: ['ssm:GetParameter', 'ssm:PutParameter'],
resources: ['*'],
})
);
certBucket.grantWrite(fn);
topic.grantPublish(fn);
new Rule(this, 'trigger', {
schedule: Schedule.cron({ minute: '32', hour: '17', day: '3', month: '*', year: '*' }),
targets: [new LambdaFunction(fn)],
});
}
}
AWS Lambda Function
Dependencies:
-
acme-client
:4.2.3
This leans very heavily on acme-client
to do all the heavy lifting, with a scattering of logic to:
- Maintain SSM parameters to ensure that only one account is managed within LetsEncrypt, rather than creating a new account each time, but ensure that this can be run without any pre-dependencies when spinning up a new environment.
- Answer the LetsEncrypt challenges with DNS records to prove we own the given domain.
- Store the resultant certificates in S3.
- Notify an admin that the certificate has been issued (or not, if there was a failure).
import AWS from 'aws-sdk';
import acme from 'acme-client';
const route53 = new AWS.Route53();
const s3 = new AWS.S3();
const sns = new AWS.SNS();
export function assertEnv(key: string): string {
if (process.env[key] !== undefined) {
console.log('env', key, 'resolved by process.env as', process.env[key]!);
return process.env[key]!;
}
throw new Error(`expected environment variable ${key}`);
}
export const assertEnvOrSSM = async (key: string, shouldThrow = true): Promise<string> => {
const workspace = assertEnv('workspace');
if (process.env[key] !== undefined) {
console.log('env', key, 'resolved by process.env as', process.env[key]!);
return Promise.resolve(process.env[key]!);
} else {
const SSMLocation = `/${workspace}/${key}`;
console.log('env', key, 'resolving via SSM at', SSMLocation);
const SSM = new AWS.SSM();
try {
const ssmResponse = await SSM.getParameter({
Name: SSMLocation,
}).promise();
if (!ssmResponse.Parameter || !ssmResponse.Parameter.Value) {
throw new Error(`env ${key} missing`);
}
console.log('env', key, 'resolved by SSM as', ssmResponse.Parameter.Value);
process.env[key] = ssmResponse.Parameter.Value;
return ssmResponse.Parameter.Value;
} catch (e) {
console.error(`SSM.getParameter({Name: ${SSMLocation}}):`, e);
if (shouldThrow) {
throw e;
}
return '';
}
}
};
export const writeSSM = async (key: string, value: string): Promise<void> => {
const workspace = assertEnv('workspace');
const SSMLocation = `/${workspace}/${key}`;
console.log('env', key, 'writing to SSM at', SSMLocation, 'value', value);
const SSM = new AWS.SSM();
await SSM.putParameter({
Name: SSMLocation,
Value: value,
Overwrite: true,
DataType: 'text',
Tier: 'Standard',
Type: 'String',
}).promise();
};
async function getOrCreateAccountPrivateKey() {
let accountKey = await assertEnvOrSSM('LETSENCRYPT_ACCOUNT_KEY', false);
if (accountKey) {
return accountKey;
}
console.log('Generating Account Key');
accountKey = (await acme.forge.createPrivateKey()).toString();
await writeSSM('LETSENCRYPT_ACCOUNT_KEY', accountKey);
return accountKey;
}
export const handler = async function (event) {
const maintainerEmail = assertEnv('NOTIFY_EMAIL');
const accountURL = await assertEnvOrSSM('LETSENCRYPT_ACCOUNT_URL', false);
const certificates = JSON.parse(assertEnv('CERTIFICATES'));
const accountPrivateKey = await getOrCreateAccountPrivateKey();
acme.setLogger(console.log);
const client = new acme.Client({
directoryUrl: acme.directory.letsencrypt.production,
accountKey: accountPrivateKey,
accountUrl: accountURL ? accountURL : undefined,
});
const certificateRuns = certificates.map(async (certificate) => {
const { domains, zoneId, certStorageBucketName, certStoragePrefix, successSnsTopicArn, failureSnsTopicArn } =
certificate;
try {
const [certificateKey, certificateCsr] = await acme.forge.createCsr({
commonName: domains[0],
altNames: domains.slice(1),
});
const certificate = await client.auto({
csr: certificateCsr,
email: maintainerEmail,
termsOfServiceAgreed: true,
challengeCreateFn: async (authz, challenge, keyAuthorization) => {
console.log(authz, challenge, keyAuthorization);
const dnsRecord = `_acme-challenge.${authz.identifier.value}`;
if (challenge.type !== 'dns-01') {
throw new Error('Only DNS-01 challenges are supported');
}
const changeReq = {
ChangeBatch: {
Changes: [
{
Action: 'UPSERT',
ResourceRecordSet: {
Name: dnsRecord,
ResourceRecords: [
{
Value: '"' + keyAuthorization + '"',
},
],
TTL: 60,
Type: 'TXT',
},
},
],
},
HostedZoneId: zoneId,
};
console.log('Sending create request', JSON.stringify(changeReq));
const response = await route53.changeResourceRecordSets(changeReq).promise();
const changeId = response.ChangeInfo.Id;
console.log(`Create request sent for ${dnsRecord} (Change id ${changeId}); waiting for it to complete`);
const waitRequest = route53.waitFor('resourceRecordSetsChanged', { Id: changeId });
const waitResponse = await waitRequest.promise();
console.log(
`Create request complete for ${dnsRecord}: (Change id ${waitResponse.ChangeInfo.Id}) ${waitResponse.ChangeInfo.Status}`
);
},
challengeRemoveFn: async (authz, challenge, keyAuthorization) => {
const dnsRecord = `_acme-challenge.${authz.identifier.value}`;
const deleteReq = {
ChangeBatch: {
Changes: [
{
Action: 'DELETE',
ResourceRecordSet: {
Name: dnsRecord,
ResourceRecords: [
{
Value: '"' + keyAuthorization + '"',
},
],
TTL: 60,
Type: 'TXT',
},
},
],
},
HostedZoneId: zoneId,
};
console.log('Sending delete request', JSON.stringify(deleteReq));
const response = await route53.changeResourceRecordSets(deleteReq).promise();
const changeId = response.ChangeInfo.Id;
console.log(`Delete request sent for ${dnsRecord} (Change id ${changeId}); waiting for it to complete`);
const waitRequest = route53.waitFor('resourceRecordSetsChanged', { Id: changeId });
const waitResponse = await waitRequest.promise();
console.log(
`Delete request complete for ${dnsRecord}: (Change id ${waitResponse.ChangeInfo.Id}) ${waitResponse.ChangeInfo.Status}`
);
},
challengePriority: ['dns-01'],
});
// Write private key & certificate to S3
const certKeyWritingPromise = s3
.putObject({
Body: certificateKey.toString(),
Bucket: certStorageBucketName,
Key: certStoragePrefix + 'key.pem',
ServerSideEncryption: 'AES256',
})
.promise();
const certChainWritingPromise = s3
.putObject({
Body: certificate,
Bucket: certStorageBucketName,
Key: certStoragePrefix + 'cert.pem',
})
.promise();
await Promise.all([certKeyWritingPromise, certChainWritingPromise]);
console.log('Completed with certificate for ', domains);
// after client.auto, an account should be available
if (!accountURL) {
await writeSSM('LETSENCRYPT_ACCOUNT_URL', client.getAccountUrl());
}
if (successSnsTopicArn) {
await sns
.publish({
TopicArn: successSnsTopicArn,
Message: `Certificate for ${JSON.stringify(domains)} issued`,
Subject: 'Certificate Issue Success',
})
.promise();
}
} catch (err) {
console.log('Error ', err);
if (failureSnsTopicArn) {
await sns
.publish({
TopicArn: failureSnsTopicArn,
Message: `Certificate for ${JSON.stringify(domains)} issue failure\n${err}`,
Subject: 'Certificate Issue Failure',
})
.promise();
}
throw err;
}
});
await Promise.all(certificateRuns);
};
Automatic DNS Records
CDK
This references:
- A
clusterArn
to collect ECS EventStream events for any task state changes in the cluster - The
serviceDiscoveryTLD
(in our casebrowser.${props.env.region}.reflow.io
) to suffix DNS records - The route 53 hosted zone to create records in
import { Rule } from 'aws-cdk-lib/aws-events';
import { LambdaFunction } from 'aws-cdk-lib/aws-events-targets';
import { PolicyStatement } from 'aws-cdk-lib/aws-iam';
// ...
const eventRule = new Rule(this, 'ECSChangeRule', {
eventPattern: {
source: ['aws.ecs'],
detailType: ['ECS Task State Change'],
detail: {
clusterArn: [cluster.clusterArn],
},
},
});
const ecsChangeFn = new ReamplifyLambdaFunction(this, 'ECSStreamLambda', {
...props,
lambdaConfig: 'stream/ecsChangeStream.ts',
unreservedConcurrency: true,
memorySize: 128,
environment: {
DOMAIN_PREFIX: props.serviceDiscoveryTLD,
HOSTED_ZONE_ID: props.hostedZone.hostedZoneId,
},
});
eventRule.addTarget(new LambdaFunction(ecsChangeFn));
ecsChangeFn.addToRolePolicy(
new PolicyStatement({
actions: ['route53:GetChange', 'route53:ChangeResourceRecordSets', 'route53:ListResourceRecordSets'],
resources: ['arn:aws:route53:::change/*'].concat(props.hostedZone.hostedZoneArn),
})
);
ecsChangeFn.addToRolePolicy(
new PolicyStatement({
actions: ['ec2:DescribeNetworkInterfaces'],
resources: ['*'],
})
);
AWS Lambda
This function:
- Does some sanity checks on if the event should affect DNS records
- If the task both currently
RUNNING
and desiredRUNNING
:- Looks up the public IP of the task.
- Upserts an
A
record pointing at the tasks public IP, on${taskId}.${DOMAIN_PREFIX}
- Else:
- Deletes the
A
record associated with the task.
- Deletes the
import type { EventBridgeHandler } from 'aws-lambda';
import AWS from 'aws-sdk';
import { Task } from 'aws-sdk/clients/ecs';
export function assertEnv(key: string): string {
if (process.env[key] !== undefined) {
console.log('env', key, 'resolved by process.env as', process.env[key]!);
return process.env[key]!;
}
throw new Error(`expected environment variable ${key}`);
}
const ec2 = new AWS.EC2();
const route53 = new AWS.Route53();
const DOMAIN_PREFIX = assertEnv('DOMAIN_PREFIX');
const HOSTED_ZONE_ID = assertEnv('HOSTED_ZONE_ID');
export const handler: EventBridgeHandler<string, Task, unknown> = async (event) => {
console.log('event', JSON.stringify(event));
const task = event.detail;
const clusterArn = task.clusterArn;
const lastStatus = task.lastStatus;
const desiredStatus = task.desiredStatus;
if (!clusterArn) {
return;
}
if (!lastStatus) {
return;
}
if (!desiredStatus) {
return;
}
const taskArn = task.taskArn;
if (!taskArn) {
return;
}
const taskId = taskArn.split('/').pop();
if (!taskId) {
return;
}
const clusterName = clusterArn.split(':cluster/')[1];
if (!clusterName) {
return;
}
const containerDomain = `${taskId}.${DOMAIN_PREFIX}`;
if (lastStatus === 'RUNNING' && desiredStatus === 'RUNNING') {
const eniId = getEniId(task);
if (!eniId) {
return;
}
const taskPublicIp = await fetchEniPublicIp(eniId);
if (!taskPublicIp) {
return;
}
const recordSet = createRecordSet(containerDomain, taskPublicIp);
await updateDnsRecord(clusterName, HOSTED_ZONE_ID, recordSet);
console.log(`DNS record update finished for ${taskId} (${taskPublicIp})`);
} else {
const recordSet = await route53
.listResourceRecordSets({
HostedZoneId: HOSTED_ZONE_ID,
StartRecordName: containerDomain,
StartRecordType: 'A',
})
.promise();
console.log('listRecordSets', JSON.stringify(recordSet));
const found = recordSet.ResourceRecordSets.find((record) => record.Name === containerDomain + '.');
if (found && found.ResourceRecords?.[0].Value) {
await route53
.changeResourceRecordSets({
HostedZoneId: HOSTED_ZONE_ID,
ChangeBatch: {
Changes: [
{
Action: 'DELETE',
ResourceRecordSet: {
Name: containerDomain,
Type: 'A',
ResourceRecords: [
{
Value: found.ResourceRecords[0].Value,
},
],
TTL: found.TTL,
},
},
],
},
})
.promise();
}
}
};
function getEniId(task): string | undefined {
const eniAttachment = task.attachments.find(function (attachment) {
return attachment.type === 'eni';
});
if (!eniAttachment) {
return undefined;
}
const networkInterfaceIdDetail = eniAttachment.details.find((detail) => detail.name === 'networkInterfaceId');
if (!networkInterfaceIdDetail) {
return undefined;
}
return networkInterfaceIdDetail.value;
}
async function fetchEniPublicIp(eniId): Promise<string | undefined> {
const data = await ec2
.describeNetworkInterfaces({
NetworkInterfaceIds: [eniId],
})
.promise();
console.log(data);
return data.NetworkInterfaces?.[0].PrivateIpAddresses?.[0].Association?.PublicIp;
}
function createRecordSet(domain, publicIp) {
return {
Action: 'UPSERT',
ResourceRecordSet: {
Name: domain,
Type: 'A',
TTL: 60,
ResourceRecords: [
{
Value: publicIp,
},
],
},
};
}
async function updateDnsRecord(clusterName, hostedZoneId, changeRecordSet) {
let param = {
ChangeBatch: {
Comment: `Auto generated Record for ECS Fargate cluster ${clusterName}`,
Changes: [changeRecordSet],
},
HostedZoneId: hostedZoneId,
};
await route53.changeResourceRecordSets(param).promise();
}
Running this in Production
This has been in production for two months now, and whilst it's not perfect, it's working well for us.
Things we unnecessarily worried about:
- We were worried that DNS Records would accumulate as some error conditions would result in them not being removed. Many thousands DNS records created later, we haven't seen this as an issue.
- We were worried about Route53 throttling our DNS change requests. Whilst we've seen this happen a few times, our lambdas do automatically retry and it eventually gets through.
Negatives:
- We have seen some flakiness in our E2E tests where sometimes a browser will not use the new DNS records until a refresh, even when waiting beyond the TTL. We had to automate around this.
- Server orchestration logic is a lot more complex when you are managing individual ECS Tasks.
Posted on March 18, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.