Crash Course on AWS CDK and Serverless with REST API and Data Lake Analytical Querying
Taavi Rehemägi
Posted on July 16, 2020
Original article here: https://dashbird.io/blog/crash-course-aws-cdk-serverless-rest-api-data-lake-analytical-querying/
This is a hands-on course on how to deploy a fully Serverless web app using the AWS CDK. You will learn how to:
- Structure **CDK Stacks **to deploy an application from end-to-end
- Deploy a REST API integrated with AWS Lambda for dynamic requests processing
- Store data in a fast and cost-effective way with DynamoDB
- Use DynamoDB streams as a source for Lambda in an event-driven architecture
- Ingest and manipulate loads of data streams with Kinesis Firehose
- Deploy and query a Data Lake with Athena, S3 and Glue
- Monitor your entire application health in a single place using Dashbird
You can use the resources declared in this demo application as a starting point to mix and adapt to your own architectures later, which should save you quite some time.
The App and Architecture
The demo app is a public blog where anyone can read, publish and like posts. It's available on this link. Go ahead and publish something in the top-left corner (yellow button) and also "like” articles already published. Check out the codebase on this repo.
Frontend
Backend
Data Lake and Analytical Querying
What is the AWS CDK
CDK stands for Cloud Development Kit. Think of it as CloudFormation (CF) in your preferred language (Python, Typescript, C#, etc). Roughly speaking, it works like this:
- You declare cloud resources using classes provided by the CDK libraries. Example:
from aws_cdk import aws_s3
my_bucket = aws_s3.Bucket(self, 'MyBucket')
- Run
cdk deploy
- CDK translates this to a CloudFormation template and deploy it on AWS for you
In case you would like to dig deeper, AWS also has a workshop that will get your basics started. I also strongly recommend reading the official CDK documentation.
Advantages using the AWS CDK
- Use languages that are more expressive than YAML or JSON, for instance
- Less - much less! - verbose than CloudFormation templates
- Easier to apply reusability and inheritance principles to infrastructure code
- Better integration with IDEs for code completion, IntelliSense, etc
- Possible to test your infra code, just as any other software
- Portable: since it's just a wrapper around CF, we can easily port it to JSON or YAML
Disadvantages of using the AWS CDK
Although released as a stable project by the AWS team, many parts (a lot of the good ones) are still experimental and APIs may change in backwards incompatible ways.
It's under constant development. During the preparation of this course, I had to upgrade my libraries three times.
Documentation is still lacking at some parts and you will need to look at the CDK code, occasionally, to understand how to declare certain things.
Deploy it yourself
Although we have provided an online demo, you can also deploy this app in your own AWS account:
- Clone the repo:
git clone git@github.com:byrro/serverless-website-demo.git sls-demo; cd sls-demo
- Setup your virtual environment:
virtualenv -p /urs/bin/python3.8 .env; source .env/bin/activate; pip install -r requirements
- Specify an AWS account ID:
export AWS_ACCOUNT_ID=1234567890
** - Deploy all three stacks:
cdk deploy sls-blog; cdk deploy sls-blog-api; cdk deploy sls-blog-analytical
** You can also hard code your Account ID in the CDK project, as I'll show in a minute;
When starting a new project from scratch, you would run cdk init --language [python|typescript|...]
. This is not necessary for this demo, since the project is already created.
Monitoring
Deploying this architecture in the cloud and blindly believing it will work flawlessly is not reasonable. We want to be the first one to know when something is not right to act upon it as quickly as possible.
In this project, I used Dashbird for its ease of use and seamless integration. Instead of having to deploy an agent inside my code, Dashbird plugs into my Stacks through a CloudFormation template that I can deploy with the effort of one click. It not only monitors Lambda function errors, but also other resources that we're using, such as DynamoDB tables. They even suggest insights for architectural improvements cross-referenced against industry best practices.
Finally, Dashbird offers a free-forever plan. It's a no-brainer to try it out by registering for free.
How a CDK project is structured
A CDK project creates an "Application”. This app may have one or more "Stacks”. A Stack is a group of cloud resources (Lambda functions, S3 buckets, etc) that are instantiated using CDK classes. It's also possible to have multiple applications in a single CDK project.
App Object
Creating a CDK app is as simple as:
app = core.App()
When you run
cdk init --language [language]
, an initial application with basic boilerplate code is created for you in the project root, underapp.py
.
The next thing we need is an environment, which is composed of an AWS Account ID and Region:
env = env = core.Environment(
account=1234567890,
region='us-east-1',
)
Declaring an environment is not required (CDK can infer from your AWS credentials), but is a good practice. Most of us work with multiple AWS accounts. It's easy to mess around with several projects, accounts, credentials. When we explicitly set the environment in the CDK app, it's locked and prevents mistaken deployments.
Now we declare our stacks:
from my_project.my_project_stack import MyStack
my_stack = MyStack(
app,
'my-stack',
env=env,
)
This is how we instantiate our stacks for deployment. In the next section we'll see how to declare those stacks.
Stack Object
The Stack object is where we declare our AWS resources. It inherits from the core.Stack
CDK class and accepts a scope - which is our app
object - a string identifier and an environment.
class MyStack(core.Stack):
def __init__(
self,
scope: core.Construct,
id: str,
env: core.Environment,
**kwargs,
) -> None:
super().__init__(scope, id, **kwargs)
# Declare AWS resources here
Declaring AWS resources
To declare AWS resources, we need a specific library for each service. Here's a list of all Python libraries and their Typescript counterparts. Other flavors are Java and .NET.
Let's see how a basic REST API would be declared (typing expressions were removed for readability purposes):
from aws_cdk import aws_apigateway, aws_lambda
class MyStack(core.Stack):
def __init__(self, scope, id, env):
super().__init__(scope, id, **kwargs)
my_lambda = aws_lambda.Function(
self,
'MyLambda',
runtime=aws_lambda.Runtime.PYTHON_3_8,
code=aws_lambda.Code.asset('my_lambda_folder),
handler='my_lambda.handler',
)
aws_apigateway.LambdaRestApi(
self,
'sls-blog-rest-api-gateway',
handler=my_lambda,
)
We first declare a Lambda function my_lambda
. We point its code to the my_lambda_folder
. Inside this folder, there should be a my_lambda.py
file, containing a function called handler
. This handler function should accept Lambda invocations normally (an event
and context
objects).
Next a LambdaRestApi
is declared, using my_lambda
as the handler (not to confuse with the Lambda's handler function). This will create a new API Gateway REST API integrated with my_lambda
using an AWS_PROXY
integration type. All HTTP requests will be routed to the Lambda function.
Our Project App & Stacks
This project comprises one application with three Stacks. They're all declared in the [app.py](https://github.com/byrro/serverless-website-demo/blob/master/app.py)
and [sls_website_stack.py](https://github.com/byrro/serverless-website-demo/blob/master/sls_website/sls_website_stack.py)
files.
Below we'll walk through all Stacks in a high level. I encourage you to inspect the stacks file to learn how these resources are declared and also integrated. For example: a Kinesis Firehose is created in one Stack and referenced in another to include its name as an environment variable for the Lambda function that will interact with it.
Except for the frontend static Stack - which is small - you will notice that resources are initialized with a None
(null) value in the beginning. The reason is that, even though the CDK is generally more succinct than CloudFormation, it can still be lengthy enough to clutter the view of the entire Stack. Having each resource declared first in one line, I can provide a short summary of everything that's in the Stack and then instantiate the CDK classes in other methods.
class SlsBlogApiStack(core.Stack):
def __init__(self, scope, id, env, static_stack):
super().__init__(scope, id, **kwargs)
self.static_stack = static_stack
# SQS Queues
self.queue_ddb_streams_dlq = None # Dead-letter-queue for DDB streams
# DynamoDB Tables
self.ddb_table_blog = None # Single-table for all blog content
# DynamoDB Event Sources
self.ddb_source_blog = None # Blog table streams source
# DynamoDB Indexes
self.ddb_gsi_latest = None # GSI ordering articles by timestamp
# Lambda Functions
self.lambda_blog = None # Serves requests to the blog public API
self.lambda_stream_reader = None # Processes DynamoDB streams
# Continues with other resources...
Notice it takes another Stack object (static_stack
) as an argument to its initialization. In the [app.py](https://github.com/byrro/serverless-website-demo/blob/master/app.py)
file, you can see that the SlsBlogApiStack
is initialized passing the SlsBlogStack
as an argument.
We use it to reference the CloudFront distribution domain (d1qmte5oc6ndq5.cloudfront.net) in the Lambda environment variables. This variable can be used to customize the HTTP response header Access-Control-Allow-Origin
to comply with CORS standards. This illustrates one way to easily integrate and reference information from one Stack into another within a CDK project.
At the end of the initialization, another method is called to instantiate the CDK classes for each resource and configure their parameters.
self.create_cdk_resources()
Next we'll walk through each of our project's Stacks.
Stack 1: static website
Our focus is on the Serverless backend, so the frontend here is terribly rough and simple. It's stored in an S3 Bucket and distributed through a CloudFront CDN.
CDK has a helpful class called BucketDeployment. It takes the contents of a directory and sync to an S3 bucket. In this case, we stored the frontend code in the website_static
folder.
aws_s3_deployment.BucketDeployment(
self,
'SlsBlogStaticS3Deployment',
sources=[aws_s3_deployment.Source.asset('website_static')],
destination_bucket=static_bucket,
distribution=cdn,
)
Stack 2: API/backend
Our backend consists of an API Gateway (REST) connected to a monolithic Lambda function. Microservices receives a lot of press, but you probably shouldn't always break your applications in several functions. A Monolith is just fine - and sometimes recommended -, really.
This API & Lambda support a single endpoint (with GET and POST methods) with a queryString "action”, which takes three parameters:
-
get-latest-articles
: populates the latest blog posts -
like-article
: triggered when someone likes an article -
publish-article
: post a new blog article
Here's the power of the CDK model. We can create a REST API with 10 lines of code:
rest_api_blog = aws_apigateway.LambdaRestApi(
self,
'sls-blog-rest-api-gateway',
handler=lambda_blog, # Previously declared Lambda function
deploy_options=aws_apigateway.StageOptions(
stage_name='api',
throttling_rate_limit=lambda_param_max_concurrency,
logging_level=aws_apigateway.MethodLoggingLevel('INFO'),
),
)
One nice little thing is that Lambda memory is used as a cache for the latest articles. We load the cache container outside the Lambda handler function. It remains in memory even after an invocation ends and is available for subsequent requests. Learn more here about how to use Lambda as a cache mechanism.
MAX_CACHE_AGE: int = 120 # In seconds
CACHE_LATEST_ARTICLES: Dict[str, Union[int, list]] = {
'last_update': time.time(),
'articles': [],
}
All the data is stored in DynamoDB (DDB) using a single-table design, in on-demand mode. The site only shows the latest blog articles and items get auto-deleted by DDB after a few days by setting a time-to-live
attribute.
ddb_attr_time_to_live = 'time-to-live'
ddb_param_max_parallel_streams = 5
ddb_table_blog = aws_dynamodb.Table(
self,
'sls-blog-dynamo-table',
partition_key=aws_dynamodb.Attribute(
name='id',
type=aws_dynamodb.AttributeType.STRING,
),
billing_mode=aws_dynamodb.BillingMode.PAY_PER_REQUEST,
point_in_time_recovery=True,
removal_policy=core.RemovalPolicy.DESTROY,
time_to_live_attribute=self.ddb_attr_time_to_live,
stream=aws_dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
)
The DDB table also has a GSI (Global Secondary Index) that makes it easier to retrieve articles ordered by date for the site:
self.ddb_table_blog.add_global_secondary_index(
index_name='latest-blogs',
partition_key=aws_dynamodb.Attribute(
name='item-type',
type=aws_dynamodb.AttributeType.STRING,
),
sort_key=aws_dynamodb.Attribute(
name='publish-timestamp',
type=aws_dynamodb.AttributeType.NUMBER,
),
projection_type=aws_dynamodb.ProjectionType.ALL,
)
Modifications to DDB items generate streams that are processed by a second Lambda function. These streams are then repackaged and sent to a Kinesis Firehose stream processor.
DDB doesn't provide the flexibility that SQL databases offer and many choose Aurora Serverless, for example. Although Aurora is a great service, personally I prefer DDB for its simplicity and reliable, consistent performance. But sometimes we do need to run analytical queries, those with aggregations and on-the-fly filters. For that, we'll be using Athena (more in the next Stack).
Stack 3: analytical querying
A Kinesis Firehose Stream is responsible for batching data inserted/modified in DDB, converting them to Apache Parquet format and storing in dedicated S3 buckets. From S3, we create a Data Lake with AWS Glue (used to declare our data schemas) and Athena (used to query the data).
Athena is extremely powerful. We can use SQL SELECT statements (with some limitations) to query terabytes of data and pay on-demand ($0.005 per GB of data scanned). Using Parquet not only improves query speed, but also reduces cost by minimizing the amount of data Athena needs to scan for each query.
Queries that are impossible or expensive/slow in DynamoDB, such as aggregations and JOINs, are fast and cheap in Athena. The two services combine each other in a perfect way, so that your application has optimized transactional storage and flexible analytical querying capabilities.
We can use Athena to query all articles ever published and cross-reference with likes and HTTP metadata (source IP address, country, device type, etc). Even articles that were already expired by DynamoDB TTL (time-to-live) would continue to be available in the Data Lake.
For example, which countries are liking the most articles? In the AWS Console, we get something like this:
Queries can also be executed programmatically with Athena API or AWS SDKs (e.g. Python's boto3) to integrate anywhere we need this data.
Athena also supports JOINs. Here's an example joining articles and HTTP metadata to analyze the most popular authors among readers of a particular country:
Deployment
CDK can deploy one Stack at a time. Since we have three, it's necessary to specify which one when running the cdk deploy
command. We do that by passing in the Stack ID as a CLI argument. For example, the following command will deploy the SlsBlogApiStack (id: sls-blog-api
):
cdk deploy sls-blog-api
Since all Stacks involve some type of permission granting, CDK asks for confirmation before deploying those resources. You can review the permissions requested and hit y
when it's good to go.
Wrapping up
We've covered how to structure CDK apps and add a bunch of AWS Resources to deploy with a simple cdk deploy
command. If you're new to the CDK - and as suggested early in the article -, it's strongly recommended to follow AWS CDK workshop and documentation.
Keep an eye on future publications as well, as Dashbird is releasing other examples and tutorials to reap the most out of AWS serverless services with the power of infra automation with CDK or else.
Posted on July 16, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 21, 2024