Josh Vincent
Posted on March 16, 2021
Recently working on a project that required getting hourly market data of the top 10.
I have put this together in the hope that it will help someone else out using Amplify, Puppeteer, Lambda, DynamoDB, and scraping public data off the web.
As we found it very difficult to find any Api's that could solve this issue for us.
We Discovered a way in which we could leverage lambda every hour to run and get the top 10 companies by marketcap, insert the data into a database which we would fetch later (another future blog post to come...)
This is how we did it.
Here is an image of what the architecture looks like.
- This has a 1-hour timer that triggers a cloud watch alarm
- Cloudwatch calls our first Lambda function
- Lambda function pushes messages as it reads from the table to an SQS queue
- SQS triggers a series of lambda functions that write data to our DynamoDB
Creating our new project.
In VS Code run the following
amplify init
Create a name for your project and then accept all defaults
Note: It is recommended to run this command from the root of your app directory
$ Enter a name for the project MarketData
$ Enter a name for the environment dev
$ Choose your default editor: Visual Studio Code
$ Choose the type of app that you're building javascript
Please tell us about your project
$ What javascript framework are you using none
$ Source Directory Path: src
$ Distribution Directory Path: dist
$ Build Command: npm run-script build
$ Start Command: npm run-script start
Using default provider awscloudformation
$ Do you want to use an AWS profile? Yes
$ Please choose the profile you want to use marketdata
If you don't have your credentials saved in ~/.aws/credentials
you can just say no to the last question.
Creating our first Lambda function
Next, we get to create our first lambda function by doing the following.
amplify add function
Follow the prompts which should look like this
$ amplify add function
$ Select which capability you want to add: Lambda function (serverless function)
$ Provide a friendly name for your resource to be used as a label for this category in the project: marketdataPuppeteer
$ Provide the AWS Lambda function name: marketdataPuppeteer
$ Choose the runtime that you want to use: NodeJS
$ Choose the function template that you want to use: Hello World
$ Do you want to access other resources in this project from your Lambda function? Yes
$ Select the category
You can access the following resource attributes as environment variables from your Lambda function
ENV
REGION
$ Do you want to invoke this function on a recurring schedule? Yes
$ At which interval should the function be invoked: Hourly
$ Enter the rate in hours: 1
$ Do you want to configure Lambda layers for this function? No
$ Do you want to edit the local lambda function now? (Y/n) y
Set Up the dependencies
next, browse into the name of the function you just created
cd amplify/backend/function/marketdataPuppeteer/src/
Now install the following.
npm install chrome-aws-lambda --save-prod
npm install puppeteer-core --save-prod
Great now open your index.js
file inside the amplify/backend/function/marketdataPuppeteer/src/
folder.
Add the required dependencies
//amplify/backend/function/marketdataPuppeteer/src/index.js
var AWS = require("aws-sdk");
var SQS = new AWS.SQS({ region: "ap-southeast-2" });
const chromium = require("chrome-aws-lambda");
You will also need to add in the URL of your SQS queue which we will set up shortly.
var QUEUE_URL =
"https://sqs.ap-southeast-2.amazonaws.com/012345678/Market-Data-Que";
here is our full function
//amplify/backend/function/marketdataPuppeteer/src/index.js
var AWS = require("aws-sdk");
var SQS = new AWS.SQS({ region: "ap-southeast-2" });
const chromium = require("chrome-aws-lambda");
var QUEUE_URL =
"https://sqs.ap-southeast-2.amazonaws.com/012345678/Market-Data-Que";
exports.handler = async (event, context, callback) => {
let result = null;
let browser = null;
try {
browser = await chromium.puppeteer.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless,
ignoreHTTPSErrors: true,
});
let page = await browser.newPage();
await page.goto("https://www.marketindex.com.au/asx-listed-companies");
await delay(5000); // Wait 5 seconds for page to load
const selector = "tbody tr"; //We are finding the table here
const row = await page.$$eval(selector, (
tableRows //looping through each row in the table
) =>
tableRows.map((tableRow) => {
const tableDataElement = [...tableRow.getElementsByTagName("td")]; //Get each table data element
return tableDataElement.map((tableData) =>
tableData.textContent.trim() //Return the text content from the table data element.
);
})
);
//All the rows now exist in the row variable, console.log(row[0][2]); will get you the first element in rows and the third column
row.slice(0, 10).map((company) => {
//Create an object for each of the rows to add to our table
const item = {
symbol: company[2],
name: company[3],
price: company[4],
daily_change: company[5],
yearly_change: company[6],
marketcap: company[7],
};
//console.log(item); //In our logs we should now have a list of objects with the above data
var params = {
MessageBody: JSON.stringify(item),
QueueUrl: QUEUE_URL,
};
console.log(item);
SQS.sendMessage(params)
.promise()
.then((result) => console.log("Successfully sent message", result))
.catch((error) => console.log("Error failed to send message", error));
});
//console.log(table);
await browser.close();
} catch (error) {
return callback(error);
} finally {
if (browser !== null) {
await browser.close();
}
}
return callback(null, result);
};
cd back to the project root and push the backend to aws by running
amplify push --y
you should see something like this
$ amplify push
✔ Successfully pulled backend environment dev from the cloud.
Current Environment: dev
| Category | Resource name | Operation | Provider plugin |
| -------- | ------------------- | --------- | ----------------- |
| Function | marketdataPuppeteer | Create | awscloudformation |
$ Are you sure you want to continue? (Y/n) y
This should push your code to AWS Amplify backend and create your lambda function.
Open your console in AWS. You should see something that looks like this
Creating the SQS Queue
Now we have our lambda function setup you might have noticed the QUEUE_URL
we haven't created yet.
As of the time this is posted Amplify doesn't have the built-in functionality to create SQS queues so we will do it via the console.
Give your queue a name as we have here Market-Data-Que
Keep all the defaults & Create a queue
Now get your URL from the endpoint address, go back to your function, and replace the QUEUE_URL
with your value.
Run another amplify push to update our recent changes.
amplify push --y
As this is a function that runs for a little longer than normal and the way Puppeteer works we have increased both the execution times and memory for this function.
- Click the configuration tab
- Click Edit
- Increase Memory to 512 MB
- Timeout to 2 min 30 seconds (increase this if you need to)
- Click on the View the marketdataLambdaRole12313-dev-role on the IAM console.
- We want to enable lambda to contact SQS, so add SQSFullAccess to the role. (update to ARN to lock down access)
Attach the policy and return to your lambda function.
Now we are ready to Test!
Open up the test tab, create a demo event using the hello-world template and Invoke.
You should now see something that looks like this
If you then visit your SQS Queue you should see a heap of messages like this.
Okay great, So far we have a lambda function opening the window in puppeteer, extracting the data, and pushing it to an SQS queue.
Now we want to create a new function that uses the SQS queue as a trigger to enter our entries into the DynamoDB table. Let's set them up.
Creating the storage table.
First let's create the Storage with amplify add storage
$ amplify add storage
$ Please select from one of the below mentioned services: NoSQL Database
Welcome to the NoSQL DynamoDB database wizard
This wizard asks you a series of questions to help determine how to set up your NoSQL database table.
$ Please provide a friendly name for your resource that will be used to label this category in the project: marketdata
$ Please provide table name: marketdata
You can now add columns to the table.
$ What would you like to name this column: id
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: symbol
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: name
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: price
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: daily_change
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: yearly_change
$ Please choose the data type: string
$ Would you like to add another column? Yes
$ What would you like to name this column: marketcap
Please choose partition key for the table: id
$ Do you want to add a sort key to your table? Yes
$ Please choose sort key for the table: symbol
You can optionally add global secondary indexes for this table. These are useful when you run queries defined in a different column than the primary key.
To learn more about indexes, see:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.SecondaryIndexes
$ Do you want to add global secondary indexes to your table? No
$ Do you want to add a Lambda Trigger for your Table? No
Do another amplify push and see the DynamoDB database created
amplify push —y
Now open the console and go to DynamoDB you should see your table here.
Now that the table is created we can now create another lambda function that will pull messages from the SQS queue and then input the data into our newly created table.
Creating Lambda function 2
We create another function with amplify add function
amplify add function
$ Select which capability you want to add: Lambda function (serverless function)
$ Provide a friendly name for your resource to be used as a label for this category in the project: addMarketDataToDB
$ Provide the AWS Lambda function name: addMarketDataToDB
$ Choose the runtime that you want to use: NodeJS
$ Choose the function template that you want to use: Hello World
$ Do you want to access other resources in this project from your Lambda function? Yes
$ Select the category storage
Storage category has a resource called marketdata
$ Select the operations you want to permit for marketdata create
You can access the following resource attributes as environment variables from your Lambda function
ENV
REGION
STORAGE_MARKETDATA_ARN
STORAGE_MARKETDATA_NAME
$ Do you want to invoke this function on a recurring schedule? No
$ Do you want to configure Lambda layers for this function? No
$ Do you want to edit the local lambda function now? (Y/n) y
cd into the function again
cd amplify/backend/functions/addMarketDataToDB/src
Add the following dependencies to the top of your index.js
const AWS = require("aws-sdk");
const docClient = new AWS.DynamoDB.DocumentClient();
/* Amplify Params - DO NOT EDIT
ENV
REGION
STORAGE_MARKETDATA_ARN
STORAGE_MARKETDATA_NAME
Amplify Params - DO NOT EDIT */
const AWS = require("aws-sdk");
const docClient = new AWS.DynamoDB.DocumentClient();
exports.handler = async (event, context) => {
const { body } = event.Records[0];
const parsedBody = JSON.parse(body);
const timestamp = new Date().toISOString();
var params = {
TableName: "marketdata-dev", // UPDATE THIS WITH THE ACTUAL NAME OF THE FORM TABLE ENV VAR (set by Amplify CLI)
Item: {
id: context.awsRequestId, //this uses the id of the invoked lambda function from context
timestamp: timestamp,
...parsedBody,
},
};
console.log(params);
await docClient
.put(params)
.promise()
.then((data) => console.log("Success!", data))
.catch((err) => console.log("error!", err));
};
/* SQS event looks like this.
{
"Records": [
{
"messageId": "19dd0b57-b21e-4ac1-bd88-01bbb068cb78",
"receiptHandle": "MessageReceiptHandle",
"body": "{\n \"symbol\": \"FMG\",\n \"name\": \"FMG Fortescue Metals Group Ltd\",\n \"price\": \"$20.63\",\n \"daily_change\": \"+1.23%\",\n \"yearly_change\": \"+113.78%\",\n \"marketcap\": \"$63.52 B\"\n }",
"attributes": {
"ApproximateReceiveCount": "1",
"SentTimestamp": "1523232000000",
"SenderId": "123456789012",
"ApproximateFirstReceiveTimestamp": "1523232000001"
},
"messageAttributes": {},
"md5OfBody": "{{{md5_of_body}}}",
"eventSource": "aws:sqs",
"eventSourceARN": "arn:aws:sqs:ap-southeast-2:123456789012:MyQueue",
"awsRegion": "ap-southeast-2"
}
]
}
*/
Update our cloud backend with another amplify push —y
amplify push --y
wait for addMarketDataToDB function to appear in the console.
Once it has appeared go to the test tab and insert the following event with this example SQS event.
{
"Records": [
{
"messageId": "19dd0b57-b21e-4ac1-bd88-01bbb068cb78",
"receiptHandle": "MessageReceiptHandle",
"body": "{\n \"symbol\": \"FMG\",\n \"name\": \"FMG Fortescue Metals Group Ltd\",\n \"price\": \"$20.63\",\n \"daily_change\": \"+1.23%\",\n \"yearly_change\": \"+113.78%\",\n \"marketcap\": \"$63.52 B\"\n }",
"attributes": {
"ApproximateReceiveCount": "1",
"SentTimestamp": "1523232000000",
"SenderId": "123456789012",
"ApproximateFirstReceiveTimestamp": "1523232000001"
},
"messageAttributes": {},
"md5OfBody": "{{{md5_of_body}}}",
"eventSource": "aws:sqs",
"eventSourceARN": "arn:aws:sqs:ap-southeast-2:123456789012:MyQueue",
"awsRegion": "ap-southeast-2"
}
]
}
If this is working you should now be able to see an entry in your DynamoDB table like this.
Next, we need to configure the lambda function to get the messages from the SQS queue.
Go back to your lambda function in the console.
- Click Triggers
- Find SQS
- Enter the name of your SQS queue
- Make sure to make the batch size 1
- Click create
Once you have created this if you go to your SQS queue you will notice the count of messages decreasing, Lambda is now querying the SQS queue and inserting the data into the table.
Now you have a fully automated web scraper running every hour. To Test the whole process just wait for 1Hr or Trigger the first lambda function we created!
Posted on March 16, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.