Lambda to scrape data using typescript & serverless
metacollective
Posted on February 12, 2022
In this blog post we are going to do the following -
- Write a lambda function in nodeJS/typescript to extract the following set of data from a website
- Title of the page
- Any image on the page
- Store that extracted data on AWS's S3
We will use the following node packages for this project -
- serverless (This must be installed globally): This will help us write & deploy Lambda function
- cheerio: This will help us parse the content of webpage into a jQuery object
- Axios: Promise based HTTP client for the browser and node.js
- exceljs: To read, manipulate and write spreadsheet
- aws-sdk
- serverless-offline: To run lambda functions locally
Step 1: Install serverless globally
npm install -g serverless
Step 2: Create a new typescript based project from serverless template library like this
sls create --template aws-nodejs-typescript
Step3: Install the required node packages for this lambda project
npm install axios exceljs cheerio aws-sdk
Step 4: Add
serverless-offline
to plugins list in serverless.ts
plugins: ['serverless-webpack', 'serverless-offline']
Step 5: Add S3 bucket name in the environment variable in serverless.ts like this
environment: {
AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
AWS_BUCKET_NAME: 'YOUR BUCKET NAME'
}
Step 6: Define your function in serverless.ts file like this
import type { AWS } from '@serverless/typescript';
const serverlessConfiguration: AWS = {
service: 'scrapeContent',
frameworkVersion: '2',
custom: {
webpack: {
webpackConfig: './webpack.config.js',
includeModules: true
}
},
// Add the serverless-webpack plugin
plugins: ['serverless-webpack', 'serverless-offline'],
provider: {
name: 'aws',
runtime: 'nodejs14.x',
apiGateway: {
minimumCompressionSize: 1024,
},
environment: {
AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
AWS_BUCKET_NAME: 'scrape-data-at-56'
},
},
functions: {
scrapeContent: {
handler: 'handler.scrapeContent',
events: [
{
http: {
method: 'get',
path: 'scrapeContent',
}
}
]
}
}
}
module.exports = serverlessConfiguration;
Step 7: In your handler.ts file define your function to do the following
- Receive the url to scrape data of from query string
- Make a get request to the url using axios
- Parse the response data using cheerio
- Extract data from the parsed response object and store them in a JSON file and all the image urls in an excel file
- Upload the extracted data up to S3
import { APIGatewayEvent } from "aws-lambda";
import "source-map-support/register";
import axios from "axios";
import * as cheerio from "cheerio";
import { badRequest, okResponse, errorResponse } from "./src/utils/responses";
import { scrape } from "./src/interface/scrape";
import { excel } from "./src/utils/excel";
import { getS3SignedUrl, uploadToS3 } from "./src/utils/awsWrapper";
export const scrapeContent = async (event: APIGatewayEvent, _context) => {
try {
if (!event.queryStringParameters?.url) {
return badRequest;
}
//load page
const $ = cheerio.load(await (await axios.get(event.queryStringParameters?.url)).data);
//extract title and all images on page
const scrapeData = {} as scrape;
scrapeData.images = [];
scrapeData.url = event.queryStringParameters?.url;
scrapeData.dateOfExtraction = new Date();
scrapeData.title = $("title").text();
$("img").each((_i, image) => {
scrapeData.images.push({
url: $(image).attr("src"),
alt: $(image).attr("alt"),
});
});
//add this data to a an excel sheet and upload to s3
const excelSheet = await saveDataAsExcel(scrapeData);
const objectKey = `${scrapeData.title.toLocaleLowerCase().replace(/ /g, '_')}_${new Date().getTime()}`;
await uploadToS3({
Bucket: process.env.AWS_BUCKET_NAME,
Key: `${objectKey}.xlsx`,
ContentType:
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
Body: await excelSheet.workbook.xlsx.writeBuffer()
});
//Get signed url with an expiry date
scrapeData.xlsxUrl = await getS3SignedUrl({
Bucket: process.env.AWS_BUCKET_NAME,
Key: `${objectKey}.xlsx`,
Expires: 3600 //this is 60 minutes, change as per your requirements
});
//Upload to S3 & give a link to download result as xslx
await uploadToS3({
Bucket: process.env.AWS_BUCKET_NAME,
Key: `${objectKey}.json`,
ContentType:
'application/json',
Body: JSON.stringify(scrapeData)
});
return okResponse(scrapeData);
} catch (error) {
return errorResponse(error);
}
};
/**
*
* @param scrapeData
* @returns excel
*/
async function saveDataAsExcel(scrapeData: scrape) {
const workbook:excel = new excel({ headerRowFillColor: '046917', defaultFillColor: 'FFFFFF' });
let worksheet = await workbook.addWorkSheet({ title: 'Scrapped data' });
workbook.addHeaderRow(worksheet, [
"Title",
"URL",
"Date of extraction",
"Images URL",
"Image ALT Text"
]);
workbook.addRow(
worksheet,
[
scrapeData.title,
scrapeData.url,
scrapeData.dateOfExtraction.toDateString()
],
{ bold: false, fillColor: "ffffff" }
);
for (let image of scrapeData.images) {
workbook.addRow(
worksheet,
[
'', '', '',
image.url,
image.alt
],
{ bold: false, fillColor: "ffffff" }
);
}
return workbook;
}
Step 8: Set your aws access key and aws secret key in your environment like this
export AWS_ACCESS_KEY_ID = YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY = YOUR_ACCESS_SECRET_KEY
Step 9: You are now ready to run this function on your machine like this
sls offline --stage local
Now you should be able to access your function from your machine like this http://localhost:3000/local/scrapeContent?url=ANY_URL_YOU_WISH_TO_SCRAPE
Step 10: If you wish to deploy this lambda function on your AWS account then you can do it like this -
sls deploy
You can checkout this lambda function from here.
Posted on February 12, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.