Lambda to scrape data using typescript & serverless

In this blog post we are going to do the following -

Write a lambda function in nodeJS/typescript to extract the following set of data from a website
- Title of the page
- Any image on the page
Store that extracted data on AWS's S3

We will use the following node packages for this project -

serverless (This must be installed globally): This will help us write & deploy Lambda function
cheerio: This will help us parse the content of webpage into a jQuery object
Axios: Promise based HTTP client for the browser and node.js
exceljs: To read, manipulate and write spreadsheet
aws-sdk
serverless-offline: To run lambda functions locally

Step 1: Install serverless globally

npm install -g serverless

Step 2: Create a new typescript based project from serverless template library like this

sls create --template aws-nodejs-typescript

Step3: Install the required node packages for this lambda project

npm install axios exceljs cheerio aws-sdk

Step 4: Add serverless-offline to plugins list in serverless.ts

plugins: ['serverless-webpack', 'serverless-offline']

Step 5: Add S3 bucket name in the environment variable in serverless.ts like this

environment: {
      AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
      AWS_BUCKET_NAME: 'YOUR BUCKET NAME'
}

Step 6: Define your function in serverless.ts file like this

import type { AWS } from '@serverless/typescript';

const serverlessConfiguration: AWS = {
  service: 'scrapeContent',
  frameworkVersion: '2',
  custom: {
    webpack: {
      webpackConfig: './webpack.config.js',
      includeModules: true
    }
  },
  // Add the serverless-webpack plugin
  plugins: ['serverless-webpack', 'serverless-offline'],
  provider: {
    name: 'aws',
    runtime: 'nodejs14.x',
    apiGateway: {
      minimumCompressionSize: 1024,
    },
    environment: {
      AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
      AWS_BUCKET_NAME: 'scrape-data-at-56'
    },
  },
  functions: {
    scrapeContent: {
      handler: 'handler.scrapeContent',
      events: [
        {
          http: {
            method: 'get',
            path: 'scrapeContent',
          }
        }
      ]
    }
  }
}

module.exports = serverlessConfiguration;

Step 7: In your handler.ts file define your function to do the following

Receive the url to scrape data of from query string
Make a get request to the url using axios
Parse the response data using cheerio
Extract data from the parsed response object and store them in a JSON file and all the image urls in an excel file
Upload the extracted data up to S3

import { APIGatewayEvent } from "aws-lambda";
import "source-map-support/register";
import axios from "axios";
import * as cheerio from "cheerio";


import { badRequest, okResponse, errorResponse } from "./src/utils/responses";
import { scrape } from "./src/interface/scrape";
import { excel } from "./src/utils/excel";
import { getS3SignedUrl, uploadToS3 } from "./src/utils/awsWrapper";

export const scrapeContent = async (event: APIGatewayEvent, _context) => {

  try {

    if (!event.queryStringParameters?.url) {
      return badRequest;
    }

    //load page
    const $ = cheerio.load(await (await axios.get(event.queryStringParameters?.url)).data);

    //extract title and all images on page
    const scrapeData = {} as scrape;
    scrapeData.images = [];
    scrapeData.url = event.queryStringParameters?.url;
    scrapeData.dateOfExtraction = new Date();
    scrapeData.title = $("title").text();
    $("img").each((_i, image) => {
      scrapeData.images.push({
        url: $(image).attr("src"),
        alt: $(image).attr("alt"),
      });
    });

    //add this data to a an excel sheet and upload to s3
    const excelSheet = await saveDataAsExcel(scrapeData);
    const objectKey = `${scrapeData.title.toLocaleLowerCase().replace(/ /g, '_')}_${new Date().getTime()}`;
    await uploadToS3({
      Bucket: process.env.AWS_BUCKET_NAME,
      Key: `${objectKey}.xlsx`,
      ContentType:
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
      Body: await excelSheet.workbook.xlsx.writeBuffer()
    });

    //Get signed url with an expiry date
    scrapeData.xlsxUrl = await getS3SignedUrl({
      Bucket: process.env.AWS_BUCKET_NAME,
      Key: `${objectKey}.xlsx`,
      Expires: 3600 //this is 60 minutes, change as per your requirements
    });

    //Upload to S3 & give a link to download result as xslx
    await uploadToS3({
      Bucket: process.env.AWS_BUCKET_NAME,
      Key: `${objectKey}.json`,
      ContentType:
        'application/json',
      Body: JSON.stringify(scrapeData)
    });

    return okResponse(scrapeData);
  } catch (error) {
    return errorResponse(error);
  }
};

/**
 * 
 * @param scrapeData 
 * @returns excel
 */
async function saveDataAsExcel(scrapeData: scrape) {
  const workbook:excel = new excel({ headerRowFillColor: '046917', defaultFillColor: 'FFFFFF' });
  let worksheet = await workbook.addWorkSheet({ title: 'Scrapped data' });
  workbook.addHeaderRow(worksheet, [
    "Title",
    "URL",
    "Date of extraction",
    "Images URL",
    "Image ALT Text"
  ]);

  workbook.addRow(
    worksheet,
    [
      scrapeData.title,
      scrapeData.url,
      scrapeData.dateOfExtraction.toDateString()
    ],
    { bold: false, fillColor: "ffffff" }
  );

  for (let image of scrapeData.images) {
    workbook.addRow(
      worksheet,
      [
        '', '', '',
        image.url,
        image.alt
      ],
      { bold: false, fillColor: "ffffff" }
    );
  }

  return workbook; 
}

Step 8: Set your aws access key and aws secret key in your environment like this

export AWS_ACCESS_KEY_ID = YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY = YOUR_ACCESS_SECRET_KEY

Step 9: You are now ready to run this function on your machine like this sls offline --stage local

Now you should be able to access your function from your machine like this http://localhost:3000/local/scrapeContent?url=ANY_URL_YOU_WISH_TO_SCRAPE

Step 10: If you wish to deploy this lambda function on your AWS account then you can do it like this -
sls deploy

You can checkout this lambda function from here.

Blog

Lambda to scrape data using typescript & serverless

metacollective

Join Our Newsletter. No Spam, Only the good stuff.

Related