AWS Scheduled Scrape using Python

I hate reading the news. What I hate more than anything is opening a news site and being overwhelmed by all the stories, so I decided to build a scraper to send me a couple top news items everyday

Github repo for the project is here

Step 1:

First thing I did was learn about cron jobs

so for now I set up a cronjob that'll log every 1 minute. just so I can debug the program

here's the command for that:

*/1 * * * * /home/dhruv/bin/python ~/Desktop/projects/newsScraper/scraper.py >> ~/Desktop/projects/newsScraper/cron.log 2>&1

the scheduler is a little complex to understand but gotta figure it out to use this tool.

next up is to set up a scraper and output a text file of the top stories.

Step 2:

I decided I'll pick all the subsections of markets and get all their top neww.

for now the output gives me the link to the news and news title

import requests, random
from bs4 import BeautifulSoup  # web scraping

content = ''
urls_dict = {
    'telecom': 'https://economictimes.indiatimes.com/industry/telecom',
    'transport': 'https://economictimes.indiatimes.com/industry/transportation',
    'services': 'https://economictimes.indiatimes.com/industry/services',
    'biotech': 'https://economictimes.indiatimes.com/industry/healthcare/biotech',
    'svs': 'https://economictimes.indiatimes.com/industry/indl-goods/svs',
    'energy': 'https://economictimes.indiatimes.com/industry/energy',
    'consumer_products': 'https://economictimes.indiatimes.com/industry/cons-products',
    'finance': 'https://economictimes.indiatimes.com/industry/banking/finance',
    'automobiles': 'https://economictimes.indiatimes.com/industry/auto'
}

todays_url = random.choice(list(urls_dict.values()))
response = requests.get(todays_url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
headline_data = soup.find("ul", class_="list1")
url = 'https://economictimes.indiatimes.com'
for i, news in enumerate(headline_data.find_all("li")):
    link = '%s%s' % (url, news.a.get('href'))
    print(i+1, link, news.text, end=" \n")

Step 3:

lets prettify this so I feel like a pro

import random
import requests
from bs4 import BeautifulSoup

# email content placeholder
content = ''

urls_dict = {
    'telecom': 'https://economictimes.indiatimes.com/industry/telecom',
    'transport': 'https://economictimes.indiatimes.com/industry/transportation',
    'services': 'https://economictimes.indiatimes.com/industry/services',
    'biotech': 'https://economictimes.indiatimes.com/industry/healthcare/biotech',
    'svs': 'https://economictimes.indiatimes.com/industry/indl-goods/svs',
    'energy': 'https://economictimes.indiatimes.com/industry/energy',
    'consumer_products': 'https://economictimes.indiatimes.com/industry/cons-products',
    'finance': 'https://economictimes.indiatimes.com/industry/banking/finance',
    'automobiles': 'https://economictimes.indiatimes.com/industry/auto'
}


def extract_news():
    todays_url = random.choice(list(urls_dict.values()))
    response = requests.get(todays_url)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    headline_data = soup.find("ul", class_="list1")

    email_body = ''

    url = 'https://economictimes.indiatimes.com'
    for i, news in enumerate(headline_data.find_all("li")):
        link = '%s%s' % (url, news.a.get('href'))
        email_body += str(i + 1) + '. ' + '<a href="' + link + '">' + news.text + '</a>' + '\n\n\n' + '<br />'

    return email_body

Step 4:

I want to introduce some more randomness to what I read. So I decided to add these things I receive into a list and shuffle it up and then send me only 5 of the news items:

def extract_news():
    todays_url = random.choice(list(urls_dict.values()))
    response = requests.get(todays_url)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    headline_data = soup.find("ul", class_="list1")

    email_body = ''

    email_body += 'Good Morning kiddo. Today we read Economics Times. Heres whats happening today: <br />\n <br />\n'

    all_news = []

    url = 'https://economictimes.indiatimes.com'
    for i, news in enumerate(headline_data.find_all("li")):
        body = ''
        link = '%s%s' % (url, news.a.get('href'))
        body += '<a href="' + link + '">' \
                + news.text + '</a>' + '<br />\n' + '<br />\n'
        # add items to a list
        all_news.append(body)

    # shuffle the list
    random.shuffle(all_news)

    n = 5
    # iterate over the first 5 elements of the randomized list
    for i in itertools.islice(all_news, n):
        email_body += '- ' + i

    email_body += '<br>---------------------------------<br>'
    email_body += '<br><br>Thats all for today. Byeeee'

    return email_body

Step 5:

now it's time to send SMTP mails

Google has changed it's policy so I went here and reconfigured my account settings

https://support.google.com/accounts/answer/6010255?hl=en-GB&visit_id=637970887869842501-2539226343&p=less-secure-apps&rd=1

after that's done, I stored my password in a .env file and ---

import os
from dotenv import load_dotenv
import smtplib

# email body
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
# system date and time manipulation
import datetime

now = datetime.datetime.now()
load_dotenv()


def send_mail(news_body):
    SERVER = 'smtp.gmail.com'
    PORT = 587
    FROM = 'homesanct@gmail.com'
    TO = 'dhrvmohapatra@gmail.com'
    PASSWORD = os.getenv('password')

    msg = MIMEMultipart()
    msg['Subject'] = 'Good Morning Champ' + ' ' + str(now.day) + '-' + str(now.month) + '-' + str(
        now.year)
    msg['From'] = FROM
    msg['To'] = TO
    msg.attach(MIMEText(news_body, 'html'))

    print('initializing server')

    server = smtplib.SMTP(SERVER, PORT)
    server.set_debuglevel(1)
    server.ehlo()
    server.starttls()
    server.login(FROM, PASSWORD)
    server.sendmail(FROM, TO, msg.as_string())

    print('Email Sent...')

    server.quit()

Step 5:

I finished up with the classic pypy

if __name__ == "__main__":
    data = extract_news()
    send_mail(data)

Step 6:

last but not the least I had to set up the proper cronjob

I changed up the project location so things changed a little, but now I'll get a random sample of news from Economics Times at 6:55 am on Monday and Thursday!

55 6 * * 1,4 /home/dhruv/Desktop/projects/toolbox/newsScraper/venv/bin/python ~/Desktop/projects/toolbox/newsScraper/newsReader01.py

I also wrote scripts for Times Of India and Reuters, but that would be redundant to add here.

Now comes the little complex parts. I don't want to keep my laptop on everyday just so I can get a ruddy email, so I decided to send this script to the cloud.

After a bit of research, I found that AWS Lambda is the most efficient solution to execute this, so I spent a while understanding it.

next, it came time to upload the script..

and everything crashed

like a 100 times

before I finally was able to figure it out. anyways,

here's the steps.

Step 7:

First sign into AWS

then open up Lambda

then press Create Function

fill in a function name, choose python runtime

and create function

Step 8:

once the function is created, scroll down to the code source window and edit our preexisting code a little to fit AWS template

import json
import random
import requests
from bs4 import BeautifulSoup
import os
import smtplib
import itertools

# email body
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
# system date and time manipulation
import datetime

now = datetime.datetime.now()

def lambda_handler(event, context):
    # email content placeholder
    content = ''

    urls_dict = {
        'telecom': 'https://economictimes.indiatimes.com/industry/telecom',
        'transport': 'https://economictimes.indiatimes.com/industry/transportation',
        'services': 'https://economictimes.indiatimes.com/industry/services',
        'biotech': 'https://economictimes.indiatimes.com/industry/healthcare/biotech',
        'svs': 'https://economictimes.indiatimes.com/industry/indl-goods/svs',
        'energy': 'https://economictimes.indiatimes.com/industry/energy',
        'consumer_products': 'https://economictimes.indiatimes.com/industry/cons-products',
        'finance': 'https://economictimes.indiatimes.com/industry/banking/finance',
        'automobiles': 'https://economictimes.indiatimes.com/industry/auto'
    }


    def extract_news():
        todays_url = random.choice(list(urls_dict.values()))
        response = requests.get(todays_url)
        content = response.content
        soup = BeautifulSoup(content, 'html.parser')
        headline_data = soup.find("ul", class_="list1")

        email_body = ''

        email_body += 'Good Morning kiddo. Today we read Economics Times: <br />\n <br />\n'

        all_news = []

        url = 'https://economictimes.indiatimes.com'
        for i, news in enumerate(headline_data.find_all("li")):
            body = ''
            link = '%s%s' % (url, news.a.get('href'))
            body += '<a href="' + link + '">' \
                    + news.text + '</a>' + '<br />\n' + '<br />\n'
            # add items to a list
            all_news.append(body)

        # shuffle the list
        random.shuffle(all_news)

        n = 3
        # iterate over the first 5 elements of the randomized list
        for i in itertools.islice(all_news, n):
            email_body += '- ' + i

        email_body += '<br>---------------------------------<br>'
        email_body += '<br><br>Thats all for today. Byeeee'

        return email_body

    def send_mail(news_body):
        SERVER = 'smtp.gmail.com'
        PORT = 587
        FROM = 'homesanct@gmail.com'
        TO = 'dhrvmohapatra@gmail.com'
        PASSWORD = os.environ.get('password')

        msg = MIMEMultipart()
        msg['Subject'] = 'Economic Times' + ' ' + str(now.day) + '-' + str(now.month) + '-' + str(
            now.year)
        msg['From'] = FROM
        msg['To'] = TO
        msg.attach(MIMEText(news_body, 'html'))

        print('initializing server')

        server = smtplib.SMTP(SERVER, PORT)
        server.set_debuglevel(1)
        server.ehlo()
        server.starttls()
        server.login(FROM, PASSWORD)
        server.sendmail(FROM, TO, msg.as_string())

        print('Email Sent...')

        server.quit()

    news_body = extract_news()
    send_mail(news_body)

save and press test to test out the code.

a configure test event window pops up.

give your new event a name and save it.

run the function again and it errors out

cuz there's two things missing

Step 9:

first thing missing is the password in our environment variables

we add that by going to the configuration tab and adding an environment variable here

the next thing missing is all the packages needed for the scrape. Requests and BeautifulSoup does just live on the AWS cloud so we need to add them to our project.

this one took a while to figure out as well.

here's my solution

Step 10:

I went to the directory I had written my project locally in and made a directory named packages

still in the project directory I opened up my terminal and ran the command

$ pip install -t packages requests
$ pip install -t packages beautifulsoup4

now I copied the code from the AWS code source and made a new file inside the 'packages' directory and called it lambda_function.py

now we are ready.

I Ctrl+A to select all and compress to a zip folder.

Back in the AWS Lambda console, there is an option that reads Upload From (right above the IDE). Upload this zip folder. Now you see this

Step 11:

Now when I pressed the test button, I got a message in my inbox
WOOHOO

but there was one last thing left.

I had to automate this task

So over to the Amazon EventBridge we go

here in the Rules submenu, I created a new rule

now I came back to the AWS Lambdas console and saw that the EventBridge trigger had been added to my function. Sweet sauce.

Time for the final step

Step 12:

Press Deploy.

A final note for the curious: The logs of all the functions can be seen in Amazon CloudWatch Console

aaaaand... that was it. I need to learn about docker and stuff cuz I've heard upload images is much smoother so maybe I'll spend some time going down that hole.

Also, hopefully reading the news will be as fun as creating this project was ✌️

Blog

AWS Scheduled Scrape using Python

dhxmo

Join Our Newsletter. No Spam, Only the good stuff.

Related