dhxmo
Posted on August 26, 2022
I hate reading the news. What I hate more than anything is opening a news site and being overwhelmed by all the stories, so I decided to build a scraper to send me a couple top news items everyday
Github repo for the project is here
Step 1:
First thing I did was learn about cron jobs
so for now I set up a cronjob that'll log every 1 minute. just so I can debug the program
here's the command for that:
*/1 * * * * /home/dhruv/bin/python ~/Desktop/projects/newsScraper/scraper.py >> ~/Desktop/projects/newsScraper/cron.log 2>&1
the scheduler is a little complex to understand but gotta figure it out to use this tool.
next up is to set up a scraper and output a text file of the top stories.
Step 2:
I decided I'll pick all the subsections of markets and get all their top neww.
for now the output gives me the link to the news and news title
import requests, random
from bs4 import BeautifulSoup # web scraping
content = ''
urls_dict = {
'telecom': 'https://economictimes.indiatimes.com/industry/telecom',
'transport': 'https://economictimes.indiatimes.com/industry/transportation',
'services': 'https://economictimes.indiatimes.com/industry/services',
'biotech': 'https://economictimes.indiatimes.com/industry/healthcare/biotech',
'svs': 'https://economictimes.indiatimes.com/industry/indl-goods/svs',
'energy': 'https://economictimes.indiatimes.com/industry/energy',
'consumer_products': 'https://economictimes.indiatimes.com/industry/cons-products',
'finance': 'https://economictimes.indiatimes.com/industry/banking/finance',
'automobiles': 'https://economictimes.indiatimes.com/industry/auto'
}
todays_url = random.choice(list(urls_dict.values()))
response = requests.get(todays_url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
headline_data = soup.find("ul", class_="list1")
url = 'https://economictimes.indiatimes.com'
for i, news in enumerate(headline_data.find_all("li")):
link = '%s%s' % (url, news.a.get('href'))
print(i+1, link, news.text, end=" \n")
Step 3:
lets prettify this so I feel like a pro
import random
import requests
from bs4 import BeautifulSoup
# email content placeholder
content = ''
urls_dict = {
'telecom': 'https://economictimes.indiatimes.com/industry/telecom',
'transport': 'https://economictimes.indiatimes.com/industry/transportation',
'services': 'https://economictimes.indiatimes.com/industry/services',
'biotech': 'https://economictimes.indiatimes.com/industry/healthcare/biotech',
'svs': 'https://economictimes.indiatimes.com/industry/indl-goods/svs',
'energy': 'https://economictimes.indiatimes.com/industry/energy',
'consumer_products': 'https://economictimes.indiatimes.com/industry/cons-products',
'finance': 'https://economictimes.indiatimes.com/industry/banking/finance',
'automobiles': 'https://economictimes.indiatimes.com/industry/auto'
}
def extract_news():
todays_url = random.choice(list(urls_dict.values()))
response = requests.get(todays_url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
headline_data = soup.find("ul", class_="list1")
email_body = ''
url = 'https://economictimes.indiatimes.com'
for i, news in enumerate(headline_data.find_all("li")):
link = '%s%s' % (url, news.a.get('href'))
email_body += str(i + 1) + '. ' + '<a href="' + link + '">' + news.text + '</a>' + '\n\n\n' + '<br />'
return email_body
Step 4:
I want to introduce some more randomness to what I read. So I decided to add these things I receive into a list and shuffle it up and then send me only 5 of the news items:
def extract_news():
todays_url = random.choice(list(urls_dict.values()))
response = requests.get(todays_url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
headline_data = soup.find("ul", class_="list1")
email_body = ''
email_body += 'Good Morning kiddo. Today we read Economics Times. Heres whats happening today: <br />\n <br />\n'
all_news = []
url = 'https://economictimes.indiatimes.com'
for i, news in enumerate(headline_data.find_all("li")):
body = ''
link = '%s%s' % (url, news.a.get('href'))
body += '<a href="' + link + '">' \
+ news.text + '</a>' + '<br />\n' + '<br />\n'
# add items to a list
all_news.append(body)
# shuffle the list
random.shuffle(all_news)
n = 5
# iterate over the first 5 elements of the randomized list
for i in itertools.islice(all_news, n):
email_body += '- ' + i
email_body += '<br>---------------------------------<br>'
email_body += '<br><br>Thats all for today. Byeeee'
return email_body
Step 5:
now it's time to send SMTP mails
Google has changed it's policy so I went here and reconfigured my account settings
after that's done, I stored my password in a .env file and ---
import os
from dotenv import load_dotenv
import smtplib
# email body
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
# system date and time manipulation
import datetime
now = datetime.datetime.now()
load_dotenv()
def send_mail(news_body):
SERVER = 'smtp.gmail.com'
PORT = 587
FROM = 'homesanct@gmail.com'
TO = 'dhrvmohapatra@gmail.com'
PASSWORD = os.getenv('password')
msg = MIMEMultipart()
msg['Subject'] = 'Good Morning Champ' + ' ' + str(now.day) + '-' + str(now.month) + '-' + str(
now.year)
msg['From'] = FROM
msg['To'] = TO
msg.attach(MIMEText(news_body, 'html'))
print('initializing server')
server = smtplib.SMTP(SERVER, PORT)
server.set_debuglevel(1)
server.ehlo()
server.starttls()
server.login(FROM, PASSWORD)
server.sendmail(FROM, TO, msg.as_string())
print('Email Sent...')
server.quit()
Step 5:
I finished up with the classic pypy
if __name__ == "__main__":
data = extract_news()
send_mail(data)
Step 6:
last but not the least I had to set up the proper cronjob
I changed up the project location so things changed a little, but now I'll get a random sample of news from Economics Times at 6:55 am on Monday and Thursday!
55 6 * * 1,4 /home/dhruv/Desktop/projects/toolbox/newsScraper/venv/bin/python ~/Desktop/projects/toolbox/newsScraper/newsReader01.py
I also wrote scripts for Times Of India and Reuters, but that would be redundant to add here.
Now comes the little complex parts. I don't want to keep my laptop on everyday just so I can get a ruddy email, so I decided to send this script to the cloud.
After a bit of research, I found that AWS Lambda is the most efficient solution to execute this, so I spent a while understanding it.
next, it came time to upload the script..
and everything crashed
like a 100 times
before I finally was able to figure it out. anyways,
here's the steps.
Step 7:
First sign into AWS
then open up Lambda
then press Create Function
fill in a function name, choose python runtime
and create function
Step 8:
once the function is created, scroll down to the code source window and edit our preexisting code a little to fit AWS template
import json
import random
import requests
from bs4 import BeautifulSoup
import os
import smtplib
import itertools
# email body
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
# system date and time manipulation
import datetime
now = datetime.datetime.now()
def lambda_handler(event, context):
# email content placeholder
content = ''
urls_dict = {
'telecom': 'https://economictimes.indiatimes.com/industry/telecom',
'transport': 'https://economictimes.indiatimes.com/industry/transportation',
'services': 'https://economictimes.indiatimes.com/industry/services',
'biotech': 'https://economictimes.indiatimes.com/industry/healthcare/biotech',
'svs': 'https://economictimes.indiatimes.com/industry/indl-goods/svs',
'energy': 'https://economictimes.indiatimes.com/industry/energy',
'consumer_products': 'https://economictimes.indiatimes.com/industry/cons-products',
'finance': 'https://economictimes.indiatimes.com/industry/banking/finance',
'automobiles': 'https://economictimes.indiatimes.com/industry/auto'
}
def extract_news():
todays_url = random.choice(list(urls_dict.values()))
response = requests.get(todays_url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
headline_data = soup.find("ul", class_="list1")
email_body = ''
email_body += 'Good Morning kiddo. Today we read Economics Times: <br />\n <br />\n'
all_news = []
url = 'https://economictimes.indiatimes.com'
for i, news in enumerate(headline_data.find_all("li")):
body = ''
link = '%s%s' % (url, news.a.get('href'))
body += '<a href="' + link + '">' \
+ news.text + '</a>' + '<br />\n' + '<br />\n'
# add items to a list
all_news.append(body)
# shuffle the list
random.shuffle(all_news)
n = 3
# iterate over the first 5 elements of the randomized list
for i in itertools.islice(all_news, n):
email_body += '- ' + i
email_body += '<br>---------------------------------<br>'
email_body += '<br><br>Thats all for today. Byeeee'
return email_body
def send_mail(news_body):
SERVER = 'smtp.gmail.com'
PORT = 587
FROM = 'homesanct@gmail.com'
TO = 'dhrvmohapatra@gmail.com'
PASSWORD = os.environ.get('password')
msg = MIMEMultipart()
msg['Subject'] = 'Economic Times' + ' ' + str(now.day) + '-' + str(now.month) + '-' + str(
now.year)
msg['From'] = FROM
msg['To'] = TO
msg.attach(MIMEText(news_body, 'html'))
print('initializing server')
server = smtplib.SMTP(SERVER, PORT)
server.set_debuglevel(1)
server.ehlo()
server.starttls()
server.login(FROM, PASSWORD)
server.sendmail(FROM, TO, msg.as_string())
print('Email Sent...')
server.quit()
news_body = extract_news()
send_mail(news_body)
save and press test to test out the code.
a configure test event window pops up.
give your new event a name and save it.
run the function again and it errors out
cuz there's two things missing
Step 9:
first thing missing is the password in our environment variables
we add that by going to the configuration tab and adding an environment variable here
the next thing missing is all the packages needed for the scrape. Requests and BeautifulSoup does just live on the AWS cloud so we need to add them to our project.
this one took a while to figure out as well.
here's my solution
Step 10:
I went to the directory I had written my project locally in and made a directory named packages
still in the project directory I opened up my terminal and ran the command
$ pip install -t packages requests
$ pip install -t packages beautifulsoup4
now I copied the code from the AWS code source and made a new file inside the 'packages' directory and called it lambda_function.py
now we are ready.
I Ctrl+A to select all and compress to a zip folder.
Back in the AWS Lambda console, there is an option that reads Upload From (right above the IDE). Upload this zip folder. Now you see this
Step 11:
Now when I pressed the test button, I got a message in my inbox
WOOHOO
but there was one last thing left.
I had to automate this task
So over to the Amazon EventBridge we go
here in the Rules submenu, I created a new rule
now I came back to the AWS Lambdas console and saw that the EventBridge trigger had been added to my function. Sweet sauce.
Time for the final step
Step 12:
Press Deploy.
A final note for the curious: The logs of all the functions can be seen in Amazon CloudWatch Console
aaaaand... that was it. I need to learn about docker and stuff cuz I've heard upload images is much smoother so maybe I'll spend some time going down that hole.
Also, hopefully reading the news will be as fun as creating this project was ✌️
Posted on August 26, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 25, 2024