Necessity is mother of invention
Urvir
Posted on September 12, 2020
During this lock down my mother came to me asking for her Gujarati news paper. As news paper weren't allowed to be distributed in our residential complex i tried to find an e-paper for her. The paper was available on the news paper's website for free but issue was all pages were stored as PDF files but stored as 1 file per page.
I am big fan of Python's approach to find solution for practical problems and ever growing list of modules and libraries for anything under sun.
I followed an experiment model and combined different methods to get the required result.
I use BeautifulSoup to scrap the data and PYPDF2 to read and merge files with tips from Stakeoveflow :)
Below is the code. This gets me a single PDF in a directory with today's date. She is able to run this from her smart phone using Pydroid3.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
import os
from datetime import datetime
today=datetime.today().strftime('%d-%m-%Y')
if not os.path.exists(today):
os.mkdir(today)
os.chdir(os.path.join(os.getcwd(),today))
req = Request("http://www.newspapaersomething.com/frmEPShow.aspx")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
mylist=[]
del links[0:15]
mylist=links
cnt=len(mylist)
i=0
urlnew = [None] * cnt
while True:
urlnew[i] = mylist[i]
r = requests.get(urlnew[i], allow_redirects=True)
z=urlnew[i].split("/")
name=z[-1]
open(name, 'wb').write(r.content)
i = i + 1
if(i >= cnt):
break
from PyPDF2 import PdfFileMerger,PdfFileReader
def mergeIntoOnePDF(path):
f=path+"/"
pdf_files=[fileName for fileName in os.listdir(f) if fileName.endswith('.pdf')]
merger=PdfFileMerger()
for filename in pdf_files:
merger.append(PdfFileReader(os.path.join(f,filename),"rb"))
merger.write(os.path.join(f,"merged_full.pdf"))
mergeIntoOnePDF(os.getcwd())
Posted on September 12, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.