Wayne Smallman
Posted on September 1, 2020
While the Under Cloud has an extension for Google Chrome that allows us to save a selection of text from a web page, what’s been lacking is the option to automate the saving of the entire page. Obvious though it is, saving a web page is no trivial task, and it’s something I’ve been 7 parts preparing for, 2 parts avoiding, and 1 part dreading for ages!
Yet, here we are — at last — and the Under Cloud now supports the saving of web pages via Newspaper3k, a versatile package written in Python. I am stretching the definition of now, since I’m still running tests in the staging environment, but it’s almost complete and should be on production within the week.
The documentation for Newspaper is sparse, and code samples were (are) few and far between. Worse, I had no idea how I would make Python talk to Node — the API is the obvious choice here, but I had no understanding of Python, the types of data it supported, or how I would get that data out of it.
I’m writing this from the perspective of someone on the other side of the learning curve, having walked the long route to get here, but — given the time constraints I’m up against — would have preferred a path less cluttered with obstacles. So this article is from present me for the attention of past me.
Alternatives to Newspaper3k
There are powerful services out there, such as DiffBot, but these are cost-prohibitive at this stage in the life of the Under Cloud, and — to be honest, and in spite of what I said a few paragraphs ago — I’d prefer to figure these things out myself first before delegating them, so I at least have a good technical understanding of what’s going on. However, there are some open source alternatives, such as BeautifulSoup.
Newspaper3k versus BeautifulSoup
I imagine some are wondering why I chose Newspaper3k instead of BeautifulSoup:
- Newspaper appears to be focused on general purpose page scraping;
- while BeautifulSoup — with its wealth of options for parsing the DOM — is geared more towards data science.
You need to know the specific parts of a web page to get the most from BeautifulSoup. I could be wrong, so I look forward to someone stepping in with more information!
Scraping a web page with Newspaper3k
I'm going to make a few assumptions:
- you have an understanding of both Vue and Node;
- and don’t need me to go through the whole process of installing and configuring either;
- or instantiating a new project;
- you have Python installed, along with the Newspaper3k package;
- I’ll be providing concise examples of the code, rather than the complete versions.
As an aside, I don’t like scraping as a description of what we’re doing here, given the horrible connotations attached to it. Please don’t use this article to create nefarious garbage for the purposes of plagiarising the work of others.
Python
Although the Under Cloud is written in JavaScript (or ECMAScript, as it’s now known), the first thing I had to do was learn some Python to create the script that would act as a bridge between the backend written in Node and Newspaper written in Python:
import os
import sys
import json
from datetime import datetime
from newspaper import Article
# Here, the `url` value should be something like: https://www.bbc.co.uk/sport/football/53944598
url = sys.argv[1]
template_for_exceptions = "An exception of type {0} occurred. Arguments:\n{1!r}"
def get_web_page(url):
try:
if url and len(url) > 0:
article = Article(url, keep_article_html = True)
article.download()
article.parse()
dataForBookmarkAsJSON = json.dumps({
'publicationDate': article.publish_date if article.publish_date is None else article.publish_date.strftime("%Y-%m-%d %H:%M:%S"),
'title': article.title,
'note': article.article_html,
'authors': article.authors
})
try:
sys.stdout.write(dataForBookmarkAsJSON)
sys.stdout.flush()
os._exit(0)
except Exception as ex:
message_for_exception = template_for_exceptions.format(type(ex).__name__, ex.args)
print(message_for_exception)
sys.exit(1)
except Exception as ex:
message_for_exception = template_for_exceptions.format(type(ex).__name__, ex.args)
print(message_for_exception)
sys.exit(1)
if __name__ == '__main__':
get_web_page(url)
A few things to point out here, such as the article.publish_date
variable, which is either a date string that I format, or is a null, that I handle when populating the JSON object. Yes, I could have done that upstream in Node, but I took the moment to learn a few things about and in Python.
Vue
At the frontend, I’m using a component with the following method:
getWebPage () {
this.$axios.get(`/newspaper`, {
params: {
// Params.
}
}).then(function(response) {
// Handle the response.
}
}).catch(function(error) {
// Handle the error.
})
}
Node
At the backend, I have the route:
router.get('/newspaper', async (req, res) => {
const getNewspaper = await controllerNewspaper.getWebPage(data)
res.json(getNewspaper)
})
… and in the controller, I have:
services.getWebPage = async (params) => {
let { spawn } = require('child_process')
let processForPython = spawn(process.env.PYTHON_VERSION, [
`${process.env.PYTHON_PATH}/get_web_page.py`,
params.url
], {
maxBuffer: 10240000
})
let dataForBookmarkStream = []
return new Promise ((resolve, reject) => {
processForPython.stdout.on('data', (response) => {
dataForBookmarkStream.push(response)
})
processForPython.stderr.on('data', (error) => {
reject({
error: `An error occurred while attempting to parse the web page: ${error.toString()}`
})
})
processForPython.on('exit', (code) => {
switch (code) {
case 0:
if ( dataForBookmarkStream ) {
if ( dataForBookmarkStream.length > 0 ) {
try {
try {
dataForBookmark = JSON.parse(dataForBookmarkStream.join().toString())
} catch (exception) {
reject({
error: "JSON object supplied by Newspaper is invalid."
})
}
if (typeof dataForBookmark === 'object') {
const paramsForBookmark = new URLSearchParams()
paramsForBookmark.append('userID', params.userID)
// Additional parameters, using dataForBookmark...
instanceOfAxios.post('/assets', paramsForBookmark)
.then(function (response) {
resolve(response)
})
.catch(function (error) {
reject(error)
})
}
} catch (exception) {
reject({
error: "An error occurred while attempting to save the web page."
})
}
} else {
reject()
}
} else {
reject()
}
break
case 1:
reject({
error: "Web page couldn't be saved."
})
break
}
})
}).catch(error => {
return {
error: "Web page couldn't be saved."
}
})
}
Yeah, it’s a lot to take in, so let’s look at some specifics…
First, figure out what the version of Python is and create an equivalent environmental variable to process.env.PYTHON_VERSION
.
Second, figure out what the path to Python is and create an equivalent environmental variable to process.env.PYTHON_PATH
.
Then, feel free to tweak maxBuffer
to fit. As an aside, I did attempt a version of the code using maxBuffer
alone, but some web pages were too big, at which point the JSON object failed to parse and then everything went to crap.
Once the Python script is called, it begins to stream the JSON object to processForPython.stdout.on('data')
, which I’m grabbing in chunks via the dataForBookmarkStream
variable.
Assuming the process was a success, we hit the switch block in processForPython.on('exit')
and exit when the code is 0. Here’s where we convert the encoded data in dataForBookmarkStream
into something useful, using:
dataForBookmark = JSON.parse(dataForBookmarkStream.join().toString())
… before sending the data via the API to somewhere else in the application.
Do we have some Node and Python people shaking their collective heads wearing an avuncular expression with a hint of disappointment? If so, share and let’s learn what could be improved!
Our brains aren't hard drives, and how we remember things and make connections between them is personal — the Under Cloud is the missing link in the evolution of doing research.
Posted on September 1, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.