Creating a searchable knowledge-base from several PDF files
Brian Onang'o
Posted on March 1, 2022
Part 1 - Exploring the Power of Bash
In the current project we are going to work with several isolated PDF files, converting them into a searchable knowledge base. We are going to see how to:
- Download a file from a website using a script.
- Scrape an html table for links.
- Batch download several files in parallel without running out of memory.
- Combine pdf files using a script
- Batch rename files
- Create a website from pdf files.
Before we begin, let us give a little background for the project. We have quarterly Bible study guides that have faithfully been produced every year since 1886. All of these have been digitized and are available for all years since 1888 (also available here). Like in all areas, the faithful student is he that compares the current studies with what has been done in the past. This comparison will help in understanding the subject under consideration more deeply and in seeing if there are any errors that have been introduced in the new, or if there were errors in the old. And Christ said that "every scribe instructed concerning the kingdom of heaven is like a householder who brings out of his treasure things new and old."
To be able to make use of the old guides, the student currently has to go to the index of titles and check for the guides relevant to his subject of study, download all the pdf files individually, the reading each of them. But it would be more desirable to have a searchable knowledge-base having all these guides which the student can search whatever he needs at once, without having to manage 500+ individual files. It is such a system that we will be building.
In the first step we are going to:
1. Download a pdf file from a website using a script
The first lesson available is found at the following link: http://documents.adventistarchives.org/SSQ/SS18880101-01.pdf
Downloading this single file is pretty straightforward.
wget http://documents.adventistarchives.org/SSQ/SS18880101-01.pdf
But since there are over 500 files and we cannot possibly copy and paste the link one by one from the site and download them, we are going to:
2. Scape the page for links to pdf files
All the pdf files we need are available in the directory being served at http://documents.adventistarchives.org/SSQ/
. But trying to access that directory does not give a listing of the files in it, but rather redirects to a different page. We would use recursive wget
to download all those files if accessing the directory gave a listing of the files it contains. But since this is not available to us, we have to use some other method.
Since we already know that all the files we are looking for have http://documents.adventistarchives.org/SSQ/
in their url. That is going to form the basis for our pattern. The final command that we have to extract both the url of the pdf file and the lesson title for that pdf is
echo -n '' > files.txt && curl -s https://www.adventistarchives.org/sabbathschoollessons | grep -o 'http://documents.adventistarchives.org/SSQ/[^.]*.pdf\"[^\>]*[^\<]*' | while read line ; do echo "$line" | sed -e 's/\"[^\>]*./ /' | sed -e 's/ \+/ /g' >> files.txt; done
grep -o
will print only the matches found.
The result is saved in files.txt
. Each line has the pdf url and the lesson title for that pdf.
To see the number of files that we will be dealing with we can do wc -l files.txt
. There are 516 files in total.
3. Downloading all 510 files
The files are in reality a little fewer than 516 because the list contains duplicates since it is designed for every quarter whereas some pdf files have two quarters. To remove the duplicated urls, we use sort:
sort -u -o files.txt files.txt
To download the files, we need to get the url for each file from files.txt
. We can do this by
cat files.txt |while read line ; do echo "$line" | cut -f 1 -d " " |xargs wget; done
But this downloads one file at a time and is as a result very slow and inefficient.
To download the files in parallel, we use wget with the '-b' option so that it can fork itself. We sort the urls file to remove duplicated urls since files.txt has duplicated urls and not duplicated lines because of different lesson titles for the duplicated urls.
cut -d\ -f1 files.txt > urls && sort -u -o urls urls && cat urls |while read line ; do echo "$line" | cut -f 1 -d " " |xargs wget -b; done
We are probably very lucky to download all the 510 files in 5 seconds without running out of 1GB of memory. It would be interesting to know how this would perform in node.
4. Combine 510 pdfs into 1
The first obvious step in making our knowledge-base is probably combining all the pdfs into one. This will result in a single big pdf (over 100mbs in size), but much more useful in study if it can be opened than 510 separate pdfs. For this we are going to use ghostscript
:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=allLessons.pdf *.pdf
Ghostscript quickly runs us out of memory. A better alternative for this task could be pdfunite
.
pdfunite *.pdf allLessons.pdf
pdfunite
gets us a bigger file, but it is also runs us out of memory. Using NitroPro on windows in a system having 8GB of RAM gives us a file of 708.4Mb. It would be interesting to see if this would also be possible in a single core 1GB RAM system like the linux one we are trying to force to work for us.
Let's try pdfunite on 10 files at a time. We will use GNU parallel
and let it do what it can to manage the memory for us. This process takes about 20 minutes and outputs files 1.pdf ... 51.pdf which we need to further combine.
ls -lha ./ |grep SS.*pdf | grep -o '[^ ]*$' |parallel -N10 pdfunite {} {#}.pdf
6. Creating a site where pdfs can be downloaded.
We are going to do this using github pages. To do this we are first going to create a CNAME entry in our DNS server to point to github pages.
CNAME | ttl | points to |
---|---|---|
sslpdfs | 900 | gospelsounders.github.io. |
Then we are going to add a CNAME file with the contents sslpdfs.gospelsounders.org
, commit and push repo to github, and configure the github pages settings in the github repo.
GIT_SSH_COMMAND='ssh -i /tmp/gitKey -o IdentitiesOnly=yes' git push origin master
Next we should create the index of the files that we have and put them in README.md. The following bash commands do this for us.
curl -s https://www.adventistarchives.org/sabbathschoollessons |grep -o ^[\<]td.* |sed -e 's/<td[^\>]*>//g' | sed -e 's/<sup>.*<\/sup>//g' | sed -e 's/<\/td>//g' |sed -e 's/\<a.*href="[^h][^t][^t][^p][^:][^>]*>.*<\/a><//' | sed -e 's/<a[^>]*><\/a>//g' |sed -e 's/<[^a][^>]*>//g' |grep -A 2 '^[0-9]\{4\}' | grep -v -- "^--$" |parallel -N3 echo {} | sed -e 's/\(<a[^<]*\)\(<.*\)/\1/g'| sed -e 's/<a.* href="\([^"]*\)"[^>]*>\(.*\)/[\2](\1)/' | sed -e 's/ */ /g' | sed -e 's/http:\/\/documents.adventistarchives.org\/SSQ\///' | sed -e 's/ \]/\]/g'
It produces a list in the following format:
But we would desire to have all the quarters of a year in a single row in a table having the following header:
Year | Quarter 1 | Quarter 2 | Quarter 3 | Quarter 4 |
---|
With the header already in the markdown file (README.md), the following commands create the table for us and add it to the markdown file.
curl -s https://www.adventistarchives.org/sabbathschoollessons |grep -o ^[\<]td.* |sed -e 's/<td[^\>]*>//g' | sed -e 's/<sup>.*<\/sup>//g' | sed -e 's/<\/td>//g' |sed -e 's/\<a.*href="[^h][^t][^t][^p][^:][^>]*>.*<\/a><//' | sed -e 's/<a[^>]*><\/a>//g' |sed -e 's/<[^a][^>]*>//g' |grep -A 2 '^[0-9]\{4\}' | grep -v -- "^--$" |parallel -N3 echo {} | sed -e 's/\(<a[^<]*\)\(<.*\)/\1/g'| sed -e 's/<a.* href="\([^"]*\)"[^>]*>\(.*\)/[\2](\1)/' | sed -e 's/ */ /g' | sed -e 's/http:\/\/documents.adventistarchives.org\/SSQ\///' | sed -e 's/ \]/\]/g' | parallel -N4 echo {} | sed -e 's/ [0-9]\{4\} [1-4][a-z]\{2\} / \|/g' |sed -e 's/ 1st /\| /' >> README.md
7. Continue combining the 51 files
We first have to add a leading zero to the first 9 files, then run pdfunite
rename.ul "" 0 ?.pdf
ls -lha ./ |grep " [0-9][0-9]\.pdf" | grep -o '[^ ]*$' |parallel -N10 pdfunite {} {#}-2.pdf
ls -lha ./ |grep " [0-9]-2\.pdf" | grep -o '[^ ]*$' |parallel -N6 pdfunite {} allLessons.pdf
Last process still breaks because of limited memory. So we will upload the file that we created using NitroPro to our server.
scp All\ Lessons.pdf user@server:/tmp/pdfs/allLessons.pdf
allLessons.pdf
is greater than 100MBs (708.4MB) and we can't push to github. We need to find a way of reducing its size. So let's try to remove the images from the pdfs and see the result.
gs -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE -o allLessonsSmall.pdf "allLessons.pdf"
We are almost done as this gives us a file that is 68.5MBs. But the text is white on a white backgroundthe text is transparent.
Using Evince
, the default pdf viewer for Ubuntu, the transparent-text pdf can be read by selecting all the text using CTRL+A
which will make the text visible. But this method seems not to work on all pdf viewers including 'Foxit Reader' and Adobe pdf readers.
We will therefore convert the transparent-text pdf to html and change the color in the generated css file.
For this exercise we have chosen to use a digital ocean trial account to launch a 4 Core 8GB RAM 160 GB SSD server. The following command is executed unbelievably fast (2m8.701s):
time pdf2htmlEX --split-pages 1 --process-nontext 0 --process-outline 0 --process-annotation 0 --process-form 0 --embed-css 0 --embed-font 0 --embed-image 0 --embed-javascript 0 --embed-outline 0 --dest-dir html allLessonsSmall.pdf
The result is 26672 files of total size 223MB. You can find these by running ls html |wc -l
and du -h html
. All text is still set to be transparent as seen from the following css file:
This is changed to black as follows:
7. Converting all the pdfs to html
We will now convert the pdfs to html files, which will allow us to be able to search their contents within github. This is also an easy task using our free trial 4 core server:
ls -lha ./ |grep SS.*pdf | grep -o '[^ ]*$' |parallel -N10 echo {} | sed 's/ /\n/g' | parallel -N1 pdf2htmlEX --split-pages 1 --dest-dir htmls {}
8. Adding to git, commiting and pushing to github
For some reason, these command add all the files at once.
ls ./|head -n $((1*18135/5))|tail -n $((18135/5))|xargs git add```
{% endraw %}
{% raw %}
```bash
for file in $(ls ./|head -n $((1*18135/5))|tail -n $((18135/5)) ); do git add $file ; done
So we move the htmls folder out of our folder, recreate it copy back the files in batches, add the files to git and commit. We run this command 5 times, incrementing the multiplier (1 in this command).
for file in $(ls ../htmls|head -n $((1*18135/5))|tail -n $((18135/5)) ); do cp "../htmls/$file" htmls/ ; done && git add . && git commit -m "added batch of files"
9. Editing the index to add links to html files
sed -i '/SS.*pdf/s/(\([^)]*\))/(htmls\/\1.html) \\| [⇩](\1)/g' README.md
10. Converting all the pdfs to text
In docs folder,
ls -lha ./ |grep SS.*pdf | grep -o '[^ ]*$' |parallel -N10 echo {} | sed 's/ /\n/g' | parallel -N1 pdftotext -layout {}
ls -lha ./ |grep "[0-9].txt" | grep -o '[^ ]*$' |parallel -N10 echo {} | sed 's/ /\n/g' | parallel -N1 mv {} texts/
11. Remove transparency from htmls
The htmls contain transparent text which we wish to convert to black. In docs/htmls
run:
sed -i '/fc0{color:transparent;/s/transparent/black/' *.html
sed -i '/fc0{color:transparent;/s/transparent/black/' *.css
12. Add text files to index
sed -i '/SS.*pdf/s/(\([^(]*\).pdf)/(\1.pdf) \\| [txt](texts\/\1.txt)/g' README.md
Posted on March 1, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.