CCONLEY-FI
Posted on February 17, 2024
In an effort to utilize the full potential of an Android device, I decided to make a short scriplet for web scraping. Particularly for finding specific file types like PDFs, EPUBs, or JPGs, the combination of Javascriptlets, Termux, and enhanced browser functionalities offers a compelling solution. This detailed guide walks through setting up the necessary tools and crafting scripts to automate the search for these file types directly from an Android device, illustrating the process with practical examples.
Initial Setup with Termux
Termux is the backbone of this operation, providing a powerful Linux environment on Android. After installing Termux from the Google Play Store, or F-droid if needed, the following commands will prepare the environment for scripting:
pkg update && pkg upgrade
pkg install python
pkg install git
These steps ensure that the Termux environment is ready for advanced operations, including web scraping tasks.
Enhancing Capabilities with Browser Extensions
To augment the web scraping process, installing browser extensions on a compatible browser like Kiwi or fenix(firefox) can significantly streamline operations. Adding an extension like Tampermonkey or Mobile Dev Tools enables the user to manage and execute Javascriptlets with ease, facilitating the automation of web tasks directly from the browser.
Crafting Javascriptlets for File Search
Javascriptlets can be designed to initiate searches for specific file types across the web. Here’s a concise script aimed at finding PDFs usong googles search logic:
javascript:(function() {
var query = encodeURIComponent('filetype:pdf');
var url = `https://www.google.com/search?q=${query}`;
window.open(url);
})();
Adapting this script to search for EPUBs or JPGs is as straightforward as changing filetype:pdf
to filetype:epub
or filetype:jpg
in the script.
Advanced Web Scraping with Termux
For more nuanced scraping tasks, such as parsing search results to extract specific URLs or directly downloading files, Python scripts executed within Termux are exceptionally useful. Tools such as Beautiful Soup can parse HTML content to find and list downloadable links. Here's an example script that searches for downloadable PDF links on a webpage:
import requests
from bs4 import BeautifulSoup
def find_downloads(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', href=True)
for link in links:
if link['href'].endswith('.pdf'):
print(link['href'])
if __name__ == "__main__":
target_url = 'https://example.com'
find_downloads(target_url)
This script could be easily modified to search for .epub
or .jpg
files by replacing .endswith('.pdf')
with the desired file extension in the script.
Automating and Scheduling with Termux
To automate the execution of scripts for repeating data collection, Termux supports scheduling through cron jobs. This functionality allows scripts to run at specified intervals, ensuring continuous data collection without manual intervention:
echo "0 * * * * python /path/to/find_downloads.py" | crontab -
This command sets the find_downloads.py
script to run hourly, demonstrating Termux’s capability to automate web scraping tasks.
Conclusion
Leveraging the capabilities of Javascriptlets for initiating web searches, coupled with the power of Termux for advanced scripting and scheduling, users can effectively automate the search and collection of specific file types like PDFs, EPUBs, and JPGs on their Android devices. This approach not only makes targeted data collection more accessible but also significantly expands the scope of projects that can be undertaken directly from a mobile device, showcasing the practical and versatile applications of these tools for sophisticated web scraping tasks. Use your own creativity to develop other use cases. Keep in mind that you should always consider a sites usage rules and legal processes.
Posted on February 17, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.