Multi-language Web Novel Scraper in Python

reinaldoassis

Reinaldo Assis

Posted on May 24, 2024

Multi-language Web Novel Scraper in Python

Introduction

TL;DR Today we will code a web scraper to scrap multiple sources of web novels, download chapters and combine them into a .epub file.

I like reading a specific web novel called Overgeared, it has more than 2000 chapters! I used to use a website that downloaded the web novel chapters for me and combined it all in a .epub file that I could send to my kindle. Unfortunately, the site went down and I was left with no choice but to code my own solution.

Overview: Here's the breakdown of the process we have to follow to achieve our goal:

  1. Find sources
  2. Identify ways to scrap each source
  3. Write a Scraper Interface
  4. Write a subclass for each source and implement the methods
  5. Combine it all together and be happy reading my lil novel

Fiding Sources

Since I'm currently bettering my French and studying Italian I'd like to have sources other than English, fortunately Overgeared has translation to multiple languages so this step wasn't hard. Here are the sources I'll be using:

Inspecting the structure of each source

Let's first start by taking a look at the French source since it poses an interesting challenge:

<ul class="lcp_catlist" id="lcp_instance_0">
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-16/">OG Chapitre 16</a></li>
   <li><a href="https://xiaowaz.fr/articles/overgeared-chapitre-1/">Overgeared Chapitre 1</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-2/">OG Chapitre 2</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-3/">OG Chapitre 3</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-4/">OG Chapitre 4</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-5/">OG Chapitre 5</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-6/">OG Chapitre 6</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-7/">OG Chapitre 7</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-8/">OG Chapitre 8</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-9/">OG Chapitre 9</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-10/">OG Chapitre 10</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-11/">OG Chapitre 11</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-12/">OG Chapitre 12</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-13/">OG Chapitre 13</a></li>
   <li><a href="https://xiaowaz.fr/articles/og-chapitre-14-rattrapage/">OG Chapitre 14 [Rattrapage]</a></li>
</ul>

Enter fullscreen mode Exit fullscreen mode

If you take a look at the links for each chapter you will see that some of them are different, that means there's no pattern we can iterate on such as ".../og-chapitre-{chapter number we want}" (this link would work for some but not all of them). As such, we have to come up with a strategy, there are some options but I think the most straight forward one is to simply divide the scraping process into two steps:

  1. First, we call the web novel home page and extract all of the available chapter's link into an array.
  2. Second, we iterate in the generated array of links, extracting the text and combining into a book.

Generality

Another fun aspect we have to take into consideration is generality, in reality it is nearly impossible to write one code that can scrap any page we want, instead it's easier to write an interface class and implement the methods for scraping in each one. For example, both scrapers of French and English would have the same method scraper, but they would have different implementations.

The scraper Interface I wrote is quite big, but here are the main methods to get the idea across:


     def scraper(self, chapter_number : str, override_link: bool = False) -> chapter:
        """Responsible for requesting the html page and extracting the wanted text from it.

        :param int chapter_number: the number of the chapter to be requested and scraped.
        :param str language: default to english.
        """
        pass

    def create_book_from_chapters(self, book_name : str, out : str, chapters : List[chapter]):
        """Turns an array of chapters into an ebook."""
        pass

    def get_multiple_chapters(self, start: int, end : int) -> List[chapter]:
        pass

    def get_multiple_chapters_from_list(self, chapters: List[chapter]) -> List[chapter]:
        """In case the links from the source are not standard, this function can be used in
        conjunction with search_available_chapters, the chapters list containing the links will be
        downloaded and stored in a new list that can be used with create_book_from_chapters function."""
        pass

Enter fullscreen mode Exit fullscreen mode
  • The ideal flow is
    • [user input chapter range of download e.g. 100-200]
    • [user select language i.e. FR/EN/IT]
    • [calls function start(range)]

The start(range : str) function is responsible for calling the needed functions, for example, if the scraper in the selected language can't predict how the url works, it should use the function search_available_chapters to compile a list of links it can use to download chapters from.

In short, for each language we want to add we will create a class that implements the method of our interface. Here are some methods of the English implementation:

def scraper(self, chapter_number: str) -> chapter:
        try:
            response = requests.get(self._partial_link + chapter_number).text
        except:
            click.echo("[ERROR] Error while getting request.")
            click.echo(f"[ERROR] Request {self._partial_link+chapter_number} has failed.")
            raise self.FAILED_REQUEST

        soup = BeautifulSoup(response, 'html.parser')

        chr_content_div = soup.find('div', id='chr-content')

        text = ""

        if chr_content_div:
            paragraphs = chr_content_div.find_all('p')

            for p in paragraphs:
                text += f"<p>{p.get_text()}</p>\n\n"
        else:
            text = ""

        ch = chapter(text,f"Chapter {chapter_number}", "en", self.partial_link+chapter_number, len(text.split(" ")), chapter_number)

        return ch

Enter fullscreen mode Exit fullscreen mode
 def create_book_from_chapters(self, book_name: str, out: str, chapters: List[chapter]):
        book = epub.EpubBook()
        book.set_identifier('overgeared')
        book.set_title('Overgeared Novel')
        book.set_language('en')
        book.add_author('Park Saenal')

        # Cover image
        book.set_cover("cover.jpg", open(self._cover_image_path, "rb").read())

        total_words = sum(ch.word_count for ch in chapters)

        info_page = epub.EpubHtml(title="Information", file_name="info.xhtml", lang='en')
        info_content = """
        <h1>Information</h1>
        <p>This is a collection of chapters compiled by me but all credits go to the author (Park Saenal) and translator (rainbowturtle).</p>
        <p>This book contains approximately {num_pages} pages.</p>
        """.format(num_pages=round(total_words/250))
        info_page.content = info_content
        book.add_item(info_page)
        book.toc.append((info_page, "Information"))

        for ch in chapters:
            book.add_item(ch.epub_info)
            book.toc.append((ch.epub_info, ch.title))

        # Define CSS
        style = 'body { font-family: Times, Times New Roman, serif; text-align: justify; }'
        nav_css = epub.EpubItem(uid="style_nav", file_name="style/nav.css", media_type="text/css", content=style)
        book.add_item(nav_css)

        book.add_item(epub.EpubNcx())
        book.add_item(epub.EpubNav())

        # Adds CSS to the book
        book.spine = [info_page, 'nav', *book.items]

        epub.write_epub(f'Overgeared {chapters[0].number} to {chapters.pop().number}.epub', book)
Enter fullscreen mode Exit fullscreen mode

Combining all together

All that is left is to combine everything we've made, my plan is to use this module as a sort of "plugin" for my personal CLI (where I have a bunch of tools I've coded for myself), as such you can see here two important methods extension_name (tells the main program what is this extension's name) and start (called when the extension is selected).

def extension_name():
    return "Overgeared Ebook Novel"

def start():
    click.echo("Module: og.py")
    click.echo("Module Version: 0.0.1")
    click.echo("Created in: 14.05.24")
    click.echo("")
    lg = click.prompt("Language code: ")
    ch = click.prompt("Chapter to download: ")

    if lg == "en":
        en = OG_Novel_Downloader_EN()
        en.start(ch)
        # en.sanity_check(ch, verbose=True)

    if lg == "fr":
        fr = OG_Novel_Downloader_FR()
        fr.start(ch)
Enter fullscreen mode Exit fullscreen mode

Usage

Now all we've to do is enjoy reading our novel in multiple languages 🥳.

Running the scraper in the terminal

Conclusion

I still struggle a bit with the concept of interfaces and reusable code, in hindsight I think my code could look a lot cleaner (maybe I'll refactor it?). Unfortunately, I won't be able to provide you guys with the source code this time, I don't have permission to scrap this sites much less share a code that does, but I hope this post was of some help and feel free to contact me if you have any suggestions or questions (my contacts can be found bellow ;).

That's it for today, thank you for reading so far!

About the Author

Computer Engineering Student making cool projects for fun (:

You can find more about me on my website (I’ll put the link here when available), TikTok, YouTube and Instagram.


💖 💪 🙅 🚩
reinaldoassis
Reinaldo Assis

Posted on May 24, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related