Organizing EBook Files with Python šŸ

brianvia

Brian Via

Posted on October 13, 2022

Organizing EBook Files with Python šŸ

If you want to jump to the code snippets, click here

Full python file available here on GitHub

The Problem

Over the years, Iā€™ve accumulated many hundreds of ebooks. Some from buying digital copies from places like Gumroad, others free online like Software Engineering at Google.

However as I became more and more busy, keeping a clean file structure to allow me to find what Iā€™m looking for became harder and harder. The result was a hodgepodge of files without naming conventions, directories upon directories labeled ā€œunorganizedā€ as I tried to manually sift through hundreds of files, manually renaming them to the format I wanted, and placed inside of a single directory per book. It just became too much.

This was basically me when looking for anything:

Charlie's Murder Board from Always Sunny in Philadelphia

What I Wanted the End to Look Like

Books Directory/
  title-author/
    title-author.epub
  title2-author2/
    title2-author.pdf
Enter fullscreen mode Exit fullscreen mode

In wanting to self host a calibre instance with all of my files, ingesting them via 1 directory per book seemed like the best system. It was also incredibly necessary for just reading the files locally from my desktops via my NAS.

Why Python? šŸ

First and foremost, I decided to go with Python the language of choice for this project for a couple reasons.

While it does involve renaming files, which could be easily done with bash, the logic was going to be a little complicated in terms of passing data between functions to-and-fro, which means bash would get a little tough to read IMO.

Python also has a rich ecosystem of ebook parsing libraries, and fairly easily handles things like file renaming, extensions, environment variables on Linux machines, which is what my NAS box runs. And while my first language is typescript/javascript, so I couldā€™ve utilized something like BASH + Googleā€™s ZX, it felt like a good case to try to get some experience with Python, which Iā€™ve never really used. Luckily VS Codeā€™s intellisense (with some Python plugins), and Pythonā€™s relatively simple syntax made it quite easy to get from Aā†’ B in terms of getting the pieces all put together.

The Individual Pieces of The Book Sorting Program

This was how I broke down the individual parts of this sorting library

  1. Gather all files from my unorganized directory
  2. Parse metadata from any ebooks in my library.
    1. EPUB files
    2. PDF files
  3. Organize books into their new location (<author-title>/<author-title>.<ext>
  4. (Optional) Reading library paths for input, outputs and any issue files

Gathering All Unorganized Files

Grabbing all my files and putting them into one flat directory wasnā€™t too bad. I called this my BOOKSORT_INPUT_PATH variable. Currently itā€™s grabbed from the command line environment, but it could be refactored to take as CLI args, or just hard-coded defaults.

# Returns all files in a directory
def getAllFiles(path: string):
    files = []
    for r, d, f in os.walk(path): #r - root, d - dir, f - file
        for file in f:
            if file.endswith(".pdf") or file.endswith(".epub"):
                files.append(os.path.join(r, file))
    print(files)
    return files
Enter fullscreen mode Exit fullscreen mode

This chunk is relatively straightforward hopefully. Given a directory, walk the directory and for each file path found that ends with .pdf or .epub add it to an array, and then return the array. This will give us a list of files to lookup the metadata for, and then eventually sort.

The array will look like this:

['/full/path/to/book/book.epub','/full/path/to/book2/book2.pdf',...]
Enter fullscreen mode Exit fullscreen mode

This array gets returned from the function, so we can iterate over the list of book files to sort.

Parsing Metadata

Parsing the metadata was relatively straightforward: Find epub and pdf parsing libraries, implement and grab the correct fields.

For epub files weā€™re using epub_meta. Make sure you install with pip install epub_meta or pip3 install epub_meta

For pdf files weā€™re using pdfx. This also needs an install with pip install pdfx or pip3 install pdfx

For all the files in our array, weā€™re going to pass them to their respective parsing functions like so:

for file in files:
        TitleAndAuthorString = ""
        if file.endswith(".epub"):
            TitleAndAuthorString = getEpubTitleAndAuthorPath(file)
        if file.endswith(".pdf"):
            TitleAndAuthorString = getPdfTitleAndAuthorPath(file)
Enter fullscreen mode Exit fullscreen mode

EPUB Files

# Returns the title and author of an epub file in the format "Title - Author"
def getEpubTitleAndAuthorPath(filepath: string):
    try:
        print("INFO: Getting metadata for: " + filepath)
        data = epub_meta.get_epub_metadata(filepath)
        title = data['title'] or "Unknown"
        authors =", ".join(data['authors']) or "Unknown"
        print("INFO: Got metadata for " + filepath + ": " + title + " - " + authors)
        return(title + " - " + authors)
    except epub_meta.EPubException as e:
        print(e)
        return None
Enter fullscreen mode Exit fullscreen mode

EPUB_META allows us to grab the metadata with this line

data = epub_meta.get_epub_metadata(filepath)

and then specific fields like this:

  • title = data['title'] or "Unknown"
  • authors =", ".join(data['authors']) or "Unknown" (In this case, weā€™re doing a join with a comma in case there is more than 1 author.

Both of these will fallback to Unknown if we canā€™t parse the metadata for some reason.

PDF Files

def getPdfTitleAndAuthorPath(filepath: string):
    issuesPath = os.environ["BOOKSORT_ISSUES_PATH"]
    file = filepath
    try:
        print("INFO: Getting metadata for: " + filepath)
        pdf = pdfx.PDFx(filepath)
        metadata = pdf.get_metadata()
        title = metadata.get("Title") or "Unknown"
        authors = metadata.get("Author") or "Unknown"
        print("INFO: Got metadata for " + filepath + ": " + title + " - " + authors)
        return(title + " - " + authors)
    except pdfx.exceptions.PDFInvalidError as e:
        print(e)
        print("ERROR: Moving " + getFileName(file) + " to issues folder")
        os.rename(file, issuesPath + "/" + getFileName(file))
        return None
    except pdfx.exceptions.PDFExtractionError as e:
        print(e)
        print("ERROR: Moving " + getFileName(file) + " to issues folder")
        os.rename(file, issuesPath + "/" + getFileName(file))
        return None
    except pdfx.exceptions.FileNotFoundError as e:
        print(e)
        print("ERROR: Moving " + getFileName(file) + " to issues folder")
        os.rename(file, issuesPath + "/" + getFileName(file))
        return None
Enter fullscreen mode Exit fullscreen mode

PDFX allows us to read metadata in a similar fashion.

After creating the pdf and parsing metadata with these two lines

pdf = pdfx.PDFx(filepath)
metadata = pdf.get_metadata()
Enter fullscreen mode Exit fullscreen mode

We can read from the metadata with the .get(<fieldName>) method

title = metadata.get("Title") or "Unknown"
authors = metadata.get("Author") or "Unknown"
# The Authors field is already comma delmited by PDFX, so no need to join here.
...
return(title + " - " + authors)
Enter fullscreen mode Exit fullscreen mode

Weā€™ll also create a function to return the file extension for proper renaming later.

# Returns the file extension of a file
def getFileExtension(file):
    return os.path.splitext(file)[1]
Enter fullscreen mode Exit fullscreen mode

Organizing the Files to their Final Locations

Lastly, we do some os.makedirs and os.rename magic to move things around and create the needed directories if it doesnā€™t already exist.

extension = getFileExtension(file) # grab this so we can rename easily.

if TitleAndAuthorString and "Unknown" not in TitleAndAuthorString:
    if not os.path.exists(outputPath + "/" + TitleAndAuthorString):
        os.makedirs(outputPath + "/" + TitleAndAuthorString)
    print("SUCCESS: Moving " + TitleAndAuthorString)
    os.rename(file, outputPath + "/" + TitleAndAuthorString + "/" + TitleAndAuthorString + extension)
    # My desired file output path is <BooksDir>/<Title> - <Author>/<Title> - <Author>.{pdf,epub,etc}
# There was an issue parsing the file, let's just move it to an `issues` folder to be manually looked at later
else:
    print("WARN: Moving " + getFileName(file) + " to issues folder")
    os.rename(file, issuesPath + "/" + getFileName(file))
    continue
Enter fullscreen mode Exit fullscreen mode

os.makedirs(...) creates the directory if needed

os.rename(...) takes the existing file at the specified path, and then the final (absolute) path for the file. So in this case itā€™s <output-directory>/+ "/" + TitleAndAuthorString + "/" + TitleAndAuthorString + extension

Putting it All Together

def main():
    inputPath = os.environ["BOOKSORT_INPUT_PATH"] or "/Users/bvia/Development/Personal/booksort/issues"
    outputPath = os.environ["BOOKSORT_OUTPUT_PATH"] or "/Users/bvia/Development/Personal/booksort/outputs"
    issuesPath = os.environ["BOOKSORT_ISSUES_PATH"] or "/Users/bvia/Development/Personal/booksort/issues"
    sort_books(inputPath, outputPath, issuesPath)
Enter fullscreen mode Exit fullscreen mode

Give it a whirl with a call to main() at the end of the file and youā€™re off.

This script isnā€™t perfect, sometimes the rename fails to write to the specified path for reason I canā€™t figure out, but itā€™s helped save me many hours of manual organization, and isnā€™t that one of our favorite parts of programming after all?

Thanks!

Thanks for reading. Hope this maybe made Python more accessible if you havenā€™t used it before, or you learned a new use case for it. Once again the full script file is here: https://github.com/BrianVia/booksort/blob/main/book-sort.py

I'm Brian. A fullstack software engineer at Clearcover.

If you want to check out my self hosted blog for more itā€™s here.

You can follow me on Twitter and GitHub as well!

šŸ’– šŸ’Ŗ šŸ™… šŸš©
brianvia
Brian Via

Posted on October 13, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related