Rendering PDF from URLs and HTML input using express js

Lead in

Project History

The original project described in this article was implemented in a single file without a clear structure. While its functionalities were effective, maintaining and expanding the code was challenging. The main purpose of the project is to be efficient and reliable in creating PDF attachments for orders in a large hospital with 10,000 employees that generates 400 orders per day.

My Contribution

My task involved dividing the original code into separate modules for the model, view, and controller. The goal was to create more readable code that is easily maintainable and allows for straightforward extensibility. I introduced a clearer separation of responsibilities among the different parts of the application.

Single Browser Instance Management with browser.js

To manage efficient communication with the browser and to provide a single instance for all parts of the application, I created the browser.js file. This file serves as a single point of access to the browser for the entire application, which brings several benefits including optimizing resource usage.

Efficient Resource Utilization with a Single Browser Instance

1. Minimizing RAM Memory Consumption

Each browser instance requires RAM memory for its operation.
By using a single instance, we minimize the need for repeated initialization, reducing overall memory load.

2. Reducing Browser Initialization Costs

Browser initialization can be a time-consuming operation.
With a single instance, we minimize the costs associated with repeatedly launching the browser.

const puppeteer = require("puppeteer-core");
const trace = require('./trace');


async function createInstance() {
  trace.log("browser.createInstance()");
  let browser;
  try {
    browser = await puppeteer.launch({
      executablePath: '/usr/bin/google-chrome',
      args: ['--no-sandbox'],
    });
  } catch (e) {
    trace.error("err:" + e.message);
  }
  return browser;
}

async function close(browser) {
  trace.log("browser.close()");
  try {
    if (browser) {
      await browser.close();
      trace.log("Prohlízeč se zavřel!")
    }
  } catch (e) {
    trace.error("Chyba při zavírání prohlížeče:" + e.message);
  }
}

process.on("SIGINT", async () => {
  trace.log("Zavírá se server a instance prohlížeče!");
  await utils.close(browser);
  process.exit();
})

module.exports = {
  createInstance,
  close
}

The most important functions

Efficient PDF Generation from HTML Input

One crucial aspect of the urlPdfizer project is the ability to process HTML input and generate a corresponding PDF output. This functionality is encapsulated in the generatePdfFromHtml function.

const generatePdfFromHtml = async (req, res) => {
  console.log("generatePdfFromHtml()");

  try {
    // Extract HTML content from the request
    const htmlContent = req.body;

    // Create a new page instance using the shared browser instance
    const page = await browser.newPage();

    // Set the HTML content of the page, waiting until the DOM is fully loaded
    await page.setContent(htmlContent, { waitUntil: 'domcontentloaded' });

    // Generate PDF from the HTML content with specific formatting options
    const pdfBuffer = await page.pdf({
      format: 'A4',
      printBackground: true,
      displayHeaderFooter: true,
      headerTemplate: 'PDF',
      footerTemplate: 'PDF',
    });

    // Close the page to avoid potential timeouts in subsequent calls
    await page.close();

    // Set response headers for the generated PDF
    res.set("Content-Disposition", "inline; filename=page.pdf");
    res.set("Content-Type", "application/pdf");

    // Send the generated PDF as the response
    res.send(pdfBuffer);
  } catch (e) {
    // Handle errors gracefully and provide a meaningful response
    console.error("Error: " + e.message);
    res.status(500).json({ message: "Error when generating PDF from HTML: " + e.message });
  }
};

Understanding the Code

HTML Content Extraction:
- The function starts by extracting the HTML content from the request body, assuming it to be the main input for PDF generation.
Browser Page Initialization:
- Utilizing the shared browser instance created through browser.js, a new page is instantiated for processing the HTML.
Setting HTML Content:
- The HTML content is set on the page, and the function waits until the DOM is fully loaded.
PDF Generation:
- The page is then used to generate a PDF, incorporating specific formatting options such as A4 size, background printing, and header/footer templates.
Response Configuration:
- The generated PDF is attached to the response with appropriate headers, ensuring correct display and download behavior.

PDF Generation from URL with Retry Mechanism

The generatePdfFromUrl function plays a crucial role in the urlPdfizer project by allowing the generation of PDFs from a specified URL. This function incorporates a retry mechanism to handle potential navigation issues.

async function generatePdfFromUrl(browser, url) {
  trace.log('pdfA4Ctl.generatePdfFromUrl()');
  trace.log(`url:${url}`);
  const maxRetries = 3;
  let retries = 0;

  while (retries < maxRetries) {
    try {
      // Create a new page instance within the provided browser
      const page = await browser.newPage();
      await page.setDefaultNavigationTimeout(60000);

      // Navigate to the specified URL, waiting for DOMContentLoaded event
      await page.goto(url, { waitUntil: ["domcontentloaded"] });

      // Generate PDF from the page with specific formatting options
      const pdfBuffer = await page.pdf({
        format: 'A4',
        printBackground: true,
        displayHeaderFooter: true,
        headerTemplate: 'PDF',
        footerTemplate: 'PDF',
      });

      // Close the page after PDF generation
      await page.close();

      // Return the generated PDF buffer
      return pdfBuffer;
    } catch (e) {
      // Handle navigation errors by retrying
      trace.error("Error navigating to the URL, retrying...");
      retries++;

      // If maximum retries reached, throw an error
      if (retries === maxRetries) {
        throw new Error("Unable to reach the source HTML after multiple attempts!");
      }
    }
  }
}

Understanding the Code

Browser Page Initialization:
- The function starts by creating a new page instance within the provided browser for PDF generation.
Setting Navigation Timeout:
- The default navigation timeout for the page is set to 60,000 milliseconds (60 seconds).
Navigating to the URL:
- The function navigates to the specified URL, waiting for the DOMContentLoaded event before proceeding.
PDF Generation:
- The page is then used to generate a PDF, incorporating specific formatting options such as A4 size, background printing, and header/footer templates.
Page Closure:
- After PDF generation, the page is closed to optimize resource usage.
Retry Mechanism:
- If there are any errors during navigation, the function retries, with a maximum retry count of 3. If the maximum retries are reached, an error is thrown.

This function ensures robustness in handling potential network issues during the process of generating PDFs from a given URL within the urlPdfizer project.

Conclusion

In conclusion, the urlPdfizer project underwent significant improvements to enhance its maintainability and extensibility. The initial implementation, residing in a single file, posed challenges in code management. My contribution focused on restructuring the code into separate modules for the model, view, and controller, fostering a clearer separation of responsibilities.

The introduction of the browser.js file played a pivotal role in optimizing resource usage by managing a single browser instance for the entire application.

The generatePdfFromHtml function showcased the project's capability to process HTML input and generate corresponding PDF output.

The generatePdfFromUrl function addressed the challenge of generating PDFs from a specified URL, incorporating a retry mechanism to handle potential navigation issues. This function demonstrated the project's resilience in adverse network conditions, ensuring successful PDF generation after multiple attempts.

The modular structure, shared browser instance, and the functionality for HTML-to-PDF conversion and URL-based PDF generation collectively contribute to a more organized and robust urlPdfizer project.

Thank you for exploring the project's evolution and functionalities. Feel free to explore the GitHub repository for further details and updates.

Author: Marek Tremel

Contact: tremelmarek@gmail.com

Creation Date: 2024

GitHub repo: urlPdfizer

Developed for: kzcr

Blog

Rendering PDF from URLs and HTML input using express js

markuss23

Lead in

Project History

My Contribution

Single Browser Instance Management with browser.js

Efficient Resource Utilization with a Single Browser Instance

1. Minimizing RAM Memory Consumption

2. Reducing Browser Initialization Costs

The most important functions

Efficient PDF Generation from HTML Input

Understanding the Code

PDF Generation from URL with Retry Mechanism

Understanding the Code

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related