Rendering PDF from URLs and HTML input using express js
markuss23
Posted on February 3, 2024
Lead in
Project History
The original project described in this article was implemented in a single file without a clear structure. While its functionalities were effective, maintaining and expanding the code was challenging. The main purpose of the project is to be efficient and reliable in creating PDF attachments for orders in a large hospital with 10,000 employees that generates 400 orders per day.
My Contribution
My task involved dividing the original code into separate modules for the model, view, and controller. The goal was to create more readable code that is easily maintainable and allows for straightforward extensibility. I introduced a clearer separation of responsibilities among the different parts of the application.
Single Browser Instance Management with browser.js
To manage efficient communication with the browser and to provide a single instance for all parts of the application, I created the browser.js
file. This file serves as a single point of access to the browser for the entire application, which brings several benefits including optimizing resource usage.
Efficient Resource Utilization with a Single Browser Instance
1. Minimizing RAM Memory Consumption
- Each browser instance requires RAM memory for its operation.
- By using a single instance, we minimize the need for repeated initialization, reducing overall memory load.
2. Reducing Browser Initialization Costs
- Browser initialization can be a time-consuming operation.
- With a single instance, we minimize the costs associated with repeatedly launching the browser.
const puppeteer = require("puppeteer-core");
const trace = require('./trace');
async function createInstance() {
trace.log("browser.createInstance()");
let browser;
try {
browser = await puppeteer.launch({
executablePath: '/usr/bin/google-chrome',
args: ['--no-sandbox'],
});
} catch (e) {
trace.error("err:" + e.message);
}
return browser;
}
async function close(browser) {
trace.log("browser.close()");
try {
if (browser) {
await browser.close();
trace.log("Prohlízeč se zavřel!")
}
} catch (e) {
trace.error("Chyba při zavírání prohlížeče:" + e.message);
}
}
process.on("SIGINT", async () => {
trace.log("Zavírá se server a instance prohlížeče!");
await utils.close(browser);
process.exit();
})
module.exports = {
createInstance,
close
}
The most important functions
Efficient PDF Generation from HTML Input
One crucial aspect of the urlPdfizer
project is the ability to process HTML input and generate a corresponding PDF output. This functionality is encapsulated in the generatePdfFromHtml
function.
const generatePdfFromHtml = async (req, res) => {
console.log("generatePdfFromHtml()");
try {
// Extract HTML content from the request
const htmlContent = req.body;
// Create a new page instance using the shared browser instance
const page = await browser.newPage();
// Set the HTML content of the page, waiting until the DOM is fully loaded
await page.setContent(htmlContent, { waitUntil: 'domcontentloaded' });
// Generate PDF from the HTML content with specific formatting options
const pdfBuffer = await page.pdf({
format: 'A4',
printBackground: true,
displayHeaderFooter: true,
headerTemplate: 'PDF',
footerTemplate: 'PDF',
});
// Close the page to avoid potential timeouts in subsequent calls
await page.close();
// Set response headers for the generated PDF
res.set("Content-Disposition", "inline; filename=page.pdf");
res.set("Content-Type", "application/pdf");
// Send the generated PDF as the response
res.send(pdfBuffer);
} catch (e) {
// Handle errors gracefully and provide a meaningful response
console.error("Error: " + e.message);
res.status(500).json({ message: "Error when generating PDF from HTML: " + e.message });
}
};
Understanding the Code
-
HTML Content Extraction:
- The function starts by extracting the HTML content from the request body, assuming it to be the main input for PDF generation.
-
Browser Page Initialization:
- Utilizing the shared browser instance created through
browser.js
, a new page is instantiated for processing the HTML.
- Utilizing the shared browser instance created through
-
Setting HTML Content:
- The HTML content is set on the page, and the function waits until the DOM is fully loaded.
-
PDF Generation:
- The page is then used to generate a PDF, incorporating specific formatting options such as A4 size, background printing, and header/footer templates.
-
Response Configuration:
- The generated PDF is attached to the response with appropriate headers, ensuring correct display and download behavior.
PDF Generation from URL with Retry Mechanism
The generatePdfFromUrl
function plays a crucial role in the urlPdfizer
project by allowing the generation of PDFs from a specified URL. This function incorporates a retry mechanism to handle potential navigation issues.
async function generatePdfFromUrl(browser, url) {
trace.log('pdfA4Ctl.generatePdfFromUrl()');
trace.log(`url:${url}`);
const maxRetries = 3;
let retries = 0;
while (retries < maxRetries) {
try {
// Create a new page instance within the provided browser
const page = await browser.newPage();
await page.setDefaultNavigationTimeout(60000);
// Navigate to the specified URL, waiting for DOMContentLoaded event
await page.goto(url, { waitUntil: ["domcontentloaded"] });
// Generate PDF from the page with specific formatting options
const pdfBuffer = await page.pdf({
format: 'A4',
printBackground: true,
displayHeaderFooter: true,
headerTemplate: 'PDF',
footerTemplate: 'PDF',
});
// Close the page after PDF generation
await page.close();
// Return the generated PDF buffer
return pdfBuffer;
} catch (e) {
// Handle navigation errors by retrying
trace.error("Error navigating to the URL, retrying...");
retries++;
// If maximum retries reached, throw an error
if (retries === maxRetries) {
throw new Error("Unable to reach the source HTML after multiple attempts!");
}
}
}
}
Understanding the Code
-
Browser Page Initialization:
- The function starts by creating a new page instance within the provided browser for PDF generation.
-
Setting Navigation Timeout:
- The default navigation timeout for the page is set to 60,000 milliseconds (60 seconds).
-
Navigating to the URL:
- The function navigates to the specified URL, waiting for the DOMContentLoaded event before proceeding.
-
PDF Generation:
- The page is then used to generate a PDF, incorporating specific formatting options such as A4 size, background printing, and header/footer templates.
-
Page Closure:
- After PDF generation, the page is closed to optimize resource usage.
-
Retry Mechanism:
- If there are any errors during navigation, the function retries, with a maximum retry count of 3. If the maximum retries are reached, an error is thrown.
This function ensures robustness in handling potential network issues during the process of generating PDFs from a given URL within the urlPdfizer
project.
Conclusion
In conclusion, the urlPdfizer
project underwent significant improvements to enhance its maintainability and extensibility. The initial implementation, residing in a single file, posed challenges in code management. My contribution focused on restructuring the code into separate modules for the model, view, and controller, fostering a clearer separation of responsibilities.
The introduction of the browser.js
file played a pivotal role in optimizing resource usage by managing a single browser instance for the entire application.
The generatePdfFromHtml
function showcased the project's capability to process HTML input and generate corresponding PDF output.
The generatePdfFromUrl
function addressed the challenge of generating PDFs from a specified URL, incorporating a retry mechanism to handle potential navigation issues. This function demonstrated the project's resilience in adverse network conditions, ensuring successful PDF generation after multiple attempts.
The modular structure, shared browser instance, and the functionality for HTML-to-PDF conversion and URL-based PDF generation collectively contribute to a more organized and robust urlPdfizer
project.
Thank you for exploring the project's evolution and functionalities. Feel free to explore the GitHub repository for further details and updates.
Author: Marek Tremel
Contact: tremelmarek@gmail.com
Creation Date: 2024
GitHub repo: urlPdfizer
Developed for: kzcr
Posted on February 3, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.