Web scraping LinkedIn jobs using Puppeteer and RxJS
llorenspujol
Posted on October 30, 2023
Web scraping may seem like a simple task, but there are many challenges to overcome. In this blog, we will dive into how to scrape LinkedIn to extract job listings. To do this, we will use Puppeteer and RxJS. The goal is to achieve web scraping in a declarative, modular, and scalable manner.
What is web scraping?
Web scraping is a data extraction technique used to gather information from websites. It involves the automated process of retrieving specific data from web pages, such as text, images, links, and more, and then storing or processing that data for various purposes.
Puppeteer
Puppeteer is a JavaScript library that allows you to control web browsers like Chrome for web scraping. Puppeteer enables us to script and monitor tasks such as navigating to a specific webpage and extracting the data we need. It's the ideal tool for web scraping because, being a web browser, it can overcome any potential obstacles, such as in the case of websites that require the execution of JavaScript to function or display data.
RxJS
RxJS is a library for reactive programming in JavaScript. It provides a set of tools and abstractions for working with asynchronous data streams. We will use RxJS in this example because it offers the following advantages:
- Declarative asyncrononous code
- Improved error handling
- Enhanced retry logic
- Simplified code adaptation
- A wide range of operators to assist us throughout the process
1. Puppeteer initialization
The code snippet below initializes a Puppeteer browser instance in a non-headless mode and subsequently creates a new web page. This represents the most fundamental and straightforward initialization process for Puppeteer:
(async () => {
console.log('Launching Chrome...');
const browser = await puppeteer.launch({
headless: false,
// devtools: true,
// slowMo: 250, // slow down puppeteer script so that it's easier to follow visually
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--single-process',
],
});
const page = await browser.newPage()
/**
* 1. Go lo linkedin jobs url
* 2. Get the jobs
* 3. Repeat step 1 with other search parameters
*/
})();
Some of the snippets in this blog may omit parts for clarity. You can find the complete code in this repository.
2. Go to Linkedin jobs list and extract the job offers data
This is the core part of this blog, where we dive into the process of accessing LinkedIn's job listings, parsing the HTML content, and retrieving the offers data in JSON format.
2.1. Construct the URL for navigating to LinkedIn job offers page
To access LinkedIn's job listings, we need to construct a URL using the function urlQueryPage
:
export const urlQueryPage = (searchParams: ScraperSearchParams) =>
`https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
&start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`
In this case, I have already done the previous investigation to find this URL. Our objective is to find a URL that we can parameterize with our desired search parameters.
For this example, our search parameters will be searchText
, pageNumber
, and optionally locationText
.
Examples of url can be:
- https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Angular&start=0
- https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=React&location=Barcelona&start=0
- https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&start=0
2.2. Navigate to the URL and extract the job offers
With our target URL identified, we can proceed with the two primary actions required:
Navigating to the Job Listings URL: This step involves directing our web scraping tool to the URL where the job listings are hosted.
Extracting the job offers data and converting to JSON: Once we're on the jobs listings page, we'll employ web scraping techniques to extract the jobs data and return them in JSON format.
export interface ScraperSearchParams {
searchText: string;
locationText: string;
pageNumber: number;
}
/** main function */
export function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
return defer(() => fromPromise(navigateToJobsPage(page, searchParams)))
.pipe(switchMap(() => getJobsFromLinkedinPage(page)));
}
/* Utility functions */
export const urlQueryPage = (searchParams: ScraperSearchParams) =>
`https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
&start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`
function navigateToJobsPage(page: Page, searchParams: ScraperSearchParams): Promise<Response | null> {
return page.goto(urlQueryPage(searchParams), { waitUntil: 'networkidle0' });
}
export const stacks = ['angularjs', 'kubernetes', 'javascript', 'jenkins', 'html', /* ... */];
export function getJobsFromLinkedinPage(page: Page): Observable<JobInterface[]> {
return defer(() => fromPromise(page.evaluate((pageEvalData) => {
const collection: HTMLCollection = document.body.children;
const results: JobInterface[] = [];
for (let i = 0; i < collection.length; i++) {
try {
const item = collection.item(i)!;
const title = item.getElementsByClassName('base-search-card__title')[0].textContent!.trim();
const imgSrc = item.getElementsByTagName('img')[0].getAttribute('data-delayed-url') || '';
const remoteOk: boolean = !!title.match(/remote|No office location/gi);
const url = (
(item.getElementsByClassName('base-card__full-link')[0] as HTMLLinkElement)
|| (item.getElementsByClassName('base-search-card--link')[0] as HTMLLinkElement)
).href;
const companyNameAndLinkContainer = item.getElementsByClassName('base-search-card__subtitle')[0];
const companyUrl: string | undefined = companyNameAndLinkContainer?.getElementsByTagName('a')[0]?.href;
const companyName = companyNameAndLinkContainer.textContent!.trim();
const companyLocation = item.getElementsByClassName('job-search-card__location')[0].textContent!.trim();
const toDate = (dateString: string) => {
const [year, month, day] = dateString.split('-')
return new Date(parseFloat(year), parseFloat(month) - 1, parseFloat(day) )
}
const dateTime = (
item.getElementsByClassName('job-search-card__listdate')[0]
|| item.getElementsByClassName('job-search-card__listdate--new')[0] // less than a day. TODO: Improve precision on this case.
).getAttribute('datetime');
const postedDate = toDate(dateTime as string).toISOString();
/**
* Calculate minimum and maximum salary
*
* Salary HTML example to parse:
* <span class="job-result-card__salary-info">$65,000.00 - $90,000.00</span>
*/
let currency: SalaryCurrency = ''
let salaryMin = -1;
let salaryMax = -1;
const salaryCurrencyMap: any = {
['€']: 'EUR',
['$']: 'USD',
['£']: 'GBP',
}
const salaryInfoElem = item.getElementsByClassName('job-search-card__salary-info')[0]
if (salaryInfoElem) {
const salaryInfo: string = salaryInfoElem.textContent!.trim();
if (salaryInfo.startsWith('€') || salaryInfo.startsWith('$') || salaryInfo.startsWith('£')) {
const coinSymbol = salaryInfo.charAt(0);
currency = salaryCurrencyMap[coinSymbol] || coinSymbol;
}
const matches = salaryInfo.match(/([0-9]|,|\.)+/g)
if (matches && matches[0]) {
// values are in USA format, so we need to remove ALL the comas
salaryMin = parseFloat(matches[0].replace(/,/g, ''));
}
if (matches && matches[1]) {
// values are in USA format, so we need to remove ALL the comas
salaryMax = parseFloat(matches[1].replace(/,/g, ''));
}
}
// Calculate tags
let stackRequired: string[] = [];
title.split(' ').concat(url.split('-')).forEach(word => {
if (!!word) {
const wordLowerCase = word.toLowerCase();
if (pageEvalData.stacks.includes(wordLowerCase)) {
stackRequired.push(wordLowerCase)
}
}
})
// Define uniq function here. remember that page.evaluate executes inside the browser, so we cannot easily import outside functions form other contexts
const uniq = (_array) => _array.filter((item, pos) => _array.indexOf(item) == pos);
stackRequired = uniq(stackRequired)
const result: JobInterface = {
id: item!.children[0].getAttribute('data-entity-urn') as string,
city: companyLocation,
url: url,
companyUrl: companyUrl || '',
img: imgSrc,
date: new Date().toISOString(),
postedDate: postedDate,
title: title,
company: companyName,
location: companyLocation,
salaryCurrency: currency,
salaryMax: salaryMax,
salaryMin: salaryMin,
countryCode: '',
countryText: '',
descriptionHtml: '',
remoteOk: remoteOk,
stackRequired: stackRequired
};
console.log('result', result);
results.push(result);
} catch (e) {
console.error(`Something when wrong retrieving linkedin page item: ${i} on url: ${window.location}`, e.stack);
}
}
return results;
}, {stacks})) as Observable<JobInterface[]>)
}
The code provided extracts the information of all jobs from the page. While it may not be the most aesthetically pleasing code, it gets the job done. It is not aesthetic because parsing this type of HTML always leads to many fallbacks and checks.
In a standard programming context, it's generally advisable to decompose code into smaller, isolated functions to enhance readability and maintainability. However, when dealing with code executed within
page.evaluate
in Puppeteer, we are somewhat constrained because this code is executed in the Puppeteer (Chrome) instance, not in our Node.js environment. Consequently, all code must be self-contained within thepage.evaluate
call. The only exception here is variables (likestacks
in our case), which can be passed as arguments topage.evaluate
, ensuring they don’t contain functions or complex objects that cannot be serialized.
In this case, the only challenging part to scrape is the salary information, as it involves converting a text format like '$65,000.00 - $90,000.00' into separate salaryMin and salaryMax values.
Additionally, we've encapsulated the entire code within a try/catch block to correctly handle errors. While we currently log errors to the console, it's advisable to consider implementing a mechanism to store these error logs on disk. This practice becomes particularly important because web pages often undergo changes, necessitating frequent updates to the HTML parsing code.
Lastly, It's important to note that we always use the defer
and fromPromise
operators to convert Promises into Observables:
defer(() => fromPromise(myPromise()));
This approach is a recommended best practice that works reliably in all scenarios. Promises are eager, whereas Observables are lazy and only initiate when someone subscribes to them. The defer operator allows us to make a Promise lazy. Go to this link for more information about it
3. Add an asynchronous loop to iterate through all pages
In the previous step, we learned how to obtain all job offers data from a LinkedIn page. Now, we want to use that code as many times as possible to gather as much data as we can. To achieve this, we first need to iterate through all available pages:
function getJobsFromAllPages(page: Page, initSearchParams: ScraperSearchParams): Observable<ScraperResult> {
const getJobs$ = (searchParams: ScraperSearchParams) => goToLinkedinJobsPageAndExtractJobs(page, searchParams).pipe(
map((jobs): ScraperResult => ({jobs, searchParams} as ScraperResult)),
catchError(error => {
console.error(error);
return of({jobs: [], searchParams: searchParams})
})
);
return getJobs$(initSearchParams).pipe(
expand(({jobs, searchParams}) => {
console.log(`Linkedin - Query: ${searchParams.searchText}, Location: ${searchParams.locationText}, Page: ${searchParams.pageNumber}, nJobs: ${jobs.length}, url: ${urlQueryPage(searchParams)}`);
if (jobs.length === 0) {
return EMPTY;
} else {
return getJobs$({...searchParams, pageNumber: searchParams.pageNumber + 1});
}
})
);
}
The code above increments the page number until we reach a page where there are no jobs. To perform this loop in RxJS, we use the operator expand
, which recursively projects each source value to an Observable that is merged into the output Observable. Its functionality is well explained here.
In RxJS, we cannot use a
for
loop as we do with await/async. We are required to use another technique likeexpand
operator or a recursive loop instead. While it might initially appear as a limitation, in an asynchronous context, this method proves to be more advantageous in numerous situations.
So, what would the equivalent code using Promises look like? Here's an example:
export async function getJobsFromAllPages(
page: Page,
searchParams: ScraperSearchParams
): Promise<ScraperResult> {
const results: ScraperResult = { jobs: [], searchParams };
try {
while (true) {
const jobs = await getJobsFromLinkedinPage(page, searchParams);
console.log(
`Linkedin - Query: ${searchParams.searchText}, Location: ${
searchParams.locationText
}, Page: ${searchParams.nPage}, nJobs: ${
jobs.length
}, url: ${urlQueryPage(searchParams)}`
);
results.jobs.push(...jobs);
if (jobs.length === 0) {
break;
}
searchParams.nPage++;
}
} catch (error) {
console.error('Error:', error);
results.jobs = []; // Clear the jobs in case of an error.
}
return results;
}
This code is nearly equivalent to the Observable-based one, with one critical difference: it only emits when all pages have finished processing. In contrast, the implementation using Observables emits after each page. Creating a stream is crucial in this case because we want to handle the jobs as soon as they are resolved.
Certainly, we could introduce our logic following the line:
const jobs = await getJobsFromLinkedinPage(page, searchParams);
/* Handle the jobs here */
...but this would unnecessarily couple our scraping code with the part that handles the jobs data. Handling the jobs data may involve some transformations, API calls, and finally, saving the data into a database.
In this example, we clearly see one of the many benefits Observables offer over Promises.
4. Add another asynchronous loop on top to iterate through all specified serach params
Now that we know how to iterate through all pages for a given set of search parameters, we can move on to the final step: creating a loop to iterate through different search parameters.
To accomplish this, we will first define the data structure in which we will store these search parameters and name it searchParamsList
:
const searchParamsList: { searchText: string; locationText: string }[] = [
{ searchText: 'Angular', locationText: 'Barcelona' },
{ searchText: 'Angular', locationText: 'Madrid' },
// ...
{ searchText: 'React', locationText: 'Barcelona' },
{ searchText: 'React', locationText: 'Madrid' },
// ...
];
To iterate through the searchParamsList
array, we essentially need to convert it from an Array to an Observable using the fromArray
operator. Subsequently, we will use the concatMap
operator to sequentially process each searchText and locationText pair. The power of RxJS here is that, in the case where we may want to switch from sequential to parallel processing, we just need to change the concatMap
for a mergeMap
. In this case, it is not recommended because we will exceed LinkedIn's rate limits, but it's something to consider in other scenarios.
/**
* Creates a new page and scrapes LinkedIn job offers data for each pair of searchText and locationText, recursively retrieving data until there are no more pages.
* @param browser A Puppeteer instance
* @returns An Observable that emits scraped job offers data as ScraperResult
*/
export function getJobsFromLinkedin(browser: Browser): Observable<ScraperResult> {
// Create a new page
const createPage = defer(() => fromPromise(browser.newPage()));
// Iterate through search parameters and scrape jobs
const scrapeJobs = (page: Page): Observable<ScraperResult> =>
fromArray(searchParamsList).pipe(
concatMap(({ searchText, locationText }) =>
getJobsFromAllPages(page, { searchText, locationText, pageNumber: 0 })
)
)
// Compose sequentially previous steps
return createPage.pipe(switchMap(page => scrapeJobs(page)));
}
This code will iterate through various search parameters, one at a time, and retrieve jobs for each combination of technology and location.
🎉 Congratulations! You are now capable of scraping LinkedIn job offers! 🎉
However, LinkedIn, like many other websites, has techniques to prevent web scraping. Let's see how to solve them 👇.
Common errors when scraping LinkedIn
If we run the code as it is now, we will quickly encounter numerous errors from LinkedIn, making it difficult to successfully scrape a significant amount of information. There are two common errors that we need to address:
1. 429 status code response
This response can occur during scraping, and it means that we are making too many requests in a short period of time. If we encounter this error, we need to reduce the rate at which requests are being made until it disappears.
2. Linkedin Authwall
From time to time, LinkedIn may redirect us to an authwall instead of the desired page. When the authwall appears, the only option is to wait a bit longer before making the next request.
How to overcome 429 status code response and Linkedin authwall
Those errors have to be handled in the function getJobsFromLinkedinPage
, we extend that function and separate the scraping html code in another function named getLinkedinJobsFromJobsPage
. The code looks like this:
const AUTHWALL_PATH = 'linkedin.com/authwall';
const STATUS_TOO_MANY_REQUESTS = 429;
const JOB_SEARCH_SELECTOR = '.job-search-card';
function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
return defer(() => fromPromise(page.setExtraHTTPHeaders({'accept-language': 'en-US,en;q=0.9'})))
.pipe(
switchMap(() => navigateToLinkedinJobsPage(page, searchParams)),
tap(response => checkResponseStatus(response)),
switchMap(() => throwErrorIfAuthwall(page)),
switchMap(() => waitForJobSearchCard(page)),
switchMap(() => getJobsFromLinkedinPage(page)),
retryWhen(retryStrategyByCondition({
maxRetryAttempts: 4,
retryConditionFn: error => error.retry === true
})),
map(jobs => Array.isArray(jobs) ? jobs : []),
take(1)
);
}
/**
* Navigate to the LinkedIn search page, using the provided search parameters.
*/
function navigateToLinkedinJobsPage(page: Page, searchParams: ScraperSearchParams) {
return defer(() => fromPromise(page.goto(urlQueryPage(searchParams), {waitUntil: 'networkidle0'})));
}
/**
* Check the HTTP response status and throw an error if too many requests have been made.
*/
function checkResponseStatus(response: any) {
const status = response?.status();
if (status === STATUS_TOO_MANY_REQUESTS) {
throw {message: 'Status 429 (Too many requests)', retry: true, status: STATUS_TOO_MANY_REQUESTS};
}
}
/**
* Check if the current page is an authwall and throw an error if it is.
*/
function throwErrorIfAuthwall(page: Page) {
return getPageLocationOperator(page).pipe(tap(locationHref => {
if (locationHref.includes(AUTHWALL_PATH)) {
console.error('Authwall error');
throw {message: `Linkedin authwall! locationHref: ${locationHref}`, retry: true};
}
}));
}
/**
* Wait for the job search card to be visible on the page, and handle timeouts or authwalls.
*/
function waitForJobSearchCard(page: Page) {
return defer(() => fromPromise(page.waitForSelector(JOB_SEARCH_SELECTOR, {visible: true, timeout: 5000}))).pipe(
catchError(error => throwErrorIfAuthwall(page).pipe(tap(() => {throw error})))
);
}
In this code, we address the previously mentioned errors, that is, the 429 response error and the authwall issue. Overcoming these errors is very important for successfully web scraping on LinkedIn.
To handle the errors, the code employs a custom retry strategy implemented by the retryStrategyByCondition
function:
export const retryStrategyByCondition = ({maxRetryAttempts = 3, scalingDuration = 1000, retryConditionFn = (error) => true}: {
maxRetryAttempts?: number,
scalingDuration?: number,
retryConditionFn?: (error) => boolean
} = {}) => (attempts: Observable<any>) => {
return attempts.pipe(
mergeMap((error, i) => {
const retryAttempt = i + 1;
if (
retryAttempt > maxRetryAttempts ||
!retryConditionFn(error)
) {
return throwError(error);
}
console.log(
`Attempt ${retryAttempt}: retrying in ${retryAttempt *
scalingDuration}ms`
);
// retry after 1s, 2s, etc...
return timer(retryAttempt * scalingDuration);
}),
finalize(() => console.log('retryStrategyOnlySpecificErrors - finalized'))
);
};
This strategy essentially increases the time between each retry after a failure. This way, we ensure that we will wait long enough for LinkedIn to allow us to make requests again
Note: It's important to be aware that LinkedIn can blacklist our IP address, and simply waiting longer might not be an effective solution. To mitigate this potential problem and reduce errors, a recommended practice is to rotate IPs from time to time.
Final Words
Web scraping can frequently violate the terms of service of a website. Always review and respect a website's robots.txt file and its Terms of Service. In this case, this code should be used ONLY for teaching and hobby purposes. LinkedIn specifically prohibits any data extraction from its website; you can read more here.
I recommend web scraping for learning, teaching, and hobby projects. Always remember to consistently respect the scraped site, avoid launching too many requests, and ensure the data is used respectfully.
All the updated code is in this repository, don't doubt to give an star if it helped! 🙏⭐
You can find me on Twitter, Linkedin or Github.
Also, thanks to aleix10kst for helping in the blog review process 🙌.
Safe Scraping!🕷
NOTE: This is a cross-post, original blog can be found in:https://gironajs.com/en/blog/web-scraping-linkedin-jobs-using-puppeteer-and-rxjs
Posted on October 30, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.