Recommend a flexible Node.js multi-functional crawler library —— x-crawl

coderhxl

CoderHXL

Posted on March 20, 2024

Recommend a flexible Node.js multi-functional crawler library —— x-crawl

x-crawl

x-crawl is a flexible Node.js multifunctional crawler library. Flexible usage and numerous functions can help you quickly, safely, and stably crawl pages, interfaces, and files.

If you also like x-crawl, you can give the x-crawl repository a star on GitHub to support it. Thank you for your support!

Features

  • 🔥 Asynchronous Synchronous - Just change the mode property to toggle asynchronous or synchronous crawling mode.
  • ⚙️Multiple uses - Supports crawling dynamic pages, static pages, interface data, files and polling operations.
  • ⚒️ Control page - Crawling dynamic pages supports automated operations, keyboard input, event operations, etc.
  • 🖋️ Flexible writing style - The same crawling API can be adapted to multiple configurations, and each configuration method is very unique.
  • ⏱️ Interval Crawling - No interval, fixed interval and random interval to generate or avoid high concurrent crawling.
  • 🔄 Failed Retry - Avoid crawling failure due to short-term problems, and customize the number of retries.
  • ➡️ Proxy Rotation - Auto-rotate proxies with failure retry, custom error times and HTTP status codes.
  • 👀 Device Fingerprinting - Zero configuration or custom configuration, avoid fingerprinting to identify and track us from different locations.
  • 🚀 Priority Queue - According to the priority of a single crawling target, it can be crawled ahead of other targets.
  • 🧾 crawl log - Logs the crawl and uses colored string reminders at the terminal.
  • 🦾 TypeScript - Own types, implement complete types through generics.

Example

Take the automatic acquisition of some photos of experiences and homes around the world every day as an example:

// 1. Import module ES/CJS
import xCrawl from 'x-crawl'

// 2. Create a crawler instance
const myXCrawl = xCrawl({ maxRetry: 3, intervalTime: { max: 2000, min: 1000 } })

// 3. Set the crawling task
/*
  Call the startPolling API to start the polling function,
  and the callback function will be called every other day
*/
myXCrawl.startPolling({ d: 1 }, async (count, stopPolling) => {
  // Call the crawlPage API to crawl the page
  const pageResults = await myXCrawl.crawlPage({
    targets: [
      'https://www.airbnb.cn/s/*/experiences',
      'https://www.airbnb.cn/s/plus_homes'
    ],
    viewport: { width: 1920, height: 1080 }
  })

  // Obtain the image URL by traversing the crawled page results
  const imgUrls = []
  for (const item of pageResults) {
    const { id } = item
    const { page } = item.data
    const elSelector = id === 1 ? '.i9cqrtb' : '.c4mnd7m'

    // wait for the page element to appear
    await page.waitForSelector(elSelector)

    // Get the URL of the page image
    const urls = await page.$$eval(`${elSelector} picture img`, (imgEls) =>
      imgEls.map((item) => item.src)
    )
    imgUrls.push(...urls.slice(0, 6))

    // close the page
    page.close()
  }

  // Call crawlFile API to crawl pictures
  await myXCrawl.crawlFile({ targets: imgUrls, storeDirs: './upload' })
})
Enter fullscreen mode Exit fullscreen mode

running result:

Note: Please do not crawl randomly, you can check the robots.txt protocol before crawling. The class name of the website may change, this is just to demonstrate how to use x-crawl.

More

More content can be viewed: https://github.com/coder-hxl/x-crawl

💖 💪 🙅 🚩
coderhxl
CoderHXL

Posted on March 20, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related