Integration with puppeter

Call SDK

For integration with Puppeteer, please refer to integration with node.js spider. Only the 'crawlab.saveItem' method needs to be called.

Avoid memory leaks

Since Puppeteer uses Chromium to run the spider, it is likely that the browser will not be closed when the spider is closed. To solve this problem, we need to use the 'dumb-init' tool to run the spider. When creating a Puppeteer spider, enter the following in 'execute command'.

dumb-init -- <command>

'' is the actual execution command, for example, 'node spider.js'. Therefore, the general 'executive command' is as follows.

dumb-init -- node spider.js

For Docker users, because 'dumb-init' is built in, you can use the tool directly. For direct deployment users, you need to download it by yourself.

The right way to start Puppeteer

Puppeteer relies on Chromium as the engine, so you need to know the correct execution path of Chromium. We recommend using 'puppeteer-chromium-resolver' to start Puppeteer. If you pre install Node.js or install Node.js on the interface, 'puppeteer-chromium-resolver' is built-in.

Here is an example of starting the Puppeteer.

...
    const pcr = await PCR({
        folderName: '.chromium-browser-snapshots',
        hosts: ["https://storage.googleapis.com", "https://npm.taobao.org/mirrors"],
        retry: 3
    });

    const browser = await pcr.puppeteer.launch({
        headless: true,
        args: ['--no-sandbox'],
        executablePath: pcr.executablePath
    }).catch(function (error) {
        console.log(error);
    });

    const page = await browser.newPage();
...

For specific examples, please refer to JD mask commodity grabbing spider on GitHub.

© 2020 Crawlab, Made by Crawlab-Team all right reserved,powered by Gitbook该文件最后修改时间: 2020-06-09 14:09:46

results matching ""

    No results matching ""

    results matching ""

      No results matching ""