Integration with puppeter
Call SDK
For integration with Puppeteer, please refer to integration with node.js spider. Only the 'crawlab.saveItem' method needs to be called.
Avoid memory leaks
Since Puppeteer uses Chromium to run the spider, it is likely that the browser will not be closed when the spider is closed. To solve this problem, we need to use the 'dumb-init' tool to run the spider. When creating a Puppeteer spider, enter the following in 'execute command'.
dumb-init -- <command>
'
dumb-init -- node spider.js
For Docker users, because 'dumb-init' is built in, you can use the tool directly. For direct deployment users, you need to download it by yourself.
The right way to start Puppeteer
Puppeteer relies on Chromium as the engine, so you need to know the correct execution path of Chromium. We recommend using 'puppeteer-chromium-resolver' to start Puppeteer. If you pre install Node.js or install Node.js on the interface, 'puppeteer-chromium-resolver' is built-in.
Here is an example of starting the Puppeteer.
...
const pcr = await PCR({
folderName: '.chromium-browser-snapshots',
hosts: ["https://storage.googleapis.com", "https://npm.taobao.org/mirrors"],
retry: 3
});
const browser = await pcr.puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
executablePath: pcr.executablePath
}).catch(function (error) {
console.log(error);
});
const page = await browser.newPage();
...
For specific examples, please refer to JD mask commodity grabbing spider on GitHub.