Introduction


Introduction

If you already know what Crawlab is and what it is used for, you can head straight to Quick Start or Installation to install and start to use Crawlab.

If you are not familiar with Crawlab, you can read sections below in order to understand more about Crawlab.

What is Crawlab?

Crawlab is a powerful Web Crawler Management Platform (WCMP) that can run web crawlers and spiders developed in various programming languages including Python, Go, Node.js, Java, C# as well as frameworks including Scrapy, Colly, Selenium, Puppeteer. It is used for running, managing and monitoring web crawlers, particularly in production environment where traceability, scalability and stability are the major factors to concern.

Background and History

Crawlab project has been under continuous development since it was published in March 2019, and gone through a number of major releases. It was initially designed for solving the managerial issue when there are a large number of spiders to coordinate and execute. With a lot of improvements and newly updated features, Crawlab is becoming more and more popular in developer communities, particularly amongst web crawler engineers.

Change Logsopen in new window

Who can use Crawlab?

  • Web Crawler Engineers. By integrating web crawler programs into Crawlab, you can now focus only on the crawling and parsing logics, instead of wasting too much time on writing common modules such as task queue, storage, logging, notification, etc.
  • Operation Engineers. The main benefits from Crawlab for Operation Engineers are the convenience in deployment (for both crawler programs and Crawlab itself). Crawlab supports easy installation with Docker and Kubernetes.
  • Data Analysts. Data analysts who can code (e.g. Python) are able to develop web crawler programs (e.g. Scrapy) and upload them into Crawlab. Then leave all the rest dirty work to Crawlab, and it will automatically collect data for you.
  • Others. Technically everyone can enjoy the convenience and easiness of automation provided by Crawlab. Though Crawlab is good at running web crawler tasks, it can also be used for other types of tasks such as data processing and automation.

Main Features

CategoryFeatureDescription
NodeNode ManagementRegister, manage and control multiple nodes in the distributed system
SpiderSpider DeploymentAuto-deploy spiders to multiple nodes and auto-sync spider files including scripts and programs
Spider Code EditingUpdate and edit script code with the online editor on the go
Spider StatsSpider crawling statistical data such as average running time and results count
Framework IntegrationIntegrate spider frameworks such as Scrapy
Data Storage IntegrationAutomatic saving results data in the database without additional configurations
Git IntegrationVersion control through embedded or external remote Git repos
TaskTask SchedulingAssign and schedule crawling tasks to multiple nodes in the distributed system
Task LoggingAutomatic saving task logs which can be viewed in the frontend UI
Task StatsVisually display task stats including task results count and running time
UserUser ManagementCreate, update and delete user accounts
OtherDependency ManagementSearch and install dependencies Python and Node.js packages
NotificationAutomatic email or mobile notifications when tasks are triggered or complete