Introduction
Introduction
If you already know what Crawlab is and what it is used for, you can head straight to Quick Start or Installation to install and start to use Crawlab.
If you are not familiar with Crawlab, you can read sections below in order to understand more about Crawlab.
What is Crawlab?
Crawlab is a powerful Web Crawler Management Platform (WCMP) that can run web crawlers and spiders developed in various programming languages including Python, Go, Node.js, Java, C# as well as frameworks including Scrapy, Colly, Selenium, Puppeteer. It is used for running, managing and monitoring web crawlers, particularly in production environment where traceability, scalability and stability are the major factors to concern.
Background and History
Crawlab project has been under continuous development since it was published in March 2019, and gone through a number of major releases. It was initially designed for solving the managerial issue when there are a large number of spiders to coordinate and execute. With a lot of improvements and newly updated features, Crawlab is becoming more and more popular in developer communities, particularly amongst web crawler engineers.
Who can use Crawlab?
- Web Crawler Engineers. By integrating web crawler programs into Crawlab, you can now focus only on the crawling and parsing logics, instead of wasting too much time on writing common modules such as task queue, storage, logging, notification, etc.
- Operation Engineers. The main benefits from Crawlab for Operation Engineers are the convenience in deployment (for both crawler programs and Crawlab itself). Crawlab supports easy installation with Docker and Kubernetes.
- Data Analysts. Data analysts who can code (e.g. Python) are able to develop web crawler programs (e.g. Scrapy) and upload them into Crawlab. Then leave all the rest dirty work to Crawlab, and it will automatically collect data for you.
- Others. Technically everyone can enjoy the convenience and easiness of automation provided by Crawlab. Though Crawlab is good at running web crawler tasks, it can also be used for other types of tasks such as data processing and automation.
Main Features
Category | Feature | Description |
---|---|---|
Node | Node Management | Register, manage and control multiple nodes in the distributed system |
Spider | Spider Deployment | Auto-deploy spiders to multiple nodes and auto-sync spider files including scripts and programs |
Spider Code Editing | Update and edit script code with the online editor on the go | |
Spider Stats | Spider crawling statistical data such as average running time and results count | |
Framework Integration | Integrate spider frameworks such as Scrapy | |
Data Storage Integration | Automatic saving results data in the database without additional configurations | |
Git Integration | Version control through embedded or external remote Git repos | |
Task | Task Scheduling | Assign and schedule crawling tasks to multiple nodes in the distributed system |
Task Logging | Automatic saving task logs which can be viewed in the frontend UI | |
Task Stats | Visually display task stats including task results count and running time | |
User | User Management | Create, update and delete user accounts |
Other | Dependency Management | Search and install dependencies Python and Node.js packages |
Notification | Automatic email or mobile notifications when tasks are triggered or complete |