Node
What is a Node?
A node is a Crawlab instance that performs specific functions within your distributed web crawling system. In simple terms, a node is a server running Crawlab software that can execute crawling tasks or provide management capabilities.
Nodes are the building blocks of Crawlab's distributed architecture, allowing you to scale your web crawling operations across multiple machines to increase throughput and resilience.
Types of Nodes
Crawlab uses a master-worker architecture with two distinct node types:
Master Node
The Master Node serves as the control center of your Crawlab system. It:
- Manages and coordinates all nodes in the system
- Assigns tasks to Worker Nodes and itself
- Deploys and distributes spider files across the system
- Provides APIs for the frontend application
- Handles communication between nodes
- Monitors system health and performance
There must be exactly ONE Master Node in a Crawlab cluster. This node is crucial as it orchestrates the entire system.
Worker Node
Worker Nodes focus on executing crawling tasks assigned by the Master Node. They:
- Run crawling tasks as directed
- Report task status and results back to the Master Node
- Can be scaled horizontally to increase crawling capacity
Adding more Worker Nodes allows you to:
- Crawl more websites simultaneously
- Distribute load across multiple machines
- Improve fault tolerance
- Overcome rate limiting by distributing requests across different IP addresses
There can be none or multiple Worker Nodes in Crawlab. A system can function with just a Master Node, but adding Worker Nodes allows for greater scalability.
System Architecture
Topology
Communication Flow
- The Master Node assigns tasks to Worker Nodes
- Worker Nodes execute their assigned tasks
- Worker Nodes report task status and results back to the Master Node
- The Master Node aggregates and stores results
Worker Nodes connect to the Master Node over gRPC (default port 9666), register themselves with the Master Node on startup, and are continuously health-monitored through a HEARTBEAT mechanism. Spider files are also synced from the Master Node to Worker Nodes over this gRPC connection.
v0.7 uses gRPC for all node-to-node communication, which is a breaking change from v0.6.x. v0.7 and v0.6.x nodes cannot interoperate in the same cluster. If you are upgrading from v0.6.x, follow the Migration Guide.
Node Management
Viewing Node Status
In the Nodes page of the Crawlab UI, you can view all registered nodes and their current status (online/offline). This
helps you monitor the health of your crawling infrastructure.
Enabling and Disabling Nodes
You can temporarily remove a node from the task scheduling pool without removing it from the system:
- Navigate to the
Nodespage - Toggle the
Enabledswitch for the desired node - Alternatively, you can change this setting in the node detail page
Disabled nodes will not receive new tasks but will continue to run any currently executing tasks.
Configuring Maximum Concurrent Tasks
To control how many tasks a node can run simultaneously:
- Navigate to the node detail page
- Adjust the
Max Runnerssetting
This setting helps you optimize resource usage based on each node's capabilities. By default, this is set to unlimited.
For production environments, it's recommended to set Max Runners based on:
- Available CPU cores
- Available memory
- Network bandwidth limitations
- Target website constraints
Node Deployment
Hardware Recommendations
| Node Type | CPU | Memory | Disk Space |
|---|---|---|---|
| Master Node | 2+ cores | 4GB+ | 20GB+ |
| Worker Node | 2+ cores | 2GB+ | 10GB+ |
Actual requirements will vary based on your specific workload and the complexity of your spiders.
Adding a New Node
To expand your Crawlab cluster by adding Worker Nodes:
- Install Crawlab on the new server
- Set the node type to "Worker" in the configuration (
CRAWLAB_NODE_MASTER: "N") - Point it at the Master Node by setting
CRAWLAB_MASTER_HOSTto the Master Node's address, so it can connect over gRPC (default port9666) - Start the Crawlab service — the Worker Node will register with the Master Node and begin sending heartbeats
Worker Nodes no longer need direct access to the Master Node's database; they communicate with the Master Node exclusively over gRPC.
For detailed instructions, refer to Set up Worker Nodes in the Multi-Node Deployment section.
Troubleshooting
Common Node Issues
-
Node shows as offline
- Check if the Crawlab service is running
- Verify the Worker Node can reach the Master Node's gRPC port (default
9666) - Confirm heartbeats are being received (a node is marked offline after repeated missed heartbeats)
-
Node not receiving tasks
- Check if the node is enabled
- Verify the node has not reached its
Max Runnerslimit - Check log files for potential errors
-
Communication issues between nodes
- Verify firewall settings allow gRPC traffic on the Master Node's port (default
9666) - Confirm
CRAWLAB_MASTER_HOSTon each Worker Node points to the correct Master Node address - Ensure consistent Crawlab versions across all nodes — v0.7 nodes cannot communicate with v0.6.x nodes (see the Migration Guide)
- Verify firewall settings allow gRPC traffic on the Master Node's port (default
Best Practices
- Start small: Begin with a single Master Node and add Worker Nodes as needed
- Monitor resource usage: Adjust
Max Runnersbased on actual performance - Regular maintenance: Update all nodes simultaneously to avoid version conflicts
- Geographic distribution: For global crawling, consider placing Worker Nodes in different regions
- Backup the Master Node: As it's critical to the system, ensure proper backup procedures
While you can run multiple Crawlab instances (nodes) on a single physical server, it's generally NOT recommended. A single instance per server is typically more efficient.