Node

What is a Node?

A node is a Crawlab instance that performs specific functions within your distributed web crawling system. In simple terms, a node is a server running Crawlab software that can execute crawling tasks or provide management capabilities.

Nodes are the building blocks of Crawlab's distributed architecture, allowing you to scale your web crawling operations across multiple machines to increase throughput and resilience.

Types of Nodes

Crawlab uses a master-worker architecture with two distinct node types:

Master Node

The Master Node serves as the control center of your Crawlab system. It:

Manages and coordinates all nodes in the system
Assigns tasks to Worker Nodes and itself
Deploys and distributes spider files across the system
Provides APIs for the frontend application
Handles communication between nodes
Monitors system health and performance

info

There must be exactly ONE Master Node in a Crawlab cluster. This node is crucial as it orchestrates the entire system.

Worker Node

Worker Nodes focus on executing crawling tasks assigned by the Master Node. They:

Run crawling tasks as directed
Report task status and results back to the Master Node
Can be scaled horizontally to increase crawling capacity

tip

Adding more Worker Nodes allows you to:

Crawl more websites simultaneously
Distribute load across multiple machines
Improve fault tolerance
Overcome rate limiting by distributing requests across different IP addresses

info

There can be none or multiple Worker Nodes in Crawlab. A system can function with just a Master Node, but adding Worker Nodes allows for greater scalability.

System Architecture

Topology

Communication Flow

The Master Node assigns tasks to Worker Nodes
Worker Nodes execute their assigned tasks
Worker Nodes report task status and results back to the Master Node
The Master Node aggregates and stores results

Worker Nodes connect to the Master Node over gRPC (default port 9666), register themselves with the Master Node on startup, and are continuously health-monitored through a HEARTBEAT mechanism. Spider files are also synced from the Master Node to Worker Nodes over this gRPC connection.

warning

v0.7 uses gRPC for all node-to-node communication, which is a breaking change from v0.6.x. v0.7 and v0.6.x nodes cannot interoperate in the same cluster. If you are upgrading from v0.6.x, follow the Migration Guide.

Node Management

Viewing Node Status

In the Nodes page of the Crawlab UI, you can view all registered nodes and their current status (online/offline). This helps you monitor the health of your crawling infrastructure.

Enabling and Disabling Nodes

You can temporarily remove a node from the task scheduling pool without removing it from the system:

Navigate to the Nodes page
Toggle the Enabled switch for the desired node
Alternatively, you can change this setting in the node detail page

Disabled nodes will not receive new tasks but will continue to run any currently executing tasks.

Configuring Maximum Concurrent Tasks

To control how many tasks a node can run simultaneously:

Navigate to the node detail page
Adjust the Max Runners setting

This setting helps you optimize resource usage based on each node's capabilities. By default, this is set to unlimited.

tip

For production environments, it's recommended to set Max Runners based on:

Available CPU cores
Available memory
Network bandwidth limitations
Target website constraints

Node Deployment

Hardware Recommendations

Node Type	CPU	Memory	Disk Space
Master Node	2+ cores	4GB+	20GB+
Worker Node	2+ cores	2GB+	10GB+

Actual requirements will vary based on your specific workload and the complexity of your spiders.

Adding a New Node

To expand your Crawlab cluster by adding Worker Nodes:

Install Crawlab on the new server
Set the node type to "Worker" in the configuration (CRAWLAB_NODE_MASTER: "N")
Point it at the Master Node by setting CRAWLAB_MASTER_HOST to the Master Node's address, so it can connect over gRPC (default port 9666)
Start the Crawlab service — the Worker Node will register with the Master Node and begin sending heartbeats

Worker Nodes no longer need direct access to the Master Node's database; they communicate with the Master Node exclusively over gRPC.

For detailed instructions, refer to Set up Worker Nodes in the Multi-Node Deployment section.

Troubleshooting

Common Node Issues

Node shows as offline
- Check if the Crawlab service is running
- Verify the Worker Node can reach the Master Node's gRPC port (default 9666)
- Confirm heartbeats are being received (a node is marked offline after repeated missed heartbeats)
Node not receiving tasks
- Check if the node is enabled
- Verify the node has not reached its Max Runners limit
- Check log files for potential errors
Communication issues between nodes
- Verify firewall settings allow gRPC traffic on the Master Node's port (default 9666)
- Confirm CRAWLAB_MASTER_HOST on each Worker Node points to the correct Master Node address
- Ensure consistent Crawlab versions across all nodes — v0.7 nodes cannot communicate with v0.6.x nodes (see the Migration Guide)

Best Practices

Start small: Begin with a single Master Node and add Worker Nodes as needed
Monitor resource usage: Adjust Max Runners based on actual performance
Regular maintenance: Update all nodes simultaneously to avoid version conflicts
Geographic distribution: For global crawling, consider placing Worker Nodes in different regions
Backup the Master Node: As it's critical to the system, ensure proper backup procedures

info

While you can run multiple Crawlab instances (nodes) on a single physical server, it's generally NOT recommended. A single instance per server is typically more efficient.

What is a Node?​

Types of Nodes​

Master Node​

Worker Node​

System Architecture​

Topology​

Communication Flow​

Node Management​

Viewing Node Status​

Enabling and Disabling Nodes​

Configuring Maximum Concurrent Tasks​

Node Deployment​

Hardware Recommendations​

Adding a New Node​

Troubleshooting​

Common Node Issues​

Best Practices​

What is a Node?

Types of Nodes

Master Node

Worker Node

System Architecture

Topology

Communication Flow

Node Management

Viewing Node Status

Enabling and Disabling Nodes

Configuring Maximum Concurrent Tasks

Node Deployment

Hardware Recommendations

Adding a New Node

Troubleshooting

Common Node Issues

Best Practices