As previously described in chapter deploy spider(../Spider/Deploy.md), the spider is deployed on the work node automatically. The following diagram shows the architecture of crawlab spider deployment.
As shown in the figure above, the life cycle of the whole automatic deployment of spiders is as follows (the source code is in 'services/spider.go#InitSpiderService'):
- The master node will get the spider information from the spider directory every 5 seconds, and then update it to the database (this process does not involve file upload);
- The master node obtains all the spider information from the database every 60 seconds, then packages the spider into a zip file and uploads it to MongoDB GridFS, and writes the file ID to the 'file_id' field in the 'spiders' table of MongoDB;
- The master node publishes a message ('file.upload' event, including file ID) to the work node by Redis's 'PubSub', and informs the work node to obtain the spider file;
- The work node receives the message to get the spider file, obtains the zip file from MongoDB GridFS, decompresses it and stores it locally.
All spiders will be deployed on the work node periodically in this way.
GridFS is the file system where MongoDB stores large files (more than 16MB). Crawlab uses GridFS as an intermediate medium for spider file storage, which enables the work node to acquire actively and deploy locally. This bypasses other traditional transport methods, such as RPC, message queue, HTTP, which require more complex and cumbersome configuration and processing.
Crawlab saves files on GridFS and generates two collections, they are 'files.files' and 'files.fs', the former stores the meta information of documents and the latter stores the contents of documents, the 'file_id' in spiders refers to the '_id' of 'files.files'.
Reference resources: https://docs.mongodb.com/manual/core/gridfs/