Schedule
Schedule
Most of the time, we may need to periodically run crawling tasks for a spider. Now you need a schedule.
The concept schedule in Crawlab is similar to crontab in Linux. It is a long-existing job that runs spider tasks in a periodical way.
Tips
If you would like to configure a web crawler that automatically runs crawling tasks every day/week/month, you should probably set up a schedule. Schedule is the right way to automate things, especially for spiders that crawl incremental content.
Create Schedule
- Navigate to
Schedules
page. - Click
New Schedule
button on the top left. - Enter basic info including
Name
, Cron Expression andSpider
. - Click
Confirm
.
The created schedule is enabled by default. Once you created a schedule which is already enabled, it should trigger a task on time according to its cron expression you have set.
Tips
You can debug whether the schedule module works in Crawlab by creating a new schedule with Cron Expression
as * * * * *
, which means "every minute", so that you can check if a task will be triggered when the next minute starts.
Enable/Disable Schedule
You can enable or disable schedules by toggling the switch button of Enabled
attribute in Schedules
page and schedule detail page.
Cron Expression
Cron Expression is a simple and standard format to describe the periodicity of tasks. It is the same as the format in Linux crontab
.
* * * * * Command_to_execute
| | | | |
| | | | Day of the Week ( 0 - 6 ) ( Sunday = 0 )
| | | |
| | | Month ( 1 - 12 )
| | |
| | Day of Month ( 1 - 31 )
| |
| Hour ( 0 - 23 )
|
Min ( 0 - 59 )
- The asterisk (*) operator specifies all possible values for a field. e.g. every hour or every day.
- The comma (,) operator specifies a list of values, for example: "1,3,4,7,8".
- The dash (-) operator specifies a range of values, for example: "1-6", which is equivalent to "1,2,3,4,5,6".
- The slash (/) operator, can be used to skip a given number of values. For example, "*/3" in the hour time field is equivalent to "0,3,6,9,12,15,18,21"; "*" specifies 'every hour' but the "/3" means that only the first, fourth, seventh...and such values given by "*" are used.
Note
Cron Expression in Crawlab uses the same format as the one in Linux crontab
. That is to say, the smallest unit is minute
. It is different from some crontab-style schedule frameworks whose smallest unit is second.
Tips
If you are not sure about your cron expression, you can go to https://crontab.guru to validate the correctness.