Crawlab AI SDK


Crawlab AI SDK

Crawlab AIopen in new window provides an SDK for users to integrate Crawlab AI into their own applications. The SDK is now available in Python.

Installation

Users can install the SDK using the following command:

pip install crawlab-ai

Authentication

To use the SDK, users need to obtain an API token from the Crawlab AI platform. The API token is used to authenticate the requests made by the SDK. Users can obtain the API token by signing up for an account on the Crawlab official siteopen in new window. You need to make sure you have a valid license to use the API.

Once you have obtained the API token, you can set it as an environment variable in your application:

export CRAWLAB_TOKEN=<your-api-token>

Or, you can set it with CLI:

crawlab-ai config -t <your-api-token>

Usage as CLI

The SDK can be used as a command-line interface (CLI) tool. The CLI tool provides a set of commands to interact with the Crawlab AI platform.

Crawl a Webpage

You can crawl a webpage using the SDK as CLI as follows:

crawlab-ai crawl [-h] [-t {article,list}] [-o OUTPUT] <url>

The -t option specifies the type of content to extract, which can be either article or list. The -o option specifies the output path for the extracted content. For type of article, the output will be in JSON format, while for list, the output type will be in CSV. If the output path is not specified, the content will be printed to the console.

For example, to extract an article from a webpage, you can run the following command:

crawlab-ai crawl -t article https://www.example.com

To extract a list of items from a webpage, you can run the following command:

crawlab-ai crawl -t list -o output.csv https://www.example.com

Code Generation

You can generate code snippets for extracting content from a webpage using the SDK as CLI as follows:

crawlab-ai codegen [-h] [-t {article,list}] [-o OUTPUT] <url>

The -t option specifies the type of content to extract, which can be either article or list. The -o option specifies the output path for the generated code. If the output path is not specified, the code will be printed to the console.

For example, to generate code for extracting an article from a webpage, you can run the following command:

crawlab-ai codegen -t article https://www.example.com

To generate code for extracting a list of items from a webpage, you can run the following command:

crawlab-ai codegen -t list -o output.py https://www.example.com

Usage as Python Library

You can also use the SDK as a Python library in your application.

Extracting List Content

Here is an example of how to use the SDK to extract list content from a webpage:

from crawlab_ai import read_list

# Define the URL
url = "https://example.com"

# Get the data without specifying fields
df = read_list(url=url)
print(df)

# You can also specify fields
fields = ["title", "content"]
df = read_list(url=url, fields=fields)

# You can also return a list of dictionaries instead of a DataFrame
data = read_list(url=url, as_dataframe=False)
print(data)

Extracting Article Content

Here is an example of how to use the SDK to extract article content from a webpage:

from crawlab_ai import read_article

# Define the URL
url = "https://example.com"

# Get the data
data = read_article(url=url)
print(data)

Scrapy Integration

You can also integrate the SDK with Scrapyopen in new window to extract content from webpages.

You can start by creating a Scrapy spider that inherits from ScrapyListSpider:

from crawlab_ai import ScrapyListSpider

class MySpider(ScrapyListSpider):
    name = "my_spider"
    start_urls = ["https://example.com"]

And that's it! You don't have to write any additional code to extract content from the webpage. The SDK will take care of the rest.

If you would like to specify the fields to extract, you can do so by overriding the fields attribute in your spider:

from crawlab_ai import ScrapyListSpider

class MySpider(ScrapyListSpider):
    name = "my_spider"
    start_urls = ["https://example.com"]
    fields = ["title", "content"]

Then, you can run the spider using the following command:

scrapy crawl my_spider