Selenium Spider Integration
Selenium Spider Integration
Introduction to Selenium
Selenium is a tool primarily used for web application testing, but it can also be used to write web scrapers. Unlike traditional HTTP request libraries (such as Requests), Selenium allows you to simulate browser behavior and automate the browser to gather data. This is particularly useful for scraping dynamic web pages that require JavaScript rendering.
Integrating Selenium Spider in Crawlab
Below, we will explain how to integrate a Selenium spider into Crawlab and display the scraping results in the Crawlab frontend interface. We will demonstrate the process using the example of scraping 36kr (36氪) website.
Creating the Spider
In the Crawlab spider list, create a spider named "36kr" with the execution command python main.py
.
Editing the Spider File
Create and open the main.py
file and enter the following content:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from crawlab import save_item
# create web driver with chrome
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(options=chrome_options)
# navigate to the news list page
browser.get('https://36kr.com/information/web_news/')
# get article items
items = browser.find_elements(by=By.CSS_SELECTOR, value='.information-flow-list > .information-flow-item')
# iterate through items
for item in items:
# extract fields
el_title = item.find_element(by=By.CSS_SELECTOR, value='.article-item-title')
title = el_title.text
url = el_title.get_attribute('href')
topic = item.find_element(by=By.CSS_SELECTOR, value='.kr-flow-bar-motif > a').text
description = item.find_element(by=By.CSS_SELECTOR, value='.article-item-description').text
try:
pic_url = item.find_element(by=By.CSS_SELECTOR, value='.article-item-pic > img').get_attribute('src')
except:
pic_url = None
# save to Crawlab
save_item({
'title': title,
'url': url,
'topic': topic,
'description': description,
'pic_url': pic_url,
})
In this code, we define the chrome_options
for the Chrome browser and include the following important parameters:
Note
These parameters are crucial; otherwise, the script may not run correctly in Crawlab!
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
Finally, we use the save_item
method from the Crawlab SDK to save the scraping results obtained by the web scraper.
Running the Spider
Run the "36kr" spider in Crawlab to obtain the scraping results.