//
Web Scraping with Python and Selenium: Tutorial and Examples
Web Scraping

Web Scraping with Python and Selenium: Tutorial and Examples

 

What Is Web Scraping?

Website scraping is the process of extracting data from web pages. This process usually involves using a program or scripts that retrieve information presented on websites, which can be used in different areas and for different purposes. For example, to compare competitors' prices and services, analyze consumer preferences, monitor various news and events, and much more. To go deeper into web scraping with Selenium and CapMonster Cloud, check out our comprehensive guide. 

Get started now and automate your solution hCaptcha

The steps of web scraping include:

  • Defining objectives: The first step is to understand what information needs to be extracted and on which web resources.
  • Analyze the structure of the target web page: Examine the HTML code of pages to understand where and how the desired information is stored. Finding and defining elements such as tags, IDs, classes, etc. 
  • Development of a script to retrieve data: Writing a code (e.g., in Python using the Selenium browser automation library) that will visit web pages, retrieve the necessary data and save it in a structured form.
  • Data processing: Extracted data often requires transformation for further use. This may include removing duplicates, correcting formats, filtering unnecessary data, etc. 
  • Data saving: Save extracted and processed data in a convenient format, e.g. CSV, JSON, database, etc.

It is also important to keep track of changes on the target site so that you can update the script in a timely manner if necessary. 

How to Work With Python and Selenium

Why are Python and Selenium handy for web scraping?

The Python programming language and Selenium library are often used for web scraping for the following reasons:

  • Ease of use: Python is easy to use and has many libraries for web scraping.
  • Action emulation, access to dynamic content: With Selenium you can automate user actions in a web browser, including emulation of page scrolling and clicking buttons required to load data.
  • Anti-crawlers: Some sites have special security mechanisms, and using emulation of real user actions helps to bypass these mechanisms.
  • Broad support: A large community and many resources make these tools easy to use.

Installing and Importing Tools

Let's look at an example of a simple script for web scraping.  For the example, let's take the page https://webscraper.io/test-sites/e-commerce/allinone/product/123, where we will search in the product card for the price of 128 GB HDD capacity.  

  • If you don't have Python installed on your computer yet, go to the official Python website and download the appropriate version for your operating system (Windows, macOS, Linux). In the terminal of your development environment, you can check the Python version with the following command: 

python --version

  • Next, you need to install Selenium by executing the command: 

pip install selenium

  • Create a new file and import the required libraries:

import time

from selenium import webdriver

  • It is also necessary to add the 'By' class to the project, we will need it to define strategies for searching elements on the web page in Selenium: 

from selenium.webdriver.common.by import By

Chrome driver settings

ChromeDriver options (ChromeOptions) are needed to configure the behavior and startup settings of the Chrome browser when automating with Selenium. Here are some of these options: 

--headless: launches Chrome without a graphical interface

--disable-gpu: disables the use of the GPU. Often used with --headless to prevent errors related to graphics rendering

--disable-popup-blocking: disables pop-up blocking

--disable-extensions: disables all browser extensions

--incognito: starts the browser in incognito mode, which does not save history, cookies and other session data

--window-size=width,height: sets the size of the browser window

--user-agent=<user-agent>: installs a custom browser agent to emulate different devices or browsers

A full list of parameters can be viewed here.

  • In our example, incognito mode is set. Initialize the Chrome driver and navigate to the desired product card page:  

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--incognito')

driver = webdriver.Chrome(options=chrome_options)

 

driver.get('https://webscraper.io/test-sites/e-commerce/allinone/product/123')

Search for HTML elements of a page

In Selenium, the search for elements on a web page is performed using various methods provided by the By class. These methods allow you to find elements by various criteria, for example, element ID, class name, tag name, attribute name, link text, XPath or CSS selector. 

Example of an ID search:

element = driver.find_element(By.ID, 'element_id')

By class name: 

element = driver.find_element(By.CLASS_NAME, 'class_name')

By tag name:

element = driver.find_element(By.TAG_NAME, 'tag_name')

By attribute name: 

element = driver.find_element(By.NAME, 'element_name')

By link text:

element = driver.find_element(By.LINK_TEXT, 'link_text')

By CSS selector:

element = driver.find_element(By.CSS_SELECTOR, 'css_selector')

By XPath: 

element = driver.find_element(By.XPATH, '//tag[@attribute="value"]')

If you need to find multiple elements, the find_elements method is used instead of find_element

  • In our example, we need to find the location of the “128” button element and price information: 

 

none provided

Simulation of user actions

In Selenium, simulation of actions on a web page is performed using methods provided by the WebElement object. These methods allow you to interact with elements on the page the way a user does: click, enter text, etc. 

 

Here are a few such methods: 

 

element.click() - element click

 

element.send_keys('text to input’) - entering text into an element (input field)

 

element.location_once_scrolled_into_view - scroll to an element

 

submit() - form submission

 

  • Let's go back to our example. XPath is used to search for the desired button, and then you need to click on it and find the element with the price: 

 

button_128 = driver.find_element(By.XPATH, "//button[@value='128']")

button 128.click()

 

# Wait a while for the price to load

time.sleep(3)

 

price_element = driver.find_element(By.XPATH, "//h4[@class='price float-end pull-right']")

 

  • Now it remains to print the price of the product with the required HDD volume to the console: 

 

price_text = price_element.text

print("Цена товара:", price_text)

 

driver.quit()

 

So here's what the full code looks like:

 

import time

from selenium import webdriver

from selenium.webdriver.common.by import By

 

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--incognito')

driver = webdriver.Chrome(options=chrome_options)

 

driver.get('https://webscraper.io/test-sites/e-commerce/allinone/product/123')

 

button_128 = driver.find_element(By.XPATH, "//button[@value='128']")

button_128.click()

 

time.sleep(3)

 

price_element = driver.find_element(By.XPATH, "//h4[@class='price float-end pull-right']")

 

price_text = price_element.text

print("item price:", price_text)

 

driver.quit()

Pop-up banners and windows

It is very common for websites to display various banners and pop-ups that can interfere with script execution. In such cases, you can configure ChromeDriver settings to disable these elements. Here are some of such settings: 

 

--disable-popup-blocking: disables pop-up blocking

 

--disable-infobars: disables information panels

 

--disable-notifications:  disables all notifications

 

On the website we are using in our example, a cookie banner appears when the page loads. You can also try to remove it in other ways, one of which is to use a separate script that blocks such notifications. Let's start by examining this element: 

none provided

Now we can hide this banner by adding it to our script: 

import time

from selenium import webdriver

from selenium.webdriver.common.by import By

 

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--incognito')

driver = webdriver.Chrome(options=chrome_options)

 

driver.get('https://webscraper.io/test-sites/e-commerce/allinone/product/123')

 

# Script for hiding element with cookie banner

script = """

var banner = document.getElementById('cookieBanner');

if (banner) {

    banner.style.display = 'none';

}

"""

 

# Executing a script to hide a banner

driver.execute_script(script)

 

button_128 = driver.find_element(By.XPATH, "//button[@value='128']")

button_128.click()

 

time.sleep(3)

 

price_element = driver.find_element(By.XPATH, "//h4[@class='price float-end pull-right']")

 

price_text = price_element.text

print("Цена товара:", price_text)

 

driver.quit()

 

 

Another effective way is to use Chrome extensions that automatically accept or block all cookie banners. You can install these extensions in your browser and then plug them into ChromeOptions. Download a suitable extension in .crx format. Use it in your script: 

 

import time

from selenium import webdriver

from selenium.webdriver.common.by import By

 

# Path to expansion .crx

extension_path = '/path/to/extension.crx'

 

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--incognito')

 

# Adding an extension

chrome_options.add_extension(extension_path)  

 

driver = webdriver.Chrome(options=chrome_options)

 

driver.get('https://webscraper.io/test-sites/e-commerce/allinone/product/123')

 

button_128 = driver.find_element(By.XPATH, "//button[@value='128']")

button_128.click()

 

time.sleep(3)

 

price_element = driver.find_element(By.XPATH, "//h4[@class='price float-end pull-right']")

 

price_text = price_element.text

print("Цена товара:", price_text)

 

driver.quit()

 

 

This approach will save you from having to manually interact with such elements when the page loads. 

How to Solve CAPTCHA When Web Scraping

Often when extracting data, there are obstacles in the form of captchas designed to protect against bots. To overcome such limitations, specialized services that recognize different types of captchas are most effective. One such tool is CapMonster Cloud, which can automatically solve even the most complex captchas in the shortest possible time. This service offers both browser extensions (for Chrome, for Firefox) and API methods (you can read about them in the documentation), which you can integrate into your code to get tokens, successfully bypass the protection and continue the work of your script. 

Sample code for automatic CAPTCHA solving and web scraping

This script solves the captcha on a page, and then extracts the header of the same page and outputs it to the console: 

 

import asyncio

from selenium import webdriver

from selenium.webdriver.remote.webdriver import WebDriver

from selenium.webdriver.common.by import By

from selenium.webdriver.chrome.options import Options

from capmonstercloudclient import CapMonsterClient, ClientOptions

from capmonstercloudclient.requests import RecaptchaV2ProxylessRequest

 

async def solve_captcha(api_key, page_url, site_key):

    client_options = ClientOptions(api_key=api_key)

    cap_monster_client = CapMonsterClient(options=client_options)

 

    recaptcha2request = RecaptchaV2ProxylessRequest(websiteUrl=page_url, websiteKey=site_key)

    responses = await cap_monster_client.solve_captcha(recaptcha2request)

    return responses['gRecaptchaResponse']

 

async def parse_site_title(driver: WebDriver, url: str) -> str:

    driver.get(url)

    

    driver.implicitly_wait(10)

 

    title_element = driver.find_element(By.TAG_NAME, 'title')

    title = title_element.get_attribute('textContent')

    

    return title

 

async def main():

    api_key = 'YOUR_CAPMONSTER_API_KEY'

    page_url = 'https://lessons.zennolab.com/captchas/recaptcha/v2_simple.php?level=low'

    site_key = '6Lcf7CMUAAAAAKzapHq7Hu32FmtLHipEUWDFAQPY'

 

    options = Options()

    driver = webdriver.Chrome(options=options)

    

    captcha_response = await solve_captcha(api_key, page_url, site_key)

    print("Решение капчи:", captcha_response)

 

    site_title = await parse_site_title(driver, page_url)

    print("Заголовок страницы:", site_title)

 

    driver.quit()

 

if __name__ == "__main__":

    asyncio.run(main())

 

 

Conclusion

Using Python and Selenium for web scraping opens up many opportunities to automate the collection of data from websites, while integration with Capmonster Cloud helps you easily solve captchas and greatly simplifies the data collection process. These tools not only make your work faster, but also ensure that the data you collect is accurate and reliable. You can use them to collect data from a wide variety of sources, from online stores to news sites, and use it to analyze, research, or create your own projects. And you don't have to have advanced programming knowledge - modern technologies make web scraping accessible even for beginners. So, if simplicity, maximum convenience and time-saving in working with data are important for you, Python, Selenium and Capmonster Cloud are the perfect combination to achieve this goal! You can find out more about how to use Selenium in our comprehensive guide

 

Note: We'd like to remind you that the product is used to automate testing on your own websites and on websites to which you have legal access.