Web Scraping in Python: Efficient Data Collection Automation

Please review the terms of use for the content provided on this website

Introduction
Tools and Their Characteristics
Installing Python
Libraries and Tools for Web Scraping in Python
Example Scrapers for Static and Dynamic Sites
Basic HTML Elements
Choosing a Website
Choosing Tools to Write the Scraper and Installation
Finding Elements on the Page and Writing the Scraper
Example Scraper for a Dynamic Websiteе
How to Save Extracted Information?
Obstacles in Web Scraping
Recommendations for Successful Scraping
Conclusion

Introduction

Web scraping is a method of gathering data from websites. It allows you to extract the necessary information for analysis, price monitoring, news tracking, and various other purposes. Web scrapers or parsers are tools used to perform web scraping. The most convenient and popular language for writing web scrapers is Python, although almost any language can be used for this purpose. Users choose Python for several reasons: its easy syntax, a wide range of handy libraries for parsing, and constant support and updates.

In this guide, we will cover the main tools for web scraping and provide a Python implementation example. With step-by-step instructions, you will understand the basics of web scraping and parsing, write a simple script to collect data, and learn methods to bypass various obstacles during the process.

Tools and Their Characteristics

Choosing an IDE

To work effectively, you need a development environment. The choice of an appropriate environment for web scraping depends on several factors such as ease of use, support for necessary tools, personal preferences, and project requirements. Among the most popular tools suitable for web scraping in Python are PyCharm and Visual Studio Code.

PyCharm
- Pros:
  - Full-fledged IDE with many features.
  - Support for debugging, code autocompletion, and project management.
  - Support for virtual environments and Git integration.
- Cons:
  - May be overloaded for small projects.
  - Requires more resources compared to text editors.
Visual Studio Code (VS Code)
- Pros:
  - Lightweight and customizable environment.
  - A wide selection of extensions for Python and web scraping.
  - Support for debugging, Git, and terminal.
- Cons:
  - May require configuration for full Python functionality.
  - Performance issues might occur with a large number of extensions.

Get started now and automate your solution hCaptcha

Start now Demo

Installing Python

Python should also be installed on your computer. Here’s the installation guide for your OS:

macOS: Download the latest Python version from the official site, and follow the installer instructions.
Linux: Python is pre-installed in some Linux distributions (e.g., Ubuntu). To check the Python version, run the following command in the terminal:
```
python --version
```
If the Python version is outdated, install the latest version with this command (example for Debian-based distributions):
```
sudo apt-get update && sudo apt-get install python3
```
Windows: Visit the Python website and download the appropriate version. Check the box "Add python.exe to PATH" during installation. This will add Python to the system PATH, making it easier to use from the command line.

Libraries and Tools for Web Scraping in Python

Now, let’s look at the libraries and tools that will make writing scripts for your parser more convenient. Below are the best options, along with their characteristics, to help you make the right choice for your tasks.

Requests: A simple and convenient library for sending HTTP requests. It easily retrieves the HTML code of web pages. This library is suitable for most standard web scraping tasks, especially if you don’t plan to perform a large number of simultaneous requests. Ideal for scraping small sites or when asynchronous execution is not needed.
Aiohttp: Suitable for more complex scenarios where you need to handle multiple simultaneous requests and manage asynchronous tasks. Useful for working with large amounts of data and servers where high speed and efficiency matter. Ideal for asynchronous web scraping.
Lxml: A flexible and highly efficient library for processing XML and HTML documents. It offers convenient features for working with these documents and supports XPath and XSLT.
BeautifulSoup: A library for parsing HTML and XML documents. It allows you to extract data from HTML code, structure it, and can even process poorly structured HTML and correct it. BeautifulSoup uses various parsers to analyze HTML and XML, such as Python’s built-in html.parser and third-party parsers like lxml and html5lib, allowing you to choose the most suitable parser for the task at hand.
Scrapy: A powerful framework for web scraping and data parsing that allows you to create projects to collect, process, and save data in various formats. Scrapy supports asynchronous requests, making it very fast.
Selenium: A tool for automating browsers. It allows interaction with web pages just like a user, making it ideal for working with dynamic sites. Selenium’s advantages include support for Python and JavaScript, and the ability to simulate user actions.
Pyppeteer: A Python version of Puppeteer (a Node.js library) that automates the Chromium browser for web scraping. It allows page navigation, form filling, submission, button clicks, and other routine user actions. Pyppeteer is also useful for working with dynamic content.
Playwright: An advanced tool for browser automation that supports multiple languages (JavaScript/TypeScript, Python, .NET/C#, and Java), is multi-threaded, and works with multiple browsers (Chromium, Firefox, WebKit). It can be installed on Windows, macOS, and Linux. Playwright offers high performance and ease of use for web scraping and web application testing.

Example Scrapers for Static and Dynamic Sites

What is HTML?

Before writing any scraper, you need to understand the basics of HTML and be able to analyze website markup to avoid getting lost when searching for elements. HTML (HyperText Markup Language) is a standard markup language used to create and structure web pages. It describes the document structure, including text, images, links, and other elements displayed in the browser.

Basic HTML Elements

Tags: An HTML document consists of various tags that define the structure and content. For example:
- <html>: The root element of an HTML document.
- <head>: Contains metadata like the page title (<title>) and links to styles.
- <body>: The main part of the document, containing the visible content of the page.
Elements: Inside tags, there can be elements like:
- <h1>, <h2>, ..., <h6>: Headers of various levels.
- <p>: A paragraph of text.
- <a>: A link.
- <img>: An image.
- <div>, <span>: Containers for grouping other elements.
Attributes: Tags can have attributes that provide additional information about the element. For example:
- <a href="https://example.com">: The href attribute specifies the URL for a link.
- <img src="image.jpg" alt="image description">: The src attribute specifies the path to the image, and alt provides alternative text.

Choosing a Website

Before selecting appropriate tools to write a scraper, you need to study the target website and determine if it contains dynamic content. To do this, load the page, open the Network tab in Developer Tools, and check if Fetch/XHR requests are being made (these technologies allow web pages to dynamically update content based on data received from the server):

If Fetch/XHR requests are not being made: You can use libraries like BeautifulSoup and requests.

If the website contains dynamic content: You should use tools like Selenium or Playwright to load the web pages.

For a test static page to extract some data, we’ll choose https://quotes.toscrape.com/. Let’s write a simple scraper to extract the first three quotes and their authors.

Choosing Tools to Write the Scraper and Installation

For our purpose, the BeautifulSoup and requests libraries are sufficient. Let’s create a new file in the editor/IDE and add the libraries to the project using the command:

pip install beautifulsoup4 requests

BeautifulSoup facilitates searching and extracting data:

By Tags:

title_tag = soup.title
print(title_tag)  # <title>Page Title</title>

By Text: To extract text from a tag, use the .get_text() method:
```
header_text = soup.h1.get_text()
print(header_text)  # Header
```

By Classes, IDs, and Attributes:

elements = soup.find_all(class_='my-class')
element = soup.find(id='my-id')
links = soup.find_all('a', href=True)

For More Complex Queries: You can use CSS selectors with the .select() method:
```
headers = soup.select('h1')
```

Finding Elements on the Page and Writing the Scraper

Let’s return to the target page, find the necessary elements, and start writing the code.

Open the newly created file and import the previously installed libraries:

import requests
from bs4 import BeautifulSoup

Specify the URL of the required page, set the User-Agent header to simulate browser activity, and send a GET request to the page:

url = 'https://quotes.toscrape.com/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers)

Check if the request was successful:

if response.status_code == 200:
    # Create a BeautifulSoup object to parse the HTML
    soup = BeautifulSoup(response.text, 'html.parser')

In a separate browser window with the target page open, locate all blocks with quotes, iterate over the first three quote blocks, and extract the quote text in your code:

Tip: For quick searching and analysis of the necessary information, hover your cursor over the desired element, right-click, and select "Inspect." This will open Developer Tools, where you can view the element in the HTML structure of the page and explore additional elements it may contain. Highlighting the selected elements makes the task easier.

quotes = soup.select('.quote')
for quote in quotes[:3]:
   text = quote.select_one('.text').get_text(strip=True)

Extract the author's name and print it to the console:

    author = quote.select_one('.author').get_text(strip=True)
    
    print(f'Quote: {text}\nAuthor: {author}\n')
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

Here is the complete code with explanations:

import requests
from bs4 import BeautifulSoup

# URL of the page to scrape
url = 'https://quotes.toscrape.com/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.6613.85 Safari/537.36'
}

# Send a GET request to the page with a user agent
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    # Create a BeautifulSoup object to parse the HTML
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all blocks containing quotes
    quotes = soup.select('.quote')
    
    for quote in quotes[:3]:
        # Extract the quote text
        text = quote.select_one('.text').get_text(strip=True)
        
        # Extract the author's name
        author = quote.select_one('.author').get_text(strip=True)
        
        # Display the information
        print(f'Quote: {text}\nAuthor: {author}\n')
else:
    print(f'Failed to retrieve the page. Status code: {response.status_code}')

When you run the code, your scraper will output the information you need – the first three quotes and the names of their authors.

Example Scraper for a Dynamic Website

As an example of scraping a dynamic website, let's use https://parsemachine.com. Specifically, we'll work with the test page https://parsemachine.com/sandbox/catalog/, which displays cards for 12 products. We'll try to extract the name of each product and its link. Since the site is dynamic, we'll use Playwright. This browser automation tool finds and extracts elements on web pages using CSS, XPath selectors, text, and ARIA selectors, and it supports combining selectors for precise selection.

First, create a new project, install Playwright, and the Chromium browser with the following commands:

pip install playwright playwright install chromium

Next, find the elements you are interested in using the Developer Tools. Import Playwright, launch the browser, and navigate to the desired page:

from playwright.sync_api import sync_playwright url = 'https://parsemachine.com/sandbox/catalog/' def scrape_with_playwright():    with sync_playwright() as p:        browser = p.chromium.launch(headless=False)        page = browser.new_page()                page.goto(url)

We’ll look for all product cards, iterate over each card, and extract the link to the product page:

        product_cards = page.query_selector_all('.card.product-card')                for card in product_cards:            title_tag = card.query_selector('.card-title .title')            title = title_tag.inner_text() if title_tag else 'No title available'

Extract the product page link:

            product_link = title_tag.get_attribute('href') if title_tag else 'No link available'

If the link is relative, add the base URL:

            if product_link and not product_link.startswith('http'):                product_link = f'https://parsemachine.com{product_link}'

Finally, close the browser and define the function to run the script:

        browser.close() scrape_with_playwright()

Here’s the complete code:

from playwright.sync_api import sync_playwright # URL of the target page url = 'https://parsemachine.com/sandbox/catalog/' def scrape_with_playwright():    with sync_playwright() as p:        # Launch Chromium browser        browser = p.chromium.launch(headless=False)  # Change to True for headless mode        # Open a new tab        page = browser.new_page()                # Navigate to the target page        page.goto(url)                # Find all product cards        product_cards = page.query_selector_all('.card.product-card')                for card in product_cards:            # Extract the product title            title_tag = card.query_selector('.card-title .title')            title = title_tag.inner_text() if title_tag else 'No title available'                        # Extract the product link            product_link = title_tag.get_attribute('href') if title_tag else 'No link available'                        # If the link is relative, add the base URL            if product_link and not product_link.startswith('http'):                product_link = f'https://parsemachine.com{product_link}'                        # Print product information            print(f'Title: {title}, Link: {product_link}')                browser.close() scrape_with_playwright()

Run the script, and it will output the required information – the names of each product listed on the page and their respective links.

How to Save Extracted Information?

To save the extracted information, it's essential to understand a bit about the formats used for storing results:

CSV – One of the most popular formats for storing tabular data. It is a text file where each line corresponds to a single record, and values are separated by commas. Advantages of this format include: support by most data processing applications, including Excel; ease of reading and editing using text editors. Disadvantages: limited capabilities for storing complex data structures (e.g., nested data); issues with escaping commas and special characters.

JSON – A text-based data interchange format that is convenient for representing structured data. It is widely used in web development. Pros: supports nested and hierarchical data structures; well-supported by most programming languages; easily readable by both humans and machines. JSON is suitable for storing data that may need to be transmitted via APIs. Cons: JSON files can be larger compared to CSV; processing can be slower due to its more complex structure.

XLS – Designed for Excel spreadsheets, it stores data about cells, formatting, and formulas. It is commonly used for storing databases. To work with XLS in Python, you need third-party libraries, such as pandas. This format allows for storing data in a readable and presentable format. The main drawback is the need for additional libraries, which can increase server load and data processing time.

XML – A markup language used for storing and transmitting data. It supports nested structures and attributes. Pros: structured; allows for storing complex data structures; well-supported by various standards and systems. Cons: XML files can be bulky and complex to process; processing XML may be slow due to its structure.

Databases are used for storing large volumes of structured data. Examples include MySQL, PostgreSQL, MongoDB, SQLite. Pros: supports large volumes of data and quick access; easy to organize and relate data; supports transactions and data recovery. Cons: requires extra effort for setup and maintenance.

For our scrapers, we will choose the CSV format because the extracted data is tabular (quote text and author, product names and links) and the data volume is relatively small, without nested structures. Additional information on how to read and write in this format can be found here. We will add CSV import to our quote code, create a writer object, and write the quote data (the quotes themselves and their authors):

with open('quotes.csv', 'w', newline='', encoding='utf-8') as csvfile:    csvwriter = csv.writer(csvfile)        csvwriter.writerow(['Quote', 'Author'])        for quote in quotes[:3]:        text = quote.select_one('.text').get_text(strip=True)                author = quote.select_one('.author').get_text(strip=True)                csvwriter.writerow([text, author])

We will also add additional console outputs and error handling:

print("Data successfully written to quotes.csv") except requests.RequestException as e:    print(f'Error requesting the page: {e}') except Exception as e:    print(f'An error occurred: {e}')

Here’s the complete updated code:

import requests from bs4 import BeautifulSoup import csv # URL of the target page url = 'https://quotes.toscrape.com/' headers = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.6613.85 Safari/537.36' } try:    # Send a GET request to the page with the user agent    response = requests.get(url, headers=headers)    response.raise_for_status()  # Check for HTTP errors        # Create a BeautifulSoup object for parsing HTML    soup = BeautifulSoup(response.text, 'html.parser')        # Find all blocks with quotes    quotes = soup.select('.quote')        # Open CSV file for writing    with open('quotes.csv', 'w', newline='', encoding='utf-8') as csvfile:        # Create a writer object        csvwriter = csv.writer(csvfile)                # Write headers        csvwriter.writerow(['Quote', 'Author'])                # Write quote data        for quote in quotes[:3]:            # Extract the quote text            text = quote.select_one('.text').get_text(strip=True)                        # Extract the author's name            author = quote.select_one('.author').get_text(strip=True)                        # Write to CSV file            csvwriter.writerow([text, author])        print("Data successfully written to quotes.csv")     except requests.RequestException as e:    print(f'Error requesting the page: {e}') except Exception as e:    print(f'An error occurred: {e}')

We will perform similar actions with the second scraper:

from playwright.sync_api import sync_playwright import csv # URL of the target page url = 'https://parsemachine.com/sandbox/catalog/' def scrape_with_playwright():    try:        with sync_playwright() as p:            # Launch Chromium browser            browser = p.chromium.launch(headless=False)  # Change to True for headless mode                        try:                # Open a new tab                page = browser.new_page()                                # Navigate to the target page                page.goto(url)                                # Find all product cards                product_cards = page.query_selector_all('.card.product-card')                                # Open CSV file for writing                with open('products.csv', 'w', newline='', encoding='utf-8') as csvfile:                    # Create a writer object                    csvwriter = csv.writer(csvfile)                                        # Write headers                    csvwriter.writerow(['Title', 'Link'])                                        # Extract data from product cards and write to CSV                    for card in product_cards:                        # Extract product title                        title_tag = card.query_selector('.card-title .title')                        title = title_tag.inner_text() if title_tag else 'No title available'                                                # Link to the product page                        product_link = title_tag.get_attribute('href') if title_tag else 'No link available'                                                # If the link is relative, add the base URL                        if product_link and not product_link.startswith('http'):                            product_link = f'https://parsemachine.com{product_link}'                                                # Write data to CSV file                        csvwriter.writerow([title, product_link])                                                # Print product information                        print(f'Title: {title}, Link: {product_link}')                                        print("Data successfully written to products.csv")                        except Exception as e:                print(f'Error working with Playwright: {e}')                        finally:                # Close the browser                browser.close()                print("Browser closed.")        except Exception as e:        print(f'Error launching Playwright: {e}') scrape_with_playwright()

Obstacles in Web Scraping

Changing, Complex Website Structure: One of the most common obstacles in web scraping is the changing structure of the website and code obfuscation. Even minor changes in the HTML markup or page structure can cause scraping scripts to stop working. This may require frequent code updates to adapt to new changes.

Request Limits: Many websites have limits on the number of requests that can be sent within a certain timeframe. If your requests exceed the set limits, your IP address may be temporarily blocked.

IP Blocking: Websites can block IP addresses that they identify as suspicious or overly active, which can be a significant obstacle for scraping. In this case, quality proxy servers may be needed to bypass such blocks.

CAPTCHA: Many web resources implement protective measures in the form of CAPTCHAs to prevent automated actions. CAPTCHA requires manual input or the use of specialized services to bypass.

One of the best services today is CapMonster Cloud – its API allows for easy integration into code for bypassing CAPTCHA and continuing scraper operation. It is easy to connect, provides quick solutions for various types of CAPTCHAs with minimal errors – it supports reCAPTCHA, DataDome, Amazon CAPTCHA, and others. CapMonster Cloud can be considered an optimal choice as an auxiliary tool and an important part of the web scraping process.

Recommendations for Successful Scraping

Use Proxy and User-Agent Rotation: To avoid IP blocking and bypass request limits, this helps mimic requests from various devices and browsers.
Add Error Handling and Retries: A web page may be temporarily unavailable, or a request may fail. A retry mechanism and error handling will help ensure your script is resilient to such situations and prevent the scraping process from being interrupted.
Review the website’s robots.txt file: This file contains guidelines for bots about which parts of the site can and cannot be crawled. Adhering to these recommendations helps avoid legal issues and conflicts with site owners.
Introduce Random Delays Between Requests: To avoid suspicious activity and reduce the likelihood of blocking.

These recommendations will help your script mimic the behavior of a real user, thus reducing the chance of detection.

Conclusion

In summary, web scraping in Python is one of the most popular methods for efficiently collecting data from various websites. We discussed how to choose the right tools for web scraping, the installation process of Python and necessary libraries, and writing code to extract data and save results in convenient formats. With the step-by-step approach described in this article, even a novice developer can master the basic techniques of web scraping and create their first scraper scripts. Web scraping opens up significant opportunities for data analysis, information gathering, market monitoring, and many other tasks. It's essential to continue learning new tools and techniques to stay relevant in this ever-evolving field.

By using libraries and tools such as BeautifulSoup, requests, Selenium, Playwright, and others mentioned in this guide, you can extract information from both static and dynamic sites. When working with web scraping, it's important to consider legal and ethical aspects, as well as be prepared to bypass various obstacles such as CAPTCHA or dynamic content loading.

Each of the discussed tools and approaches has its advantages and limitations. The choice of the right tool depends on the specifics of the task, the complexity of the web pages, and the volume of data. For effective web scraping, understanding the characteristics of the web pages you are working with is crucial.

We hope that these instructions will help you better understand the web scraping process and provide the necessary foundational knowledge to create your own scrapers. Good luck with your data collection and analysis projects!

Note: We'd like to remind you that the product is used to automate testing on your own websites and on websites to which you have legal access.