Error 403 and Other Problems in Web Scraping: Why They Occur and How to Avoid Them
Web scraping is a process that automates data collection from various sources. However, scrapers often encounter different obstacles, one of the most common being the "403 Forbidden" error. This error indicates that the server has blocked access to the requested resource. To effectively engage in web scraping, it is crucial to understand why this error occurs and know how to circumvent it. In this article, we will explore what the 403 error is, why it occurs, strategies to bypass it, other limitations when collecting data, and ways to solve them.
Why does a server block access to data? During web scraping, a 403 error arises due to a website's protection mechanisms against unauthorized access or resource abuse. Let’s take a closer look at the reasons behind this error and the methods to solve them.
- IP Address Restriction: Websites may restrict access based on IP addresses. If too many requests come from a single IP address, the server may block it to prevent overloads and protect against potential attacks.
- Headless Mode: Using a headless browser in automation tools like Selenium can also lead to errors. Some websites can detect that requests are coming from a browser in headless mode, where there is no user interaction (e.g., clicks, scrolling). This can indicate automated access, which sites might consider suspicious. If you still need to use headless mode, configure the browser to mimic a real browser with a graphical interface.
- Missing Required Cookies: Some websites require specific cookies or sessions for content access.
- Incorrect User-Agent: Many sites check the User-Agent header for information about the browser and device. If you do not specify this header, provide an incorrect one, or fail to rotate it during large-scale requests, the server may deny access.
To ensure seamless data collection, consider several effective methods to prevent access blocks:
- Use of Quality Proxy Servers: Periodically changing IP addresses helps avoid blocks. It is important to use reliable proxies to avoid blacklists.
- Avoiding Too Many Frequent Requests: Reducing the frequency of requests and introducing delays between them can help avoid blocks. If you use Python for your scraper, the time library can help set delays between requests:
import time
time.sleep(5) # 5-second delay between requests
- Emulating a Real Browser: Use various options to implement this, as shown in Selenium:
from selenium import webdriver
options = webdriver.ChromeOptions()
# Do not add the --headless parameter if you need to run a graphical browser
options.add_argument("--headless")
# Screen size emulation
options.add_argument("window-size=1920,1080")
#This flag helps to hide signs of automation.
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
- Correctly Handling Cookies: If a website requires authentication, it is essential to correctly save and use cookies. You can pass cookies along with requests using the requests library:
import requests
session = requests.Session()
response = session.get('https://example.com')
# Using cookies in subsequent requests
# Sending a second request with the same cookies
response2 = session.get('https://example.com/another-page')
- Setting the Right User-Agent: Using realistic User-Agent headers can help bypass blocks. It’s best to use those commonly used by popular browsers (e.g., Chrome, Firefox):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.6533.122 Safari/537.36"
}
You can also use a rotation of User-Agent with the Python random library. Create a list of different User-Agent strings in advance and periodically update them.
Example code to select a random User-Agent from a pre-created list using random:
import random
import requests
# List of User-Agents
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.6533.122 Safari/537.3",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/56.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15",
]
# Select random User-Agent
random_user_agent = random.choice(user_agents)
headers = {
"User-Agent": random_user_agent
}
# Example request using random User-Agent
response = requests.get("https://example.com", headers=headers)
# Print response status code and used User-Agent
print(f"Status Code: {response.se}")
print(f"Used User-Agent: {random_user_agent}")
In addition to changing the User-Agent, this library also allows using random IP addresses for each request from a proxy pool, adding random delays between requests, and rotating other elements to mimic various user behaviors and devices.
In addition to error 403, scrapers often encounter other errors:
- 401 Unauthorized: Access is denied without credentials. Solution: use authentication with a username and password.
- 500 Internal Server Error: A problem on the server side. Solution: retry the request later or notify the administrator.
- 429 Too Many Requests: Too many requests. Solution: reduce the request frequency, use proxies.
- Obfuscated HTML Structure: During web scraping, you may encounter obfuscated HTML code where classes, IDs, and other elements have unclear or dynamically generated names. Solution: use robust XPath or CSS selectors, search for elements by text content, and use special libraries such as lxml for parsing and processing HTML. In complex cases, TensorFlow or PyTorch can be used to create machine learning models that recognize patterns and classify obfuscated elements based on large data volumes.
Another common obstacle is CAPTCHA, a protection system that frequently appears for similar reasons. But don’t worry, many services help bypass such limitations effectively, and one of the best is CapMonster Cloud. This convenient cloud tool provides an API for automatic CAPTCHA solving, greatly simplifying the work. Here are the steps to integrate CapMonster Cloud into your scraper code in Python:
- Register and Get an API Key: To use CapMonster Cloud, you need to register on the service and obtain an API key for authenticating requests to the service.
- Install Necessary Libraries: CapMonster Cloud has its libraries for different languages. Here's how to connect the official library for Python:
pip install capmonstercloudclient
Create a task, send it to the server, and receive a response:
import asyncio
from capmonstercloudclient import CapMonsterClient, ClientOptions
from capmonstercloudclient.requests import RecaptchaV2ProxylessRequest
async def solve_captcha(api_key, page_url, site_key):
client_options = ClientOptions(api_key=api_key)
cap_monster_client = CapMonsterClient(options=client_options)
recaptcha2request = RecaptchaV2ProxylessRequest(websiteUrl=page_url, websiteKey=site_key)
responses = await cap_monster_client.solve_captcha(recaptcha2request)
return responses['gRecaptchaResponse']
async def main():
api_key = 'YOUR_CAPMONSTER_API_KEY'
page_url = 'https://lessons.zennolab.com/captchas/recaptcha/v2_simple.php?level=low'
site_key = '6Lcf7CMUAAAAAKzapHq7Hu32FmtLHipEUWDFAQPY'
captcha_response = await solve_captcha(api_key, page_url, site_key)
print("CAPTCHA Solution:", captcha_response)
if __name__ == "__main__":
asyncio.run(main())
Before using each of the tools mentioned in this article, we recommend checking their documentation. Here are some useful links to resources where you can find more detailed information and answers to possible questions:
- Selenium WebDriver
- Python libraries: time, random, requests
- CapMonster Cloud: website, documentation, CapMonster Cloud API
Web scraping is effective even with large data volumes, but frequent errors can complicate the process. Understanding the reasons for errors like 403 and applying the correct bypass methods — setting User-Agent, using proxies, and CAPTCHA-solving services — will make your work more efficient. Following proven methods will reduce the risk of blocks and simplify data collection, while a careful approach to the task will ensure a positive experience with web resources.
Note: We'd like to remind you that the product is used for automating testing on your own websites and on websites to which you have legal access.