Causes of occurrence
Why does the server block access to data? In web scraping, the 403 error occurs due to website protection mechanisms against unauthorized access or resource abuse. Let’s take a closer look at the causes of this error and ways to fix it.
IP address restriction: websites may restrict access by IP address. If too many requests come from a single IP address, the server may block it to prevent overload and protect against potential attacks.
Headless mode: using a headless browser in automation tools such as Selenium can also lead to errors. Some websites are able to detect that requests come from a browser in headless mode, where there is no user interaction (for example, clicks or page scrolling). This may indicate automated access, which websites can consider potentially suspicious activity. However, if you still need this mode, configure the browser to imitate a real browser with a graphical interface.
Missing required headers and cookies. Some websites require specific cookies or sessions to access content.
Incorrect User-Agent: many websites check the User-Agent header containing browser and device information. If you do not set this header, set it incorrectly, or fail to rotate it during large-scale requests, the server may deny access.
How to bypass the 403 error in web scraping
To ensure smooth data collection, let’s consider several effective methods to prevent access blocking to required resources:
- Using high-quality proxy servers: periodically changing the IP address helps avoid blocks. It is important to use reliable proxies to avoid being blacklisted.
- Avoiding too many requests: reducing request frequency and introducing delays between them can help prevent blocking. If you are using Python for your scraper, the time library can help add delays between requests:
import time
time.sleep(5) # 5-second delay between requests
- Browser emulation. For this, various options can be used, for example, as implemented in Selenium:
from selenium import webdriver
options = webdriver.ChromeOptions()
# Do not add --headless if a graphical browser is required
options.add_argument("--headless")
# Screen size emulation
options.add_argument("window-size=1920,1080")
# This flag helps hide automation signs.
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
- If the website requires authentication, it is important to correctly store and use cookies. With the requests library, cookies can be passed along with requests:
import requests
session = requests.Session()
response = session.get('https://example.com')
# using cookies in subsequent requests
response2 = session.get('https://example.com/another-page')
- Setting proper User-Agent: using realistic User-Agents can help bypass blocking. It is best to use those used by popular browsers (for example, Chrome, Firefox):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
}
- You can also use User-Agent rotation with the Python random library. For this, you need to prepare a separate list of User-Agents from different browsers and periodically update it.
Example code for selecting a random User-Agent from a predefined list using random:
import random
import requests
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.6533.122 Safari/537.3",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/56.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15",
]
random_user_agent = random.choice(user_agents)
headers = {
"User-Agent": random_user_agent
}
response = requests.get("https://example.com", headers=headers)
print(f"Status code: {response.status_code}")
print(f"Used User-agent: {random_user_agent}")
In addition to changing the User-Agent, this library also allows using random IP addresses for each request from a proxy pool, adding random delays between requests, and rotating other elements to simulate the behavior of different users and devices.