TDM 20200: Project 3 - Web Scraping Introduction 3
Project Objectives
This project covers advanced web scraping techniques including rate limiting, handling dynamic content, working with headers and cookies, error handling, and ethical considerations. These skills are essential for scraping real-world websites responsibly and effectively.
|
If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation. The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”. We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/spring2026/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty. |
Questions
Question 1 (2 points)
Let’s start by learning about rate limiting with simple delays. First, these are the imports used.
import requests
from lxml import html
import time
(You can read more about it here: lxml.de/ (official) and lxml.de/lxmlhtml.html (html specific); get more information through the Menu, such as specific documentations)
More information here: docs.python.org/3/library/time.html |
We start with a function that retrieves a website. It takes in url, which is the website to be scraped, and delay, which the number of seconds to wait after the initial request. We will get more into that soon.
def scrape_with_delay(url, delay=1):
"""Scrape a URL with a delay between requests."""
response = requests.get(url)
time.sleep(delay) # Wait before next request
return response
As we can see by the comment, requests.get(url) sends the HTTP GET request (trying to get data of a source, such as a website here), and eventually server sends us the response back, contianing information such as status code, metadata, and HTML content. These information gets put into a Response object by Python, and for us, in 'response' as we defined.
|
The |
Below, we create a list to store the page URLs to scrape.
# Example: Scraping multiple pages with delays
base_url = "https://quotes.toscrape.com"
pages_to_scrape = [
f"{base_url}/page/1/",
f"{base_url}/page/2/",
f"{base_url}/page/3/"
]
After this, the different URLs will look like: - quotes.toscrape.com/page/1/, quotes.toscrape.com/page/2/, and quotes.toscrape.com/page/3/
Now we move onto the main data collection loop.
# List to store all quotes from all pages
all_quotes = []
# Loop over each page in the list
for page_url in pages_to_scrape:
print(f"Scraping {page_url}...")
# HTTP request to page_url
response = scrape_with_delay(page_url, delay=2)
tree = html.fromstring(response.text)
# Extract quotes
quotes = tree.xpath('//span[@class="text"]/text()')
all_quotes.extend(quotes)
print(f" Found {len(quotes)} quotes")
print(f"\nTotal quotes collected: {len(all_quotes)}")
-
One website is processed per iteration in the for loop.
-
Use the function we built earlier to send the HTTP request to page_url. The number of seconds to wait in between is specified by us in 'delay=' (here 2 seconds).
-
html.fromstringcomes from thelxmllibrary. It takes in raw HTML and returns an Element object (here, 'tree'). This means that the parsing process not only reads the raw string, but also processes information such as tags, attributes, etc, and convert the HTML string into a tree structure. -
.xpath: This function also comes from thelxmllibrary, and it is used to search and locate XML/HTML elements. It is a query language to address specific parts of the document.
It is a natural to ask what this expression represents: '//span[@class="text"]/text()'. Let’s take a closer look.
-
//: Search any part of the document
-
span: Focus on <span> elements
-
[@class="text"]: Only get the parts where the class attribute is exactly "text"
-
/text(): Get the text inside each element we found
Now, implement a scraper that:
1. Scrapes multiple pages from quotes.toscrape.com/
2. Adds a 2-second delay between each request,
3. Tracks how long the scraping takes,
4. Calculates the average time per request.
1.1. Write a scraper with rate limiting (2-second delays).
1.2. Scrape at least 5 pages.
1.3. Track total scraping time.
1.4. Calculate and display average time per request.
1.5. Explain why rate limiting is important.
Question 2 (2 points)
When a server is overloaded or you’re making requests too quickly, you might receive HTTP error codes like 429 (Too Many Requests) or 503 (Service Unavailable). Exponential backoff is a strategy where you wait progressively longer between retries when you encounter errors.
import random
As the name might suggest, the above import statement generates random numbers.
We now create a function that incorporates automatic retries in requesting a webpage.
def scrape_with_exponential_backoff(url, max_retries=5, initial_delay=1):
"""
Scrape a URL with exponential backoff retry logic.
If we get an error, wait longer before retrying:
- First retry: wait 1 second
- Second retry: wait 2 seconds
- Third retry: wait 4 seconds
- etc. (exponential: 2^retry_number)
"""
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
# Check for rate limiting errors
if response.status_code == 429:
# Use the exponential backoff delay formula (see docstring)
wait_time = initial_delay * (2 ** attempt)
# Just for us to see what is going on:
print(f"Rate limited! Waiting {wait_time} seconds before retry {attempt + 1}...")
# Give some wait time, based on our calculated delay time
time.sleep(wait_time)
continue
# Check for server errors
if response.status_code >= 500:
wait_time = initial_delay * (2 ** attempt)
print(f"Server error {response.status_code}! Waiting {wait_time} seconds before retry {attempt + 1}...")
time.sleep(wait_time)
continue
response.raise_for_status() # Raises exception for 4xx/5xx errors
return response
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
# Last attempt failed
raise Exception(f"Failed after {max_retries} attempts: {e}")
wait_time = initial_delay * (2 ** attempt)
# Add some randomness - this generates random number between a and b, allowing us to add extra delay time
extra_delay = random.uniform(0, 0.1 * wait_time)
# The resulting number is added to sleept time, to prevent numerous retries simultaneously
total_wait = wait_time + extra_delay
print(f"Error: {e}")
print(f"Waiting {total_wait:.2f} seconds before retry {attempt + 1}...")
time.sleep(total_wait)
raise Exception("Max retries exceeded")
-
response = requests.get(url, timeout=10): Retrieve the page’s data, and thetimeout=10parameter makes the program raise an error instead of waiting, if it takes longer than 10 seconds to receive a response.
HTTP Status Codes are used here. For instace, response.status_code == 429 checks to see if there are too many requests.
|
These sites have some good lists of the different codes and details on each: |
You might also notice the try/except used here; in Python, we use them to handle errors effectively. try block executes some code, also testing for errors, and except block is for handling the error.
except requests.exceptions.RequestException as e: Our code uses this to catch all request library related errors, such as connection and timeouts. 'e' is just a variable we assign this exception instance to, that holds the specific error message and related details.
# Example usage
url = "https://quotes.toscrape.com/page/1/"
response = scrape_with_exponential_backoff(url)
tree = html.fromstring(response.text)
quotes = tree.xpath('//span[@class="text"]/text()')
print(f"Successfully scraped {len(quotes)} quotes")
|
Exponential backoff means the wait time doubles with each retry: 1s, 2s, 4s, 8s, etc. This gives the server time to recover. Adding "jitter" (randomness; our 'extra_delay') prevents multiple clients from retrying at exactly the same time (the "thundering herd" problem). |
Now, implement a scraper with exponential backoff that:
1. Handles 429 (Too Many Requests) errors,
2. Handles 503 (Service Unavailable) errors,
3. Implements exponential backoff with jitter,
4. Logs retry attempts and wait times,
5. Scrapes multiple pages, handling errors gracefully.
2.1. Implement exponential backoff function.
2.2. Handle HTTP error codes 429 and 503.
2.3. Add jitter to avoid synchronized retries.
2.4. Test your function (you can simulate errors by making requests too quickly).
2.5. Scrape multiple pages using your robust scraper.
Question 3 (2 points)
Some websites check the "User-Agent" header to identify what type of browser or program is making the request. Some sites block requests that don’t have a proper User-Agent or that look like bots. We can customize our request headers to appear more like a regular browser:
# Common User-Agent strings - each of these are web browsers
user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36']
User agent strings contain information such as browser name, version, and operating system, in a text line which website use to get information on the client making the request.
We recommend you to read more about it online if you’re interested (some sites, such as 51degrees.com/blog/understanding-user-agent-string, have some good summary to gain more understanding)
First, we create a function to send HTTP request. It takes in two parameters: url specifying the page to scrape, and an optional user_agent . If we don’t specify, the default is used
def scrape_with_headers(url, user_agent=None):
"""Scrape with custom headers."""
headers = {'User-Agent': user_agent or user_agents[0],
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',}
response = requests.get(url, headers=headers)
return response
-
User-Agent': user_agent or user_agents[0]: We use the given user_agent if passed in, otherwise use first entry in user_agent. -
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8': This line informs the server what content type is acceptable. Here we have HTML, XHTML, XML, and any other types (/part denotes anything else and any subtype).
Also note that the quality value is included, as seen in q=0.9,/;q=0.8', where the higher number denotes higher preference (priority). The range is 0.0 to 1.0. So, for example in our code / has a quality value of 0.8, it has lower priorty compared to things like text/html (1.0), application/xhtml+xml (1.0), and application/xml (0.9).
-
'Accept-Language': 'en-US,en;q=0.5': To provide language preferences.
Similarly as above, q denotes the quality value, specifically secondary preference. In our code, we accept US english as primary preference, and other English types with lower priority.
For further information: developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Accept-Language
-
'Accept-Encoding': 'gzip, deflate': To identify if content compression is supported by client. In that case, the specific method is used to decompress data.
'gzip' and 'deflate' are compression formats (popular options). We can accept either type of responses.
No compression can be applied by specifying 'identity'. A general matching can be done using a 'wildcard (*)', which accepts any encoding even if it was not given.
You can read more about other types and what they are here: developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Accept-Language
-
'Connection':' 'keep-alive': This is used to maintain TCP connection after the request. Additionally, it can help to improve performance as multiple requests can be processed through the same connection, without us opening a new connection everytime.
By default, HTTP/1.0 closes the connection after a request.
For further information: developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Keep-Alive
Below is an example of scraping with custom headers
# Page to scrape
url = "https://quotes.toscrape.com/"
# Call our function
response = scrape_with_headers(url)
# HTML text conversion into tree structure
tree = html.fromstring(response.text)
# Get quote text from the page
quotes = tree.xpath('//span[@class="text"]/text()')
print(f"Scraped {len(quotes)} quotes with custom headers")
|
The User-Agent header tells the server what browser/client is making the request. Some websites block requests without a User-Agent or with suspicious User-Agents. Using a realistic User-Agent string can help avoid blocks. However, always respect robots.txt and terms of service regardless of headers. |
Some websites also use cookies for session management. We can handle cookies using a session object:
# Create a session to maintain cookies across requests
session = requests.Session()
# Set headers for the session
session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
# First request - cookies are automatically saved
url1 = "https://quotes.toscrape.com/"
response1 = session.get(url1)
print(f"First request cookies: {session.cookies}")
# Subsequent requests automatically include cookies
url2 = "https://quotes.toscrape.com/page/2/"
response2 = session.get(url2)
print(f"Second request cookies: {session.cookies}")
-
requests.Session(): Session object in requests library can be created using this. This lets us maintain parameters such as cookies and headers across multiple HTTP requests.
A session can remember cookies, share headers, and reuse connections. Cookies and headers would not be remembered between requests in each requests.get() without this.
Documentation: requests.readthedocs.io/en/latest/user/advanced/
-
session.headers.update(): session.headers has the HTTP headers for all requests. .update() part allows us to add or make changes to default headers.
User-Agent is the header name, and we specified the value of 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', assigning a browser user agent to the session. It gets applied to all request made by session.
-
response1 = session.get(url1):session.get()returns the value of requested session variable.
In our code, we first send the HTTP GET request to url1’s URL. Then, through the session object, it gets headers, cookies, and maintains the connection settings. Response object (response variable) containing HTML content, HTTP status code, and cookies (session.cookies)
Now, create a scraper that:
1. Uses custom User-Agent headers,
2. Maintains a session across multiple requests,
3. Rotates between different User-Agent strings for different requests,
4. Scrapes multiple pages while maintaining the session.
3.1. Implement a scraper with custom User-Agent headers.
3.2. Use a session object to maintain cookies.
3.3. Rotate User-Agent strings across requests.
3.4. Scrape multiple pages using the session.
3.5. Explain why custom headers might be necessary.
Question 4 (2 points)
In this question, we explore some rules that should be followed before scraping websites, and how to incorporate them into our code. More specifically, we need to follow the instructions given in a file called robots.txt in websites. So in Question 4, we aim to create a function that checks the robots.txt rules and adheres to them by only accessing the allowed paths.
We have two new imports:
First Import:
from urllib.robotparser import RobotFileParser
RobotFileParser class becomes available to us with this import statement. This is what provides us with the Python’s built in robots.txt file, a standard that websites use to tell scrapers which parts of the site they can and cannot access.
robots.txt is an implementation of the Robots Exclusion Protocol, which defines set of rules for web crawlers and web robots on what sections of a website they may or may not visit. The text file is located at the root of a website.
Couple notes on this: If this file can not be found, the assumption is that there is no restriction on accessing the whole website. As well, if there are multiple subdomains in a website, they need to have their individual 'robots.txt' file.
Second Import:
from urllib.parse import urljoin, urlparse
urllib.parse is a Python module that provides us with various functions that can work with and manipulate URLs. For instance,
1) Break up URLs which initially is one string, into different componenets such as addressing scheme, network location, path, etc
This is where urlparse comes in. urlparse breaks a single string down into six components, and returns ParseResult(scheme, netloc, path, params, query, fragment). As you might be able to tell from its parameters, this gives us the ability to easily access specific parts of the URL.
Here is a simple example. If we had: base_url = "https://parts.toscrape.com/page/1/", then parsed_url = urlparse(base_url) gives us:
scheme = "https", netloc = "parts.toscrape.com", and path = "/page/1/". params, query, and framgent are "".
The last three are empty because the URL simply does not contain them. If they exist, you will see 'query' after '?' and 'fragment' after '#'. Some other parameters include '=', which is for key-value pair (key=value), or '&' to separate multiple parameters.
2) Convert relative URL into absolute URL. urljoin is used here.
urljoin takes in two arguments, 'base' and 'url', where if given the relative URL, the function combines it with the 'base' URL to create the absolute version.
|
Official documentations can be very helpful:
|
Now, let’s get started on creating a function that checks a website’s 'robots.txt' rules.
It takes in two parameters, 'base_url' is the website to scrape, and 'user_agent='*'' is to specify which crawler to check rules for. Similarly as our other cases where * was used, here it means rules that apply for all bots.
def check_robots_txt(base_url, user_agent='*'):
"""Check robots.txt for a given URL."""
robots_url = urljoin(base_url, '/robots.txt')
try:
# Send HTTP GET request to '/robots.txt'
# Because we have 'timeout=5', exception occurs if we do not get a response back in 5 seconds
response = requests.get(robots_url, timeout=5)
# 200 denotes successful request to server
if response.status_code == 200:
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
# Check if we can access a specific path
# Break URL down
parsed_url = urlparse(base_url)
# Extract the path part. If there is no path, the default is '/'
path = parsed_url.path or '/'
# Check permission
can_fetch = rp.can_fetch(user_agent, base_url)
print(f"robots.txt found at: {robots_url}")
print(f"Can fetch {base_url}? {can_fetch}")
# Check if the site specified a delay, and if so we show it
crawl_delay = rp.crawl_delay(user_agent)
if crawl_delay:
print(f"Crawl delay specified: {crawl_delay} seconds")
# Return parsed rules and permission result
return rp, can_fetch
# Missing robots.txt case:
else:
print(f"robots.txt not found or inaccessible (status: {response.status_code})")
return None, True # If no robots.txt, generally we assume scraping is allowed
except Exception as e:
print(f"Error checking robots.txt: {e}")
# Returns True as robots.txt could not be checked, and proceed with scraping. But, we take into consideration that it was only assumed.
return None, True
Please see below for more explanation!
-
rp = RobotFileParser(): We initialize a RobotFileParser class instance. Once initialized, the object can download 'robots.txt' and parse content.
As initially mentioned, this is how crawlers know when they can or can not access. This is done through the 'Disallow' and 'Allow' directives. 'robots.txt' generally has the structure of:
User-agent: *
Disallow:
Allow:
'Disallow' is an instruction to not access a path or a directory. There usually is a path that specifies the location. For example, if we had 'Disallow: /admin/', it corresponds to not allowing crawlers to visit 'https://example.com/admin/'. Having an empty 'Disallow' is a permission to access everything; 'Allow' can be used to allow specific path access, although the empty 'Disallow' is a broader permission.
-
rp.set_url(robots_url): Natural step following creating an instance. This defines the robots.txt file URL. -
.rp.read(): This gets the robots.txt file, so it can be used by the parser. -
.can_fetch(): To determine if the access privilege exists. It returns 'True' if permission is granted, otherwise 'False'.
# Example: Check robots.txt
base_url = "https://quotes.toscrape.com"
# Call the function
rp, allowed = check_robots_txt(base_url)
# allowed == True
if allowed:
print("\nProceeding with scraping...")
# Your scraping code here
else:
print("\nScraping not allowed by robots.txt!")
|
Always check |
Now, create a function that:
1. Checks robots.txt before scraping,
2. Respects crawl delays specified in robots.txt,
3. Only scrapes paths that are allowed,
4. Implements a scraper that checks multiple websites' robots.txt files.
Test with:
- quotes.toscrape.com/
- books.toscrape.com/
- www.google.com/ (to see a more complex robots.txt)
4.1. Write a function to check and parse robots.txt.
4.2. Respect crawl delays from robots.txt.
4.3. Check if specific paths are allowed before scraping.
4.4. Test your function on multiple websites.
4.5. Explain why respecting robots.txt is important.
Question 5 (2 points)
So far we have talked about a few key aspects of web scraping to create a more robust scraper. These include:
- rate limiting
- error handling
- crawl delays
- user agent headers
- robots.txt checking
We can now combine all these aspects into a single scraper class!
|
Normally, when you get to this point we would move the creation of the class to a python file instead of keeping it in the notebook, but we need to keep it in the notebook for the sake of the assignment. So please keep in mind that all of this code that we will go through in this problem, is going to eventually be in one cell contained within the same class. |
We will start by also adding logging to track what’s happening. First, import the logging module:
import logging
# Set up logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
This creates a logger object that we can use to cleanly log information regarding our scraper’s activity instead of using print statements and feeding to stdout. We set level = logging.INFO to set the level of information we want in our output. There are six levels in order of increasing severity:
- NOTSET=0
- DEBUG=10
- INFO=20
- WARN=30
- ERROR=40
- CRITICAL=50
By setting it to INFO, we will only see messages at the INFO level and above (we wont see any DEBUG messages, for example). Sometimes you may want a more sensitive logger, such as DEBUG, which will log all messages or you may only want to see errors, in which case you would set it to ERROR.
We can then specify the format of the log messages. Here, we have %(asctime)s - %(levelname)s - %(message)s, which means:
- %(asctime)s: The timestamp of the log message
- %(levelname)s: The level of the log message
- %(message)s: The message of the log message
This is a pretty standard format, but you can customize it to your liking.
We can then specify the name of the logger. Here, we have __name__, which is the name of the module. This is useful if we want to log messages from different modules. If you wanted to make other loggers with different names, you would simply change __name__ to the name of the logger you want (i.e. 'api_logger') and can load that logger in different parts of your code.
You can read more about logging here on the official documentation or here with some good explanations for logging.
Lets get started on making our scraper class! We will start by defining the class and its constructor:
class RobustScraper:
"""Web scraper with rate limiting, error handling, and robots.txt checking."""
def __init__(self, base_url, user_agent=None, default_delay=1):
"""Initialize the scraper."""
self.base_url = base_url
self.session = requests.Session()
self.default_delay = default_delay
self.robots_parser = None
# Set headers
headers = {
'User-Agent': user_agent or 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}
self.session.headers.update(headers)
Here we are going to take in the following parameters:
- base_url: The base URL of the website to scrape.
- user_agent: The user agent to use for the scraper. If not provided, we will use a default user agent.
- default_delay: The default delay between requests.
We also initialize the session object and the robots parser which will be used to check the robots.txt file.
We then set the headers for the session allowing the user to pass in an agent or use a default user agent accepting HTML, XHTML, and XML content.
Now that we have the headers set and the session object initialized, we are just about ready to start scraping. But, we also need to check the robots.txt file like we did last question to make sure we are not violating any rules set by the website before scraping.
We will do this by creating a private method called _check_robots_txt.
|
Technically Python doesn’t have private methods in the same way other languages like Java do, but in Python we prefix the methods with an underscore to indicate that a method is not meant to be called by outside code - it is only for internal use. |
The logic we will use here is similar to what we did last question. We will get the robots.txt file, parse it, and save it to our class’s robots_parser object if one exists. If not, we will log a warning and assume no rules are set.
class RobustScraper:
...
def _check_robots_txt(self):
"""Check and parse robots.txt."""
robots_url = urljoin(self.base_url, '/robots.txt') # join base url with robots.txt
try:
response = self.session.get(robots_url, timeout=5) # fetch the file
if response.status_code == 200: # if it exists, parse the robots.txt file
self.robots_parser = RobotFileParser()
self.robots_parser.set_url(robots_url)
self.robots_parser.read()
logger.info(f"Loaded robots.txt from {robots_url}")
except Exception as e: # if it does not exist, log a warning, assume no rules are set
logger.warning(f"Could not load robots.txt: {e}")
Now we can add this method to our constructor:
def __init__(self, base_url, user_agent=None, default_delay=1):
"""Initialize the scraper."""
...
self.session.headers.update(headers)
self._check_robots_txt() # check robots.txt
Note that this does not check for permissions yet, we will add that when we start scraping.
|
When creating a class like this, we tend to compartmentalize our code into methods with descriptive names and a clear purpose. This way, if someone wants to glance through the code, they can easily understand logic and purpose without having to read through the entire class and each line. This makes our code more readable and easier to maintain. |
Now we will add the core functionality of our scraper. We will start by adding the fetch method which will be used to fetch a URL using the aspects we covered earlier like exponential backoff and error handling (this method will only be used to fetch the content of the page, not to parse the data; that will be done in another method we will create later).
class RobustScraper:
...
def fetch(self, url, max_retries=3, initial_delay=1):
"""Fetch a URL with exponential backoff and error handling."""
...
First things first, we need to check if the URL is allowed to be scraped according to the robots.txt file. We will do this by creating a private method called _can_fetch which will return True if the URL is allowed to be scraped, otherwise False.
Lets make a quick private method called _can_fetch to check if the URL is allowed to be scraped:
class RobustScraper:
...
def _can_fetch(self, url):
"""Check if URL can be fetched according to robots.txt."""
if self.robots_parser:
return self.robots_parser.can_fetch(self.session.headers['User-Agent'], url)
return True
def fetch(self, url, max_retries=3, initial_delay=1):
"""Fetch a URL with exponential backoff and error handling."""
if not self.can_fetch(url):
logger.warning(f"robots.txt disallows: {url}")
return None
Nice! Now if the URL is allowed to be scraped, we can proceed to fetch the URL.
Recall that even though we are allowed to scrape the URL, we still need to respect the crawl delay specified in the robots.txt file. We will do this by creating a private method called _get_crawl_delay which will return the crawl delay if it exists, otherwise the default delay.
class RobustScraper:
...
def _get_crawl_delay(self):
"""Get crawl delay from robots.txt."""
if self.robots_parser:
delay = self.robots_parser.crawl_delay(self.session.headers['User-Agent'])
if delay:
return delay
return self.default_delay
def fetch(self, url, max_retries=3, initial_delay=1):
"""Fetch a URL with exponential backoff and error handling."""
if not self._can_fetch(url):
logger.warning(f"robots.txt disallows: {url}")
return None
...
for attempt in range(max_retries):
try:
# Wait between requests
delay = self._get_crawl_delay()
time.sleep(delay)
response = self.session.get(url, timeout=10)
...
Now all thats left for us to do here is to handle the rate limiting and server errors! Please use the logic from the previous question to handle the rate limiting and server errors but this time use our logger object to log the messages (log.info for success, log.warning for warnings, and log.error for errors).
class RobustScraper:
...
def fetch(self, url, max_retries=3, initial_delay=1):
for attempt in range(max_retries):
try:
# Wait between requests
delay = self._get_crawl_delay()
time.sleep(delay)
response = self.session.get(url, timeout=10)
# Handle rate limiting - warning
if response.status_code == 429:
...
# Handle server errors - warning
if response.status_code >= 500:
...
response.raise_for_status()
logger.info(f"Successfully fetched {url}")
return response
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1: # error
...
... # retry - warning
return None
We now have functionality for fetching the URL with respect to the websites rules and handle errors gracefully, now lets create a method to parse the data we just fetched.
This will be the main driver method called scrape_quotes that will be used to control the flow of our scraper and will be the entry point for the user to scrape the data they want.
|
If you wanted to, you could make |
Lets create the scrape_quotes method which will parse the data we just fetched:
class RobustScraper:
...
def scrape_quotes(self, url):
"""Scrape quotes from a page."""
response = self.fetch(url)
if not response:
return []
...
Now we can finally parse the data we just fetched.
class RobustScraper:
def scrape_quotes(self, url):
"""Scrape quotes from a page."""
...
quote_containers = tree.xpath('//div[@class="quote"]')
for container in quote_containers:
text = container.xpath('.//span[@class="text"]/text()')
author = container.xpath('.//small[@class="author"]/text()')
if text and author:
quotes.append({
'text': text[0],
'author': author[0]
})
logger.info(f"Scraped {len(quotes)} quotes from {url}")
return quotes
Our class should be finished now, lets test it out!
scraper = RobustScraper("https://quotes.toscrape.com/")
quotes = scraper.scrape_quotes("https://quotes.toscrape.com/page/1/")
print(f"\nScraped {len(quotes)} quotes:")
for quote in quotes[:3]:
print(f" - {quote['text'][:50]}... - {quote['author']}")
Now, create your own robust scraper class that:
1. Checks robots.txt before scraping,
2. Implements exponential backoff,
3. Handles errors gracefully,
4. Includes logging,
5. Respects crawl delays,
6. Uses custom headers.
Use it to scrape multiple pages from a website of your choice.
5.1. Create a RobustScraper class with all advanced features.
5.2. Implement robots.txt checking.
5.3. Add logging to track scraping activity.
5.4. Test your scraper on multiple pages.
5.5. Demonstrate error handling and retry logic.
5.6. Explain the ethical considerations of web scraping.
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project3.ipynb
|
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generative AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |