Guides
How to scrape tables from websites using Pandas read_html() function
If you’re a data scientist or analyst, you know how important it is to be able to extract data from websites. Fortunately, Python makes it easy to do this with the Pandas library. In this blog post, we’ll look at how to scrape tables from websites in Python using Pandas pd.read_html()
.
The Pandas read_html() function is an efficient and convenient way to convert an HTML table into a Pandas DataFrame. This tool can be beneficial for quickly importing tables from various websites without having to figure out how to scrape the HTML. Utilizing Pandas read_html is simple and works well with uncomplicated tables, such as those on Wikipedia pages. However, there can be some challenges when using this function on websites that do not welcome scraping and with its formatting.
The pd.read_html() function takes a URL as an argument and returns a list of Dataframe objects, each of which contains the data from a single table. To use it, you will need to import the Pandas library.
import pandas as pd
Once you have imported Pandas, you can use the read_html() function to extract tables from a website. For example, if you wanted to extract tables from the Wikipedia page for the 2020 Summer Olympics, you could use the following code:
tables = pd.read_html("https://en.wikipedia.org/wiki/2020_Summer_Olympics")
or you can do the same in two lines using the following code:
url = 'https://en.wikipedia.org/wiki/2020_Summer_Olympics'
tables = pd.read_html(url)
Please note that this code extracts all tables available from the URL. You can check the number of tables captured from the page using:
print(f'Total tables: {len(tables)}')
Lets see how many tables are on the Wikipedia page for the 2020 Summer Olympics.
The above tells us that there are nineteen tables on the page. Now, you may be wondering how to scrape the specific table from the page. You will need to inspect all scraped tables by using an index. Tip: "tables" is the list of dataframes returned by Pandas. You can get the first dataframe in the list of dataframes using tables[0]
.
This data comes from the table called "Games of the XXXII Olympiad".
If you want to extract the table called "2020 Summer Olympics medal table," you can inspect all tables and identify that the table was assigned index 7 tables[7]
. However, this is not an effective way of scraping the table from the page. Instead, you can make the table selection easier by using the match
parameter to select a subset of tables. We can use the caption “2020 Summer Olympics medal table” to select the table.
url = 'https://en.wikipedia.org/wiki/2020_Summer_Olympics'
medal_table = pd.read_html(url, match=''2020 Summer Olympics medal table'')
print(medal_table)
This gives us the data we wanted from the table called "2020 Summer Olympics medal table".
Now you can save the table as a csv file using the following code:
df = medal_table[0]
df.to_csv('medal_table.csv')
Full Solution
import pandas as pd url = 'https://en.wikipedia.org/wiki/2020_Summer_Olympics' tables = pd.read_html(url) #prints number of tables found on the page print(f'Total tables: {len(tables)}') #check first which index was assigned to the table you are after #in this case "2020 Summer Olympics medal table" has index 7 #alternatively you can use match parameter to get one table df = tables[7] #save to csv file df.to_csv('Olympics2020_medal_table.csv')
Pandas read_html() and error 403
This function works well on Wikipedia pages, but not on websites that don't welcome scraping. Therefore, you are likely to run into error 403 "HTTPError: HTTP Error 403: Forbidden". Here is how to get around this problem: use the requests library together with pandas.
For example, if you try to scrape the table called "2020 FFIEC Census Report - Summary Census Demographic Information" from the FFIEC website using only Pandas, you will get an error 403.
pd.read_html(url) - returns error 403
You can get around a 403 error by using the requests library in conjunction with the pandas read_html() function.
import requests import pandas as pd url = 'https://www.ffiec.gov/census/report.aspx?year=2020&state=01&report=demographic&msa=11500' #get html first with get function from requests lib html = requests.get(url).content #pass html into read_html() function df_list = pd.read_html(html) #index -1 will get you the last table available on the page df = df_list[-1] #save to csv file df.to_csv('my data.csv') print(df)
Screenshot of partial table printout
Conclusion
This demonstrates that using pd.read_html()
is a straightforward way of scraping data from webpages, as Pandas uses BeautifulSoup4 and html5lib on the backend, which can be beneficial for simple data scraping projects. However, if you want to be able to scrape any website on the web, you will need to learn much more about scraping. Hehe, learning never stops, just like web development.
Continue Reading
Apps
Timestripe - my new favourite productivity app
March 5, 2023
Guides
Drop all duplicate rows across multiple columns in Python Pandas
January 28, 2023
Guides
How to create effective prompts for AI image generation
August 15, 2022
Guides
Generate Huge Datasets With Fake Data Easily and Quickly using Python and Faker
April 16, 2022
Guides
How to change or update a specific cell in Python Pandas Dataframe
March 25, 2021
Guides
How to add a row at the top in Pandas dataframe
March 22, 2021
Guides
Creating WordClouds in Python from a single-column in Pandas dataframe
November 15, 2020
Guides
Python Regex examples - How to use Regex with Pandas
September 9, 2020
Guides
Python regular expressions (RegEx) simple yet complete guide for beginners
September 15, 2020
Guides
8 Python Pandas Value_counts() tricks that make your work more efficient
May 31, 2020
Guides
Exploring Correlation in Python: Pandas, SciPy
May 5, 2020
Guides
How to add new columns to Pandas dataframe?
March 22, 2020
Guides
Delete column/row from a Pandas dataframe using .drop() method
February 2, 2020
Guides
How to visualize data with Matplotlib from a Pandas Dataframe
November 15, 2019
Guides
The ultimate beginners guide to Group by in Python Pandas
August 8, 2019
Guides
Guide to renaming columns with Python Pandas
July 2, 2019
Guides
How to suppress scientific notation in Pandas
July 12, 2019
Guides
The complete beginners guide to Pandas
June 29, 2019
Guides
Data project #1: Stockmarket analysis
June 29, 2019
Guides
Use Jupyter notebooks anywhere
June 10, 2019