Session 26 – Python: Web Scraping w/ BeautifulSoup
26.1 Introduction
The following guide was adapted from here.
BeautifulSoup is a library allowing the user to easily scrape data from web pages. More technically it parsers HTML and XML code. Always make sure your target webpage allows scrapping before mining that website. Not all websites allow this activity as it is can be taxing on their servers, and access or repeated access is costly to the provider.
26.2 Installation
Install in terminal with the following commands:
$pip install beautifulsoup4
$python -m pip install requests
Notice for masOS use: $pip3 install beautifulsoup4
26.3 Getting started
Open the Python Version in which BeautifulSoup was installed in and change to your working directory for the class.
#import request
import requests
# get webpage data
= requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page
# if page provides output then we successfully downloaded the page into Python
page
# you can even get the status code of the webpage
page.status_code
# you can print the HTML content that was downloaded
page.content
# now that we downloaded the webpage we can use beautifulsoup to parse, the actual scraping of the content saved above
from bs4 import BeautifulSoup= BeautifulSoup(page.content, 'html.parser')
soup
# and we can view the html in a structured format
print(soup.prettify())
# we can call the HTML as a list with soup
list(soup.children)
# we can lookup the element types involved
type(item) for item in list(soup.children)]
[
# We only want to look at the Tags in this case
= list(soup.children)[2]
html
html
# Reformat html tags as a list
list(html.children)
# We can now further select the html tag from this new list
= list(html.children)[3]
body
# And make that into a list! We are making so many lists
list(body.children)
# we can now select the 2nd item in the list
= list(body.children)[1]
p
# and finally we can extract only the text from that element without the html code that flanks it
p.get_text()
This is the key to web scraping. Finding the part of the HTML or XML code that you are interested in and mining it.
26.6 Use CSS Selectors for positional mining
# finds all p tags inside of div tag
soup.select("div p")
# find all p tags with a class of outer-text
soup.select("p.outer-text")
# find all p tags with an id of first
soup.select("p#first")
# find all p tags with class outer-text inside of body tag
soup.select("body p.outer-text")
26.7 Weather Page Example
We can mine specific data from a weather website:
= requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
page = BeautifulSoup(page.content, 'html.parser')
soup = soup.find(id="seven-day-forecast")
seven_day = seven_day.find_all(class_="tombstone-container")
forecast_items = forecast_items[0]
tonight print(tonight.prettify())
= tonight.find(class_="period-name").get_text()
period = tonight.find(class_="short-desc").get_text()
short_desc = tonight.find(class_="temp").get_text()
temp print(period)
print(short_desc)
print(temp)
= tonight.find("img")
img = img['title']
desc print(desc)
We can extract all the information instead of just pieces:
# chunks of info
= seven_day.select(".tombstone-container .period-name")
period_tags = [pt.get_text() for pt in period_tags]
periods
periods
# all info
= [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
short_descs = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
temps = [d["title"] for d in seven_day.select(".tombstone-container img")]
descs print(short_descs)
print(temps)
print(descs)
We can then create a table with that scraped data:
Install Pandas: $ pip install pandas
import pandas as pd= pd.DataFrame({
weather "period": periods,
"short_desc": short_descs,
"temp": temps,
"desc":descs
}) weather