Session 26 – Python: Web Scraping w/ BeautifulSoup

26.1 Introduction

The following guide was adapted from here.

BeautifulSoup is a library allowing the user to easily scrape data from web pages. More technically it parsers HTML and XML code. Always make sure your target webpage allows scrapping before mining that website. Not all websites allow this activity as it is can be taxing on their servers, and access or repeated access is costly to the provider.

26.2 Installation

Install in terminal with the following commands:

$pip install beautifulsoup4

$python -m pip install requests

Notice for masOS use: $pip3 install beautifulsoup4

26.3 Getting started

Open the Python Version in which BeautifulSoup was installed in and change to your working directory for the class.

#import request
import requests

# get webpage data
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")

# if page provides output then we successfully downloaded the page into Python
page

# you can even get the status code of the webpage
page.status_code

# you can print the HTML content that was downloaded
page.content

# now that we downloaded the webpage we can use beautifulsoup to parse, the actual scraping of the content saved above
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

# and we can view the html in a structured format
print(soup.prettify())

# we can call the HTML as a list with soup
list(soup.children)

# we can lookup the element types involved
[type(item) for item in list(soup.children)]

# We only want to look at the Tags in this case
html = list(soup.children)[2]
html

# Reformat html tags as a list
list(html.children)

# We can now further select the html tag from this new list
body = list(html.children)[3]

# And make that into a list! We are making so many lists
list(body.children)

# we can now select the 2nd item in the list
p = list(body.children)[1]

# and finally we can extract only the text from that element without the html code that flanks it
p.get_text()

This is the key to web scraping. Finding the part of the HTML or XML code that you are interested in and mining it.

26.4 Mining all matching tags

# You can mine all tags as long as they are identical
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

# Then you can subsequently extract the text
soup.find_all('p')[0].get_text()

You can see how shorter and simpler this code was. As we get more accustomed to web scraping our code evolves and allows us to quickly parse webpages.

If you want to mine the first matching tag and ignore the others, that can be done as well.

# find will return the first instance
soup.find('p')

26.5 Mining tags by class or id

page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup


# search tag and class
soup.find_all('p', class_='outer-text')

# search only class
soup.find_all(class_="outer-text")

# search only id
soup.find_all(id="first")

26.6 Use CSS Selectors for positional mining

# finds all p tags inside of div tag
soup.select("div p")

# find all p tags with a class of outer-text
soup.select("p.outer-text")

# find all p tags with an id of first
soup.select("p#first")

# find all p tags with class outer-text inside of body tag
soup.select("body p.outer-text")

26.7 Weather Page Example

We can mine specific data from a weather website:

page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())


period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)


img = tonight.find("img")
desc = img['title']
print(desc)

We can extract all the information instead of just pieces:

# chunks of info
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

# all info
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

We can then create a table with that scraped data:

Install Pandas: $ pip install pandas

import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather