Data Collection

Using APIs and extracting data from HTMLs
Author
Affiliation

Department of Biostatistics, Johns Hopkins

Published

November 7, 2024

Pre-lecture activities

Important

In advance of class, please install the following R packages

  1. jsonlite (https://jeroen.cran.dev/jsonlite)
  2. httr2 (https://httr2.r-lib.org)
  3. rvest (https://rvest.tidyverse.org) (should be installed already with the tidyverse)
install.packages("jsonlite")
install.packages("httr2")

In addition, please read through

Lecture

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

Learning objectives

At the end of this lesson you will:

  • Introduce the JSON file format
  • Demonstrate how to convert JSON file format into data frames in R
  • Know what does API mean and state four types of API architectures
  • Practice with the GitHub API and make authenticated requests
  • Practice a range of rvest functions to scrape data from HTML pages
  • Recognize various HTML elements on the page (text, links, images, lists, etc.)

Slides

Class activity

For the rest of the time in class, we will practice using rvest to learn how to scrape data from HTML pages. To do this, we will load the following packages and extract data from the FIFA Women’s World Cup HTML page.

library(tidyverse)
library(rvest) # should be installed with the tidyverse
library(xml2)
Objectives of the activity
  • Practice a range of rvest functions to scrape data from the web
  • Understand various various HTML elements on the page (text, links, images, lists, etc.).
Tip

If any tables are nested within specific sections, you may need to target them individually using CSS selectors.

Part 1: Extracting tables

  • Use html_table to extract a table containing the FIFA Women’s World Cup and the corresponding runner-up for each World Cup year. This is the first table under “Results”.
  • Show the first few rows with head().
# This is the URL we want to scrape data from
url <- "https://en.wikipedia.org/wiki/FIFA_Women%27s_World_Cup"

# This is a local HTML file to avoid scraping data from it each time I compile this quarto file. 
html_worldcup <- here::here("lectures","04-data-collection", "world-cup.html")

if(!file.exists(html_worldcup)){
  page <- read_html(url)
  write_html(page, html_worldcup)
} else {
  page <- read_html(html_worldcup)
}
# Extract the table and show the first few rows
page %>%
  html_table() %>%
  .[[5]] %>% 
  head()
# A tibble: 6 × 13
  X1    X2     X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
  <chr> <chr>  <chr> <lgl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> <chr>
1 Ed.   Year   Hosts NA    Final Final Final NA    Thir… Thir… Thir… NA    No. …
2 Ed.   Year   Hosts NA    Cham… Score Runn… NA    Thir… Score Four… NA    No. …
3 1     1991   China NA    Unit… 2–1   Norw… NA    Swed… 4–0   Germ… NA    12   
4 2     1995   Swed… NA    Norw… 2–0   Germ… NA    Unit… 2–0   China NA    12   
5 3     1999   Unit… NA    Unit… 0–0 … China NA    Braz… 0–0[… Norw… NA    16   
6 4     2003[… Unit… NA    Germ… 2–1 … Swed… NA    Unit… 3–1   Cana… NA    16   
  • Use html_table() to extract a table containing the number of goal scored by country. This is the second table under “Top goalscores”.
# Extract the table and show the first few rows
page %>%
  html_table() %>%
  .[[9]] %>% 
  head()
# A tibble: 6 × 3
   Rank Country       `Goals scored`
  <int> <chr>                  <int>
1     1 United States            142
2     2 Germany                  129
3     3 Norway                   100
4     4 Sweden                    83
5     5 Brazil                    71
6     6 England                   56

Part 2: Extract text

  • Use html_elements() to select the introductory paragraph(s) of text (e.g., the first three <p> tags under the main content) from the introductory section of the Wikipedia page, where key information about the World Cup is provided.
  • Use html_text2() to retrieve the text.
page %>%
  html_elements("p") %>%   # Selecting the first few paragraphs
  .[1:3] %>%            # Adjust based on the number of paragraphs you want
  html_text2() 
[1] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
[2] "The FIFA Women's World Cup is an international association football competition contested by the senior women's national teams of the members of Fédération Internationale de Football Association (FIFA), the sport's international governing body. The competition has been held every four years and one year after the men's FIFA World Cup since 1991, when the inaugural tournament, then called the FIFA Women's World Championship, was held in China. Under the tournament's current format, national teams vie for the remaining 31 slots in a three-year qualification phase. The host nation's team is automatically entered as the first slot. The tournament, called the World Cup Finals, is contested at venues within the host nation(s) over about one month."
[3] "The nine FIFA Women's World Cup tournaments have been won by five national teams. The United States have won four times. The other winners are Germany, with two titles, and Japan, Norway, and Spain with one title each."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Post-lecture

Other good R packages to know about

Additional practice

Here are some additional practice questions to help you think about the material discussed.

Questions
  1. Using the GitHub API, access the repository information and ask how many open github issues you have?
  2. Pick another API that we have not discussed here and use httr to retreive data from it.
  3. Look how many open issues there are in the dplyr package in the tidyverse.
  4. Practice requesting data from the openFDA API, which returns JSON files. This API provides create easy access to public data, to create a new level of openness and accountability, to ensure the privacy and security of public FDA data, and ultimately to educate the public and save lives. See data definitions for all included data.