install.packages("jsonlite")
install.packages("httr2")
Data Collection
Pre-lecture activities
In advance of class, please install the following R packages
jsonlite
(https://jeroen.cran.dev/jsonlite)httr2
(https://httr2.r-lib.org)rvest
(https://rvest.tidyverse.org) (should be installed already with thetidyverse
)
In addition, please read through
Lecture
Acknowledgements
Material for this lecture was borrowed and adopted from
Learning objectives
At the end of this lesson you will:
- Introduce the JSON file format
- Demonstrate how to convert JSON file format into data frames in R
- Know what does API mean and state four types of API architectures
- Practice with the GitHub API and make authenticated requests
- Practice a range of
rvest
functions to scrape data from HTML pages - Recognize various HTML elements on the page (text, links, images, lists, etc.)
Slides
Class activity
For the rest of the time in class, we will practice using rvest
to learn how to scrape data from HTML pages. To do this, we will load the following packages and extract data from the FIFA Women’s World Cup HTML page.
library(tidyverse)
library(rvest) # should be installed with the tidyverse
library(xml2)
- Practice a range of
rvest
functions to scrape data from the web - Understand various various HTML elements on the page (text, links, images, lists, etc.).
If any tables are nested within specific sections, you may need to target them individually using CSS selectors.
Part 1: Extracting tables
- Use
html_table
to extract a table containing the FIFA Women’s World Cup and the corresponding runner-up for each World Cup year. This is the first table under “Results”. - Show the first few rows with
head()
.
# This is the URL we want to scrape data from
<- "https://en.wikipedia.org/wiki/FIFA_Women%27s_World_Cup"
url
# This is a local HTML file to avoid scraping data from it each time I compile this quarto file.
<- here::here("lectures","04-data-collection", "world-cup.html")
html_worldcup
if(!file.exists(html_worldcup)){
<- read_html(url)
page write_html(page, html_worldcup)
else {
} <- read_html(html_worldcup)
page }
# Extract the table and show the first few rows
%>%
page html_table() %>%
5]] %>%
.[[head()
# A tibble: 6 × 13
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> <chr>
1 Ed. Year Hosts NA Final Final Final NA Thir… Thir… Thir… NA No. …
2 Ed. Year Hosts NA Cham… Score Runn… NA Thir… Score Four… NA No. …
3 1 1991 China NA Unit… 2–1 Norw… NA Swed… 4–0 Germ… NA 12
4 2 1995 Swed… NA Norw… 2–0 Germ… NA Unit… 2–0 China NA 12
5 3 1999 Unit… NA Unit… 0–0 … China NA Braz… 0–0[… Norw… NA 16
6 4 2003[… Unit… NA Germ… 2–1 … Swed… NA Unit… 3–1 Cana… NA 16
- Use
html_table()
to extract a table containing the number of goal scored by country. This is the second table under “Top goalscores”.
# Extract the table and show the first few rows
%>%
page html_table() %>%
9]] %>%
.[[head()
# A tibble: 6 × 3
Rank Country `Goals scored`
<int> <chr> <int>
1 1 United States 142
2 2 Germany 129
3 3 Norway 100
4 4 Sweden 83
5 5 Brazil 71
6 6 England 56
Part 2: Extract text
- Use
html_elements()
to select the introductory paragraph(s) of text (e.g., the first three<p>
tags under the main content) from the introductory section of the Wikipedia page, where key information about the World Cup is provided. - Use
html_text2()
to retrieve the text.
%>%
page html_elements("p") %>% # Selecting the first few paragraphs
1:3] %>% # Adjust based on the number of paragraphs you want
.[html_text2()
[1] ""
[2] "The FIFA Women's World Cup is an international association football competition contested by the senior women's national teams of the members of Fédération Internationale de Football Association (FIFA), the sport's international governing body. The competition has been held every four years and one year after the men's FIFA World Cup since 1991, when the inaugural tournament, then called the FIFA Women's World Championship, was held in China. Under the tournament's current format, national teams vie for the remaining 31 slots in a three-year qualification phase. The host nation's team is automatically entered as the first slot. The tournament, called the World Cup Finals, is contested at venues within the host nation(s) over about one month."
[3] "The nine FIFA Women's World Cup tournaments have been won by five national teams. The United States have won four times. The other winners are Germany, with two titles, and Japan, Norway, and Spain with one title each."
Part 3: Extract external links
Here, we will extract a list of external links related to the FIFA Women’s World Cup from the “References” section at the bottom.
- Use
html_elements()
to select all the references. - Use
html_elements()
to select all links (<a>
tags) in the references section. - Use
html_attr()
to extract the URLs (href
) for these links. - Display a list of all unique URLs (hint: use
unique()
function)
%>%
page html_elements(".mw-references-columns") %>%
html_elements("li") %>%
html_elements("a") %>%
html_attr("href") %>%
unique()
[1] "#cite_ref-1"
[2] "https://web.archive.org/web/20150706151332/http://www.ussoccer.com/stories/2015/07/05/21/19/150705-wnt-v-jpn-game-story"
[3] "http://www.ussoccer.com/stories/2015/07/05/21/19/150705-wnt-v-jpn-game-story"
[4] "#cite_ref-FIFAformat_2-0"
[5] "#cite_ref-FIFAformat_2-1"
[6] "#cite_ref-FIFAformat_2-2"
[7] "https://web.archive.org/web/20141209162112/http://resources.fifa.com/mm/document/tournament/competition/02/07/47/91/regulationsfwwccanada2015e_neutral.pdf"
[8] "http://resources.fifa.com/mm/document/tournament/competition/02/07/47/91/regulationsfwwccanada2015e_neutral.pdf"
[9] "#cite_ref-3"
[10] "https://www.rsssf.org/tablesm/mondo-women70.html"
[11] "/wiki/RSSSF"
[12] "https://web.archive.org/web/20220728155233/https://www.rsssf.org/tablesm/mondo-women70.html"
[13] "#cite_ref-4"
[14] "https://www.bbc.co.uk/news/business-46149887"
[15] "/wiki/British_Broadcasting_Cooperation"
[16] "https://web.archive.org/web/20181207214128/https://www.bbc.com/news/business-46149887"
[17] "#cite_ref-5"
[18] "https://www.rsssf.org/tablesm/mundo-women71.html"
[19] "https://web.archive.org/web/20220728155235/https://www.rsssf.org/tablesm/mundo-women71.html"
[20] "#cite_ref-6"
[21] "https://www.theguardian.com/football/blog/2015/jun/04/womens-world-cup-unofficial-record-breaking"
[22] "/wiki/The_Guardian"
[23] "https://web.archive.org/web/20150605044132/https://www.theguardian.com/football/blog/2015/jun/04/womens-world-cup-unofficial-record-breaking"
[24] "#cite_ref-7"
[25] "https://www.rsssf.org/tablesm/mundialito-women.html"
[26] "https://web.archive.org/web/20220803221248/https://www.rsssf.org/tablesm/mundialito-women.html"
[27] "#cite_ref-8"
[28] "http://www.the-afc.com/competitions/afc-womens-asian-cup/latest/news/foundation-of-asian-brilliance"
[29] "https://web.archive.org/web/20190703193448/http://www.the-afc.com/competitions/afc-womens-asian-cup/latest/news/foundation-of-asian-brilliance"
[30] "#cite_ref-9"
[31] "https://web.archive.org/web/20190608195632/https://www.fifa.com/womensworldcup/news/ellen-wille-mother-norwegian-women-football-1462830"
[32] "https://www.fifa.com/womensworldcup/news/ellen-wille-mother-norwegian-women-football-1462830"
[33] "#cite_ref-10"
[34] "http://www.fifamuseum.com/stories/blog/a-green-and-gold-shirt-steeped-in-history-500107/"
[35] "https://web.archive.org/web/20190629110059/http://www.fifamuseum.com/stories/blog/a-green-and-gold-shirt-steeped-in-history-500107/"
[36] "#cite_ref-11"
[37] "https://web.archive.org/web/20181213173348/https://www.fifa.com/womensworldcup/news/when-akers-and-usa-got-the-party-started"
[38] "https://www.fifa.com/womensworldcup/news/when-akers-and-usa-got-the-party-started"
[39] "#cite_ref-12"
[40] "https://web.archive.org/web/20190524061602/https://www.fifa.com/womensworldcup/news/fifa-women-world-cup-sweden-1995-501999"
[41] "https://www.fifa.com/womensworldcup/news/fifa-women-world-cup-sweden-1995-501999"
[42] "#cite_ref-13"
[43] "http://www.sportsnetwork.com/default.asp?c=sportsnetwork&page=SOC-WWC/STAT/WWC-HISTORY.htm"
[44] "/wiki/Wikipedia:Link_rot"
[45] "#cite_ref-14"
[46] "https://www.usatoday.com/sports/soccer/world/2003-05-03-womens-cup-sars_x.htm"
[47] "https://web.archive.org/web/20090212124055/https://www.usatoday.com/sports/soccer/world/2003-05-03-womens-cup-sars_x.htm"
[48] "#cite_ref-15"
[49] "https://www.cbc.ca/sports/soccer/canada-gets-2015-women-s-world-cup-of-soccer-1.988843"
[50] "/wiki/Canadian_Broadcasting_Company"
[51] "https://web.archive.org/web/20110304143611/http://www.cbc.ca/sports/soccer/story/2011/03/03/sp-womens-world-cup.html"
[52] "#cite_ref-16"
[53] "https://web.archive.org/web/20160821094856/http://uk.reuters.com/article/uk-soccer-women-japan-idUKKBN0NM3D120150501"
[54] "http://uk.reuters.com/article/uk-soccer-women-japan-idUKKBN0NM3D120150501"
[55] "#cite_ref-17"
[56] "http://www.huffingtonpost.com/2015/06/17/christie-rampone-oldest-player_n_7599882.html"
[57] "https://web.archive.org/web/20150617061147/http://www.huffingtonpost.com/2015/06/17/christie-rampone-oldest-player_n_7599882.html"
[58] "#cite_ref-18"
[59] "https://web.archive.org/web/20150320194817/http://www.fifa.com/womensworldcup/news/y=2015/m=3/news=france-to-host-the-fifa-women-s-world-cup-in-2019-2567761.html"
[60] "https://www.fifa.com/womensworldcup/news/y=2015/m=3/news=france-to-host-the-fifa-women-s-world-cup-in-2019-2567761.html"
[61] "#cite_ref-19"
[62] "https://www.fifamuseum.com/en/blog-stories/blog/the-fifa-women-s-world-cuptm-original-trophy-is-back-at-the-fifa-museu-2620371/"
[63] "#cite_ref-20"
[64] "https://web.archive.org/web/20170319113103/http://www.fifa.com/marketinghighlights/canada2015/marketing-higlights/the-brand/the-official-womens-world-cup-trophy.html"
[65] "https://www.fifa.com/marketinghighlights/canada2015/marketing-higlights/the-brand/the-official-womens-world-cup-trophy.html"
[66] "#cite_ref-21"
[67] "https://thejewelerblog.wordpress.com/2015/07/06/womens-world-cup-trophy-is-made-of-gold-glad-sterling-silver-mens-version-is-18-karat-gold/"
[68] "https://web.archive.org/web/20181013211741/https://thejewelerblog.wordpress.com/2015/07/06/womens-world-cup-trophy-is-made-of-gold-glad-sterling-silver-mens-version-is-18-karat-gold/"
[69] "#cite_ref-22"
[70] "https://web.archive.org/web/20191222125210/https://www.fifa.com/clubworldcup/news/fifa-world-champions-badge-honours-real-madrid-s-impeccable-year-2494968"
[71] "/wiki/FIFA"
[72] "https://www.fifa.com/clubworldcup/news/fifa-world-champions-badge-honours-real-madrid-s-impeccable-year-2494968"
[73] "#cite_ref-23"
[74] "https://digitalhub.fifa.com/m/1816849eda4db6/original/jaeq2lvmczqjofxccj3u-pdf.pdf"
[75] "https://web.archive.org/web/20210831030705/https://digitalhub.fifa.com/m/1816849eda4db6/original/jaeq2lvmczqjofxccj3u-pdf.pdf"
[76] "#cite_ref-2015_keyfacts_24-0"
[77] "#cite_ref-2015_keyfacts_24-1"
[78] "https://web.archive.org/web/20150711210112/http://www.fifa.com/womensworldcup/news/y=2015/m=7/news=key-figures-from-the-fifa-women-s-world-cup-canada-2015tm-2661648.html"
[79] "https://www.fifa.com/womensworldcup/news/y=2015/m=7/news=key-figures-from-the-fifa-women-s-world-cup-canada-2015tm-2661648.html"
[80] "#cite_ref-25"
[81] "https://www.nytimes.com/2003/05/27/sports/soccer-us-replaces-china-as-host-of-soccer-s-women-s-world-cup.html"
[82] "/wiki/The_New_York_Times"
[83] "https://web.archive.org/web/20180908054348/https://www.nytimes.com/2003/05/27/sports/soccer-us-replaces-china-as-host-of-soccer-s-women-s-world-cup.html"
[84] "#cite_ref-26"
[85] "https://web.archive.org/web/20020228035519/http://sportsillustrated.cnn.com/soccer/world/1999/womens_worldcup/news/1999/07/10/brazil_norway/"
[86] "http://sportsillustrated.cnn.com/soccer/world/1999/womens_worldcup/news/1999/07/10/brazil_norway/"
[87] "#cite_ref-ussf_070815_29-0"
[88] "#cite_ref-ussf_070815_29-1"
[89] "https://web.archive.org/web/20150709161003/http://www.ussoccer.com/stories/2015/07/08/16/59/150708-wnt-victory-breaks-tv-records"
[90] "http://www.ussoccer.com/stories/2015/07/08/16/59/150708-wnt-victory-breaks-tv-records"
[91] "#cite_ref-30"
[92] "https://www.sbnation.com/2015/7/6/8900299/more-americans-watched-the-womens-world-cup-final-than-the-nba-finals"
[93] "https://web.archive.org/web/20150707222336/https://www.sbnation.com/2015/7/6/8900299/more-americans-watched-the-womens-world-cup-final-than-the-nba-finals"
[94] "#cite_ref-31"
[95] "https://web.archive.org/web/20151218034246/http://www.fifa.com/womensworldcup/news/y=2015/m=12/news=record-breaking-fifa-women-s-world-cup-tops-750-million-tv-viewers-2745963.html"
[96] "https://www.fifa.com/womensworldcup/news/y=2015/m=12/news=record-breaking-fifa-women-s-world-cup-tops-750-million-tv-viewers-2745963.html"
[97] "#cite_ref-32"
[98] "https://www.nbcsports.com/northwest/world-cup/equal-pay-womens-world-cup-players-seriously"
[99] "https://web.archive.org/web/20230124175112/https://www.nbcsports.com/northwest/world-cup/equal-pay-womens-world-cup-players-seriously"
[100] "#cite_ref-33"
[101] "https://pluralist.com/uswnt-equal-pay-controversy/"
[102] "https://web.archive.org/web/20190703120647/https://pluralist.com/uswnt-equal-pay-controversy/"
[103] "#cite_ref-34"
[104] "https://www.nytimes.com/2018/06/12/sports/fifa-revenue.html"
[105] "https://ghostarchive.org/archive/20220102/https://www.nytimes.com/2018/06/12/sports/fifa-revenue.html"
[106] "#cite_ref-35"
[107] "https://boxscorenews.com/fifa-calls-womens-world-cup-broadcast-offers-disappointing-p168538-199.htm"
[108] "https://web.archive.org/web/20230518034444/https://boxscorenews.com/fifa-calls-womens-world-cup-broadcast-offers-disappointing-p168538-199.htm"
[109] "#cite_ref-ath-2023-broadcast_36-0"
[110] "#cite_ref-ath-2023-broadcast_36-1"
[111] "https://theathletic.com/4575427/2023/06/04/womens-world-cup-commercial-tv-rights/?source=twitterhq"
[112] "/wiki/The_Athletic"
[113] "#cite_ref-37"
[114] "https://digitalhub.fifa.com/m/6bd2fa3c769ee09c/original/lnpeuvaoc1v5tih9rf7p-pdf.pdf"
[115] "https://web.archive.org/web/20210727190318/https://digitalhub.fifa.com/m/6bd2fa3c769ee09c/original/lnpeuvaoc1v5tih9rf7p-pdf.pdf"
Part 4: Extract links from images
Here we will extract the URLs for images on the page, particularly images related to tournament trophies or logos.
- Use
html_elements()
to select tags. - Use
html_attr()
to get the image source (src
) URLs. - Filter for images that contain keywords like “FIFA”, “World Cup”, or “Logo” in their src using the
grepl
function.
%>%
page html_elements("img") %>%
html_attr("src") %>%
tibble(url = .) %>%
filter(grepl("FIFA|World_Cup|Logo", url))
# A tibble: 5 × 1
url
<chr>
1 //upload.wikimedia.org/wikipedia/commons/thumb/a/aa/FIFA_logo_without_slogan.…
2 //upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Womens_World_Cup_countrie…
3 //upload.wikimedia.org/wikipedia/commons/thumb/a/aa/FIFA_logo_without_slogan.…
4 //upload.wikimedia.org/wikipedia/commons/thumb/2/26/World_Map_FIFA.svg/100px-…
5 //upload.wikimedia.org/wikipedia/commons/thumb/2/26/World_Map_FIFA.svg/200px-…
Post-lecture
Other good R packages to know about
- googlesheets4 to interact with Google Sheets in R
- googledrive to interact with files on your Google Drive
Additional practice
Here are some additional practice questions to help you think about the material discussed.
- Using the GitHub API, access the repository information and ask how many open github issues you have?
- Pick another API that we have not discussed here and use
httr
to retreive data from it. - Look how many open issues there are in the
dplyr
package in thetidyverse
. - Practice requesting data from the openFDA API, which returns JSON files. This API provides create easy access to public data, to create a new level of openness and accountability, to ensure the privacy and security of public FDA data, and ultimately to educate the public and save lives. See data definitions for all included data.