install.packages("jsonlite")
install.packages("httr")
Pre-lecture materials
Read ahead
Acknowledgements
Material for this lecture was borrowed and adopted from
Install new packages
Before we begin, you will need to install these packages
Now we load a few R packages
library(tidyverse)
library(jsonlite)
library(httr)
Learning objectives
Motivation
Today, we are going to talk about getting data from APIs and examples of common data formats.
First, let’s have a bit of a philosophical discussion about data.
“Raw” vs “Clean” data
As data analysts, this is what we wished data looked like whenever we start a project
However, the reality, is data is rarely in that form in comes in all types of “raw” formats that need to be transformed into a “clean” format.
For example, in field of genomics, raw data looks like something like this:
Or if you are interested in analyzing data from Twitter:
Or data from Electronic Healthcare Records (EHRs):
We all have our scary spreadsheet tales. Here is Jenny Bryan from Posit and UBC actually asking for some of those spreadsheet tales on twitter.
For example, this is an actual spreadsheet from Enron in 2001:
What do we mean by “raw” data?
From https://simplystatistics.org/posts/2016-07-20-relativity-raw-data/ raw data is defined as data…
…if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.
Where do data live?
Data lives anywhere and everywhere. Data might be stored simply in a .csv
or .txt
file. Data might be stored in an Excel or Google Spreadsheet. Data might be stored in large databases that require users to write special functions to interact with to extract the data they are interested in.
For example, you may have heard of the terms mySQL
or MongoDB
.
From Wikipedia, MySQL is defined as an open-source relational database management system (RDBMS). Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter,[7] and “SQL”, the abbreviation for Structured Query Language.
From Wikipeda, MongoDB is defined as “a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemata.”
So after reading that, we get the sense that there are multiple ways large databases can be structured, data can be formatted and interacted with. In addition, we see that database programs (e.g. MySQL and MongoDB) can also interact with each other.
We will learn more about JSON
today and learn about SQL
in a later lecture more formally.
Best practices on sharing data
A great article in PeerJ was written titled How to share data for collaboration, in which the authors describe a set of guidelines for sharing data:
We highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician. With these guidelines we hope to avoid errors and delays in data analysis. the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician.
It’s a great paper that describes the information you should pass to a statistician to facilitate the most efficient and timely analysis.
Specifically:
- The raw data (or the rawest form of the data to which you have access)
- Should not have modified, removed or summarized any data; Ran no software on data
- e.g. strange binary file your measurement machine spits out
- e.g. complicated JSON file you scrapped from Twitter Application Programming Interfaces (API)
- e.g. hand-entered numbers you collected looking through a microscope
- A clean data set
- This may or may not be transforming data into a
tidy
dataset, but possibly yes
- This may or may not be transforming data into a
- A code book describing each variable and its values in the clean or tidy data set.
- More detailed information about the measurements in the data set (e.g. units, experimental design, summary choices made)
- Doesn’t quite fit into the column names in the spreadsheet
- Often reported in a
.md
,.txt
or Word file.
- An explicit and exact recipe you used to go from 1 -> 2,3
Getting data
JSON files
JSON (or JavaScript Object Notation) is a file format that stores information in human-readable, organized, logical, easy-to-access manner.
For example, here is what a JSON file looks like:
var stephanie = {
"age" : "33",
"hometown" : "Baltimore, MD",
"gender" : "female",
"cars" : {
"car1" : "Hyundai Elantra",
"car2" : "Toyota Rav4",
"car3" : "Honda CR-V"
} }
Some features about JSON
objects:
- JSON objects are surrounded by curly braces
{}
- JSON objects are written in key/value pairs
- Keys must be strings, and values must be a valid JSON data type (string, number, object, array, boolean)
- Keys and values are separated by a colon
- Each key/value pair is separated by a comma
Overview of APIs
From AWS, API stands for Application Programming Interface.
- “Application” = any software with a distinct function
- “Interface” = a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses.
The API documentation contains information on how developers are to structure those requests and responses.
How do APIs work?
To understand how APIs work, two terms that are important are
- client. This is the application sending the request.
- server. This is the application sending the response.
So in the weather example, the bureau’s weather database is the server, and the mobile app is the client.
Four types of API architectures
There are four different ways that APIs can work depending on when and why they were created.
SOAP APIs. These APIs use Simple Object Access Protocol. Client and server exchange messages using XML. This is a less flexible API that was more popular in the past.
RPC APIs. These APIs are called Remote Procedure Calls. The client completes a function (or procedure) on the server, and the server sends the output back to the client.
Websocket APIs. Websocket API is another modern web API development that uses JSON objects to pass data. A WebSocket API supports two-way communication between client apps and the server. The server can send callback messages to connected clients, making it more efficient than REST API.
REST APIs. REST stands for Representational State Transfer (and are the most popular and flexible APIs). The client sends requests to the server as data. The server uses this client input to start internal functions and returns output data back to the client. REST defines a set of functions like GET, PUT, DELETE, etc. that clients can use to access server data. Clients and servers exchange data using HTTP.
The main feature of REST API is statelessness (i.e. servers do not save client data between requests). Client requests to the server are similar to URLs you type in your browser to visit a website. The response from the server is plain data, without the typical graphical rendering of a web page.
How to use an API?
The basic steps to using an API are:
- Obtaining an API key. This is done by creating a verified account with the API provider.
- Set up an HTTP API client. This tool allows you to structure API requests easily using the API keys received. Here, we will use the
GET()
function from thehttr
package. - If you don’t have an API client, you can try to structure the request yourself in your browser by referring to the API documentation.
- Once you are comfortable with the new API syntax, you can start using it in your code.
Where can I find new APIs?
New web APIs can be found on API marketplaces and API directories, such as:
- Rapid API – One of the largest global API markets (10k+ public APIs). Users to test APIs directly on the platform before committing to purchase.
- Public REST APIs – Groups REST APIs into categories, making it easier to browse and find the right one to meet your needs.
- APIForThat and APIList – Both these websites have lists of 500+ web APIs, along with in-depth information on how to use them.
GitHub API
The GitHub REST API may be of interest when studying online communities, working methods, organizational structures, communication and discussions, etc. with a focus on (open-source) software development.
Many projects that are hosted on GitHub are open-source projects with a transparent development process and communications. For private projects, which can also be hosted on GitHub, there’s understandably only a few aggregate data available.
Let’s say we want to use the GitHub REST API to find out how many of my GitHub repositories have open issues?
Access the API from R
There are packages for many programming languages that provide convenient access for communicating with the GitHub API, but there are no such packages (that I’m aware of) for accessing the API from R.
This means we can only access the API directly, e.g. by using the jsonlite
package to fetch the data and convert it to an R list
or data.frame
.
Specifically, we will use the jsonlite::fromJSON()
function to convert from a JSON object to a data frame.
The JSON file is located at https://api.github.com/users/stephaniehicks/repos
= "https://api.github.com/users/stephaniehicks/repos"
github_url
library(jsonlite)
library(tidyverse)
<- as_tibble(fromJSON(github_url))
jsonData glimpse(jsonData)
Rows: 30
Columns: 79
$ id <int> 160194123, 132884754, 225501707, 63822882,…
$ node_id <chr> "MDEwOlJlcG9zaXRvcnkxNjAxOTQxMjM=", "MDEwO…
$ name <chr> "2018-bioinfosummer-scrnaseq", "advdatasci…
$ full_name <chr> "stephaniehicks/2018-bioinfosummer-scrnase…
$ private <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ owner <df[,18]> <data.frame[26 x 18]>
$ html_url <chr> "https://github.com/stephaniehicks/201…
$ description <chr> NA, NA, "A curated list of bioinformatics …
$ fork <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
$ url <chr> "https://api.github.com/repos/stephaniehic…
$ forks_url <chr> "https://api.github.com/repos/stephaniehic…
$ keys_url <chr> "https://api.github.com/repos/stephaniehic…
$ collaborators_url <chr> "https://api.github.com/repos/stephaniehic…
$ teams_url <chr> "https://api.github.com/repos/stephaniehic…
$ hooks_url <chr> "https://api.github.com/repos/stephaniehic…
$ issue_events_url <chr> "https://api.github.com/repos/stephaniehic…
$ events_url <chr> "https://api.github.com/repos/stephaniehic…
$ assignees_url <chr> "https://api.github.com/repos/stephaniehic…
$ branches_url <chr> "https://api.github.com/repos/stephaniehic…
$ tags_url <chr> "https://api.github.com/repos/stephaniehic…
$ blobs_url <chr> "https://api.github.com/repos/stephaniehic…
$ git_tags_url <chr> "https://api.github.com/repos/stephaniehic…
$ git_refs_url <chr> "https://api.github.com/repos/stephaniehic…
$ trees_url <chr> "https://api.github.com/repos/stephaniehic…
$ statuses_url <chr> "https://api.github.com/repos/stephaniehic…
$ languages_url <chr> "https://api.github.com/repos/stephaniehic…
$ stargazers_url <chr> "https://api.github.com/repos/stephaniehic…
$ contributors_url <chr> "https://api.github.com/repos/stephaniehic…
$ subscribers_url <chr> "https://api.github.com/repos/stephaniehic…
$ subscription_url <chr> "https://api.github.com/repos/stephaniehic…
$ commits_url <chr> "https://api.github.com/repos/stephaniehic…
$ git_commits_url <chr> "https://api.github.com/repos/stephaniehic…
$ comments_url <chr> "https://api.github.com/repos/stephaniehic…
$ issue_comment_url <chr> "https://api.github.com/repos/stephaniehic…
$ contents_url <chr> "https://api.github.com/repos/stephaniehic…
$ compare_url <chr> "https://api.github.com/repos/stephaniehic…
$ merges_url <chr> "https://api.github.com/repos/stephaniehic…
$ archive_url <chr> "https://api.github.com/repos/stephaniehic…
$ downloads_url <chr> "https://api.github.com/repos/stephaniehic…
$ issues_url <chr> "https://api.github.com/repos/stephaniehic…
$ pulls_url <chr> "https://api.github.com/repos/stephaniehic…
$ milestones_url <chr> "https://api.github.com/repos/stephaniehic…
$ notifications_url <chr> "https://api.github.com/repos/stephaniehic…
$ labels_url <chr> "https://api.github.com/repos/stephaniehic…
$ releases_url <chr> "https://api.github.com/repos/stephaniehic…
$ deployments_url <chr> "https://api.github.com/repos/stephaniehic…
$ created_at <chr> "2018-12-03T13:20:45Z", "2018-05-10T10:22:…
$ updated_at <chr> "2019-08-08T02:18:17Z", "2018-05-10T10:22:…
$ pushed_at <chr> "2018-12-05T17:07:09Z", "2017-12-18T17:18:…
$ git_url <chr> "git://github.com/stephaniehicks/2018-bioi…
$ ssh_url <chr> "git@github.com:stephaniehicks/2018-bioinf…
$ clone_url <chr> "https://github.com/stephaniehicks/2018-bi…
$ svn_url <chr> "https://github.com/stephaniehicks/2018-bi…
$ homepage <chr> NA, NA, NA, NA, "", NA, NA, NA, NA, NA, NA…
$ size <int> 60296, 172353, 121, 675, 26688, 20, 92401,…
$ stargazers_count <int> 4, 0, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0, 0,…
$ watchers_count <int> 4, 0, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0, 0,…
$ language <chr> "TeX", "HTML", NA, NA, "R", "R", "Jupyter …
$ has_issues <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ has_projects <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_downloads <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_wiki <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_pages <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ has_discussions <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ forks_count <int> 4, 0, 0, 1, 1, 0, 2, 0, 0, 1, 4, 1, 1, 0, …
$ mirror_url <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ archived <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ open_issues_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ license <df[,5]> <data.frame[26 x 5]>
$ allow_forking <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ is_template <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ web_commit_signoff_required <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ topics <list> <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>…
$ visibility <chr> "public", "public", "public", "public", "p…
$ forks <int> 4, 0, 0, 1, 1, 0, 2, 0, 0, 1, 4, 1, 1, 0,…
$ open_issues <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ watchers <int> 4, 0, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0, 0,…
$ default_branch <chr> "master", "master", "master", "master", "m…
The function fromJSON()
has now converted the JSON file into a data frame.
However, from here, we see that there are only 30 rows (or 30 repositories). If you look on my github page, you can see there are more than 30 repositories.
I have 85 public repositories as of today!
Solution: You should explicitly specify in your request how many items you would like to receive from server pagination engine, using formula for Github pagination api:
?page=1&per_page=<numberOfItemsYouSpecify>"
You can read more about pagination here:
Using API keys
Authenticating with the GitHub API via an API key allows you to send much more requests to the API.
API access keys for the GitHub API are called personal access tokens (PAT) and the documentation explains how to generate a PAT once you have logged into your GitHub account.
Assuming you have created and stored an API key in the .Renviron
file in your home directory, you can fetch it with the Sys.getenv()
function.
<- Sys.getenv("GITHUB_API_KEY") github_key
We will use this in a little bit.
Access API with httr
and GET
There are a set of basic HTTP verbs that allow you access a set of endpoints.
The basic request patterns are:
- Retrieve a single item (GET)
- Retrieve a list of items (GET)
- Create an item (POST)
- Update an item (PUT)
- Delete an item (DELETE)
Here, we will use the GET()
function from httr
package (i.e. tools to work with URLs and HTTP) to retrieve a single JSON file.
We will also make this an authenticated HTTP response to the GitHub API using authenticate()
from the httr
package.
Next we extract / retrieve the contents from the raw JSON output using the content()
function from the httr
package. If you use the argument as = 'text'
, it extracts the contents as a character vector.
<- fromJSON(httr::content(response, as = 'text'))
account_details account_details
$login
[1] "stephaniehicks"
$id
[1] 1452065
$node_id
[1] "MDQ6VXNlcjE0NTIwNjU="
$avatar_url
[1] "https://avatars.githubusercontent.com/u/1452065?v=4"
$gravatar_id
[1] ""
$url
[1] "https://api.github.com/users/stephaniehicks"
$html_url
[1] "https://github.com/stephaniehicks"
$followers_url
[1] "https://api.github.com/users/stephaniehicks/followers"
$following_url
[1] "https://api.github.com/users/stephaniehicks/following{/other_user}"
$gists_url
[1] "https://api.github.com/users/stephaniehicks/gists{/gist_id}"
$starred_url
[1] "https://api.github.com/users/stephaniehicks/starred{/owner}{/repo}"
$subscriptions_url
[1] "https://api.github.com/users/stephaniehicks/subscriptions"
$organizations_url
[1] "https://api.github.com/users/stephaniehicks/orgs"
$repos_url
[1] "https://api.github.com/users/stephaniehicks/repos"
$events_url
[1] "https://api.github.com/users/stephaniehicks/events{/privacy}"
$received_events_url
[1] "https://api.github.com/users/stephaniehicks/received_events"
$type
[1] "User"
$site_admin
[1] FALSE
$name
[1] "Stephanie Hicks"
$company
[1] "Johns Hopkins"
$blog
[1] "http://www.stephaniehicks.com"
$location
[1] "Baltimore, MD"
$email
NULL
$hireable
NULL
$bio
[1] "Associate Prof at Johns Hopkins Biostatistics"
$twitter_username
[1] "stephaniehicks"
$public_repos
[1] 85
$public_gists
[1] 8
$followers
[1] 233
$following
[1] 16
$created_at
[1] "2012-02-19T21:18:27Z"
$updated_at
[1] "2022-11-29T18:35:36Z"
Next, let’s perform the same request we did above about my 85 repositories, but instead of reading in the JSON file from the web, we use an authenticated GET()
response:
<- GET('https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000',
response authenticate('stephaniehicks', github_key))
<- as_tibble(fromJSON(httr::content(response, as = 'text')))
repo_details repo_details
# A tibble: 85 × 80
id node_id name full_…¹ private owner…² html_…³ descr…⁴ fork url
<int> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> <chr>
1 160194123 MDEwOlJl… 2018… stepha… FALSE stepha… https:… <NA> FALSE http…
2 132884754 MDEwOlJl… advd… stepha… FALSE stepha… https:… <NA> TRUE http…
3 225501707 MDEwOlJl… Awes… stepha… FALSE stepha… https:… A cura… TRUE http…
4 63822882 MDEwOlJl… awes… stepha… FALSE stepha… https:… List o… TRUE http…
5 16586187 MDEwOlJl… Back… stepha… FALSE stepha… https:… Gene e… FALSE http…
6 287533539 MDEwOlJl… benc… stepha… FALSE stepha… https:… <NA> FALSE http…
7 168789632 MDEwOlJl… benc… stepha… FALSE stepha… https:… Benchm… FALSE http…
8 140320609 MDEwOlJl… benc… stepha… FALSE stepha… https:… Reposi… FALSE http…
9 178318764 MDEwOlJl… benc… stepha… FALSE stepha… https:… Data a… FALSE http…
10 313126225 MDEwOlJl… bioc… stepha… FALSE stepha… https:… <NA> FALSE http…
# … with 75 more rows, 87 more variables: owner$id <int>, $node_id <chr>,
# $avatar_url <chr>, $gravatar_id <chr>, $url <chr>, $html_url <chr>,
# $followers_url <chr>, $following_url <chr>, $gists_url <chr>,
# $starred_url <chr>, $subscriptions_url <chr>, $organizations_url <chr>,
# $repos_url <chr>, $events_url <chr>, $received_events_url <chr>,
# $type <chr>, $site_admin <lgl>, forks_url <chr>, keys_url <chr>,
# collaborators_url <chr>, teams_url <chr>, hooks_url <chr>, …
A bit of EDA fun
Let’s have a bit of fun and explore some questions:
- How many are private repos?
- How many have forks? How many forks?
table(repo_details$private)
FALSE
85
table(repo_details$forks)
0 1 2 3 4 5 6 8 11
56 11 3 2 7 1 2 2 1
What’s the most popular language?
table(repo_details$language)
HTML JavaScript Jupyter Notebook Makefile
20 4 1 1
Perl R Ruby Shell
1 28 3 2
TeX
5
To find out how many repos that I have with open issues, we can just create a table:
# how many repos have open issues?
table(repo_details$open_issues_count)
0 1 2 3
79 3 2 1
Whew! Not as many as I thought.
Other examples with GitHub API
Finally, I will leave you with a few other examples of using GitHub API:
COVID Act Now API
Next, we will demonstrate how to request data from the API at COVID Act Now, which returns CSV or JSON files.
This API provides access to COVID data tracking US states, counties, and metros, including data and metrics for cases, vaccinations, tests, hospitalizations, and deaths. See data definitions for all included data.
Register for an API Key
First, you need to register for an API key here
You should also store the API key in your .Renviron
like above for the GitHub API key.
Building the URL for GET
First, we will request time series COVID data for one county in the US (Baltimore City) defined by it FIPS code (24510).
The URL we want is the following
https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=<your_API_key_here>
Let’s build up the URL.
- The first is the base URL:
https://api.covidactnow.org/v2/
. This part of the URL will be the same for all our calls to this API. - The county/portion of the URL indicates that we only want COVID data for a single county. By looking at the COVID Act Now API documentation, I can see that states is an alternative option for this part of the URL.
- 24510 is the unique identifier for Baltimore City. If I want to get the same data but for a different county, I just have to change this number.
.timeseries
provides the API with more information about the data I am requesting, and.json
tells the API to format the data as a JSON (which we will convert to a data frame).- Everything after
apiKey=
is my authorization token, which tells the COVID Act Now servers that I am allowed to ask for this data.
Now that we have dissected the anatomy of an API, you can see how easy it is to build them!
Basically anybody with an internet connection, an authorization token, and who knows the grammar of the API can access it. Most APIs are published with extensive documentation to help you understand the available options and parameters.
Calling an API with GET
Let’s join the URL together for one county (later I will show how to loop through multiple counties)
## extract my API from `.Renviron`
<- Sys.getenv("COVID_ACT_NOW_API_KEY")
covid_key
## build the URL
<- 'https://api.covidactnow.org/v2/county/'
base <- '24510'
county <- '.timeseries.json?apiKey='
info_key
## put it all together
<- paste0(base, county, info_key, covid_key) API_URL
Now we have the entire URL stored in a simple R object called API_URL
.
We can now use the URL to call the API, and we will store the returned data in an object called raw_data
:
<- GET(API_URL)
raw_data raw_data
Response [https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=b228d58133a04e3186e5ce081f73bbf5]
Date: 2022-12-06 18:29
Status: 200
Content-Type: application/json
Size: 1.34 MB
Next, we can inspect the object and we see that it is a list.
str(raw_data)
List of 10
$ url : chr "https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=b228d58133a04e3186e5ce081f73bbf5"
$ status_code: int 200
$ headers :List of 16
..$ content-type : chr "application/json"
..$ x-amz-id-2 : chr "1Zbs9YrvpaiBiFMTsb0imkgtLHOELu8G0JopXchPmJIwWN/qJoo6W3zgvweFYN1fU0VU1NY2tCicyiSxMuq/hw=="
..$ x-amz-request-id : chr "HAJVVPHDJ8D03R8K"
..$ date : chr "Tue, 06 Dec 2022 18:29:11 GMT"
..$ access-control-allow-origin : chr "*"
..$ access-control-allow-methods: chr "GET"
..$ last-modified : chr "Tue, 06 Dec 2022 15:23:54 GMT"
..$ etag : chr "W/\"10d0f86efbf0c1846bf70a19394e8d29\""
..$ x-amz-version-id : chr "6dSBQTzjCLxCzOiu3TC3P51hA3gYmcTs"
..$ server : chr "AmazonS3"
..$ content-encoding : chr "gzip"
..$ vary : chr "Accept-Encoding"
..$ x-cache : chr "Miss from cloudfront"
..$ via : chr "1.1 c7705692ed008dad7e46e32f966aa3fe.cloudfront.net (CloudFront)"
..$ x-amz-cf-pop : chr "JFK50-P8"
..$ x-amz-cf-id : chr "0E0pZr1-jowcMvOTp00B64tKiNeVNfh1_RSuB_ZGBra-XjH47tNBmw=="
..- attr(*, "class")= chr [1:2] "insensitive" "list"
$ all_headers:List of 1
..$ :List of 3
.. ..$ status : int 200
.. ..$ version: chr "HTTP/2"
.. ..$ headers:List of 16
.. .. ..$ content-type : chr "application/json"
.. .. ..$ x-amz-id-2 : chr "1Zbs9YrvpaiBiFMTsb0imkgtLHOELu8G0JopXchPmJIwWN/qJoo6W3zgvweFYN1fU0VU1NY2tCicyiSxMuq/hw=="
.. .. ..$ x-amz-request-id : chr "HAJVVPHDJ8D03R8K"
.. .. ..$ date : chr "Tue, 06 Dec 2022 18:29:11 GMT"
.. .. ..$ access-control-allow-origin : chr "*"
.. .. ..$ access-control-allow-methods: chr "GET"
.. .. ..$ last-modified : chr "Tue, 06 Dec 2022 15:23:54 GMT"
.. .. ..$ etag : chr "W/\"10d0f86efbf0c1846bf70a19394e8d29\""
.. .. ..$ x-amz-version-id : chr "6dSBQTzjCLxCzOiu3TC3P51hA3gYmcTs"
.. .. ..$ server : chr "AmazonS3"
.. .. ..$ content-encoding : chr "gzip"
.. .. ..$ vary : chr "Accept-Encoding"
.. .. ..$ x-cache : chr "Miss from cloudfront"
.. .. ..$ via : chr "1.1 c7705692ed008dad7e46e32f966aa3fe.cloudfront.net (CloudFront)"
.. .. ..$ x-amz-cf-pop : chr "JFK50-P8"
.. .. ..$ x-amz-cf-id : chr "0E0pZr1-jowcMvOTp00B64tKiNeVNfh1_RSuB_ZGBra-XjH47tNBmw=="
.. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
$ cookies :'data.frame': 0 obs. of 7 variables:
..$ domain : logi(0)
..$ flag : logi(0)
..$ path : logi(0)
..$ secure : logi(0)
..$ expiration: 'POSIXct' num(0)
..$ name : logi(0)
..$ value : logi(0)
$ content : raw [1:1338019] 7b 22 66 69 ...
$ date : POSIXct[1:1], format: "2022-12-06 18:29:11"
$ times : Named num [1:6] 0 0.0353 0.0479 0.0669 0.294 ...
..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
$ request :List of 7
..$ method : chr "GET"
..$ url : chr "https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=b228d58133a04e3186e5ce081f73bbf5"
..$ headers : Named chr "application/json, text/xml, application/xml, */*"
.. ..- attr(*, "names")= chr "Accept"
..$ fields : NULL
..$ options :List of 2
.. ..$ useragent: chr "libcurl/7.84.0 r-curl/4.3.3 httr/1.4.4"
.. ..$ httpget : logi TRUE
..$ auth_token: NULL
..$ output : list()
.. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
..- attr(*, "class")= chr "request"
$ handle :Class 'curl_handle' <externalptr>
- attr(*, "class")= chr "response"
One of the elements is content
and we can inspect that
str(raw_data$content)
raw [1:1338019] 7b 22 66 69 ...
We see the actual data have been stored as raw vectors (or raw bytes), which need to be converted to character vectors. This is not in a useable format yet.
Converting JSON to a data.frame
There is a function in base R rawTo_Char()
that converts raw bytes to characters
<- fromJSON(rawToChar(raw_data$content), flatten = TRUE) covid_data
This converts the raw data into a list.
Now that it is in a list format, you can see that it actually contains several data frames!
You can use this data right away if you are already familiar with lists in R, or you can extract the data frames into separate objects, like this:
<- covid_data$actualsTimeseries ts_df
The data frame that we have just created contains many different variables and a lot of information. Below, you can see the first six rows of a selection of some interesting variables in our data:
head(ts_df[ , c("cases", "deaths", "newCases", "newDeaths", "date")], n=10)
cases deaths newCases newDeaths date
1 NA NA NA NA 2020-03-11
2 NA NA NA NA 2020-03-12
3 NA NA NA NA 2020-03-13
4 NA NA NA NA 2020-03-14
5 1 0 NA NA 2020-03-15
6 1 0 0 0 2020-03-16
7 3 0 2 0 2020-03-17
8 7 0 4 0 2020-03-18
9 7 0 0 0 2020-03-19
10 11 0 4 0 2020-03-20
Looping multiple API calls
Next, let’s create a loop to make several requests to the API
Now that we’ve seen how to make an API call for one county, let’s create a simple loop to make several calls at a time. We’ll use a for loop, which you can read more about here.
First, we’ll create a vector with the ID code for each county we want to get data for:
# Baltimore City County, Montgomery County, Baltimore County
<- c('24510', '24031', '24005') counties
We will loop through each element and adjust the API_URL
accordingly.
<- vector("list", length = 3)
temp_list <- NULL
ts_df
for(i in 1:length(counties)) {
# Build the API URL with the new county code
<- paste0(base, counties[i], info_key, covid_key)
API_URL
# Store the raw and processed API results in temporary objects
<- GET(API_URL)
temp_raw <- fromJSON(rawToChar(temp_raw$content), flatten = TRUE)
temp_list
# Add the most recent results to your data frame
<- rbind(ts_df, temp_list$actualsTimeseries)
ts_df }
dim(ts_df)
[1] 3007 29
Post-lecture materials
Other good R packages to know about
- googlesheets4 to interact with Google Sheets in R
- googledrive to interact with files on your Google Drive
Final Questions
Here are some post-lecture questions to help you think about the material discussed.