Pre-lecture materials

Read ahead

Before class, you can prepare by reading the following materials:

Acknowledgements

Material for this lecture was borrowed and adopted from

Install new packages

Before we begin, you will need to install these packages

install.packages("jsonlite")
install.packages("httr")

Now we load a few R packages

library(tidyverse)
library(jsonlite)
library(httr)

Learning objectives

Note

At the end of this lesson you will:

Describe what the difference is between “raw” vs “clean” data
Learn about what are JSON files and how we can convert them into data frames in R
Describe some best practices on sharing data with collaborators
Know what does API mean and state four types of API architectures
Practice with two APIs: the GitHub API and the COVID Act Now API

Motivation

Today, we are going to talk about getting data from APIs and examples of common data formats.

First, let’s have a bit of a philosophical discussion about data.

“Raw” vs “Clean” data

As data analysts, this is what we wished data looked like whenever we start a project

However, the reality, is data is rarely in that form in comes in all types of “raw” formats that need to be transformed into a “clean” format.

For example, in field of genomics, raw data looks like something like this:

Or if you are interested in analyzing data from Twitter:

Or data from Electronic Healthcare Records (EHRs):

We all have our scary spreadsheet tales. Here is Jenny Bryan from Posit and UBC actually asking for some of those spreadsheet tales on twitter.

For example, this is an actual spreadsheet from Enron in 2001:

What do we mean by “raw” data?

From https://simplystatistics.org/posts/2016-07-20-relativity-raw-data/ raw data is defined as data…

…if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.

Where do data live?

Data lives anywhere and everywhere. Data might be stored simply in a .csv or .txt file. Data might be stored in an Excel or Google Spreadsheet. Data might be stored in large databases that require users to write special functions to interact with to extract the data they are interested in.

For example, you may have heard of the terms mySQL or MongoDB.

From Wikipedia, MySQL is defined as an open-source relational database management system (RDBMS). Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter,[7] and “SQL”, the abbreviation for Structured Query Language.

From Wikipeda, MongoDB is defined as “a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemata.”

So after reading that, we get the sense that there are multiple ways large databases can be structured, data can be formatted and interacted with. In addition, we see that database programs (e.g. MySQL and MongoDB) can also interact with each other.

We will learn more about JSON today and learn about SQL in a later lecture more formally.

Getting data

JSON files

JSON (or JavaScript Object Notation) is a file format that stores information in human-readable, organized, logical, easy-to-access manner.

For example, here is what a JSON file looks like:

var stephanie = {
    "age" : "33",
    "hometown" : "Baltimore, MD",
    "gender" : "female", 
  "cars" : {
    "car1" : "Hyundai Elantra",
    "car2" : "Toyota Rav4",
    "car3" : "Honda CR-V"
  }
}

Some features about JSON objects:

JSON objects are surrounded by curly braces {}
JSON objects are written in key/value pairs
Keys must be strings, and values must be a valid JSON data type (string, number, object, array, boolean)
Keys and values are separated by a colon
Each key/value pair is separated by a comma

Overview of APIs

From AWS, API stands for Application Programming Interface.

“Application” = any software with a distinct function
“Interface” = a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses.

The API documentation contains information on how developers are to structure those requests and responses.

Purpose of APIs

The purpose of APIs is enable two software components to communicate with each other using a set of definitions and protocols.

For example, the weather bureau’s software system contains daily weather data. The weather app on your phone “talks” to this system via APIs and shows you daily weather updates on your phone.

How do APIs work?

To understand how APIs work, two terms that are important are

client. This is the application sending the request.
server. This is the application sending the response.

So in the weather example, the bureau’s weather database is the server, and the mobile app is the client.

Four types of API architectures

There are four different ways that APIs can work depending on when and why they were created.

SOAP APIs. These APIs use Simple Object Access Protocol. Client and server exchange messages using XML. This is a less flexible API that was more popular in the past.
RPC APIs. These APIs are called Remote Procedure Calls. The client completes a function (or procedure) on the server, and the server sends the output back to the client.
Websocket APIs. Websocket API is another modern web API development that uses JSON objects to pass data. A WebSocket API supports two-way communication between client apps and the server. The server can send callback messages to connected clients, making it more efficient than REST API.
REST APIs. REST stands for Representational State Transfer (and are the most popular and flexible APIs). The client sends requests to the server as data. The server uses this client input to start internal functions and returns output data back to the client. REST defines a set of functions like GET, PUT, DELETE, etc. that clients can use to access server data. Clients and servers exchange data using HTTP.

The main feature of REST API is statelessness (i.e. servers do not save client data between requests). Client requests to the server are similar to URLs you type in your browser to visit a website. The response from the server is plain data, without the typical graphical rendering of a web page.

How to use an API?

The basic steps to using an API are:

Obtaining an API key. This is done by creating a verified account with the API provider.
Set up an HTTP API client. This tool allows you to structure API requests easily using the API keys received. Here, we will use the GET() function from the httr package.
If you don’t have an API client, you can try to structure the request yourself in your browser by referring to the API documentation.
Once you are comfortable with the new API syntax, you can start using it in your code.

Where can I find new APIs?

New web APIs can be found on API marketplaces and API directories, such as:

Rapid API – One of the largest global API markets (10k+ public APIs). Users to test APIs directly on the platform before committing to purchase.
Public REST APIs – Groups REST APIs into categories, making it easier to browse and find the right one to meet your needs.
APIForThat and APIList – Both these websites have lists of 500+ web APIs, along with in-depth information on how to use them.

GitHub API

The GitHub REST API may be of interest when studying online communities, working methods, organizational structures, communication and discussions, etc. with a focus on (open-source) software development.

Many projects that are hosted on GitHub are open-source projects with a transparent development process and communications. For private projects, which can also be hosted on GitHub, there’s understandably only a few aggregate data available.

Let’s say we want to use the GitHub REST API to find out how many of my GitHub repositories have open issues?

Pro-tip

The API can be used for free and you can send up to 60 requests per hour if you are not authenticated (i.e. if you don’t provide an API key).

For serious data collection, this is not much, so it is recommended to sign up on GitHub and generate a personal access token that acts as API key.

This token can then be used to authenticate your API requests. Your quota is then 5000 requests per hour.

Access the API from R

There are packages for many programming languages that provide convenient access for communicating with the GitHub API, but there are no such packages (that I’m aware of) for accessing the API from R.

This means we can only access the API directly, e.g. by using the jsonlite package to fetch the data and convert it to an R list or data.frame.

Specifically, we will use the jsonlite::fromJSON() function to convert from a JSON object to a data frame.

The JSON file is located at https://api.github.com/users/stephaniehicks/repos

github_url = "https://api.github.com/users/stephaniehicks/repos"

library(jsonlite)
library(tidyverse)
jsonData <- as_tibble(fromJSON(github_url))
glimpse(jsonData)

Rows: 30
Columns: 79
$ id                          <int> 160194123, 132884754, 225501707, 63822882,…
$ node_id                     <chr> "MDEwOlJlcG9zaXRvcnkxNjAxOTQxMjM=", "MDEwO…
$ name                        <chr> "2018-bioinfosummer-scrnaseq", "advdatasci…
$ full_name                   <chr> "stephaniehicks/2018-bioinfosummer-scrnase…
$ private                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ owner                       <df[,18]> <data.frame[26 x 18]>
$ html_url                    <chr> "https://github.com/stephaniehicks/201…
$ description                 <chr> NA, NA, "A curated list of bioinformatics …
$ fork                        <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FAL…
$ url                         <chr> "https://api.github.com/repos/stephaniehic…
$ forks_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ keys_url                    <chr> "https://api.github.com/repos/stephaniehic…
$ collaborators_url           <chr> "https://api.github.com/repos/stephaniehic…
$ teams_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ hooks_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ issue_events_url            <chr> "https://api.github.com/repos/stephaniehic…
$ events_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ assignees_url               <chr> "https://api.github.com/repos/stephaniehic…
$ branches_url                <chr> "https://api.github.com/repos/stephaniehic…
$ tags_url                    <chr> "https://api.github.com/repos/stephaniehic…
$ blobs_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ git_tags_url                <chr> "https://api.github.com/repos/stephaniehic…
$ git_refs_url                <chr> "https://api.github.com/repos/stephaniehic…
$ trees_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ statuses_url                <chr> "https://api.github.com/repos/stephaniehic…
$ languages_url               <chr> "https://api.github.com/repos/stephaniehic…
$ stargazers_url              <chr> "https://api.github.com/repos/stephaniehic…
$ contributors_url            <chr> "https://api.github.com/repos/stephaniehic…
$ subscribers_url             <chr> "https://api.github.com/repos/stephaniehic…
$ subscription_url            <chr> "https://api.github.com/repos/stephaniehic…
$ commits_url                 <chr> "https://api.github.com/repos/stephaniehic…
$ git_commits_url             <chr> "https://api.github.com/repos/stephaniehic…
$ comments_url                <chr> "https://api.github.com/repos/stephaniehic…
$ issue_comment_url           <chr> "https://api.github.com/repos/stephaniehic…
$ contents_url                <chr> "https://api.github.com/repos/stephaniehic…
$ compare_url                 <chr> "https://api.github.com/repos/stephaniehic…
$ merges_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ archive_url                 <chr> "https://api.github.com/repos/stephaniehic…
$ downloads_url               <chr> "https://api.github.com/repos/stephaniehic…
$ issues_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ pulls_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ milestones_url              <chr> "https://api.github.com/repos/stephaniehic…
$ notifications_url           <chr> "https://api.github.com/repos/stephaniehic…
$ labels_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ releases_url                <chr> "https://api.github.com/repos/stephaniehic…
$ deployments_url             <chr> "https://api.github.com/repos/stephaniehic…
$ created_at                  <chr> "2018-12-03T13:20:45Z", "2018-05-10T10:22:…
$ updated_at                  <chr> "2019-08-08T02:18:17Z", "2018-05-10T10:22:…
$ pushed_at                   <chr> "2018-12-05T17:07:09Z", "2017-12-18T17:18:…
$ git_url                     <chr> "git://github.com/stephaniehicks/2018-bioi…
$ ssh_url                     <chr> "git@github.com:stephaniehicks/2018-bioinf…
$ clone_url                   <chr> "https://github.com/stephaniehicks/2018-bi…
$ svn_url                     <chr> "https://github.com/stephaniehicks/2018-bi…
$ homepage                    <chr> NA, NA, NA, NA, "", NA, NA, NA, NA, NA, NA…
$ size                        <int> 60296, 172353, 121, 675, 26688, 20, 92401,…
$ stargazers_count            <int> 4, 0, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0, 0,…
$ watchers_count              <int> 4, 0, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0, 0,…
$ language                    <chr> "TeX", "HTML", NA, NA, "R", "R", "Jupyter …
$ has_issues                  <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ has_projects                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_downloads               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_wiki                    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_pages                   <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ has_discussions             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ forks_count                 <int> 4, 0, 0, 1, 1, 0, 2, 0, 0, 1, 4, 1, 1, 0, …
$ mirror_url                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ archived                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ disabled                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ open_issues_count           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ license                     <df[,5]> <data.frame[26 x 5]>
$ allow_forking               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ is_template                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ web_commit_signoff_required <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ topics                      <list> <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>…
$ visibility                  <chr> "public", "public", "public", "public", "p…
$ forks                       <int> 4, 0, 0, 1, 1, 0, 2, 0, 0, 1, 4, 1, 1, 0,…
$ open_issues                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ watchers                    <int> 4, 0, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0, 0,…
$ default_branch              <chr> "master", "master", "master", "master", "m…

The function fromJSON() has now converted the JSON file into a data frame.

However, from here, we see that there are only 30 rows (or 30 repositories). If you look on my github page, you can see there are more than 30 repositories.

https://github.com/stephaniehicks?tab=repositories

I have 85 public repositories as of today!

APIs limit info from users

What’s happening is called pagination.

At a high-level, the API is limiting the amount of items a user gets and splitting it into pages.

Formally, pagination is the process of splitting the contents or a section of a website into discrete pages. Users tend to get lost when there’s bunch of data and with pagination splitting they can concentrate on a particular amount of content. Hierarchy and paginated structure improve the readability score of the content.

In this use case Github api splits the result into 30 items per resonse, depends on the request

Solution: You should explicitly specify in your request how many items you would like to receive from server pagination engine, using formula for Github pagination api:

?page=1&per_page=<numberOfItemsYouSpecify>"

You can read more about pagination here:

https://docs.github.com/en/enterprise-server@3.6/rest/guides/traversing-with-pagination

Example

Here we can visit this website:

https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000

And see there are more than 30 repos. Let’s read it into R.

github_url = "https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000"

jsonDataAll <- as_tibble(fromJSON(github_url))
dim(jsonDataAll)

[1] 85 79

We now get all 85 repositories! yay!

Using API keys

Authenticating with the GitHub API via an API key allows you to send much more requests to the API.

API access keys for the GitHub API are called personal access tokens (PAT) and the documentation explains how to generate a PAT once you have logged into your GitHub account.

Where to store API keys

First, please be careful with your PATs and never publish them.

If you want guidance on where you should store them, I like this post:

https://www.r-bloggers.com/2015/11/how-to-store-and-use-webservice-keys-and-authentication-details-with-r/

Personally, I keep mine in my .Renviron file which looks something like this on the inside:

GITHUB_API_KEY = <add my GitHub API key here> 
COVID_ACT_NOW_API_KEY = <add my COVID Act Now API key here> 
CENSUS_API_KEY = <add my tidycensus API key here>

If you don’t have an .Renviron file in your home directory, you can make one:

cd ~
touch .Renviron

Assuming you have created and stored an API key in the .Renviron file in your home directory, you can fetch it with the Sys.getenv() function.

github_key <- Sys.getenv("GITHUB_API_KEY")

We will use this in a little bit.

Access API with `httr` and `GET`

There are a set of basic HTTP verbs that allow you access a set of endpoints.

The basic request patterns are:

Retrieve a single item (GET)
Retrieve a list of items (GET)
Create an item (POST)
Update an item (PUT)
Delete an item (DELETE)

Here, we will use the GET() function from httr package (i.e. tools to work with URLs and HTTP) to retrieve a single JSON file.

We will also make this an authenticated HTTP response to the GitHub API using authenticate() from the httr package.

Example

Let’s start by using the GitHub API to learn information about myself (Stephanie Hicks)

github_key <- Sys.getenv("GITHUB_API_KEY")
response <- GET('https://api.github.com/user', 
                authenticate(user = 'stephaniehicks', 
                             password = github_key))
response

Response [https://api.github.com/user]
  Date: 2022-12-06 18:29
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 1.48 kB
{
  "login": "stephaniehicks",
  "id": 1452065,
  "node_id": "MDQ6VXNlcjE0NTIwNjU=",
  "avatar_url": "https://avatars.githubusercontent.com/u/1452065?v=4",
  "gravatar_id": "",
  "url": "https://api.github.com/users/stephaniehicks",
  "html_url": "https://github.com/stephaniehicks",
  "followers_url": "https://api.github.com/users/stephaniehicks/followers",
  "following_url": "https://api.github.com/users/stephaniehicks/following{/ot...
...

We see the response we got is a JSON file.

Next we extract / retrieve the contents from the raw JSON output using the content() function from the httr package. If you use the argument as = 'text', it extracts the contents as a character vector.

account_details <- fromJSON(httr::content(response, as = 'text'))
account_details

$login
[1] "stephaniehicks"

$id
[1] 1452065

$node_id
[1] "MDQ6VXNlcjE0NTIwNjU="

$avatar_url
[1] "https://avatars.githubusercontent.com/u/1452065?v=4"

$gravatar_id
[1] ""

$url
[1] "https://api.github.com/users/stephaniehicks"

$html_url
[1] "https://github.com/stephaniehicks"

$followers_url
[1] "https://api.github.com/users/stephaniehicks/followers"

$following_url
[1] "https://api.github.com/users/stephaniehicks/following{/other_user}"

$gists_url
[1] "https://api.github.com/users/stephaniehicks/gists{/gist_id}"

$starred_url
[1] "https://api.github.com/users/stephaniehicks/starred{/owner}{/repo}"

$subscriptions_url
[1] "https://api.github.com/users/stephaniehicks/subscriptions"

$organizations_url
[1] "https://api.github.com/users/stephaniehicks/orgs"

$repos_url
[1] "https://api.github.com/users/stephaniehicks/repos"

$events_url
[1] "https://api.github.com/users/stephaniehicks/events{/privacy}"

$received_events_url
[1] "https://api.github.com/users/stephaniehicks/received_events"

$type
[1] "User"

$site_admin
[1] FALSE

$name
[1] "Stephanie Hicks"

$company
[1] "Johns Hopkins"

$blog
[1] "http://www.stephaniehicks.com"

$location
[1] "Baltimore, MD"

$email
NULL

$hireable
NULL

$bio
[1] "Associate Prof at Johns Hopkins Biostatistics"

$twitter_username
[1] "stephaniehicks"

$public_repos
[1] 85

$public_gists
[1] 8

$followers
[1] 233

$following
[1] 16

$created_at
[1] "2012-02-19T21:18:27Z"

$updated_at
[1] "2022-11-29T18:35:36Z"

Next, let’s perform the same request we did above about my 85 repositories, but instead of reading in the JSON file from the web, we use an authenticated GET() response:

response <- GET('https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000',
                authenticate('stephaniehicks', github_key))
repo_details <- as_tibble(fromJSON(httr::content(response, as = 'text')))
repo_details

# A tibble: 85 × 80
          id node_id   name  full_…¹ private owner…² html_…³ descr…⁴ fork  url  
       <int> <chr>     <chr> <chr>   <lgl>   <chr>   <chr>   <chr>   <lgl> <chr>
 1 160194123 MDEwOlJl… 2018… stepha… FALSE   stepha… https:… <NA>    FALSE http…
 2 132884754 MDEwOlJl… advd… stepha… FALSE   stepha… https:… <NA>    TRUE  http…
 3 225501707 MDEwOlJl… Awes… stepha… FALSE   stepha… https:… A cura… TRUE  http…
 4  63822882 MDEwOlJl… awes… stepha… FALSE   stepha… https:… List o… TRUE  http…
 5  16586187 MDEwOlJl… Back… stepha… FALSE   stepha… https:… Gene e… FALSE http…
 6 287533539 MDEwOlJl… benc… stepha… FALSE   stepha… https:… <NA>    FALSE http…
 7 168789632 MDEwOlJl… benc… stepha… FALSE   stepha… https:… Benchm… FALSE http…
 8 140320609 MDEwOlJl… benc… stepha… FALSE   stepha… https:… Reposi… FALSE http…
 9 178318764 MDEwOlJl… benc… stepha… FALSE   stepha… https:… Data a… FALSE http…
10 313126225 MDEwOlJl… bioc… stepha… FALSE   stepha… https:… <NA>    FALSE http…
# … with 75 more rows, 87 more variables: owner$id <int>, $node_id <chr>,
#   $avatar_url <chr>, $gravatar_id <chr>, $url <chr>, $html_url <chr>,
#   $followers_url <chr>, $following_url <chr>, $gists_url <chr>,
#   $starred_url <chr>, $subscriptions_url <chr>, $organizations_url <chr>,
#   $repos_url <chr>, $events_url <chr>, $received_events_url <chr>,
#   $type <chr>, $site_admin <lgl>, forks_url <chr>, keys_url <chr>,
#   collaborators_url <chr>, teams_url <chr>, hooks_url <chr>, …

A bit of EDA fun

Let’s have a bit of fun and explore some questions:

How many are private repos?
How many have forks? How many forks?

table(repo_details$private)


FALSE 
   85

table(repo_details$forks)


 0  1  2  3  4  5  6  8 11 
56 11  3  2  7  1  2  2  1

What’s the most popular language?

table(repo_details$language)


            HTML       JavaScript Jupyter Notebook         Makefile 
              20                4                1                1 
            Perl                R             Ruby            Shell 
               1               28                3                2 
             TeX 
               5

To find out how many repos that I have with open issues, we can just create a table:

# how many repos have open issues? 
table(repo_details$open_issues_count)


 0  1  2  3 
79  3  2  1

Whew! Not as many as I thought.

More about GET

You can use the query argument to specify details about the response.

Let’s look how many open issues there are in the dplyr package in the tidyverse

req <- GET("https://api.github.com/repos/tidyverse/dplyr/issues", 
           query = list(state = "open", per_page = 100, page = 1))
dplyr_details <- as_tibble(fromJSON(httr::content(req, as = 'text')))
dplyr_details

# A tibble: 26 × 30
   url       repos…¹ label…² comme…³ event…⁴ html_…⁵     id node_id number title
   <chr>     <chr>   <chr>   <chr>   <chr>   <chr>    <int> <chr>    <int> <chr>
 1 https://… https:… https:… https:… https:… https:… 1.48e9 I_kwDO…   6588 ".by…
 2 https://… https:… https:… https:… https:… https:… 1.48e9 I_kwDO…   6587 "arr…
 3 https://… https:… https:… https:… https:… https:… 1.47e9 I_kwDO…   6585 "Sug…
 4 https://… https:… https:… https:… https:… https:… 1.47e9 I_kwDO…   6582 "Sho…
 5 https://… https:… https:… https:… https:… https:… 1.47e9 I_kwDO…   6580 "Mak…
 6 https://… https:… https:… https:… https:… https:… 1.47e9 I_kwDO…   6565 "Fee…
 7 https://… https:… https:… https:… https:… https:… 1.46e9 I_kwDO…   6560 "`fi…
 8 https://… https:… https:… https:… https:… https:… 1.46e9 PR_kwD…   6550 "Loo…
 9 https://… https:… https:… https:… https:… https:… 1.45e9 I_kwDO…   6545 "`ac…
10 https://… https:… https:… https:… https:… https:… 1.44e9 I_kwDO…   6532 "Alt…
# … with 16 more rows, 20 more variables: user <df[,18]>, labels <list>,
#   state <chr>, locked <lgl>, assignee <df[,18]>, assignees <list>,
#   milestone <df[,16]>, comments <int>, created_at <chr>, updated_at <chr>,
#   closed_at <lgl>, author_association <chr>, active_lock_reason <lgl>,
#   body <chr>, reactions <df[,10]>, timeline_url <chr>,
#   performed_via_github_app <lgl>, state_reason <lgl>, draft <lgl>,
#   pull_request <df[,5]>, and abbreviated variable names ¹repository_url, …

Other examples with GitHub API

Finally, I will leave you with a few other examples of using GitHub API:

COVID Act Now API

Next, we will demonstrate how to request data from the API at COVID Act Now, which returns CSV or JSON files.

This API provides access to COVID data tracking US states, counties, and metros, including data and metrics for cases, vaccinations, tests, hospitalizations, and deaths. See data definitions for all included data.

Register for an API Key

First, you need to register for an API key here

https://apidocs.covidactnow.org/#register

You should also store the API key in your .Renviron like above for the GitHub API key.

Building the URL for `GET`

First, we will request time series COVID data for one county in the US (Baltimore City) defined by it FIPS code (24510).

The URL we want is the following

https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=<your_API_key_here>

Note

You cannot use this API without an API key

e.g. if we click here https://api.covidactnow.org/v2/county/24510.timeseries.json it says

{"error": "API key required. Visit https://apidocs.covidactnow.org/#register to get an API key"}

Let’s build up the URL.

The first is the base URL: https://api.covidactnow.org/v2/. This part of the URL will be the same for all our calls to this API.
The county/portion of the URL indicates that we only want COVID data for a single county. By looking at the COVID Act Now API documentation, I can see that states is an alternative option for this part of the URL.
24510 is the unique identifier for Baltimore City. If I want to get the same data but for a different county, I just have to change this number.
.timeseries provides the API with more information about the data I am requesting, and
.json tells the API to format the data as a JSON (which we will convert to a data frame).
Everything after apiKey= is my authorization token, which tells the COVID Act Now servers that I am allowed to ask for this data.

Now that we have dissected the anatomy of an API, you can see how easy it is to build them!

Basically anybody with an internet connection, an authorization token, and who knows the grammar of the API can access it. Most APIs are published with extensive documentation to help you understand the available options and parameters.

Calling an API with `GET`

Let’s join the URL together for one county (later I will show how to loop through multiple counties)

## extract my API from `.Renviron`
covid_key <- Sys.getenv("COVID_ACT_NOW_API_KEY")

## build the URL
base <- 'https://api.covidactnow.org/v2/county/'
county <- '24510'
info_key <- '.timeseries.json?apiKey='

## put it all together
API_URL <- paste0(base, county, info_key, covid_key)

Now we have the entire URL stored in a simple R object called API_URL.

We can now use the URL to call the API, and we will store the returned data in an object called raw_data:

raw_data <- GET(API_URL)
raw_data

Response [https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=b228d58133a04e3186e5ce081f73bbf5]
  Date: 2022-12-06 18:29
  Status: 200
  Content-Type: application/json
  Size: 1.34 MB

Pro-tip

We can see status element of the list. Traditionally, a status of “200” means that the API call was successful, and other codes are used to indicate errors. You can troubleshoot those error codes using the API documentation.

Next, we can inspect the object and we see that it is a list.

str(raw_data)

List of 10
 $ url        : chr "https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=b228d58133a04e3186e5ce081f73bbf5"
 $ status_code: int 200
 $ headers    :List of 16
  ..$ content-type                : chr "application/json"
  ..$ x-amz-id-2                  : chr "1Zbs9YrvpaiBiFMTsb0imkgtLHOELu8G0JopXchPmJIwWN/qJoo6W3zgvweFYN1fU0VU1NY2tCicyiSxMuq/hw=="
  ..$ x-amz-request-id            : chr "HAJVVPHDJ8D03R8K"
  ..$ date                        : chr "Tue, 06 Dec 2022 18:29:11 GMT"
  ..$ access-control-allow-origin : chr "*"
  ..$ access-control-allow-methods: chr "GET"
  ..$ last-modified               : chr "Tue, 06 Dec 2022 15:23:54 GMT"
  ..$ etag                        : chr "W/\"10d0f86efbf0c1846bf70a19394e8d29\""
  ..$ x-amz-version-id            : chr "6dSBQTzjCLxCzOiu3TC3P51hA3gYmcTs"
  ..$ server                      : chr "AmazonS3"
  ..$ content-encoding            : chr "gzip"
  ..$ vary                        : chr "Accept-Encoding"
  ..$ x-cache                     : chr "Miss from cloudfront"
  ..$ via                         : chr "1.1 c7705692ed008dad7e46e32f966aa3fe.cloudfront.net (CloudFront)"
  ..$ x-amz-cf-pop                : chr "JFK50-P8"
  ..$ x-amz-cf-id                 : chr "0E0pZr1-jowcMvOTp00B64tKiNeVNfh1_RSuB_ZGBra-XjH47tNBmw=="
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ all_headers:List of 1
  ..$ :List of 3
  .. ..$ status : int 200
  .. ..$ version: chr "HTTP/2"
  .. ..$ headers:List of 16
  .. .. ..$ content-type                : chr "application/json"
  .. .. ..$ x-amz-id-2                  : chr "1Zbs9YrvpaiBiFMTsb0imkgtLHOELu8G0JopXchPmJIwWN/qJoo6W3zgvweFYN1fU0VU1NY2tCicyiSxMuq/hw=="
  .. .. ..$ x-amz-request-id            : chr "HAJVVPHDJ8D03R8K"
  .. .. ..$ date                        : chr "Tue, 06 Dec 2022 18:29:11 GMT"
  .. .. ..$ access-control-allow-origin : chr "*"
  .. .. ..$ access-control-allow-methods: chr "GET"
  .. .. ..$ last-modified               : chr "Tue, 06 Dec 2022 15:23:54 GMT"
  .. .. ..$ etag                        : chr "W/\"10d0f86efbf0c1846bf70a19394e8d29\""
  .. .. ..$ x-amz-version-id            : chr "6dSBQTzjCLxCzOiu3TC3P51hA3gYmcTs"
  .. .. ..$ server                      : chr "AmazonS3"
  .. .. ..$ content-encoding            : chr "gzip"
  .. .. ..$ vary                        : chr "Accept-Encoding"
  .. .. ..$ x-cache                     : chr "Miss from cloudfront"
  .. .. ..$ via                         : chr "1.1 c7705692ed008dad7e46e32f966aa3fe.cloudfront.net (CloudFront)"
  .. .. ..$ x-amz-cf-pop                : chr "JFK50-P8"
  .. .. ..$ x-amz-cf-id                 : chr "0E0pZr1-jowcMvOTp00B64tKiNeVNfh1_RSuB_ZGBra-XjH47tNBmw=="
  .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ cookies    :'data.frame':    0 obs. of  7 variables:
  ..$ domain    : logi(0) 
  ..$ flag      : logi(0) 
  ..$ path      : logi(0) 
  ..$ secure    : logi(0) 
  ..$ expiration: 'POSIXct' num(0) 
  ..$ name      : logi(0) 
  ..$ value     : logi(0) 
 $ content    : raw [1:1338019] 7b 22 66 69 ...
 $ date       : POSIXct[1:1], format: "2022-12-06 18:29:11"
 $ times      : Named num [1:6] 0 0.0353 0.0479 0.0669 0.294 ...
  ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
 $ request    :List of 7
  ..$ method    : chr "GET"
  ..$ url       : chr "https://api.covidactnow.org/v2/county/24510.timeseries.json?apiKey=b228d58133a04e3186e5ce081f73bbf5"
  ..$ headers   : Named chr "application/json, text/xml, application/xml, */*"
  .. ..- attr(*, "names")= chr "Accept"
  ..$ fields    : NULL
  ..$ options   :List of 2
  .. ..$ useragent: chr "libcurl/7.84.0 r-curl/4.3.3 httr/1.4.4"
  .. ..$ httpget  : logi TRUE
  ..$ auth_token: NULL
  ..$ output    : list()
  .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
  ..- attr(*, "class")= chr "request"
 $ handle     :Class 'curl_handle' <externalptr> 
 - attr(*, "class")= chr "response"

One of the elements is content and we can inspect that

str(raw_data$content)

 raw [1:1338019] 7b 22 66 69 ...

We see the actual data have been stored as raw vectors (or raw bytes), which need to be converted to character vectors. This is not in a useable format yet.

Converting JSON to a `data.frame`

There is a function in base R rawTo_Char() that converts raw bytes to characters

covid_data <- fromJSON(rawToChar(raw_data$content), flatten = TRUE)

This converts the raw data into a list.

Note

We can also do this with httr::content (as above) and just define the encoding for the character set.

covid_data <- fromJSON(httr::content(raw_data, 
                                     as = 'text', 
                                     encoding =  "UTF-8"))
str(covid_data)

List of 25
 $ fips                          : chr "24510"
 $ country                       : chr "US"
 $ state                         : chr "MD"
 $ county                        : chr "Baltimore city"
 $ hsa                           : chr "016"
 $ hsaName                       : chr "Baltimore, MD - Baltimore City, MD"
 $ level                         : chr "county"
 $ lat                           : NULL
 $ locationId                    : chr "iso1:us#iso2:us-md#fips:24510"
 $ long                          : NULL
 $ population                    : int 593490
 $ hsaPopulation                 : int 2586756
 $ metrics                       :List of 13
  ..$ testPositivityRatio            : num 0.054
  ..$ testPositivityRatioDetails     :List of 1
  .. ..$ source: chr "other"
  ..$ caseDensity                    : num 20.9
  ..$ weeklyNewCasesPer100k          : num 146
  ..$ contactTracerCapacityRatio     : NULL
  ..$ infectionRate                  : num 1.11
  ..$ infectionRateCI90              : num 0.19
  ..$ icuCapacityRatio               : num 0.84
  ..$ bedsWithCovidPatientsRatio     : num 0.041
  ..$ weeklyCovidAdmissionsPer100k   : num 8
  ..$ vaccinationsInitiatedRatio     : num 0.769
  ..$ vaccinationsCompletedRatio     : num 0.674
  ..$ vaccinationsAdditionalDoseRatio: num 0.364
 $ riskLevels                    :List of 6
  ..$ overall                   : int 2
  ..$ testPositivityRatio       : int 1
  ..$ caseDensity               : int 2
  ..$ contactTracerCapacityRatio: int 4
  ..$ infectionRate             : int 2
  ..$ icuCapacityRatio          : int 2
 $ cdcTransmissionLevel          : int 3
 $ communityLevels               :List of 2
  ..$ cdcCommunityLevel: int 0
  ..$ canCommunityLevel: int 0
 $ actuals                       :List of 18
  ..$ cases                            : int 141637
  ..$ deaths                           : int 1898
  ..$ positiveTests                    : NULL
  ..$ negativeTests                    : NULL
  ..$ contactTracers                   : NULL
  ..$ hospitalBeds                     :List of 4
  .. ..$ capacity             : int 3546
  .. ..$ currentUsageTotal    : int 2945
  .. ..$ currentUsageCovid    : int 128
  .. ..$ weeklyCovidAdmissions: int 94
  ..$ hsaHospitalBeds                  :List of 4
  .. ..$ capacity             : int 5795
  .. ..$ currentUsageTotal    : int 4879
  .. ..$ currentUsageCovid    : int 249
  .. ..$ weeklyCovidAdmissions: int 230
  ..$ icuBeds                          :List of 3
  .. ..$ capacity         : int 422
  .. ..$ currentUsageTotal: int 355
  .. ..$ currentUsageCovid: int 9
  ..$ hsaIcuBeds                       :List of 3
  .. ..$ capacity         : int 631
  .. ..$ currentUsageTotal: int 526
  .. ..$ currentUsageCovid: int 18
  ..$ newCases                         : int 113
  ..$ newDeaths                        : int 1
  ..$ vaccinesDistributed              : NULL
  ..$ vaccinationsInitiated            : int 456528
  ..$ vaccinationsCompleted            : int 399804
  ..$ vaccinationsAdditionalDose       : int 216056
  ..$ vaccinesAdministered             : NULL
  ..$ vaccinesAdministeredDemographics : NULL
  ..$ vaccinationsInitiatedDemographics: NULL
 $ annotations                   :List of 28
  ..$ cases                          :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "NYTimes"
  .. .. ..$ url : chr "https://github.com/nytimes/covid-19-data"
  .. .. ..$ name: chr "The New York Times"
  .. ..$ anomalies: list()
  ..$ deaths                         :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "NYTimes"
  .. .. ..$ url : chr "https://github.com/nytimes/covid-19-data"
  .. .. ..$ name: chr "The New York Times"
  .. ..$ anomalies: list()
  ..$ positiveTests                  : NULL
  ..$ negativeTests                  : NULL
  ..$ contactTracers                 : NULL
  ..$ hospitalBeds                   :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u"
  .. .. ..$ name: chr "Department of Health and Human Services"
  .. ..$ anomalies: list()
  ..$ hsaHospitalBeds                : NULL
  ..$ icuBeds                        :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u"
  .. .. ..$ name: chr "Department of Health and Human Services"
  .. ..$ anomalies: list()
  ..$ hsaIcuBeds                     : NULL
  ..$ newCases                       :List of 2
  .. ..$ sources  : list()
  .. ..$ anomalies:'data.frame':    2 obs. of  3 variables:
  .. .. ..$ date                : chr [1:2] "2020-10-01" "2021-12-30"
  .. .. ..$ type                : chr [1:2] "zscore_outlier" "zscore_outlier"
  .. .. ..$ original_observation: num [1:2] 164 1664
  ..$ newDeaths                      : NULL
  ..$ vaccinesDistributed            : NULL
  ..$ vaccinationsInitiated          :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh"
  .. .. ..$ name: chr "Centers for Disease Control and Prevention"
  .. ..$ anomalies: list()
  ..$ vaccinationsCompleted          :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh"
  .. .. ..$ name: chr "Centers for Disease Control and Prevention"
  .. ..$ anomalies: list()
  ..$ vaccinationsAdditionalDose     :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh"
  .. .. ..$ name: chr "Centers for Disease Control and Prevention"
  .. ..$ anomalies: list()
  ..$ vaccinesAdministered           : NULL
  ..$ testPositivityRatio            :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://data.cdc.gov/Public-Health-Surveillance/Weekly-COVID-19-County-Level-of-Community-Transmis/dt66-w6m6/"
  .. .. ..$ name: chr "Centers for Disease Control and Prevention"
  .. ..$ anomalies: list()
  ..$ caseDensity                    :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "NYTimes"
  .. .. ..$ url : chr "https://github.com/nytimes/covid-19-data"
  .. .. ..$ name: chr "The New York Times"
  .. ..$ anomalies: list()
  ..$ weeklyNewCasesPer100k          :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "NYTimes"
  .. .. ..$ url : chr "https://github.com/nytimes/covid-19-data"
  .. .. ..$ name: chr "The New York Times"
  .. ..$ anomalies: list()
  ..$ contactTracerCapacityRatio     : NULL
  ..$ infectionRate                  :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "NYTimes"
  .. .. ..$ url : chr "https://github.com/nytimes/covid-19-data"
  .. .. ..$ name: chr "The New York Times"
  .. ..$ anomalies: list()
  ..$ infectionRateCI90              :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "NYTimes"
  .. .. ..$ url : chr "https://github.com/nytimes/covid-19-data"
  .. .. ..$ name: chr "The New York Times"
  .. ..$ anomalies: list()
  ..$ icuCapacityRatio               :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u"
  .. .. ..$ name: chr "Department of Health and Human Services"
  .. ..$ anomalies: list()
  ..$ bedsWithCovidPatientsRatio     :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u"
  .. .. ..$ name: chr "Department of Health and Human Services"
  .. ..$ anomalies: list()
  ..$ weeklyCovidAdmissionsPer100k   :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "other"
  .. .. ..$ url : chr "https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u"
  .. .. ..$ name: chr "Department of Health and Human Services"
  .. ..$ anomalies: list()
  ..$ vaccinationsInitiatedRatio     : NULL
  ..$ vaccinationsCompletedRatio     :List of 2
  .. ..$ sources  :'data.frame':    1 obs. of  3 variables:
  .. .. ..$ type: chr "CANScrapersStateProviders"
  .. .. ..$ url : chr "https://coronavirus.maryland.gov/#Vaccine"
  .. .. ..$ name: chr "Maryland Department of Health"
  .. ..$ anomalies: list()
  ..$ vaccinationsAdditionalDoseRatio: NULL
 $ lastUpdatedDate               : chr "2022-12-06"
 $ url                           : chr "https://covidactnow.org/us/maryland-md/county/baltimore_city"
 $ metricsTimeseries             :'data.frame': 1000 obs. of  13 variables:
  ..$ testPositivityRatio            : num [1:1000] 0 0 0.018 0.024 0.018 0.026 0.05 0.047 0.048 0.048 ...
  ..$ caseDensity                    : num [1:1000] 0 0 0 0 0 0 0 0.1 0.2 0.2 ...
  ..$ weeklyNewCasesPer100k          : num [1:1000] 0 0 0 0 0 0 0.3 1 1.3 1.7 ...
  ..$ contactTracerCapacityRatio     : logi [1:1000] NA NA NA NA NA NA ...
  ..$ infectionRate                  : num [1:1000] NA NA NA NA NA NA 1.4 1.42 1.44 1.45 ...
  ..$ infectionRateCI90              : num [1:1000] NA NA NA NA NA NA 0.96 0.7 0.6 0.56 ...
  ..$ icuCapacityRatio               : num [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ bedsWithCovidPatientsRatio     : num [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ weeklyCovidAdmissionsPer100k   : num [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ date                           : chr [1:1000] "2020-03-11" "2020-03-12" "2020-03-13" "2020-03-14" ...
  ..$ vaccinationsInitiatedRatio     : num [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ vaccinationsCompletedRatio     : num [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ vaccinationsAdditionalDoseRatio: num [1:1000] NA NA NA NA NA NA NA NA NA NA ...
 $ actualsTimeseries             :'data.frame': 1000 obs. of  19 variables:
  ..$ cases                            : int [1:1000] NA NA NA NA 1 1 3 7 7 11 ...
  ..$ deaths                           : int [1:1000] NA NA NA NA 0 0 0 0 0 0 ...
  ..$ positiveTests                    : logi [1:1000] NA NA NA NA NA NA ...
  ..$ negativeTests                    : logi [1:1000] NA NA NA NA NA NA ...
  ..$ contactTracers                   : logi [1:1000] NA NA NA NA NA NA ...
  ..$ hospitalBeds                     :'data.frame':   1000 obs. of  4 variables:
  .. ..$ capacity             : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageTotal    : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageCovid    : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ weeklyCovidAdmissions: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ hsaHospitalBeds                  :'data.frame':   1000 obs. of  4 variables:
  .. ..$ capacity             : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageTotal    : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageCovid    : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ weeklyCovidAdmissions: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ icuBeds                          :'data.frame':   1000 obs. of  3 variables:
  .. ..$ capacity         : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageTotal: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageCovid: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ hsaIcuBeds                       :'data.frame':   1000 obs. of  3 variables:
  .. ..$ capacity         : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageTotal: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  .. ..$ currentUsageCovid: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ newCases                         : int [1:1000] NA NA NA NA NA 0 2 4 0 4 ...
  ..$ newDeaths                        : int [1:1000] NA NA NA NA NA 0 0 0 0 0 ...
  ..$ vaccinesAdministeredDemographics : logi [1:1000] NA NA NA NA NA NA ...
  ..$ vaccinationsInitiatedDemographics: logi [1:1000] NA NA NA NA NA NA ...
  ..$ date                             : chr [1:1000] "2020-03-11" "2020-03-12" "2020-03-13" "2020-03-14" ...
  ..$ vaccinesDistributed              : logi [1:1000] NA NA NA NA NA NA ...
  ..$ vaccinationsInitiated            : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ vaccinationsCompleted            : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ vaccinationsAdditionalDose       : int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ vaccinesAdministered             : logi [1:1000] NA NA NA NA NA NA ...
 $ riskLevelsTimeseries          :'data.frame': 1000 obs. of  3 variables:
  ..$ overall    : int [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ caseDensity: int [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ date       : chr [1:1000] "2020-03-11" "2020-03-12" "2020-03-13" "2020-03-14" ...
 $ cdcTransmissionLevelTimeseries:'data.frame': 1000 obs. of  2 variables:
  ..$ date                : chr [1:1000] "2020-03-11" "2020-03-12" "2020-03-13" "2020-03-14" ...
  ..$ cdcTransmissionLevel: int [1:1000] 0 0 0 0 0 0 1 0 0 0 ...
 $ communityLevelsTimeseries     :'data.frame': 1000 obs. of  3 variables:
  ..$ cdcCommunityLevel: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ canCommunityLevel: int [1:1000] NA NA NA NA NA NA NA NA NA NA ...
  ..$ date             : chr [1:1000] "2020-03-11" "2020-03-12" "2020-03-13" "2020-03-14" ...

Now that it is in a list format, you can see that it actually contains several data frames!

You can use this data right away if you are already familiar with lists in R, or you can extract the data frames into separate objects, like this:

ts_df <- covid_data$actualsTimeseries

The data frame that we have just created contains many different variables and a lot of information. Below, you can see the first six rows of a selection of some interesting variables in our data:

head(ts_df[ , c("cases", "deaths", "newCases", "newDeaths", "date")], n=10)

   cases deaths newCases newDeaths       date
1     NA     NA       NA        NA 2020-03-11
2     NA     NA       NA        NA 2020-03-12
3     NA     NA       NA        NA 2020-03-13
4     NA     NA       NA        NA 2020-03-14
5      1      0       NA        NA 2020-03-15
6      1      0        0         0 2020-03-16
7      3      0        2         0 2020-03-17
8      7      0        4         0 2020-03-18
9      7      0        0         0 2020-03-19
10    11      0        4         0 2020-03-20

Looping multiple API calls

Next, let’s create a loop to make several requests to the API

Now that we’ve seen how to make an API call for one county, let’s create a simple loop to make several calls at a time. We’ll use a for loop, which you can read more about here.

First, we’ll create a vector with the ID code for each county we want to get data for:

# Baltimore City County, Montgomery County, Baltimore County
counties <- c('24510', '24031', '24005')

We will loop through each element and adjust the API_URL accordingly.

temp_list <- vector("list", length = 3)
ts_df <- NULL

for(i in 1:length(counties)) {
 
  # Build the API URL with the new county code
  API_URL <- paste0(base, counties[i], info_key, covid_key)
 
  # Store the raw and processed API results in temporary objects
  temp_raw <- GET(API_URL)
  temp_list <- fromJSON(rawToChar(temp_raw$content), flatten = TRUE)
 
  # Add the most recent results to your data frame
  ts_df <- rbind(ts_df, temp_list$actualsTimeseries)
}

dim(ts_df)

[1] 3007   29

Post-lecture materials

Other good R packages to know about

googlesheets4 to interact with Google Sheets in R
googledrive to interact with files on your Google Drive

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions

Using the GitHub API, access the repository information and ask how many open github issues you have?
Pick another API that we have not discussed here and use httr to retreive data from it.

Additional Resources

Tip

Pre-lecture materials

Read ahead

Acknowledgements

Install new packages

Learning objectives

Motivation

“Raw” vs “Clean” data

What do we mean by “raw” data?

Where do data live?

Best practices on sharing data

Getting data

JSON files

Overview of APIs

How do APIs work?

Four types of API architectures

How to use an API?

Where can I find new APIs?

GitHub API

Access the API from R

Using API keys

Access API with httr and GET

A bit of EDA fun

Other examples with GitHub API

COVID Act Now API

Register for an API Key

Building the URL for GET

Calling an API with GET

Converting JSON to a data.frame

Looping multiple API calls

Post-lecture materials

Other good R packages to know about

Final Questions

Additional Resources

Access API with `httr` and `GET`

Building the URL for `GET`

Calling an API with `GET`

Converting JSON to a `data.frame`