Pre-lecture materials

Read ahead

Before class, you can prepare by reading the following materials:

Acknowledgements

Material for this lecture was borrowed and adopted from

Install new packages

Before we begin, you will need to install these packages

install.packages("jsonlite")
install.packages("httr")

Now we load a few R packages

library(tidyverse)
library(jsonlite)
library(httr)

Learning objectives

Note

At the end of this lesson you will:

Describe what the difference is between “raw” vs “clean” data
Learn about what are JSON files and how we can convert them into data frames in R
Describe some best practices on sharing data with collaborators
Know what does API mean and state four types of API architectures
Practice with two APIs: the GitHub API and the openFDA API

Motivation

Today, we are going to talk about getting data from APIs and examples of common data formats.

First, let’s have a bit of a philosophical discussion about data.

“Raw” vs “Clean” data

As data analysts, this is what we wished data looked like whenever we start a project

However, the reality, is data is rarely in that form in comes in all types of “raw” formats that need to be transformed into a “clean” format.

For example, in field of genomics, raw data looks like something like this:

Or if you are interested in analyzing data from Twitter:

Or data from Electronic Healthcare Records (EHRs):

We all have our scary spreadsheet tales. Here is Jenny Bryan from Posit and UBC actually asking for some of those spreadsheet tales on twitter.

For example, this is an actual spreadsheet from Enron in 2001:

What do we mean by “raw” data?

From https://simplystatistics.org/posts/2016-07-20-relativity-raw-data/ raw data is defined as data…

…if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.

Where do data live?

Data lives anywhere and everywhere. Data might be stored simply in a .csv or .txt file. Data might be stored in an Excel or Google Spreadsheet. Data might be stored in large databases that require users to write special functions to interact with to extract the data they are interested in.

For example, you may have heard of the terms mySQL or MongoDB.

From Wikipedia, MySQL is defined as an open-source relational database management system (RDBMS). Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter,[7] and “SQL”, the abbreviation for Structured Query Language.

From Wikipeda, MongoDB is defined as “a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemata.”

So after reading that, we get the sense that there are multiple ways large databases can be structured, data can be formatted and interacted with. In addition, we see that database programs (e.g. MySQL and MongoDB) can also interact with each other.

We will learn more about JSON today and learn about SQL in a later lecture more formally.

Getting data

JSON files

JSON (or JavaScript Object Notation) is a file format that stores information in human-readable, organized, logical, easy-to-access manner.

For example, here is what a JSON file looks like:

var stephanie = {
    "job-title" : "Associate Professor",
    "hometown" : "Baltimore, MD",
    "pronouns": "she/her",
  "states-lived" : {
    "state1" : "Louisiana",
    "state2" : "Texas",
    "state3" : "Massachusetts",
    "state4" : "Maryland"
  }
}

Some features about JSON objects:

JSON objects are surrounded by curly braces {}
JSON objects are written in key/value pairs
Keys must be strings, and values must be a valid JSON data type (string, number, object, array, boolean)
Keys and values are separated by a colon
Each key/value pair is separated by a comma

Overview of APIs

From AWS, API stands for Application Programming Interface.

“Application” = any software with a distinct function
“Interface” = a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses.

The API documentation contains information on how developers are to structure those requests and responses.

Purpose of APIs

The purpose of APIs is enable two software components to communicate with each other using a set of definitions and protocols.

For example, the weather bureau’s software system contains daily weather data. The weather app on your phone “talks” to this system via APIs and shows you daily weather updates on your phone.

How do APIs work?

To understand how APIs work, two terms that are important are

client. This is the application sending the request.
server. This is the application sending the response.

So in the weather example, the bureau’s weather database is the server, and the mobile app is the client.

Four types of API architectures

There are four different ways that APIs can work depending on when and why they were created.

SOAP APIs. These APIs use Simple Object Access Protocol. Client and server exchange messages using XML. This is a less flexible API that was more popular in the past.
RPC APIs. These APIs are called Remote Procedure Calls. The client completes a function (or procedure) on the server, and the server sends the output back to the client.
Websocket APIs. Websocket API is another modern web API development that uses JSON objects to pass data. A WebSocket API supports two-way communication between client apps and the server. The server can send callback messages to connected clients, making it more efficient than REST API.
REST APIs. REST stands for Representational State Transfer (and are the most popular and flexible APIs). The client sends requests to the server as data. The server uses this client input to start internal functions and returns output data back to the client. REST defines a set of functions like GET, PUT, DELETE, etc. that clients can use to access server data. Clients and servers exchange data using HTTP.

The main feature of REST API is statelessness (i.e. servers do not save client data between requests). Client requests to the server are similar to URLs you type in your browser to visit a website. The response from the server is plain data, without the typical graphical rendering of a web page.

How to use an API?

The basic steps to using an API are:

Obtaining an API key. This is done by creating a verified account with the API provider.
Set up an HTTP API client. This tool allows you to structure API requests easily using the API keys received. Here, we will use the GET() function from the httr package.
If you don’t have an API client, you can try to structure the request yourself in your browser by referring to the API documentation.
Once you are comfortable with the new API syntax, you can start using it in your code.

Where can I find new APIs?

New web APIs can be found on API marketplaces and API directories, such as:

Rapid API – One of the largest global API markets (10k+ public APIs). Users to test APIs directly on the platform before committing to purchase.
Public REST APIs – Groups REST APIs into categories, making it easier to browse and find the right one to meet your needs.
APIForThat and APIList – Both these websites have lists of 500+ web APIs, along with in-depth information on how to use them.

GitHub API

The GitHub REST API may be of interest when studying online communities, working methods, organizational structures, communication and discussions, etc. with a focus on (open-source) software development.

Many projects that are hosted on GitHub are open-source projects with a transparent development process and communications. For private projects, which can also be hosted on GitHub, there’s understandably only a few aggregate data available.

Let’s say we want to use the GitHub REST API to find out how many of my GitHub repositories have open issues?

Pro-tip

The API can be used for free and you can send up to 60 requests per hour if you are not authenticated (i.e. if you don’t provide an API key).

For serious data collection, this is not much, so it is recommended to sign up on GitHub and generate a personal access token that acts as API key.

This token can then be used to authenticate your API requests. Your quota is then 5000 requests per hour.

Access the API from R

There are packages for many programming languages that provide convenient access for communicating with the GitHub API, but there are no such packages (that I’m aware of) for accessing the API from R.

This means we can only access the API directly, e.g. by using the jsonlite package to fetch the data and convert it to an R list or data.frame.

Specifically, we will use the jsonlite::fromJSON() function to convert from a JSON object to a data frame.

The JSON file is located at https://api.github.com/users/stephaniehicks/repos

github_url = "https://api.github.com/users/stephaniehicks/repos"

library(jsonlite)
library(tidyverse)
jsonData <- as_tibble(fromJSON(github_url))
glimpse(jsonData)

Rows: 30
Columns: 79
$ id                          <int> 160194123, 132884754, 647539937, 225501707…
$ node_id                     <chr> "MDEwOlJlcG9zaXRvcnkxNjAxOTQxMjM=", "MDEwO…
$ name                        <chr> "2018-bioinfosummer-scrnaseq", "advdatasci…
$ full_name                   <chr> "stephaniehicks/2018-bioinfosummer-scrnase…
$ private                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ owner                       <df[,18]> <data.frame[26 x 18]>
$ html_url                    <chr> "https://github.com/stephaniehicks/201…
$ description                 <chr> NA, NA, "Repo to share code for the atlas-…
$ fork                        <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FAL…
$ url                         <chr> "https://api.github.com/repos/stephaniehic…
$ forks_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ keys_url                    <chr> "https://api.github.com/repos/stephaniehic…
$ collaborators_url           <chr> "https://api.github.com/repos/stephaniehic…
$ teams_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ hooks_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ issue_events_url            <chr> "https://api.github.com/repos/stephaniehic…
$ events_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ assignees_url               <chr> "https://api.github.com/repos/stephaniehic…
$ branches_url                <chr> "https://api.github.com/repos/stephaniehic…
$ tags_url                    <chr> "https://api.github.com/repos/stephaniehic…
$ blobs_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ git_tags_url                <chr> "https://api.github.com/repos/stephaniehic…
$ git_refs_url                <chr> "https://api.github.com/repos/stephaniehic…
$ trees_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ statuses_url                <chr> "https://api.github.com/repos/stephaniehic…
$ languages_url               <chr> "https://api.github.com/repos/stephaniehic…
$ stargazers_url              <chr> "https://api.github.com/repos/stephaniehic…
$ contributors_url            <chr> "https://api.github.com/repos/stephaniehic…
$ subscribers_url             <chr> "https://api.github.com/repos/stephaniehic…
$ subscription_url            <chr> "https://api.github.com/repos/stephaniehic…
$ commits_url                 <chr> "https://api.github.com/repos/stephaniehic…
$ git_commits_url             <chr> "https://api.github.com/repos/stephaniehic…
$ comments_url                <chr> "https://api.github.com/repos/stephaniehic…
$ issue_comment_url           <chr> "https://api.github.com/repos/stephaniehic…
$ contents_url                <chr> "https://api.github.com/repos/stephaniehic…
$ compare_url                 <chr> "https://api.github.com/repos/stephaniehic…
$ merges_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ archive_url                 <chr> "https://api.github.com/repos/stephaniehic…
$ downloads_url               <chr> "https://api.github.com/repos/stephaniehic…
$ issues_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ pulls_url                   <chr> "https://api.github.com/repos/stephaniehic…
$ milestones_url              <chr> "https://api.github.com/repos/stephaniehic…
$ notifications_url           <chr> "https://api.github.com/repos/stephaniehic…
$ labels_url                  <chr> "https://api.github.com/repos/stephaniehic…
$ releases_url                <chr> "https://api.github.com/repos/stephaniehic…
$ deployments_url             <chr> "https://api.github.com/repos/stephaniehic…
$ created_at                  <chr> "2018-12-03T13:20:45Z", "2018-05-10T10:22:…
$ updated_at                  <chr> "2019-08-08T02:18:17Z", "2018-05-10T10:22:…
$ pushed_at                   <chr> "2018-12-05T17:07:09Z", "2017-12-18T17:18:…
$ git_url                     <chr> "git://github.com/stephaniehicks/2018-bioi…
$ ssh_url                     <chr> "git@github.com:stephaniehicks/2018-bioinf…
$ clone_url                   <chr> "https://github.com/stephaniehicks/2018-bi…
$ svn_url                     <chr> "https://github.com/stephaniehicks/2018-bi…
$ homepage                    <chr> NA, NA, NA, NA, NA, "", NA, NA, NA, NA, NA…
$ size                        <int> 60296, 172353, 8858, 121, 675, 26688, 20, …
$ stargazers_count            <int> 4, 0, 1, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0,…
$ watchers_count              <int> 4, 0, 1, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0,…
$ language                    <chr> "TeX", "HTML", "R", NA, NA, "R", "R", "Jup…
$ has_issues                  <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRU…
$ has_projects                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_downloads               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_wiki                    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_pages                   <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ has_discussions             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ forks_count                 <int> 4, 0, 0, 0, 1, 1, 0, 2, 0, 0, 0, 4, 1, 1, …
$ mirror_url                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ archived                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ disabled                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ open_issues_count           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ license                     <df[,5]> <data.frame[26 x 5]>
$ allow_forking               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ is_template                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ web_commit_signoff_required <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ topics                      <list> <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>…
$ visibility                  <chr> "public", "public", "public", "public", "p…
$ forks                       <int> 4, 0, 0, 0, 1, 1, 0, 2, 0, 0, 0, 4, 1, 1,…
$ open_issues                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ watchers                    <int> 4, 0, 1, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0,…
$ default_branch              <chr> "master", "master", "main", "master", "mas…

The function fromJSON() has now converted the JSON file into a data frame.

However, from here, we see that there are only 30 rows (or 30 repositories). If you look on my github page, you can see there are more than 30 repositories.

https://github.com/stephaniehicks?tab=repositories

APIs limit info from users

What’s happening is called pagination.

At a high-level, the API is limiting the amount of items a user gets and splitting it into pages.

Formally, pagination is the process of splitting the contents or a section of a website into discrete pages. Users tend to get lost when there’s bunch of data and with pagination splitting they can concentrate on a particular amount of content. Hierarchy and paginated structure improve the readability score of the content.

In this use case Github api splits the result into 30 items per resonse, depends on the request

Solution: You should explicitly specify in your request how many items you would like to receive from server pagination engine, using formula for Github pagination api:

?page=1&per_page=<numberOfItemsYouSpecify>"

You can read more about pagination here:

https://docs.github.com/en/rest/guides/using-pagination-in-the-rest-api

Example

Here we can visit this website:

https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000

And see there are more than 30 repos. Let’s read it into R.

github_url = "https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000"

jsonDataAll <- as_tibble(fromJSON(github_url))
dim(jsonDataAll)

[1] 90 79

We now get all the public repositories! yay!

Using API keys

Authenticating with the GitHub API via an API key allows you to send much more requests to the API.

API access keys for the GitHub API are called personal access tokens (PAT) and the documentation explains how to generate a PAT once you have logged into your GitHub account.

Where to store API keys

First, please be careful with your PATs and never publish them.

If you want guidance on where you should store them, I like this post:

https://www.r-bloggers.com/2015/11/how-to-store-and-use-webservice-keys-and-authentication-details-with-r/

Personally, I keep mine in my .Renviron file which looks something like this on the inside:

GITHUB_API_KEY = <add my GitHub API key here> 
CENSUS_API_KEY = <add my tidycensus API key here> 
OPENFDA_API_KEY = <add my openFDA API key here>

If you do not have an .Renviron file in your home directory, you can make one:

cd ~
touch .Renviron

Assuming you have created and stored an API key in the .Renviron file in your home directory, you can fetch it with the Sys.getenv() function.

github_key <- Sys.getenv("GITHUB_API_KEY")

We will use this in a little bit.

Access API with `httr` and `GET`

There are a set of basic HTTP verbs that allow you access a set of endpoints.

The basic request patterns are:

Retrieve a single item (GET)
Retrieve a list of items (GET)
Create an item (POST)
Update an item (PUT)
Delete an item (DELETE)

Here, we will use the GET() function from httr package (i.e. tools to work with URLs and HTTP) to retrieve a single JSON file.

We will also make this an authenticated HTTP response to the GitHub API using authenticate() from the httr package.

Example

Let’s start by using the GitHub API to learn information about myself (Stephanie Hicks)

github_key <- Sys.getenv("GITHUB_API_KEY")
response <- GET('https://api.github.com/user', 
                authenticate(user = 'stephaniehicks', 
                             password = github_key))
response

Response [https://api.github.com/user]
  Date: 2023-11-24 10:21
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 1.76 kB
{
  "login": "stephaniehicks",
  "id": 1452065,
  "node_id": "MDQ6VXNlcjE0NTIwNjU=",
  "avatar_url": "https://avatars.githubusercontent.com/u/1452065?v=4",
  "gravatar_id": "",
  "url": "https://api.github.com/users/stephaniehicks",
  "html_url": "https://github.com/stephaniehicks",
  "followers_url": "https://api.github.com/users/stephaniehicks/followers",
  "following_url": "https://api.github.com/users/stephaniehicks/following{/ot...
...

We see the response we got is a JSON file.

Next we extract / retrieve the contents from the raw JSON output using the content() function from the httr package. If you use the argument as = 'text', it extracts the contents as a character vector.

account_details <- fromJSON(httr::content(response, as = 'text'))
account_details[1:30]

$login
[1] "stephaniehicks"

$id
[1] 1452065

$node_id
[1] "MDQ6VXNlcjE0NTIwNjU="

$avatar_url
[1] "https://avatars.githubusercontent.com/u/1452065?v=4"

$gravatar_id
[1] ""

$url
[1] "https://api.github.com/users/stephaniehicks"

$html_url
[1] "https://github.com/stephaniehicks"

$followers_url
[1] "https://api.github.com/users/stephaniehicks/followers"

$following_url
[1] "https://api.github.com/users/stephaniehicks/following{/other_user}"

$gists_url
[1] "https://api.github.com/users/stephaniehicks/gists{/gist_id}"

$starred_url
[1] "https://api.github.com/users/stephaniehicks/starred{/owner}{/repo}"

$subscriptions_url
[1] "https://api.github.com/users/stephaniehicks/subscriptions"

$organizations_url
[1] "https://api.github.com/users/stephaniehicks/orgs"

$repos_url
[1] "https://api.github.com/users/stephaniehicks/repos"

$events_url
[1] "https://api.github.com/users/stephaniehicks/events{/privacy}"

$received_events_url
[1] "https://api.github.com/users/stephaniehicks/received_events"

$type
[1] "User"

$site_admin
[1] FALSE

$name
[1] "Stephanie Hicks"

$company
[1] "Johns Hopkins"

$blog
[1] "http://www.stephaniehicks.com"

$location
[1] "Baltimore, MD"

$email
NULL

$hireable
NULL

$bio
[1] "Associate Prof at Johns Hopkins Biostatistics"

$twitter_username
[1] "stephaniehicks"

$public_repos
[1] 90

$public_gists
[1] 8

$followers
[1] 275

$following
[1] 18

Next, let’s perform the same request we did above about my 85 repositories, but instead of reading in the JSON file from the web, we use an authenticated GET() response:

response <- GET('https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000',
                authenticate('stephaniehicks', github_key))
repo_details <- as_tibble(fromJSON(httr::content(response, as = 'text')))
repo_details

# A tibble: 90 × 80
       id node_id name  full_name private owner$login html_url description fork 
    <int> <chr>   <chr> <chr>     <lgl>   <chr>       <chr>    <chr>       <lgl>
 1 1.60e8 MDEwOl… 2018… stephani… FALSE   stephanieh… https:/… <NA>        FALSE
 2 1.33e8 MDEwOl… advd… stephani… FALSE   stephanieh… https:/… <NA>        TRUE 
 3 6.48e8 R_kgDO… atla… stephani… FALSE   stephanieh… https:/… Repo to sh… FALSE
 4 2.26e8 MDEwOl… Awes… stephani… FALSE   stephanieh… https:/… A curated … TRUE 
 5 6.38e7 MDEwOl… awes… stephani… FALSE   stephanieh… https:/… List of so… TRUE 
 6 1.66e7 MDEwOl… Back… stephani… FALSE   stephanieh… https:/… Gene expre… FALSE
 7 2.88e8 MDEwOl… benc… stephani… FALSE   stephanieh… https:/… <NA>        FALSE
 8 1.69e8 MDEwOl… benc… stephani… FALSE   stephanieh… https:/… Benchmarki… FALSE
 9 1.40e8 MDEwOl… benc… stephani… FALSE   stephanieh… https:/… Repository… FALSE
10 1.78e8 MDEwOl… benc… stephani… FALSE   stephanieh… https:/… Data and B… FALSE
# ℹ 80 more rows
# ℹ 88 more variables: owner$id <int>, $node_id <chr>, $avatar_url <chr>,
#   $gravatar_id <chr>, $url <chr>, $html_url <chr>, $followers_url <chr>,
#   $following_url <chr>, $gists_url <chr>, $starred_url <chr>,
#   $subscriptions_url <chr>, $organizations_url <chr>, $repos_url <chr>,
#   $events_url <chr>, $received_events_url <chr>, $type <chr>,
#   $site_admin <lgl>, url <chr>, forks_url <chr>, keys_url <chr>, …

A bit of EDA fun

Let’s have a bit of fun and explore some questions:

How many have forks? How many forks?

table(repo_details$forks)


 0  1  2  3  4  5  6  7  8  9 11 22 
61 10  4  2  6  1  1  1  1  1  1  1

What’s the most popular language?

table(repo_details$language)


             CSS             HTML       JavaScript Jupyter Notebook 
               1               20                7                1 
        Makefile             Perl                R             Ruby 
               1                1               29                3 
           Shell              TeX 
               2                5

To find out how many repos that I have with open issues, we can just create a table:

# how many repos have open issues? 
table(repo_details$open_issues_count)


 0  1  2 
83  6  1

Whew! Not as many as I thought.

More about GET

You can use the query argument to specify details about the response.

Let’s look how many open issues there are in the dplyr package in the tidyverse

req <- GET("https://api.github.com/repos/tidyverse/dplyr/issues", 
           query = list(state = "open", per_page = 100, page = 1))
dplyr_details <- as_tibble(fromJSON(httr::content(req, as = 'text')))
dplyr_details

# A tibble: 50 × 30
   url         repository_url labels_url comments_url events_url html_url     id
   <chr>       <chr>          <chr>      <chr>        <chr>      <chr>     <int>
 1 https://ap… https://api.g… https://a… https://api… https://a… https:/… 2.01e9
 2 https://ap… https://api.g… https://a… https://api… https://a… https:/… 2.00e9
 3 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.99e9
 4 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.98e9
 5 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.98e9
 6 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.96e9
 7 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.95e9
 8 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.94e9
 9 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.92e9
10 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.87e9
# ℹ 40 more rows
# ℹ 23 more variables: node_id <chr>, number <int>, title <chr>,
#   user <df[,18]>, labels <list>, state <chr>, locked <lgl>,
#   assignee <df[,18]>, assignees <list>, milestone <lgl>, comments <int>,
#   created_at <chr>, updated_at <chr>, closed_at <lgl>,
#   author_association <chr>, active_lock_reason <chr>, body <chr>,
#   reactions <df[,10]>, timeline_url <chr>, performed_via_github_app <lgl>, …

Other examples with GitHub API

Finally, I will leave you with a few other examples of using GitHub API:

openFDA API

Next, we will demonstrate how to request data from the API at openFDA API, which returns JSON files.

This API provides create easy access to public data, to create a new level of openness and accountability, to ensure the privacy and security of public FDA data, and ultimately to educate the public and save lives. See data definitions for all included data.

Register for an API Key

First, you need to register for an API key here

https://open.fda.gov/apis/authentication/

You should also store the API key in your .Renviron like above for the GitHub API key.

Building the URL for `GET`

First, we will request a summarized set of counts around food recalls either voluntary by a firm or mandated by the FDA.

The URL we want is the following

https://api.fda.gov/food/enforcement.json?api_key=<your_API_key_here>&count=voluntary_mandated.exact

Let’s build up the URL.

The first is the base URL: https://api.fda.gov/food/enforcement.json. This part of the URL will be the same for all our calls to the food enforcement API (but is different if you want to investigate e.g. patient responses from drugs).
Next, ?apiKey=<your_API_key_here> is how I use my authorization token, which tells the openFDA servers that I am allowed to ask for this data.
Finally, we want to return a set of summarized counts for a specific field (&count=voluntary_mandated.exact)

Now that we have dissected the anatomy of an API, you can see how easy it is to build them!

Basically anybody with an internet connection, an authorization token, and who knows the grammar of the API can access it. Most APIs are published with extensive documentation to help you understand the available options and parameters.

Calling an API with `GET`

Let’s join the URL together:

## extract my API from `.Renviron`
openFDA_key <- Sys.getenv("OPENFDA_API_KEY")

## build the URL
base <- 'https://api.fda.gov/food/enforcement.json?api_key='
query <- '&count=voluntary_mandated.exact'

## put it all together
API_URL <- paste0(base, openFDA_key, query)

Now we have the entire URL stored in a simple R object called API_URL.

We can now use the URL to call the API, and we will store the returned data in an object called raw_data:

raw_data <- GET(API_URL)
raw_data

Response [https://api.fda.gov/food/enforcement.json?api_key=6cz2JMCPEw0gxiI8GtZeZXDD6wt1J3aM3Mp9rDDE&count=voluntary_mandated.exact]
  Date: 2023-11-24 10:21
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 711 B
{
  "meta": {
    "disclaimer": "Do not rely on openFDA to make decisions regarding medical...
    "terms": "https://open.fda.gov/terms/",
    "license": "https://open.fda.gov/license/",
    "last_updated": "2023-11-22"
  },
  "results": [
    {
      "term": "Voluntary: Firm initiated",
...

Pro-tip

We can see status element of the list. Traditionally, a status of “200” means that the API call was successful, and other codes are used to indicate errors. You can troubleshoot those error codes using the API documentation.

Next, we can inspect the object and we see that it is a list.

str(raw_data)

List of 10
 $ url        : chr "https://api.fda.gov/food/enforcement.json?api_key=6cz2JMCPEw0gxiI8GtZeZXDD6wt1J3aM3Mp9rDDE&count=voluntary_mandated.exact"
 $ status_code: int 200
 $ headers    :List of 22
  ..$ date                            : chr "Fri, 24 Nov 2023 10:21:51 GMT"
  ..$ content-type                    : chr "application/json; charset=utf-8"
  ..$ vary                            : chr "Accept-Encoding"
  ..$ access-control-allow-credentials: chr "true"
  ..$ access-control-allow-origin     : chr "*"
  ..$ age                             : chr "0"
  ..$ cache-control                   : chr "no-cache, no-store, must-revalidate"
  ..$ content-security-policy         : chr "default-src 'none'"
  ..$ etag                            : chr "W/\"2c7-D4yYChONTPyF2qQYL1Qita83Uu4\""
  ..$ strict-transport-security       : chr "max-age=31536000;"
  ..$ strict-transport-security       : chr "max-age=31536000; preload"
  ..$ vary                            : chr "Accept-Encoding"
  ..$ via                             : chr "https/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])"
  ..$ x-api-umbrella-request-id       : chr "cc03cq16o9mdgfjlqf30"
  ..$ x-cache                         : chr "MISS"
  ..$ x-content-type-options          : chr "nosniff"
  ..$ x-ratelimit-limit               : chr "240"
  ..$ x-ratelimit-remaining           : chr "239"
  ..$ x-vcap-request-id               : chr "d6a63001-2511-441a-4dab-58c1ab363d6d"
  ..$ x-xss-protection                : chr "1; mode=block"
  ..$ x-frame-options                 : chr "deny"
  ..$ content-encoding                : chr "gzip"
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ all_headers:List of 1
  ..$ :List of 3
  .. ..$ status : int 200
  .. ..$ version: chr "HTTP/2"
  .. ..$ headers:List of 22
  .. .. ..$ date                            : chr "Fri, 24 Nov 2023 10:21:51 GMT"
  .. .. ..$ content-type                    : chr "application/json; charset=utf-8"
  .. .. ..$ vary                            : chr "Accept-Encoding"
  .. .. ..$ access-control-allow-credentials: chr "true"
  .. .. ..$ access-control-allow-origin     : chr "*"
  .. .. ..$ age                             : chr "0"
  .. .. ..$ cache-control                   : chr "no-cache, no-store, must-revalidate"
  .. .. ..$ content-security-policy         : chr "default-src 'none'"
  .. .. ..$ etag                            : chr "W/\"2c7-D4yYChONTPyF2qQYL1Qita83Uu4\""
  .. .. ..$ strict-transport-security       : chr "max-age=31536000;"
  .. .. ..$ strict-transport-security       : chr "max-age=31536000; preload"
  .. .. ..$ vary                            : chr "Accept-Encoding"
  .. .. ..$ via                             : chr "https/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])"
  .. .. ..$ x-api-umbrella-request-id       : chr "cc03cq16o9mdgfjlqf30"
  .. .. ..$ x-cache                         : chr "MISS"
  .. .. ..$ x-content-type-options          : chr "nosniff"
  .. .. ..$ x-ratelimit-limit               : chr "240"
  .. .. ..$ x-ratelimit-remaining           : chr "239"
  .. .. ..$ x-vcap-request-id               : chr "d6a63001-2511-441a-4dab-58c1ab363d6d"
  .. .. ..$ x-xss-protection                : chr "1; mode=block"
  .. .. ..$ x-frame-options                 : chr "deny"
  .. .. ..$ content-encoding                : chr "gzip"
  .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ cookies    :'data.frame':    0 obs. of  7 variables:
  ..$ domain    : logi(0) 
  ..$ flag      : logi(0) 
  ..$ path      : logi(0) 
  ..$ secure    : logi(0) 
  ..$ expiration: 'POSIXct' num(0) 
  ..$ name      : logi(0) 
  ..$ value     : logi(0) 
 $ content    : raw [1:711] 7b 0a 20 20 ...
 $ date       : POSIXct[1:1], format: "2023-11-24 10:21:51"
 $ times      : Named num [1:6] 0 0.062 0.137 0.218 0.386 ...
  ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
 $ request    :List of 7
  ..$ method    : chr "GET"
  ..$ url       : chr "https://api.fda.gov/food/enforcement.json?api_key=6cz2JMCPEw0gxiI8GtZeZXDD6wt1J3aM3Mp9rDDE&count=voluntary_mandated.exact"
  ..$ headers   : Named chr "application/json, text/xml, application/xml, */*"
  .. ..- attr(*, "names")= chr "Accept"
  ..$ fields    : NULL
  ..$ options   :List of 2
  .. ..$ useragent: chr "libcurl/8.1.2 r-curl/5.1.0 httr/1.4.7"
  .. ..$ httpget  : logi TRUE
  ..$ auth_token: NULL
  ..$ output    : list()
  .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
  ..- attr(*, "class")= chr "request"
 $ handle     :Class 'curl_handle' <externalptr> 
 - attr(*, "class")= chr "response"

One of the elements is content and we can inspect that

str(raw_data$content)

 raw [1:711] 7b 0a 20 20 ...

We see the actual data have been stored as raw vectors (or raw bytes), which need to be converted to character vectors. This is not in a useable format yet.

Converting JSON to a `data.frame`

There is a function in base R rawTo_Char() that converts raw bytes to characters

openFDA_data <- fromJSON(rawToChar(raw_data$content), flatten = TRUE)

This converts the raw data into a list.

Note

We can also do this with httr::content (as above) and just define the encoding for the character set.

openFDA_data <- fromJSON(httr::content(raw_data, 
                                       as = 'text', 
                                       encoding =  "UTF-8"))
str(openFDA_data)

List of 2
 $ meta   :List of 4
  ..$ disclaimer  : chr "Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data"| __truncated__
  ..$ terms       : chr "https://open.fda.gov/terms/"
  ..$ license     : chr "https://open.fda.gov/license/"
  ..$ last_updated: chr "2023-11-22"
 $ results:'data.frame':    4 obs. of  2 variables:
  ..$ term : chr [1:4] "Voluntary: Firm initiated" "Voluntary: Firm Initiated" "FDA Mandated" ""
  ..$ count: int [1:4] 24217 563 397 6

Now that it is in a list format, you can see that it actually contains several data frames!

You can use this data right away if you are already familiar with lists in R, or you can extract the data frames into separate objects, like this:

ts_df <- openFDA_data$results
ts_df

                       term count
1 Voluntary: Firm initiated 24217
2 Voluntary: Firm Initiated   563
3              FDA Mandated   397
4                               6

We could wrangle and visualize the data from here.

Post-lecture materials

Other good R packages to know about

googlesheets4 to interact with Google Sheets in R
googledrive to interact with files on your Google Drive

Final Questions

Here are some post-lecture questions to help you think about the material discussed.

Questions

Using the GitHub API, access the repository information and ask how many open github issues you have?
Pick another API that we have not discussed here and use httr to retreive data from it.

Additional Resources

Tip

Pre-lecture materials

Read ahead

Acknowledgements

Install new packages

Learning objectives

Motivation

“Raw” vs “Clean” data

What do we mean by “raw” data?

Where do data live?

Best practices on sharing data

Getting data

JSON files

Overview of APIs

How do APIs work?

Four types of API architectures

How to use an API?

Where can I find new APIs?

GitHub API

Access the API from R

Using API keys

Access API with httr and GET

A bit of EDA fun

Other examples with GitHub API

openFDA API

Register for an API Key

Building the URL for GET

Calling an API with GET

Converting JSON to a data.frame

Post-lecture materials

Other good R packages to know about

Final Questions

Additional Resources

Access API with `httr` and `GET`

Building the URL for `GET`

Calling an API with `GET`

Converting JSON to a `data.frame`