install.packages("jsonlite")
install.packages("httr")
Pre-lecture materials
Read ahead
Before class, you can prepare by reading the following materials:
Acknowledgements
Material for this lecture was borrowed and adopted from
Install new packages
Before we begin, you will need to install these packages
Now we load a few R packages
library(tidyverse)
library(jsonlite)
library(httr)
Learning objectives
At the end of this lesson you will:
- Describe what the difference is between “raw” vs “clean” data
- Learn about what are JSON files and how we can convert them into data frames in R
- Describe some best practices on sharing data with collaborators
- Know what does API mean and state four types of API architectures
- Practice with two APIs: the GitHub API and the openFDA API
Motivation
Today, we are going to talk about getting data from APIs and examples of common data formats.
First, let’s have a bit of a philosophical discussion about data.
“Raw” vs “Clean” data
As data analysts, this is what we wished data looked like whenever we start a project
However, the reality, is data is rarely in that form in comes in all types of “raw” formats that need to be transformed into a “clean” format.
For example, in field of genomics, raw data looks like something like this:
Or if you are interested in analyzing data from Twitter:
Or data from Electronic Healthcare Records (EHRs):
We all have our scary spreadsheet tales. Here is Jenny Bryan from Posit and UBC actually asking for some of those spreadsheet tales on twitter.
For example, this is an actual spreadsheet from Enron in 2001:
What do we mean by “raw” data?
From https://simplystatistics.org/posts/2016-07-20-relativity-raw-data/ raw data is defined as data…
…if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the rawest version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.
Where do data live?
Data lives anywhere and everywhere. Data might be stored simply in a .csv
or .txt
file. Data might be stored in an Excel or Google Spreadsheet. Data might be stored in large databases that require users to write special functions to interact with to extract the data they are interested in.
For example, you may have heard of the terms mySQL
or MongoDB
.
From Wikipedia, MySQL is defined as an open-source relational database management system (RDBMS). Its name is a combination of “My”, the name of co-founder Michael Widenius’s daughter,[7] and “SQL”, the abbreviation for Structured Query Language.
From Wikipeda, MongoDB is defined as “a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemata.”
So after reading that, we get the sense that there are multiple ways large databases can be structured, data can be formatted and interacted with. In addition, we see that database programs (e.g. MySQL and MongoDB) can also interact with each other.
We will learn more about JSON
today and learn about SQL
in a later lecture more formally.
Best practices on sharing data
A great article in PeerJ was written titled How to share data for collaboration, in which the authors describe a set of guidelines for sharing data:
We highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician. With these guidelines we hope to avoid errors and delays in data analysis. the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician.
It’s a great paper that describes the information you should pass to a statistician to facilitate the most efficient and timely analysis.
Specifically:
- The raw data (or the rawest form of the data to which you have access)
- Should not have modified, removed or summarized any data; Ran no software on data
- e.g. strange binary file your measurement machine spits out
- e.g. complicated JSON file you scrapped from Twitter Application Programming Interfaces (API)
- e.g. hand-entered numbers you collected looking through a microscope
- A clean data set
- This may or may not be transforming data into a
tidy
dataset, but possibly yes
- This may or may not be transforming data into a
- A code book describing each variable and its values in the clean or tidy data set.
- More detailed information about the measurements in the data set (e.g. units, experimental design, summary choices made)
- Doesn’t quite fit into the column names in the spreadsheet
- Often reported in a
.md
,.txt
or Word file.
- An explicit and exact recipe you used to go from 1 -> 2,3
Getting data
JSON files
JSON (or JavaScript Object Notation) is a file format that stores information in human-readable, organized, logical, easy-to-access manner.
For example, here is what a JSON file looks like:
var stephanie = {
"job-title" : "Associate Professor",
"hometown" : "Baltimore, MD",
"pronouns": "she/her",
"states-lived" : {
"state1" : "Louisiana",
"state2" : "Texas",
"state3" : "Massachusetts",
"state4" : "Maryland"
} }
Some features about JSON
objects:
- JSON objects are surrounded by curly braces
{}
- JSON objects are written in key/value pairs
- Keys must be strings, and values must be a valid JSON data type (string, number, object, array, boolean)
- Keys and values are separated by a colon
- Each key/value pair is separated by a comma
Overview of APIs
From AWS, API stands for Application Programming Interface.
- “Application” = any software with a distinct function
- “Interface” = a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses.
The API documentation contains information on how developers are to structure those requests and responses.
The purpose of APIs is enable two software components to communicate with each other using a set of definitions and protocols.
For example, the weather bureau’s software system contains daily weather data. The weather app on your phone “talks” to this system via APIs and shows you daily weather updates on your phone.
How do APIs work?
To understand how APIs work, two terms that are important are
- client. This is the application sending the request.
- server. This is the application sending the response.
So in the weather example, the bureau’s weather database is the server, and the mobile app is the client.
Four types of API architectures
There are four different ways that APIs can work depending on when and why they were created.
SOAP APIs. These APIs use Simple Object Access Protocol. Client and server exchange messages using XML. This is a less flexible API that was more popular in the past.
RPC APIs. These APIs are called Remote Procedure Calls. The client completes a function (or procedure) on the server, and the server sends the output back to the client.
Websocket APIs. Websocket API is another modern web API development that uses JSON objects to pass data. A WebSocket API supports two-way communication between client apps and the server. The server can send callback messages to connected clients, making it more efficient than REST API.
REST APIs. REST stands for Representational State Transfer (and are the most popular and flexible APIs). The client sends requests to the server as data. The server uses this client input to start internal functions and returns output data back to the client. REST defines a set of functions like GET, PUT, DELETE, etc. that clients can use to access server data. Clients and servers exchange data using HTTP.
The main feature of REST API is statelessness (i.e. servers do not save client data between requests). Client requests to the server are similar to URLs you type in your browser to visit a website. The response from the server is plain data, without the typical graphical rendering of a web page.
How to use an API?
The basic steps to using an API are:
- Obtaining an API key. This is done by creating a verified account with the API provider.
- Set up an HTTP API client. This tool allows you to structure API requests easily using the API keys received. Here, we will use the
GET()
function from thehttr
package. - If you don’t have an API client, you can try to structure the request yourself in your browser by referring to the API documentation.
- Once you are comfortable with the new API syntax, you can start using it in your code.
Where can I find new APIs?
New web APIs can be found on API marketplaces and API directories, such as:
- Rapid API – One of the largest global API markets (10k+ public APIs). Users to test APIs directly on the platform before committing to purchase.
- Public REST APIs – Groups REST APIs into categories, making it easier to browse and find the right one to meet your needs.
- APIForThat and APIList – Both these websites have lists of 500+ web APIs, along with in-depth information on how to use them.
GitHub API
The GitHub REST API may be of interest when studying online communities, working methods, organizational structures, communication and discussions, etc. with a focus on (open-source) software development.
Many projects that are hosted on GitHub are open-source projects with a transparent development process and communications. For private projects, which can also be hosted on GitHub, there’s understandably only a few aggregate data available.
Let’s say we want to use the GitHub REST API to find out how many of my GitHub repositories have open issues?
The API can be used for free and you can send up to 60 requests per hour if you are not authenticated (i.e. if you don’t provide an API key).
For serious data collection, this is not much, so it is recommended to sign up on GitHub and generate a personal access token that acts as API key.
This token can then be used to authenticate your API requests. Your quota is then 5000 requests per hour.
Access the API from R
There are packages for many programming languages that provide convenient access for communicating with the GitHub API, but there are no such packages (that I’m aware of) for accessing the API from R.
This means we can only access the API directly, e.g. by using the jsonlite
package to fetch the data and convert it to an R list
or data.frame
.
Specifically, we will use the jsonlite::fromJSON()
function to convert from a JSON object to a data frame.
The JSON file is located at https://api.github.com/users/stephaniehicks/repos
= "https://api.github.com/users/stephaniehicks/repos"
github_url
library(jsonlite)
library(tidyverse)
<- as_tibble(fromJSON(github_url))
jsonData glimpse(jsonData)
Rows: 30
Columns: 79
$ id <int> 160194123, 132884754, 647539937, 225501707…
$ node_id <chr> "MDEwOlJlcG9zaXRvcnkxNjAxOTQxMjM=", "MDEwO…
$ name <chr> "2018-bioinfosummer-scrnaseq", "advdatasci…
$ full_name <chr> "stephaniehicks/2018-bioinfosummer-scrnase…
$ private <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ owner <df[,18]> <data.frame[26 x 18]>
$ html_url <chr> "https://github.com/stephaniehicks/201…
$ description <chr> NA, NA, "Repo to share code for the atlas-…
$ fork <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FAL…
$ url <chr> "https://api.github.com/repos/stephaniehic…
$ forks_url <chr> "https://api.github.com/repos/stephaniehic…
$ keys_url <chr> "https://api.github.com/repos/stephaniehic…
$ collaborators_url <chr> "https://api.github.com/repos/stephaniehic…
$ teams_url <chr> "https://api.github.com/repos/stephaniehic…
$ hooks_url <chr> "https://api.github.com/repos/stephaniehic…
$ issue_events_url <chr> "https://api.github.com/repos/stephaniehic…
$ events_url <chr> "https://api.github.com/repos/stephaniehic…
$ assignees_url <chr> "https://api.github.com/repos/stephaniehic…
$ branches_url <chr> "https://api.github.com/repos/stephaniehic…
$ tags_url <chr> "https://api.github.com/repos/stephaniehic…
$ blobs_url <chr> "https://api.github.com/repos/stephaniehic…
$ git_tags_url <chr> "https://api.github.com/repos/stephaniehic…
$ git_refs_url <chr> "https://api.github.com/repos/stephaniehic…
$ trees_url <chr> "https://api.github.com/repos/stephaniehic…
$ statuses_url <chr> "https://api.github.com/repos/stephaniehic…
$ languages_url <chr> "https://api.github.com/repos/stephaniehic…
$ stargazers_url <chr> "https://api.github.com/repos/stephaniehic…
$ contributors_url <chr> "https://api.github.com/repos/stephaniehic…
$ subscribers_url <chr> "https://api.github.com/repos/stephaniehic…
$ subscription_url <chr> "https://api.github.com/repos/stephaniehic…
$ commits_url <chr> "https://api.github.com/repos/stephaniehic…
$ git_commits_url <chr> "https://api.github.com/repos/stephaniehic…
$ comments_url <chr> "https://api.github.com/repos/stephaniehic…
$ issue_comment_url <chr> "https://api.github.com/repos/stephaniehic…
$ contents_url <chr> "https://api.github.com/repos/stephaniehic…
$ compare_url <chr> "https://api.github.com/repos/stephaniehic…
$ merges_url <chr> "https://api.github.com/repos/stephaniehic…
$ archive_url <chr> "https://api.github.com/repos/stephaniehic…
$ downloads_url <chr> "https://api.github.com/repos/stephaniehic…
$ issues_url <chr> "https://api.github.com/repos/stephaniehic…
$ pulls_url <chr> "https://api.github.com/repos/stephaniehic…
$ milestones_url <chr> "https://api.github.com/repos/stephaniehic…
$ notifications_url <chr> "https://api.github.com/repos/stephaniehic…
$ labels_url <chr> "https://api.github.com/repos/stephaniehic…
$ releases_url <chr> "https://api.github.com/repos/stephaniehic…
$ deployments_url <chr> "https://api.github.com/repos/stephaniehic…
$ created_at <chr> "2018-12-03T13:20:45Z", "2018-05-10T10:22:…
$ updated_at <chr> "2019-08-08T02:18:17Z", "2018-05-10T10:22:…
$ pushed_at <chr> "2018-12-05T17:07:09Z", "2017-12-18T17:18:…
$ git_url <chr> "git://github.com/stephaniehicks/2018-bioi…
$ ssh_url <chr> "git@github.com:stephaniehicks/2018-bioinf…
$ clone_url <chr> "https://github.com/stephaniehicks/2018-bi…
$ svn_url <chr> "https://github.com/stephaniehicks/2018-bi…
$ homepage <chr> NA, NA, NA, NA, NA, "", NA, NA, NA, NA, NA…
$ size <int> 60296, 172353, 8858, 121, 675, 26688, 20, …
$ stargazers_count <int> 4, 0, 1, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0,…
$ watchers_count <int> 4, 0, 1, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0,…
$ language <chr> "TeX", "HTML", "R", NA, NA, "R", "R", "Jup…
$ has_issues <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRU…
$ has_projects <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_downloads <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_wiki <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ has_pages <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ has_discussions <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ forks_count <int> 4, 0, 0, 0, 1, 1, 0, 2, 0, 0, 0, 4, 1, 1, …
$ mirror_url <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ archived <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ disabled <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ open_issues_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ license <df[,5]> <data.frame[26 x 5]>
$ allow_forking <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ is_template <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ web_commit_signoff_required <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ topics <list> <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>…
$ visibility <chr> "public", "public", "public", "public", "p…
$ forks <int> 4, 0, 0, 0, 1, 1, 0, 2, 0, 0, 0, 4, 1, 1,…
$ open_issues <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ watchers <int> 4, 0, 1, 1, 0, 0, 1, 8, 0, 1, 0, 14, 3, 0,…
$ default_branch <chr> "master", "master", "main", "master", "mas…
The function fromJSON()
has now converted the JSON file into a data frame.
However, from here, we see that there are only 30 rows (or 30 repositories). If you look on my github page, you can see there are more than 30 repositories.
What’s happening is called pagination.
At a high-level, the API is limiting the amount of items a user gets and splitting it into pages.
Formally, pagination is the process of splitting the contents or a section of a website into discrete pages. Users tend to get lost when there’s bunch of data and with pagination splitting they can concentrate on a particular amount of content. Hierarchy and paginated structure improve the readability score of the content.
In this use case Github api splits the result into 30 items per resonse, depends on the request
Solution: You should explicitly specify in your request how many items you would like to receive from server pagination engine, using formula for Github pagination api:
?page=1&per_page=<numberOfItemsYouSpecify>"
You can read more about pagination here:
Here we can visit this website:
And see there are more than 30 repos. Let’s read it into R.
= "https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000"
github_url
<- as_tibble(fromJSON(github_url))
jsonDataAll dim(jsonDataAll)
[1] 90 79
We now get all the public repositories! yay!
Using API keys
Authenticating with the GitHub API via an API key allows you to send much more requests to the API.
API access keys for the GitHub API are called personal access tokens (PAT) and the documentation explains how to generate a PAT once you have logged into your GitHub account.
First, please be careful with your PATs and never publish them.
If you want guidance on where you should store them, I like this post:
Personally, I keep mine in my .Renviron
file which looks something like this on the inside:
GITHUB_API_KEY = <add my GitHub API key here>
CENSUS_API_KEY = <add my tidycensus API key here>
OPENFDA_API_KEY = <add my openFDA API key here>
If you do not have an .Renviron
file in your home directory, you can make one:
cd ~
touch .Renviron
Assuming you have created and stored an API key in the .Renviron
file in your home directory, you can fetch it with the Sys.getenv()
function.
<- Sys.getenv("GITHUB_API_KEY") github_key
We will use this in a little bit.
Access API with httr
and GET
There are a set of basic HTTP verbs that allow you access a set of endpoints.
The basic request patterns are:
- Retrieve a single item (GET)
- Retrieve a list of items (GET)
- Create an item (POST)
- Update an item (PUT)
- Delete an item (DELETE)
Here, we will use the GET()
function from httr
package (i.e. tools to work with URLs and HTTP) to retrieve a single JSON file.
We will also make this an authenticated HTTP response to the GitHub API using authenticate()
from the httr
package.
Let’s start by using the GitHub API to learn information about myself (Stephanie Hicks)
<- Sys.getenv("GITHUB_API_KEY")
github_key <- GET('https://api.github.com/user',
response authenticate(user = 'stephaniehicks',
password = github_key))
response
Response [https://api.github.com/user]
Date: 2023-11-24 10:21
Status: 200
Content-Type: application/json; charset=utf-8
Size: 1.76 kB
{
"login": "stephaniehicks",
"id": 1452065,
"node_id": "MDQ6VXNlcjE0NTIwNjU=",
"avatar_url": "https://avatars.githubusercontent.com/u/1452065?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/stephaniehicks",
"html_url": "https://github.com/stephaniehicks",
"followers_url": "https://api.github.com/users/stephaniehicks/followers",
"following_url": "https://api.github.com/users/stephaniehicks/following{/ot...
...
We see the response we got is a JSON file.
Next we extract / retrieve the contents from the raw JSON output using the content()
function from the httr
package. If you use the argument as = 'text'
, it extracts the contents as a character vector.
<- fromJSON(httr::content(response, as = 'text'))
account_details 1:30] account_details[
$login
[1] "stephaniehicks"
$id
[1] 1452065
$node_id
[1] "MDQ6VXNlcjE0NTIwNjU="
$avatar_url
[1] "https://avatars.githubusercontent.com/u/1452065?v=4"
$gravatar_id
[1] ""
$url
[1] "https://api.github.com/users/stephaniehicks"
$html_url
[1] "https://github.com/stephaniehicks"
$followers_url
[1] "https://api.github.com/users/stephaniehicks/followers"
$following_url
[1] "https://api.github.com/users/stephaniehicks/following{/other_user}"
$gists_url
[1] "https://api.github.com/users/stephaniehicks/gists{/gist_id}"
$starred_url
[1] "https://api.github.com/users/stephaniehicks/starred{/owner}{/repo}"
$subscriptions_url
[1] "https://api.github.com/users/stephaniehicks/subscriptions"
$organizations_url
[1] "https://api.github.com/users/stephaniehicks/orgs"
$repos_url
[1] "https://api.github.com/users/stephaniehicks/repos"
$events_url
[1] "https://api.github.com/users/stephaniehicks/events{/privacy}"
$received_events_url
[1] "https://api.github.com/users/stephaniehicks/received_events"
$type
[1] "User"
$site_admin
[1] FALSE
$name
[1] "Stephanie Hicks"
$company
[1] "Johns Hopkins"
$blog
[1] "http://www.stephaniehicks.com"
$location
[1] "Baltimore, MD"
$email
NULL
$hireable
NULL
$bio
[1] "Associate Prof at Johns Hopkins Biostatistics"
$twitter_username
[1] "stephaniehicks"
$public_repos
[1] 90
$public_gists
[1] 8
$followers
[1] 275
$following
[1] 18
Next, let’s perform the same request we did above about my 85 repositories, but instead of reading in the JSON file from the web, we use an authenticated GET()
response:
<- GET('https://api.github.com/users/stephaniehicks/repos?page=1&per_page=1000',
response authenticate('stephaniehicks', github_key))
<- as_tibble(fromJSON(httr::content(response, as = 'text')))
repo_details repo_details
# A tibble: 90 × 80
id node_id name full_name private owner$login html_url description fork
<int> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl>
1 1.60e8 MDEwOl… 2018… stephani… FALSE stephanieh… https:/… <NA> FALSE
2 1.33e8 MDEwOl… advd… stephani… FALSE stephanieh… https:/… <NA> TRUE
3 6.48e8 R_kgDO… atla… stephani… FALSE stephanieh… https:/… Repo to sh… FALSE
4 2.26e8 MDEwOl… Awes… stephani… FALSE stephanieh… https:/… A curated … TRUE
5 6.38e7 MDEwOl… awes… stephani… FALSE stephanieh… https:/… List of so… TRUE
6 1.66e7 MDEwOl… Back… stephani… FALSE stephanieh… https:/… Gene expre… FALSE
7 2.88e8 MDEwOl… benc… stephani… FALSE stephanieh… https:/… <NA> FALSE
8 1.69e8 MDEwOl… benc… stephani… FALSE stephanieh… https:/… Benchmarki… FALSE
9 1.40e8 MDEwOl… benc… stephani… FALSE stephanieh… https:/… Repository… FALSE
10 1.78e8 MDEwOl… benc… stephani… FALSE stephanieh… https:/… Data and B… FALSE
# ℹ 80 more rows
# ℹ 88 more variables: owner$id <int>, $node_id <chr>, $avatar_url <chr>,
# $gravatar_id <chr>, $url <chr>, $html_url <chr>, $followers_url <chr>,
# $following_url <chr>, $gists_url <chr>, $starred_url <chr>,
# $subscriptions_url <chr>, $organizations_url <chr>, $repos_url <chr>,
# $events_url <chr>, $received_events_url <chr>, $type <chr>,
# $site_admin <lgl>, url <chr>, forks_url <chr>, keys_url <chr>, …
A bit of EDA fun
Let’s have a bit of fun and explore some questions:
- How many have forks? How many forks?
table(repo_details$forks)
0 1 2 3 4 5 6 7 8 9 11 22
61 10 4 2 6 1 1 1 1 1 1 1
What’s the most popular language?
table(repo_details$language)
CSS HTML JavaScript Jupyter Notebook
1 20 7 1
Makefile Perl R Ruby
1 1 29 3
Shell TeX
2 5
To find out how many repos that I have with open issues, we can just create a table:
# how many repos have open issues?
table(repo_details$open_issues_count)
0 1 2
83 6 1
Whew! Not as many as I thought.
GET
You can use the query
argument to specify details about the response.
Let’s look how many open issues there are in the dplyr
package in the tidyverse
<- GET("https://api.github.com/repos/tidyverse/dplyr/issues",
req query = list(state = "open", per_page = 100, page = 1))
<- as_tibble(fromJSON(httr::content(req, as = 'text')))
dplyr_details dplyr_details
# A tibble: 50 × 30
url repository_url labels_url comments_url events_url html_url id
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 https://ap… https://api.g… https://a… https://api… https://a… https:/… 2.01e9
2 https://ap… https://api.g… https://a… https://api… https://a… https:/… 2.00e9
3 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.99e9
4 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.98e9
5 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.98e9
6 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.96e9
7 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.95e9
8 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.94e9
9 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.92e9
10 https://ap… https://api.g… https://a… https://api… https://a… https:/… 1.87e9
# ℹ 40 more rows
# ℹ 23 more variables: node_id <chr>, number <int>, title <chr>,
# user <df[,18]>, labels <list>, state <chr>, locked <lgl>,
# assignee <df[,18]>, assignees <list>, milestone <lgl>, comments <int>,
# created_at <chr>, updated_at <chr>, closed_at <lgl>,
# author_association <chr>, active_lock_reason <chr>, body <chr>,
# reactions <df[,10]>, timeline_url <chr>, performed_via_github_app <lgl>, …
Other examples with GitHub API
Finally, I will leave you with a few other examples of using GitHub API:
openFDA API
Next, we will demonstrate how to request data from the API at openFDA API, which returns JSON files.
This API provides create easy access to public data, to create a new level of openness and accountability, to ensure the privacy and security of public FDA data, and ultimately to educate the public and save lives. See data definitions for all included data.
Register for an API Key
First, you need to register for an API key here
You should also store the API key in your .Renviron
like above for the GitHub API key.
Building the URL for GET
First, we will request a summarized set of counts around food recalls either voluntary by a firm or mandated by the FDA.
The URL we want is the following
https://api.fda.gov/food/enforcement.json?api_key=<your_API_key_here>&count=voluntary_mandated.exact
Let’s build up the URL.
- The first is the base URL:
https://api.fda.gov/food/enforcement.json
. This part of the URL will be the same for all our calls to the food enforcement API (but is different if you want to investigate e.g. patient responses from drugs). - Next,
?apiKey=<your_API_key_here>
is how I use my authorization token, which tells the openFDA servers that I am allowed to ask for this data. - Finally, we want to return a set of summarized counts for a specific field (
&count=voluntary_mandated.exact
)
Now that we have dissected the anatomy of an API, you can see how easy it is to build them!
Basically anybody with an internet connection, an authorization token, and who knows the grammar of the API can access it. Most APIs are published with extensive documentation to help you understand the available options and parameters.
Calling an API with GET
Let’s join the URL together:
## extract my API from `.Renviron`
<- Sys.getenv("OPENFDA_API_KEY")
openFDA_key
## build the URL
<- 'https://api.fda.gov/food/enforcement.json?api_key='
base <- '&count=voluntary_mandated.exact'
query
## put it all together
<- paste0(base, openFDA_key, query) API_URL
Now we have the entire URL stored in a simple R object called API_URL
.
We can now use the URL to call the API, and we will store the returned data in an object called raw_data
:
<- GET(API_URL)
raw_data raw_data
Response [https://api.fda.gov/food/enforcement.json?api_key=6cz2JMCPEw0gxiI8GtZeZXDD6wt1J3aM3Mp9rDDE&count=voluntary_mandated.exact]
Date: 2023-11-24 10:21
Status: 200
Content-Type: application/json; charset=utf-8
Size: 711 B
{
"meta": {
"disclaimer": "Do not rely on openFDA to make decisions regarding medical...
"terms": "https://open.fda.gov/terms/",
"license": "https://open.fda.gov/license/",
"last_updated": "2023-11-22"
},
"results": [
{
"term": "Voluntary: Firm initiated",
...
We can see status
element of the list. Traditionally, a status of “200” means that the API call was successful, and other codes are used to indicate errors. You can troubleshoot those error codes using the API documentation.
Next, we can inspect the object and we see that it is a list.
str(raw_data)
List of 10
$ url : chr "https://api.fda.gov/food/enforcement.json?api_key=6cz2JMCPEw0gxiI8GtZeZXDD6wt1J3aM3Mp9rDDE&count=voluntary_mandated.exact"
$ status_code: int 200
$ headers :List of 22
..$ date : chr "Fri, 24 Nov 2023 10:21:51 GMT"
..$ content-type : chr "application/json; charset=utf-8"
..$ vary : chr "Accept-Encoding"
..$ access-control-allow-credentials: chr "true"
..$ access-control-allow-origin : chr "*"
..$ age : chr "0"
..$ cache-control : chr "no-cache, no-store, must-revalidate"
..$ content-security-policy : chr "default-src 'none'"
..$ etag : chr "W/\"2c7-D4yYChONTPyF2qQYL1Qita83Uu4\""
..$ strict-transport-security : chr "max-age=31536000;"
..$ strict-transport-security : chr "max-age=31536000; preload"
..$ vary : chr "Accept-Encoding"
..$ via : chr "https/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])"
..$ x-api-umbrella-request-id : chr "cc03cq16o9mdgfjlqf30"
..$ x-cache : chr "MISS"
..$ x-content-type-options : chr "nosniff"
..$ x-ratelimit-limit : chr "240"
..$ x-ratelimit-remaining : chr "239"
..$ x-vcap-request-id : chr "d6a63001-2511-441a-4dab-58c1ab363d6d"
..$ x-xss-protection : chr "1; mode=block"
..$ x-frame-options : chr "deny"
..$ content-encoding : chr "gzip"
..- attr(*, "class")= chr [1:2] "insensitive" "list"
$ all_headers:List of 1
..$ :List of 3
.. ..$ status : int 200
.. ..$ version: chr "HTTP/2"
.. ..$ headers:List of 22
.. .. ..$ date : chr "Fri, 24 Nov 2023 10:21:51 GMT"
.. .. ..$ content-type : chr "application/json; charset=utf-8"
.. .. ..$ vary : chr "Accept-Encoding"
.. .. ..$ access-control-allow-credentials: chr "true"
.. .. ..$ access-control-allow-origin : chr "*"
.. .. ..$ age : chr "0"
.. .. ..$ cache-control : chr "no-cache, no-store, must-revalidate"
.. .. ..$ content-security-policy : chr "default-src 'none'"
.. .. ..$ etag : chr "W/\"2c7-D4yYChONTPyF2qQYL1Qita83Uu4\""
.. .. ..$ strict-transport-security : chr "max-age=31536000;"
.. .. ..$ strict-transport-security : chr "max-age=31536000; preload"
.. .. ..$ vary : chr "Accept-Encoding"
.. .. ..$ via : chr "https/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])"
.. .. ..$ x-api-umbrella-request-id : chr "cc03cq16o9mdgfjlqf30"
.. .. ..$ x-cache : chr "MISS"
.. .. ..$ x-content-type-options : chr "nosniff"
.. .. ..$ x-ratelimit-limit : chr "240"
.. .. ..$ x-ratelimit-remaining : chr "239"
.. .. ..$ x-vcap-request-id : chr "d6a63001-2511-441a-4dab-58c1ab363d6d"
.. .. ..$ x-xss-protection : chr "1; mode=block"
.. .. ..$ x-frame-options : chr "deny"
.. .. ..$ content-encoding : chr "gzip"
.. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
$ cookies :'data.frame': 0 obs. of 7 variables:
..$ domain : logi(0)
..$ flag : logi(0)
..$ path : logi(0)
..$ secure : logi(0)
..$ expiration: 'POSIXct' num(0)
..$ name : logi(0)
..$ value : logi(0)
$ content : raw [1:711] 7b 0a 20 20 ...
$ date : POSIXct[1:1], format: "2023-11-24 10:21:51"
$ times : Named num [1:6] 0 0.062 0.137 0.218 0.386 ...
..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
$ request :List of 7
..$ method : chr "GET"
..$ url : chr "https://api.fda.gov/food/enforcement.json?api_key=6cz2JMCPEw0gxiI8GtZeZXDD6wt1J3aM3Mp9rDDE&count=voluntary_mandated.exact"
..$ headers : Named chr "application/json, text/xml, application/xml, */*"
.. ..- attr(*, "names")= chr "Accept"
..$ fields : NULL
..$ options :List of 2
.. ..$ useragent: chr "libcurl/8.1.2 r-curl/5.1.0 httr/1.4.7"
.. ..$ httpget : logi TRUE
..$ auth_token: NULL
..$ output : list()
.. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
..- attr(*, "class")= chr "request"
$ handle :Class 'curl_handle' <externalptr>
- attr(*, "class")= chr "response"
One of the elements is content
and we can inspect that
str(raw_data$content)
raw [1:711] 7b 0a 20 20 ...
We see the actual data have been stored as raw vectors (or raw bytes), which need to be converted to character vectors. This is not in a useable format yet.
Converting JSON to a data.frame
There is a function in base R rawTo_Char()
that converts raw bytes to characters
<- fromJSON(rawToChar(raw_data$content), flatten = TRUE) openFDA_data
This converts the raw data into a list.
We can also do this with httr::content
(as above) and just define the encoding for the character set.
<- fromJSON(httr::content(raw_data,
openFDA_data as = 'text',
encoding = "UTF-8"))
str(openFDA_data)
List of 2
$ meta :List of 4
..$ disclaimer : chr "Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data"| __truncated__
..$ terms : chr "https://open.fda.gov/terms/"
..$ license : chr "https://open.fda.gov/license/"
..$ last_updated: chr "2023-11-22"
$ results:'data.frame': 4 obs. of 2 variables:
..$ term : chr [1:4] "Voluntary: Firm initiated" "Voluntary: Firm Initiated" "FDA Mandated" ""
..$ count: int [1:4] 24217 563 397 6
Now that it is in a list format, you can see that it actually contains several data frames!
You can use this data right away if you are already familiar with lists in R, or you can extract the data frames into separate objects, like this:
<- openFDA_data$results
ts_df ts_df
term count
1 Voluntary: Firm initiated 24217
2 Voluntary: Firm Initiated 563
3 FDA Mandated 397
4 6
We could wrangle and visualize the data from here.
Post-lecture materials
Other good R packages to know about
- googlesheets4 to interact with Google Sheets in R
- googledrive to interact with files on your Google Drive
Final Questions
Here are some post-lecture questions to help you think about the material discussed.
- Using the GitHub API, access the repository information and ask how many open github issues you have?
- Pick another API that we have not discussed here and use
httr
to retreive data from it.