Introduction to tidy data and how to convert between wide and long data with the tidyr R package.
“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Before class, you can prepare by reading the following materials:
Material for this lecture was borrowed and adopted from
At the end of this lesson you will:
As we learned in the last lesson, one unifying concept of the tidyverse is the notion of tidy data. As defined by Hadley Wickham in his 2014 paper published in the Journal of Statistical Software, a tidy dataset has the following properties:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
[Source: Artwork by Allison Horst]
The purpose of defining tidy data is to highlight the fact that most data do not start out life as tidy. In fact, much of the work of data analysis may involve simply making the data tidy (at least this has been our experience). Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.
As a quick example, consider the following data illustrating religion and income survey data with the number of respondees with income range in column name. This is in a classic table format:
# A tibble: 18 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137
2 Atheist 12 27 37 52 35 70
3 Buddhist 27 21 30 34 33 58
4 Catholic 418 617 732 670 638 1116
5 Don’t kn… 15 14 15 11 10 35
6 Evangeli… 575 869 1064 982 881 1486
7 Hindu 1 9 7 9 11 34
8 Historic… 228 244 236 238 197 223
9 Jehovah'… 20 27 24 24 21 30
10 Jewish 19 19 25 25 30 95
11 Mainline… 289 495 619 655 651 1107
12 Mormon 29 40 48 51 56 112
13 Muslim 6 7 9 10 9 23
14 Orthodox 13 17 23 32 32 47
15 Other Ch… 9 7 11 13 13 14
16 Other Fa… 20 33 40 46 49 63
17 Other Wo… 5 2 3 4 2 7
18 Unaffili… 217 299 374 365 341 528
# … with 4 more variables: $75-100k <dbl>, $100-150k <dbl>,
# >150k <dbl>, Don't know/refused <dbl>
While this format is canonical and is useful for quickly observing the relationship between multiple variables, it is not tidy. This format violates the tidy form because there are variables in the columns. In this case the variables are religion, income bracket, and the number of respondents, which is the third variable, is presented inside the table.
Converting this data to tidy format would give us
library(tidyverse)
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
mutate(religion = factor(religion), income = factor(income))
# A tibble: 180 × 3
religion income respondents
<fct> <fct> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# … with 170 more rows
Some of these functions you have seen before, others might be new to you. Let’s talk about each one in the context of the Tidyverse package.
There are a number of R packages that take advantage of the tidy data form and can be used to do interesting things with data. Many (but not all) of these packages are written by Hadley Wickham and the collection of packages is sometimes referred to as the “tidyverse” because of their dependence on and presumption of tidy data. “Tidyverse” packages include:
ggplot2: a plotting system based on the grammar of graphics
magrittr: defines the %>%
operator for chaining functions together in a series of operations on data
dplyr: a suite of (fast) functions for working with data frames
tidyr: easily tidy data with pivot_wider()
and pivot_longer()
functions (also separate()
and unite()
)
We will be using these packages quite a bit this week.
The “tidyverse” package can be used to install all of the packages in the tidyverse at once. For example, instead of starting an R script with this:
You can start with this:
In the code above, let’s talk about what we did using the pivot_longer()
function. We will also talk about pivot_wider()
.
pivot_longer()
and pivot_wider()
The tidyr
package includes functions to transfer a data frame between long and wide.
In the section above, we showed an example that used pivot_longer()
to convert data into a tidy format.
The key problem with the tidyness of the data is that the income variables are not in their own columns, but rather are embedded in the structure of the columns.
To fix this, you can use the pivot_longer()
function to gather values spread across several columns into a single column, with the column names gathered into an income
column. When gathering, exclude any columns that you do not want “gathered” (religion
in this case) by including the column names with a the minus sign in the pivot_longer()
function. For example:
# Gather everything EXCEPT religion to tidy data
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "respondents")
# A tibble: 180 × 3
religion income respondents
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# … with 170 more rows
Even if your data is in a tidy format, pivot_longer()
is occasionally useful for pulling data together to take advantage of faceting, or plotting separate plots based on a grouping variable. We will talk more about that in a future lecture.
The pivot_wider()
function is less commonly needed to tidy data. It can, however, be useful for creating summary tables. For example, you use the summarize()
function in dplyr
to summarize the total number of respondents per income category.
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "respondents") %>%
mutate(religion = factor(religion), income = factor(income)) %>%
group_by(income) %>%
summarize(total_respondents = sum(respondents)) %>%
pivot_wider(names_from = "income",
values_from = "total_respondents") %>%
knitr::kable()
<$10k | >150k | $10-20k | $100-150k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | Don’t know/refused |
---|---|---|---|---|---|---|---|---|---|
1930 | 2608 | 2781 | 3197 | 3357 | 3302 | 3085 | 5185 | 3990 | 6121 |
Notice in this example how pivot_wider()
has been used at the very end of the code sequence to convert the summarized data into a shape that offers a better tabular presentation for a report. In the pivot_wider()
call, you first specify the name of the column to use for the new column names (income
in this example) and then specify the column to use for the cell values (total_respondents
here).
Example: Let’s try another dataset. This data contain an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
If we wanted to make lifeExp
, pop
and gdpPercap
(all measurements that we observe) go from a wide table into a long table, what would we do?
# try it yourself
One more! Try using pivot_longer()
to convert the the following data that contains made-up revenues for three companies by quarter for years 2006 to 2009.
Afterward, use group_by()
and summarize()
to calculate the average revenue for each company across all years and all quarters.
Bonus: Calculate a mean revenue for each company AND each year (averaged across all 4 quarters).
df <- tibble(
"company" = rep(1:3, each=4),
"year" = rep(2006:2009, 3),
"Q1" = sample(x = 0:100, size = 12),
"Q2" = sample(x = 0:100, size = 12),
"Q3" = sample(x = 0:100, size = 12),
"Q4" = sample(x = 0:100, size = 12),
)
df
# A tibble: 12 × 6
company year Q1 Q2 Q3 Q4
<int> <int> <int> <int> <int> <int>
1 1 2006 100 90 24 86
2 1 2007 23 29 30 67
3 1 2008 42 39 53 77
4 1 2009 98 98 60 87
5 2 2006 8 97 17 73
6 2 2007 15 83 18 8
7 2 2008 7 12 38 72
8 2 2009 22 49 99 82
9 3 2006 65 28 39 22
10 3 2007 20 33 14 56
11 3 2008 56 78 72 44
12 3 2009 36 67 91 42
# try it yourself
separate()
and unite()
The same tidyr
package also contains two useful functions:
unite()
: combine contents of two or more columns into a single columnseparate()
: separate contents of a column into two or more columnsFirst, we combine the first three columns into one new column using unite()
.
gapminder %>%
unite(col="country_continent_year", country:year, sep="_")
# A tibble: 1,704 × 4
country_continent_year lifeExp pop gdpPercap
<chr> <dbl> <int> <dbl>
1 Afghanistan_Asia_1952 28.8 8425333 779.
2 Afghanistan_Asia_1957 30.3 9240934 821.
3 Afghanistan_Asia_1962 32.0 10267083 853.
4 Afghanistan_Asia_1967 34.0 11537966 836.
5 Afghanistan_Asia_1972 36.1 13079460 740.
6 Afghanistan_Asia_1977 38.4 14880372 786.
7 Afghanistan_Asia_1982 39.9 12881816 978.
8 Afghanistan_Asia_1987 40.8 13867957 852.
9 Afghanistan_Asia_1992 41.7 16317921 649.
10 Afghanistan_Asia_1997 41.8 22227415 635.
# … with 1,694 more rows
Next, we show how to separate the columns into three separate columns using separate()
using the col
, into
and sep
arguments.
gapminder %>%
unite(col="country_continent_year", country:year, sep="_") %>%
separate(col="country_continent_year", into=c("country", "continent", "year"), sep="_")
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <chr> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# … with 1,694 more rows
Here are some post-lecture questions to help you think about the material discussed.
Questions:
Using prose, describe how the variables and observations are organised in a tidy dataset versus an non-tidy dataset.
What do the extra and fill arguments do in separate()
? Experiment with the various options for the following two toy datasets.
Both unite()
and separate()
have a remove argument. What does it do? Why would you set it to FALSE?
Compare and contrast separate()
and extract()
. Why are there three variations of separation (by position, by separator, and with groups), but only one unite()
?
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hicks (2021, Sept. 9). Statistical Computing: Tidy data and the Tidyverse. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/posts/2021-09-07-tidy-data-and-the-tidyverse/
BibTeX citation
@misc{hicks2021tidy, author = {Hicks, Stephanie}, title = {Statistical Computing: Tidy data and the Tidyverse}, url = {https://stephaniehicks.com/jhustatcomputing2021/posts/2021-09-07-tidy-data-and-the-tidyverse/}, year = {2021} }