<- function(){
foo return("This is foo.")
}class(foo)
[1] "function"
foo
function ()
{
return("This is foo.")
}
foo()
[1] "This is foo."
purrr
November 5, 2024
If you have ever heard the phrase
“R is a functional language.”
you might have asked yourself what does this means exactly? Generally, this means that R lends itself nice to a particular style of programming, namely a functional style of programming (will explain more below), which is often very helpful to the types of problems you encounter when doing a data analysis.
A functional style of programming is contrast to a the formal definition of a functional language (or functional programming, which can be complementary to object-oriented programming), which are languages that use functions to create conditional expressions to perform specific computations.
From this resource some differences are:
Basic elements: The fundamental elements of object-oriented languages are objects and methods, while the elements of functional programming are functions and variables.
States: Object-oriented languages can change objects within the program, which means it has states or current modifications that affect the result of inputs. Functional languages do not use imperative programming, so they do not keep track of current states.
Parallel programming: This type of programming involves multiple computational processes occurring at the same time. Object-oriented languages have little support for parallel programming, but functional languages have extensive support for it.
Order: In object-oriented programming, computations occur in a specific order. In functional programming, computations can occur in any order.
Iterative data: Object-oriented programming uses loops, meaning repeated execution, for iterative data. Functional programming uses recursion for iterative data, meaning it attempts to solve problems using simpler versions of the same problem.
A traditional weakness of functional languages are poorer performance and sometimes unpredictable memory usage, but these have been much reduced in recent years.
There are many definitions for precisely what makes a language functional, but there are two common threads and/or characteristics.
At it is core, functional programming treats functions equally as other data structures, called first class functions.
In R, this means that you can do many of the things with a function that you can do with a vector: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.
foo
):[1] "function"
function ()
{
return("This is foo.")
}
[1] "This is foo."
list
:foo_list <- list(
fun_1 = function() return("foo_1"),
fun_2 = function() return("foo_2")
)
str(foo_list)
List of 2
$ fun_1:function ()
$ fun_2:function ()
[1] "foo_1"
[1] "foo_2"
[1] "foo_1"
[1] "foo_2"
function ()
{
return("This is foo_2.")
}
<environment: 0x12824dbf8>
[1] "This is foo_2."
The bottom line, you can manipulate functions as the same way as you can to a vector or a matrix.
A function is pure, if it satisfies two properties:
The output only depends on the inputs, i.e. if you call it again with the same inputs, you get the same outputs. This excludes functions like runif()
, read.csv()
, or Sys.time()
that can return different values.
The function has no side-effects, like changing the value of a global variable, writing to disk, or displaying to the screen. This excludes functions like print()
, write.csv()
and <-
.
Pure functions are much easier to reason about, but obviously have significant downsides: imagine doing a data analysis where you could not generate random numbers or read files from disk.
To be clear, R is not formally a functional programming language as it does not require pure functions to be used when writing code.
So you might be asking yourself, why are we talking about this then?
The formal definition of a functional programming language introduces a new style of programming, namely a functional style of programming.
The key idea of a functional style is this programming style encourages programmers to write a big function as many smaller isolated functions, where each function addresses one specific task.
You can always adopt a functional style for certain parts of your code! For example, this style of writing code motivates more humanly readable code, and recyclable code.
At a high-level, a functional style is the concept of decomposing a big problem into smaller components, then solving each piece with a function or combination of functions.
In this lecture, we will focus on one type of functional technique, namely functionals, which are functions that take another function as an argument and returns a vector as output.
Functionals allow you to take a function that solves the problem for a single input and generalize it to handle any number of inputs. Once you learn about them, you will find yourself using them all the time in data analysis.
The chances are that you have already used a functional. You might have used for-loop replacements like base R’s lapply()
, apply()
, and tapply()
or maybe you have used a mathematical functional like integrate()
or optim()
.
One of the most common use of functionals is an alternative to for
loops.
For loops have a bad rap in R because many people believe they are slow, but the real downside of for loops is that they’re very flexible: a loop conveys that you’re iterating, but not what should be done with the results.
Typically it is not the for loop itself that is slow, but what you are doing inside of it. A common culprit of slow loops is modifying a data structure, where each modification generates a copy.
If you’re an experienced for loop user, switching to functionals is typically a pattern matching exercise. You look at the for loop and find a functional that matches the basic form. If one does not exist, do not try and torture an existing functional to fit the form you need. Instead, just leave it as a for loop! (Or once you have repeated the same loop two or more times, maybe think about writing your own functional).
Just as it is better to use while
than repeat
, and it’s better to use for
than while
, it is better to use a functional than for
.
Each functional is tailored for a specific task, so when you recognize the functional you immediately know why it’s being used.
purrr
: the functional programming toolkitThe R package purrr
as one important component of the tidyverse
, provides a interface to manipulate vectors in the functional style.
purrr
enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
purrr
cheatsheet
It is very difficult, if not impossible, to remember all functions that a package offers as well as their use cases.
Hence, purrr
developers offer a nice compact cheatsheet with visualizations at https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf.
Similar cheat sheets are available for other tidyverse
packages.
The most popular function in purrr
is map()
which iterates over the supplied data structure and apply a function during the iterations. Beside the map()
function,purrr
also offers a series of useful functions to manipulate the list
data frame.
map
familyThe most fundamental functional in the purrr
package is the map(.x, .f)
function. It takes a vector (.x
) and a function (.f
), calls the function once for each element of the vector, and returns the results in a list. In other words, map(1:3, f)
is equivalent to list(f(1), f(2), f(3))
.
library(purrr)
# we create a function called "triple"
triple <- function(x) x * 3
# using for loop to iterate over a vector
loop_ret <- list()
for(i in 1:3){
loop_ret[i] <- triple(i)
}
loop_ret
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
# map implementation to iterate over a vector
map_eg1 <- map(.x = 1:3, .f = triple)
map_eg2 <- map(.x = 1:3, .f = function(x) triple(x)) # create an inline anonymous function
map_eg3 <- map(.x = 1:3, .f = ~triple(.x)) # same as above, but special purrr syntax with a "twiddle"
identical(loop_ret,map_eg1)
[1] TRUE
[1] TRUE
[1] TRUE
Or, graphically this is what map()
is doing:
map
relate to functional programming in base R?
map()
returns a list, which makes it the most general of the map family because you can put anything in a list.
The base equivalent to map(.x, .f)
is lapply(X, FUN)
.
Because the arguments include functions (.f
) besides data (.x
), map()
functions are considered as a convenient interface to implement functional programming.
map
variantsSometimes it is inconvenient to return a list when a simpler data structure would do, so there are four more specific variants of map
that make it really a family of functions (of syntax map_*()
).
map_lgl()
map_int()
map_dbl()
map_chr()
For example, purrr
uses the convention that suffixes, like _dbl()
, refer to the output. Each returns an atomic vector of the specified type:
mpg cyl disp hp drat wt qsec vs
"double" "double" "double" "double" "double" "double" "double" "double"
am gear carb
"double" "double" "double"
mpg cyl disp hp drat wt qsec vs am gear carb
TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# map_int() always returns a integer vector
n_unique <- function(x) length(unique(x))
map_int(.x = mtcars, .f = n_unique)
mpg cyl disp hp drat wt qsec vs am gear carb
25 3 27 22 22 29 30 2 2 3 6
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
All map_*()
functions can take any type of vector as input. The examples above rely on two facts:
mtcars
is a data.frame
. In R, data.frame
is a special case of list
, where each column as one item of the list. Don’t confuse with each row as an item.map
functions always return an output vector the same length as the input, which implies that each call to .f
must return a single value. If it does not, you will get an error:Error in `map_dbl()`:
ℹ In index: 1.
Caused by error:
! Result must be length 1, not 2.
This is similar to the error you will get if .f
returns the wrong type of result:
...
It is often convenient to pass along additional arguments to the function that you are calling.
For example, you might want to pass na.rm = TRUE
along to mean()
. One way to do that is with an anonymous function:
But because the map functions pass ...
along, there is a simpler form available:
This is easiest to understand with a picture: any arguments that come after f
in the call to map()
are inserted after the data in individual calls to f()
:
It’s important to note that these arguments are not decomposed; or said another way, map()
is only vectorised over its first argument.
If an argument after f
is a vector, it will be passed along as is:
map
Before we go on to explore more map variants, let’s take a quick look at how you tend to use multiple purrr
functions to solve a moderately realistic problem: fitting a model to each subgroup and extracting a coefficient of the model.
For this toy example, I will break the mtcars
data set down into groups defined by the number of cylinders, using the base split
function:
[1] 3
List of 3
$ 4:'data.frame': 11 obs. of 11 variables:
..$ mpg : num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
..$ cyl : num [1:11] 4 4 4 4 4 4 4 4 4 4 ...
..$ disp: num [1:11] 108 146.7 140.8 78.7 75.7 ...
..$ hp : num [1:11] 93 62 95 66 52 65 97 66 91 113 ...
..$ drat: num [1:11] 3.85 3.69 3.92 4.08 4.93 4.22 3.7 4.08 4.43 3.77 ...
..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
..$ qsec: num [1:11] 18.6 20 22.9 19.5 18.5 ...
..$ vs : num [1:11] 1 1 1 1 1 1 1 1 0 1 ...
..$ am : num [1:11] 1 0 0 1 1 1 0 1 1 1 ...
..$ gear: num [1:11] 4 4 4 4 4 4 3 4 5 5 ...
..$ carb: num [1:11] 1 2 2 1 2 1 1 1 2 2 ...
$ 6:'data.frame': 7 obs. of 11 variables:
..$ mpg : num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
..$ cyl : num [1:7] 6 6 6 6 6 6 6
..$ disp: num [1:7] 160 160 258 225 168 ...
..$ hp : num [1:7] 110 110 110 105 123 123 175
..$ drat: num [1:7] 3.9 3.9 3.08 2.76 3.92 3.92 3.62
..$ wt : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
..$ qsec: num [1:7] 16.5 17 19.4 20.2 18.3 ...
..$ vs : num [1:7] 0 0 1 1 1 1 0
..$ am : num [1:7] 1 1 0 0 0 0 1
..$ gear: num [1:7] 4 4 3 3 4 4 5
..$ carb: num [1:7] 4 4 1 1 4 4 6
$ 8:'data.frame': 14 obs. of 11 variables:
..$ mpg : num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
..$ cyl : num [1:14] 8 8 8 8 8 8 8 8 8 8 ...
..$ disp: num [1:14] 360 360 276 276 276 ...
..$ hp : num [1:14] 175 245 180 180 180 205 215 230 150 150 ...
..$ drat: num [1:14] 3.15 3.21 3.07 3.07 3.07 2.93 3 3.23 2.76 3.15 ...
..$ wt : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
..$ qsec: num [1:14] 17 15.8 17.4 17.6 18 ...
..$ vs : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
..$ am : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
..$ gear: num [1:14] 3 3 3 3 3 3 3 3 3 3 ...
..$ carb: num [1:14] 2 4 3 3 3 4 4 4 2 2 ...
This creates a list of three data frames: the cars with 4, 6, and 8 cylinders respectively.
First, imagine we want to fit a linear model to understand how the miles per gallon (mpg
) associated with the weight (wt
). We can do this for all observations in mtcars
using:
Call:
lm(formula = mpg ~ wt, data = mtcars)
Coefficients:
(Intercept) wt
37.285 -5.344
The following code shows how you might do that with purrr
, which returns a list with output from each lm fit for each cylinder:
$`4`
Call:
lm(formula = mpg ~ wt, data = .x)
Coefficients:
(Intercept) wt
39.571 -5.647
$`6`
Call:
lm(formula = mpg ~ wt, data = .x)
Coefficients:
(Intercept) wt
28.41 -2.78
$`8`
Call:
lm(formula = mpg ~ wt, data = .x)
Coefficients:
(Intercept) wt
23.868 -2.192
Let’s say we wanted to extract the second coefficient (i.e. the slope). Using all the observations in mtcars
(i.e. ignoring cyl
), it would be something like this:
(Intercept) wt
37.285126 -5.344472
wt
-5.344472
How would we do this with the map()
family functions if we wanted to stratify the analysis for each cyl
?
Hint: you can use two map
functions (e.g. map()
and map_dbl(2)
where you can extract a specific element by a specific name or position).
Or, of course, you could use a for loop:
slopes <- double(length(by_cyl))
for (i in seq_along(by_cyl)) {
model <- lm(mpg ~ wt, data = by_cyl[[i]])
slopes[[i]] <- coef(model)[[2]]
}
slopes
[1] -5.647025 -2.780106 -2.192438
It’s interesting to note that as you move from purrr to base apply functions to for loops you tend to do more and more in each iteration.
In purrr we iterate 3 times (map()
, map()
, map_dbl()
), and with a for loop we iterate once. I prefer more, but simpler, steps because I think it makes the code easier to understand and later modify.
The map
family include functions that organize the output in different data structures, whose names follow the pattern map_*
. As we’ve seen, the map
function return a list. The following functions will return a vector of a specific kind, e.g. map_lgl
returns a vector of logical variables, map_chr
returns a vector of strings.
It is also possible to return the the results as data frames by
map_dfr
) ormap_dfc
) 4 6 8
26.66364 19.74286 15.10000
# A tibble: 3 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 26.7 4 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
2 19.7 6 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
3 15.1 8 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
# A tibble: 11 × 3
`4` `6` `8`
<dbl> <dbl> <dbl>
1 26.7 19.7 15.1
2 4 6 8
3 105. 183. 353.
4 82.6 122. 209.
5 4.07 3.59 3.23
6 2.29 3.12 4.00
7 19.1 18.0 16.8
8 0.909 0.571 0
9 0.727 0.429 0.143
10 4.09 3.86 3.29
11 1.55 3.43 3.5
There are 23 primary variants of map()
. So far, we have learned about five (map()
, map_lgl()
, map_int()
, map_dbl()
and map_chr()
). That means that you have got 18 (!!) more to learn. That sounds like a lot, but fortunately the design of purrr
means that you only need to learn five new ideas:
modify()
map2()
.imap()
walk()
.pmap()
.The map family of functions has orthogonal input and outputs, meaning that we can organise all the family into a matrix, with inputs in the rows and outputs in the columns. Once you have mastered the idea in a row, you can combine it with any column; once you have mastered the idea in a column, you can combine it with any row. That relationship is summarised in the following table:
List | Atomic | Same type | Nothing | |
---|---|---|---|---|
One argument | map() |
map_lgl() , … |
modify() |
walk() |
Two arguments | map2() |
map2_lgl() , … |
modify2() |
walk2() |
One argument + index | imap() |
imap_lgl() , … |
imodify() |
iwalk() |
N arguments | pmap() |
pmap_lgl() , … |
— | pwalk() |
modify()
Imagine you wanted to double every column in a data frame. You might first try using map()
, but map()
always returns a list:
If you want to keep the output as a data frame, you can use modify()
, which always returns the same type of output as the input:
map2()
and friendsmap()
is vectorised over a single argument, .x
.
This means it only varies .x
when calling .f
, and all other arguments are passed along unchanged, thus making it poorly suited for some problems.
For example, how would you find a weighted mean when you have a list of observations and a list of weights? Imagine we have the following data:
You can use map_dbl()
to compute the unweighted means:
[1] NA 0.4464825 0.4806867 0.5177861 0.4752695 0.6147369 0.4525804
[8] 0.5614233
But passing ws
as an additional argument does not work because arguments after .f
are not transformed:
Error in map_dbl(x. = xs, .f = weighted.mean, w = ws): argument ".x" is missing, with no default
We need a new tool: a map2()
, which is vectorised over two arguments. This means both .x
and .y
are varied in each call to .f
:
[1] NA 0.4548610 0.4449436 0.5327727 0.4077519 0.5728543 0.3940515
[8] 0.5399974
The arguments to map2()
are slightly different to the arguments to map()
as two vectors come before the function, rather than one. Additional arguments still go afterwards:
walk()
and friendsMost functions are called for the value that they return, so it makes sense to capture and store the value with a map()
function.
But some functions are called primarily for their side-effects (e.g. cat()
, write.csv()
, or ggsave()
) and it does not make sense to capture their results.
Let’s consider the example of saving a dataset. In this case, map
will force an output, e.g. NULL
. One can consider using walk
instead. The function walk
(and walk2
for more than two inputs) behaves exactly the same as map
but does not output anything.