Functional Programming with purrr

Introduction to tools to work with functions and vectors in R
module 3
week 6
functions
functional
programming
purrr
Author
Affiliation

Department of Biostatistics, Johns Hopkins

Published

November 29, 2022

Pre-lecture materials

Read ahead

Read ahead

Prerequisites

Before starting you must install the additional package:

  • purrr - this provides a consistent functional programming interface to work with functions and vectors

You can do this by calling

install.packages("purrr")

or use the “Install Packages…” option from the “Tools” menu in RStudio.

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

Learning objectives

At the end of this lesson you will:

  • Be familiar with the concept of functional programming
  • Get comfortable with the major functions in purrr, e.g. the map family, reduce
  • Write your loops with map functions instead of the for loop

Functional Programming

The characteristics

At it is core, functional programming treats functions equally as other data structures, namely first class functions.

In R, this means that you can do many of the things with a function that you can do with a vector: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.

What do you mean?

  • Assign a function to a variable
foo <- function(){
  return("This is foo.")
}
class(foo)
[1] "function"
  • Store functions in a list
foo_list <- list( 
  fun_1 = function() return("foo_1"),
  fun_2 = function() return("foo_2")
)

str(foo_list)
List of 2
 $ fun_1:function ()  
  ..- attr(*, "srcref")= 'srcref' int [1:8] 2 11 2 36 11 36 2 2
  .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f925303de48> 
 $ fun_2:function ()  
  ..- attr(*, "srcref")= 'srcref' int [1:8] 3 11 3 36 11 36 3 3
  .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7f925303de48> 
  • Pass functions as arguments to other functions
shell <- function(f) f()
shell(foo_list$fun_1)
[1] "foo_1"
shell(foo_list$fun_2)
[1] "foo_2"
  • Create functions inside of functions & return them as the result of a function
foo_wrap <- function(){
  foo_2 <- function(){
    return("This is foo_2.")
  }
  return(foo_2)
}

foo_wrap()
function(){
    return("This is foo_2.")
  }
<environment: 0x7f92410bf898>
(foo_wrap())()
[1] "This is foo_2."

The bottom line, you can manipulate functions as the same way as you can to a vector or a matrix.

Why is functional programming important?

Functional programming introduces a new style of programming, namely functional style. Broadly speaking, this programming style encourages programmers to write a big function as many smaller isolated functions, where each function addresses one specific task.

As a by-product, funcitonal style motivates more humanly readable code, and recyclable code.

"data_set.csv" |> 
  import_data_from_file() |> 
  data_cleaning() |> 
  run_regression() |>
  model_diagnostics() |>
  model_visualization()

"data_set2.csv" |> 
  import_data_from_file() |> 
  data_cleaning() |> 
  run_different_regression() |>
  model_diagnostics() |>
  model_visualization()
Pipe operators

R provides some pipe operators to make code readable, e.g. |> from the base R, %>% from the package magrittr. These pipe operators operate like a pipe, piping the output from the previous function (left hand side of the pipe operator) to the following function (right hand side of the pipe operator). The pipe operator |> was introduced in R 4.1.0 and requires no loading of additional packages, unlike %>%.

A keyboard shortcut to type a pipe operator in RStudio is shift+cmd+m for Mac or shift+ctrl+m in Windows.

purrr: the functional programming toolkit

The R package purrr, as one important component of the tidyverse, provides a interface to manipulate vectors in the functional style.

purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.

purrr cheatsheet

It is very difficulty, if not impossible, to remember all functions that a package offers as well as their use cases. Hence, purrr developers offer a nice compact cheatsheet with visualizations at https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf. Similar cheatsheets are available for other tidyverse packages.

The most popular function in purrr is map() which iterates over the supplied data structure and apply a function during the iterations. Beside the map function,purrr also offers a series of useful functions to manipulate list the data structure.

The map family

The map family of functions provides a convenient way to iterate through vectors or lists and apply functions during this iteration. Depending on the dimension of the input and the format of the output, there are many different variants of the basic map function.

How does map relate to functional programming

Because their arguments include functions (.f) besides data (.x), map functions are considered as a convinient interface to implement functional programming.

map as a foor loop

library(purrr)

triple <- function(x) x * 3

# for loop
loop_ret <- list()
for(i in 1:3){
  loop_ret[i] <- triple(i)
}

# map implementation
map_eg1 <- map(.x = 1:3, .f = triple)
map_eg2 <- map(.x = 1:3, .f = ~triple(.x))
map_eg3 <- map(.x = 1:3, .f = function(x) triple(x))

identical(loop_ret,map_eg1)
[1] TRUE
identical(loop_ret,map_eg2)
[1] TRUE
identical(loop_ret,map_eg3)
[1] TRUE

map with a data frame

tmp_dat <- data.frame(
  x = 1:5,
  y = 6:10
)

tmp_dat |> 
  map(.f = mean)
$x
[1] 3

$y
[1] 8
# Alternatively
# map(.x = tmp_dat, .f = mean)
data.frame vs list

data.frame is a special case of list, where each column as one item of the list. Don’t confuse with each row as an item.

class(tmp_dat)
[1] "data.frame"
typeof(tmp_dat)
[1] "list"

Extra arguments for functions

tmp_dat2 <- as.list(tmp_dat)
tmp_dat2$y[6] <- NA
str(tmp_dat2)
List of 2
 $ x: int [1:5] 1 2 3 4 5
 $ y: int [1:6] 6 7 8 9 10 NA
tmp_dat2 |> map(.f = mean) # No extra arguments
$x
[1] 3

$y
[1] NA
tmp_dat2 |> 
  map(.f = mean, na.rm = TRUE) # With extra arguments
$x
[1] 3

$y
[1] 8
tmp_dat2 |> 
  map(.f = function(x, remove_na) mean(x, na.rm = remove_na),
      remove_na = TRUE)
$x
[1] 3

$y
[1] 8

Stratified analysis with map

We use the mtcars from the package datasets to demonstrate

library(datasets)
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
unique(mtcars$cyl) # different numbers of cylinders
[1] 6 4 8

We are interested in the averaged miles per gallon for vehicles with different numbers of cylinders

# Create a dataset for cylinders level
str_dat <- mtcars |> split(mtcars$cyl)
length(str_dat)
[1] 3
str(str_dat)
List of 3
 $ 4:'data.frame':  11 obs. of  11 variables:
  ..$ mpg : num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  ..$ cyl : num [1:11] 4 4 4 4 4 4 4 4 4 4 ...
  ..$ disp: num [1:11] 108 146.7 140.8 78.7 75.7 ...
  ..$ hp  : num [1:11] 93 62 95 66 52 65 97 66 91 113 ...
  ..$ drat: num [1:11] 3.85 3.69 3.92 4.08 4.93 4.22 3.7 4.08 4.43 3.77 ...
  ..$ wt  : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  ..$ qsec: num [1:11] 18.6 20 22.9 19.5 18.5 ...
  ..$ vs  : num [1:11] 1 1 1 1 1 1 1 1 0 1 ...
  ..$ am  : num [1:11] 1 0 0 1 1 1 0 1 1 1 ...
  ..$ gear: num [1:11] 4 4 4 4 4 4 3 4 5 5 ...
  ..$ carb: num [1:11] 1 2 2 1 2 1 1 1 2 2 ...
 $ 6:'data.frame':  7 obs. of  11 variables:
  ..$ mpg : num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
  ..$ cyl : num [1:7] 6 6 6 6 6 6 6
  ..$ disp: num [1:7] 160 160 258 225 168 ...
  ..$ hp  : num [1:7] 110 110 110 105 123 123 175
  ..$ drat: num [1:7] 3.9 3.9 3.08 2.76 3.92 3.92 3.62
  ..$ wt  : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
  ..$ qsec: num [1:7] 16.5 17 19.4 20.2 18.3 ...
  ..$ vs  : num [1:7] 0 0 1 1 1 1 0
  ..$ am  : num [1:7] 1 1 0 0 0 0 1
  ..$ gear: num [1:7] 4 4 3 3 4 4 5
  ..$ carb: num [1:7] 4 4 1 1 4 4 6
 $ 8:'data.frame':  14 obs. of  11 variables:
  ..$ mpg : num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
  ..$ cyl : num [1:14] 8 8 8 8 8 8 8 8 8 8 ...
  ..$ disp: num [1:14] 360 360 276 276 276 ...
  ..$ hp  : num [1:14] 175 245 180 180 180 205 215 230 150 150 ...
  ..$ drat: num [1:14] 3.15 3.21 3.07 3.07 3.07 2.93 3 3.23 2.76 3.15 ...
  ..$ wt  : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
  ..$ qsec: num [1:14] 17 15.8 17.4 17.6 18 ...
  ..$ vs  : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ am  : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ gear: num [1:14] 3 3 3 3 3 3 3 3 3 3 ...
  ..$ carb: num [1:14] 2 4 3 3 3 4 4 4 2 2 ...
str_dat |> 
  map(.f = ~mean(.x$mpg))
$`4`
[1] 26.66364

$`6`
[1] 19.74286

$`8`
[1] 15.1

Matrix as the output

The map family include functions that organize the output in different data structures, whose names follow the pattern map_*. As we’ve seen, the map function return a list. The following functions will return a vector of a specific kind, e.g. map_lgl returns a vector of logical variables, map_chr returns a vector of strings. It is also possible to return the the results as data frames by row binding (map_dfr) or column binding (map_dfc).

str_dat |> 
  map_dbl(.f = ~mean(.x$mpg)) # returns a vector of doubles
       4        6        8 
26.66364 19.74286 15.10000 
str_dat |> 
  map_dfr(.f = ~colMeans(.x)) # return a data frame by row binding
# A tibble: 3 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
2  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
3  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5 
str_dat |> 
  map_dfc(.f = ~colMeans(.x)) # return a data frame by col binding
# A tibble: 11 × 3
       `4`     `6`     `8`
     <dbl>   <dbl>   <dbl>
 1  26.7    19.7    15.1  
 2   4       6       8    
 3 105.    183.    353.   
 4  82.6   122.    209.   
 5   4.07    3.59    3.23 
 6   2.29    3.12    4.00 
 7  19.1    18.0    16.8  
 8   0.909   0.571   0    
 9   0.727   0.429   0.143
10   4.09    3.86    3.29 
11   1.55    3.43    3.5  

Multiple Input

It is possible that an operation requires a pair of variables as input. While it is still managable in map to achieve this, there are better options provided in purrr, specifically map2 and pmap.

map_avg <- map_dbl(.x = mtcars, .f = mean)

map2_avg <- map2_dbl(.x = mtcars,
                     .y = list(weight = 1/nrow(mtcars)),
                     .f = ~sum(.x*.y))
identical(map_avg, map2_avg)
[1] TRUE
pmap_avg <- pmap_dbl(list(x = mtcars,
                          y = list(weight = 1/(2*nrow(mtcars))),
                          z = list(weight2 = 2)),
                     .f = ~sum(..1*..2*..3))
identical(map_avg, pmap_avg)
[1] TRUE
# Use element names in pmap
mtcars$weight <- 1/2
mtcars$weight2 <-  2
pmap_eg2 <- pmap_dbl(mtcars,
                     .f = function(mpg, weight, weight2, ...){
                       mpg * weight * weight2
                     })

identical(pmap_eg2, mtcars$mpg)
[1] TRUE

No output

It is possible that some operations don’t need any output during the iteration, e.g. saving the dataset. In this case, map will force an output, e.g. NULL. One can consider using walk instead. The function walk behaves exactly the same as map but does not output anything.

tmp_fldr <- tempdir()


map2(.x = str_dat,
     .y = 1:length(str_dat),
     .f = ~saveRDS(.x, 
                   file = paste0(tmp_fldr, "/",.y, ".rds"))
)
$`4`
NULL

$`6`
NULL

$`8`
NULL
# No output
walk2(.x = str_dat,
      .y = (1:length(str_dat)),
      .f = ~saveRDS(.x, 
                    file = paste0(tmp_fldr, "/",.y, ".rds"))
)

Other functions in purrr

reduce and accumulate

purrr also provides functions to summarize a list by a preferred operator, namesly reduce. Its variant accumulate provides the history of this reduction process.

mtcars$weight <- 1/(2*nrow(mtcars))
mtcars$weight2 <-  2
reduce_eg <- 
  pmap_dbl(mtcars,
           .f = function(mpg, weight, weight2, ...){
             mpg * weight * weight2
           }) |> 
  reduce(`+`)

pmap_dbl(mtcars,
           .f = function(mpg, weight, weight2, ...){
             mpg * weight * weight2
           })|>
  head() |> # Only show the first 7 operations
  accumulate(`+`)
[1] 0.656250 1.312500 2.025000 2.693750 3.278125 3.843750

Working with list

Let’s move to the purrr cheatsheet at https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf.

Summary

  • Introduction to functional programming.
  • The R package purrr provides a nice interface to functional programming and list manipulation.
  • The function map and its aternative map_* provide a neat way to iterate over a list or vector with the output in different data structures.
  • The function map2 and pmap allow having more than one list as input.
  • The function walk and its alternatives walk2, walk_* do not provide any output.
  • The functions reduce and accumulate help to summarize a list with a preferred operator or function.

Post-lecture materials

Questions
  1. What does imap and iwalk do? In this lecture note, can you find the one example possible to substitute with imap and iwalk? Hint: see the sub-section named No output

  2. Is there any function in the R base package provide nice interface for functional programming? Hint: ?with, ?within

  3. Can you write a section of code to demonstrate the central limited theorem primarily using the purrr package and/or using the R base package?

Additional Resources