Introduction to using Python in R and the reticulate package
Before class, you can prepare by reading the following materials:
Material for this lecture was borrowed and adopted from
At the end of this lesson you will:
reticulate
package to work interoperability between Python and RFor this lesson, we will be using the reticulate
R package, which provides a set of tools for interoperability between Python and R. The package includes facilities for:
Calling Python from R in a variety of ways including R Markdown, sourcing Python scripts, importing Python modules, and using Python interactively within an R session.
Translation between R and Python objects (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).
[Source: Rstudio]
Installing python: If you would like recommendations on installing python, I like this resource: https://py-pkgs.org/02-setup#installing-python
What’s happening under the hood?: reticulate
embeds a Python session within your R session, enabling seamless, high-performance interoperability.
If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate
can make your life better!
Let’s try it out. Before we get started, you will need to install the packages, if not already:
install.package("reticulate")
We will also load the here
and tidyverse
packages for our lesson:
By default, reticulate
uses the version of Python found on your PATH
Sys.which("python3.9")
python3.9
""
The use_python()
function enables you to specify an alternate version, for example:
use_python("/usr/<new>/<path>/local/bin/python")
For example, I can define the path explicitly:
use_python("/Users/shicks/opt/miniconda3/bin/python3.9", required = TRUE)
There are a variety of ways to integrate Python code into your R projects:
Python in R Markdown — A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa).
Importing Python modules — The import()
function enables you to import any Python module and call its functions directly from R.
Sourcing Python scripts — The source_python()
function enables you to source a Python script the same way you would source()
an R script (Python functions and objects defined within the script become directly available to the R session).
Python REPL — The repl_python()
function creates an interactive Python console within R. Objects you create within Python are available to your R session (and vice-versa).
Below I will focus on introducing the first and last one. However, before we do that, let’s introduce a bit about python basics.
Python is a high-level, object-oriented programming language useful to know for anyone analyzing data. The most important thing to know before learning Python, is that in Python, everything is an object. There is no compiling and no need to define the type of variables before using them. No need to allocate memory for variables. The code is very easy to learn and easy to read (syntax).
There is a large scientific community contributing to Python. Some of the most widely used libraries in Python are numpy
, scipy
, pandas
, and matplotlib
.
There are two modes you can write Python code in: interactive mode or script mode. If you open up a UNIX command window and have a command-line interface, you can simply type python
(or python3
) in the shell:
python3
and the interactive mode will open up. You can write code in the interactive mode and Python will interpret the code using the python interpreter.
Another way to pass code to Python is to store code in a file ending in .py
, and execute the file in the script mode using
python3 myscript.py
To check what version of Python you are using, type the following in the shell:
python3 --version
Everything in Python is an object. Think of an object as a data structure that contains both data as well as functions. These objects can be variables, functions, and modules which are all objects. We can operate on this objects with what are called operators (e.g. addition, subtraction, concatenation or other operations), define/apply functions, test/apply for conditionals statements, (e.g. if
, else
statements) or iterate over the objects.
Not all objects are required to have attributes and methods to operate on the objects in Python, but everything is an object (i.e. all objects can be assigned to a variable or passed as an argument to a function). A user can work with built-in defined classes of objects or can create new classes of objects. Using these objects, a user can perform operations on the objects by modifying / interacting with them.
Variable names are case sensitive, can contain numbers and letters, can contain underscores, cannot begin with a number, cannot contain illegal characters and cannot be one of the 31 keywords in Python:
“and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield”
+
, -
, *
, /
, **
(exponent), %
(modulus if applied to integers)+
and *
.=
+=
(or -=
) can be used like n += x
which is equal to n = n + x
==
(equal), !=
(not equal), >
, <
, >=
(greater than or equal to), <=
(less than or equal to)and
, or
, and not
. e.g. x > 1 and x <= 5
2 ** 3
8
= 3
x > 1 and x <= 5 x
True
If %
is applied to strings, this operator is the format operator. It tells Python how to format a list of values in a string. For example,
%d
says to format the value as an integer%g
says to format the value as an float%s
says to format the value as an stringprint('In %d days, I have eaten %g %s.' % (5, 3.5, 'crabs'))
In 5 days, I have eaten 3.5 crabs.
Python contains a small list of very useful built-in functions. All other functions need defined by the user or need to be imported from modules. For a more detailed list on the built-in functions in Python, see Built-in Python Functions.
The first function we will discuss, type()
, reports the type of any object, which is very useful when handling multiple data types (remember, everything in Python is an object). Here are some the mains types you will encounter:
int
)float
)str
)list
)dict
)tuple
)function
)module
)bool
): e.g. True, Falseenumerate
)If we asked for the type of a string “Let’s go Ravens!”
type("Let's go Ravens!")
<class 'str'>
This would return the str
type.
You have also seen how to use the print()
function. The function print will accept an argument and print the argument to the screen. Print can be used in two ways:
print("Let's go Ravens!")
[1] "Let's go Ravens!"
New functions can be defined using one of the 31 keywords in Python def.
def new_world():
return 'Hello world!'
print(new_world())
Hello world!
The first line of the function (the header) must start with def
, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon. The arguments can be specified in any order.
The rest of the function (the body) always has an indentation of four spaces. If you define a function in the interactive mode, the interpreter will print ellipses (…) to let you know the function is not complete. To complete the function, enter an empty line (not necessary in a script).
To return a value from a function, use return. The function will immediately terminate and not run any code written past this point.
def squared(x):
""" Return the square of a
value """
return x ** 2
print(squared(4))
16
Note: python has its version of ...
(also from docs.python.org)
def concat(*args, sep="/"):
return sep.join(args)
"a", "b", "c") concat(
'a/b/c'
Iterative loops can be written with the for
, while
and break
statements.
Defining a for
loop is similar to defining a new function. The header ends with a colon and the body is indented. The function range(n)
takes in an integer n
and creates a set of values from 0
to n - 1
. for
loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries.
for i in range(3):
print('Baby shark, doo doo doo doo doo doo!')
Baby shark, doo doo doo doo doo doo!
Baby shark, doo doo doo doo doo doo!
Baby shark, doo doo doo doo doo doo!
print('Baby shark!')
Baby shark!
The function len()
can be used to:
= 'Baby shark!'
x len(x)
11
For strings, lists and dictionaries, there are set of methods you can use to manipulate the objects. In general, the notation for methods is the dot notation. The syntax is the name of the objects followed by a dot (or period) followed by the name of the method.
= "Hello Baltimore!"
x x.split()
['Hello', 'Baltimore!']
We have already seen lists. Python has other data structures built in.
{"a", "a", "a", "b"}
(unique elements)(1, 2, 3)
(a lot like lists but not mutable, i.e. need to create a new to modify)dict = {"a" : 1, "b" : 2}
dict['a']
1
dict['b']
2
More about data structures can be founds at the python docs
reticulate
The reticulate
package includes a Python engine for R Markdown with the following features:
Run Python chunks in a single Python session embedded within your R session (shared variables/state between Python chunks)
Printing of Python output, including graphical output from matplotlib
.
Access to objects created within Python chunks from R using the py
object (e.g. py$x
would access an x
variable created within Python from R).
Access to objects created within R chunks from Python using the r
object (e.g. r.x
would access to x
variable created within R from Python)
Built in conversion for many Python object types is provided, including NumPy arrays and Pandas data frames.
As an example, you can use Pandas to read and manipulate data then easily plot the Pandas data frame using ggplot2
:
Let’s first create a flights.csv
dataset in R:
if(!file.exists(here("data", "flights.csv"))){
readr::write_csv(nycflights13::flights,
file = here("data", "flights.csv"))
}
Use Python to read in the file and do some data wrangling
import pandas
= "/Users/shicks/Documents/github/teaching/jhustatcomputing2021/data/flights.csv"
flights_path = pandas.read_csv(flights_path)
flights = flights[flights['dest'] == "ORD"]
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
flights = flights.dropna()
flights flights
carrier dep_delay arr_delay
5 UA -4.0 12.0
9 AA -2.0 8.0
25 MQ 8.0 32.0
38 AA -1.0 14.0
57 AA -4.0 4.0
... ... ... ...
336645 AA -12.0 -37.0
336669 UA -7.0 -13.0
336675 MQ -7.0 -11.0
336696 B6 -5.0 -23.0
336709 AA -13.0 -38.0
[16566 rows x 3 columns]
head(py$flights)
carrier dep_delay arr_delay
5 UA -4 12
9 AA -2 8
25 MQ 8 32
38 AA -1 14
57 AA -4 4
70 UA 9 20
py$flights_path
[1] "/Users/shicks/Documents/github/teaching/jhustatcomputing2021/data/flights.csv"
Next, we can use R to visualize the Pandas DataFrame
. The data frame is loaded in as an R object now stored in the variable py
.
ggplot(py$flights, aes(x = carrier, y = arr_delay)) +
geom_point() +
geom_jitter()
Note that the reticulate
Python engine is enabled by default within R Markdown whenever reticulate
is installed.
Use R to read and manipulate data
library(tidyverse)
flights <- read_csv(here("data","flights.csv")) %>%
filter(dest == "ORD") %>%
select(carrier, dep_delay, arr_delay) %>%
na.omit()
flights
# A tibble: 16,566 × 3
carrier dep_delay arr_delay
<chr> <dbl> <dbl>
1 UA -4 12
2 AA -2 8
3 MQ 8 32
4 AA -1 14
5 AA -4 4
6 UA 9 20
7 UA 2 21
8 AA -6 -12
9 MQ 39 49
10 B6 -2 15
# … with 16,556 more rows
If you recall, we can access objects created within R chunks from Python using the r
object (e.g. r.x
would access to x
variable created within R from Python). We can then ask for the first ten rows using the head()
function in python.
10) r.flights.head(
carrier dep_delay arr_delay
0 UA -4.0 12.0
1 AA -2.0 8.0
2 MQ 8.0 32.0
3 AA -1.0 14.0
4 AA -4.0 4.0
5 UA 9.0 20.0
6 UA 2.0 21.0
7 AA -6.0 -12.0
8 MQ 39.0 49.0
9 B6 -2.0 15.0
You can use the import()
function to import any Python module and call it from R. For example, this code imports the Python os
module in python and calls the listdir()
function:
os <- import("os")
os$listdir(".")
[1] "python-for-r-users_files" "python-for-r-users.Rmd"
[3] "python-for-r-users.html"
Functions and other data within Python modules and classes can be accessed via the $
operator (analogous to the way you would interact with an R list, environment, or reference class).
Imported Python modules support code completion and inline help:
[Source: Rstudio]
Similarly, we can import the pandas library:
year month day dep_time sched_dep_time dep_delay arr_time
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
sched_arr_time arr_delay carrier flight tailnum origin dest
1 819 11 UA 1545 N14228 EWR IAH
2 830 20 UA 1714 N24211 LGA IAH
3 850 33 AA 1141 N619AA JFK MIA
4 1022 -18 B6 725 N804JB JFK BQN
5 837 -25 DL 461 N668DN LGA ATL
6 728 12 UA 1696 N39463 EWR ORD
air_time distance hour minute time_hour
1 227 1400 5 15 2013-01-01T10:00:00Z
2 227 1416 5 29 2013-01-01T10:00:00Z
3 160 1089 5 40 2013-01-01T10:00:00Z
4 183 1576 5 45 2013-01-01T10:00:00Z
5 116 762 6 0 2013-01-01T11:00:00Z
6 150 719 5 58 2013-01-01T10:00:00Z
class(test)
[1] "data.frame"
or the scikit-learn python library:
skl_lr <- import("sklearn.linear_model")
source_python("secret_functions.py")
subject_1 <- read_subject("secret_data.csv")
If you want to work with Python interactively you can call the repl_python()
function, which provides a Python REPL embedded within your R session.
Objects created within the Python REPL can be accessed from R using the py
object exported from reticulate
. For example:
[Source: Rstudio]
i.e. objects do have permenancy in R after exiting the python repl.
So typing x = 4
in the repl will put py$x
as 4 in R after you exit the repl.
Enter exit within the Python REPL to return to the R prompt.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hicks (2021, Oct. 14). Statistical Computing: Python for R users. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/posts/2021-10-14-python-for-r-users/
BibTeX citation
@misc{hicks2021python, author = {Hicks, Stephanie}, title = {Statistical Computing: Python for R users}, url = {https://stephaniehicks.com/jhustatcomputing2021/posts/2021-10-14-python-for-r-users/}, year = {2021} }