Let’s dig into the R programming language and the RStudio integrated developer environment
There are only two kinds of languages: the ones people complain about and the ones nobody uses. —Bjarne Stroustrup
Before class, you can prepare by reading the following materials:
Material for this lecture was borrowed and adopted from
At the end of this lesson you will:
Below is a very quick introduction to R, to get you set up and running. We’ll go deeper into R and coding later.
Like every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:
.docx
, .pdf
, .html
, .ppt
files with integrated R code output and graphics.While we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).
Depending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.
With the skills gained in this course, hopefully you will find R a fun and useful programming langauge for your future projects.
[Source: Artwork by Allison Horst]
Today R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.
One nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.
Another key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R’s base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.
R has maintained the original S philosophy (see box below), which is that it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.
For a great discussion on an overview and history of R and the S programming language, read through this chapter from Roger D. Peng.
Finally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter #rstats and Reddit.
A major advantage that R has over many other statistical packages and is that it’s free in the sense of free software (it’s also free in the sense of free beer). The copyright for the primary source code for R is held by the R Foundation and is published under the GNU General Public License version 2.0.
According to the Free Software Foundation, with free software, you are granted the following four freedoms
The freedom to run the program, for any purpose (freedom 0).
The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.
The freedom to redistribute copies so you can help your neighbor (freedom 2).
The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.
You can visit the Free Software Foundation’s web site to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web site is an interesting read if you happen to have some spare time.
The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.
The R system is divided into 2 conceptual parts:
R functionality is divided into a number of packages.
The “base” R system contains, among other things, the base
package which is required to run R and contains the most fundamental functions.
The other packages contained in the “base” system include utils
, stats
, datasets
, graphics
, grDevices
, grid
, methods
, tools
, parallel
, compiler
, splines
, tcltk
, stats4
.
There are also “Recommended” packages: boot
, class
, cluster
, codetools
, foreign
, KernSmooth
, lattice
, mgcv
, nlme
, rpart
, survival
, MASS
, spatial
, nnet
, Matrix
.
When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:
There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.
There are also many packages associated with the Bioconductor project.
People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.
Questions:
No programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old days”).
Another commonly cited limitation of R is that objects must generally be stored in physical memory (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.
At a higher level one “limitation” of R is that its functionality is based on consumer demand and (voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.
If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. — Nicholas Tierney
[Source]
The RStudio layout has the following features:
The R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we’ll learn about these files later).
The file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.
Installing R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry’s dsbook
If things don’t work, ask for help in the courseplus discussion board.
I personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).
To first get set up, I highly recommend changing the following setting
Tools > Global Options (or Cmd + ,
on macOS)
Under the General tab:
This means that you won’t save the objects and other things that you create in your R session and reload them. This is important for two reasons
Your “history” is the commands that you have entered into R.
Additionally, not saving your history means that you won’t be relying on things that you typed in the last session, which is a good habit to get into!
As we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.
The “official” place for packages is the CRAN website. If you are interested in packages on a specific topic, the CRAN task views provide curated descriptions of packages sorted by topic.
To install an R package from CRAN, one can simply call the install.packages()
function and pass the name of the package as an argument. For example, to install the ggplot2
package from CRAN: open RStudio,go to the R prompt (the >
symbol) in the lower-left corner and type
install.packages("ggplot2")
and the appropriate version of the package will be installed.
Often, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.
Questions:
ggplot2
package, what other packages were installed?GGplot2
?It could be that you already have all packages required by ggplot2
installed. In that case, you will not see any other packages installed. To see which of the packages above ggplot2
needs (and thus installs if it is not present), type into the R console:
tools::package_dependencies("ggplot2")
In RStudio, you can also install (and update/remove) packages by clicking on the ‘Packages’ tab in the bottom right window.
It is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the remotes
package and then use the following function
remotes::install_github()
We will not do that now, but it is quite likely that at one point later in this course we will.
You only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the library()
command. For instance to load the ggplot2
package, type
You may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.
This was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.
While one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the R Studio Essentials collection of videos.
For more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the dsbook
:
This chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.
Here are some post-lecture questions to help you think about the material discussed.
Questions:
If a software company asks you, as a requirement for using their software, to sign a license that restricts you from using their software to commit illegal activities, is this consistent with the “Four Freedoms” of Free Software?
What is an R package and what is it used for?
What function in R can be used to install packages from CRAN?
What is a limitation of the current R system?
R for Data Science by Wickham & Grolemund (2017). Covers most of the basics of using R for data analysis.
Advanced R by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hicks (2021, Aug. 31). Statistical Computing: Introduction to R and RStudio. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/posts/2021-08-31-introduction-to-r-and-rstudio/
BibTeX citation
@misc{hicks2021introduction, author = {Hicks, Stephanie}, title = {Statistical Computing: Introduction to R and RStudio}, url = {https://stephaniehicks.com/jhustatcomputing2021/posts/2021-08-31-introduction-to-r-and-rstudio/}, year = {2021} }