install.packages("ggplot2")
1 Introduction to R
There are only two kinds of languages: the ones people complain about and the ones nobody uses. —Bjarne Stroustrup
1.1 Acknowledgements
Material for this chapter was borrowed and adopted from
- https://rdpeng.github.io/Biostat776/lecture-introduction-and-overview.html
- https://rafalab.github.io/dsbook
- https://rmd4sci.njtierney.com
- https://andreashandel.github.io/MADAcourse
- An overview and history of R from Roger Peng
- Installing R and RStudio from Rafael Irizarry
- Getting Started in R and RStudio from Rafael Irizarry
- R for Data Science by Wickham & Grolemund (2017). Covers most of the basics of using R for data analysis.
- Advanced R by Wickham (2014). Covers a number of areas including object-oriented, programming, functional programming, profiling and other advanced topics.
- RStudio IDE cheatsheet
1.2 Learning objectives
At the end of this lesson you will:
- Learn about (some of) the history of R.
- Identify some of the strengths and weaknesses of R.
- Install R and Rstudio on your computer.
- Know how to install and load R packages.
1.3 Overview and history of R
Below is a very quick introduction to R, to get you set up and running.
Like every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:
- R is open-source, freely accessible, and cross-platform (multiple OS).
- R is a “high-level” programming language, relatively easy to learn.
- While “Low-level” programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.
- In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.
- R integrates easily with document preparation systems like LaTeX, but R files can also be used to create
.docx
,.pdf
,.html
,.ppt
files with integrated R code output and graphics. - The R Community is very dynamic, helpful and welcoming.
- Check out the #rstats or #rtistry on Twitter, TidyTuesday podcast and community activity in the R4DS Online Learning Community, and r/rstats subreddit.
- If you are looking for more local resources, check out R-Ladies Baltimore.
- Through R packages, it is easy to get lots of state-of-the-art algorithms.
- Documentation and help files for R are generally good.
1.3.1 Basic Features of R
Today R runs on almost any standard computing platform and operating system.
Its open source nature means that anyone is free to adapt the software to whatever platform they choose.
One nice feature that R shares with many popular open source projects is frequent releases.
- These days there is a major annual release, typically in October, where major new features are incorporated and released to the public.
- Throughout the year, smaller-scale bugfix releases will be made as needed.
Another key advantage that R has over many other packages (even today) is its sophisticated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues.
Finally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things.
R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications.
The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter #rstats, #rtistry, and Reddit.
1.3.2 Free Software
A major advantage that R has over many other statistical packages and is that it’s free in the sense of free software (it’s also free in the sense of free beer). The copyright for the primary source code for R is held by the R Foundation and is published under the GNU General Public License version 2.0.
According to the Free Software Foundation, with free software, you are granted the following four freedoms
The freedom to run the program, for any purpose (freedom 0).
The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.
The freedom to redistribute copies so you can help your neighbor (freedom 2).
The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.
You can visit the Free Software Foundation’s web site to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web site is an interesting read if you happen to have some spare time.
1.3.3 Design of the R System
The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.
The R system is divided into 2 conceptual parts:
- The “base” R system that you download from CRAN:
- Everything else.
R functionality is divided into a number of packages.
Packages are installable pieces of software that contain functions for you use in your own data analyses.
The “base” R system contains, among other things, the
base
package which is required to run R and contains the most fundamental functions.The other packages contained in the “base” system include
utils
,stats
,datasets
,graphics
,grDevices
,grid
,methods
,tools
,parallel
,compiler
,splines
,tcltk
,stats4
.There are also “Recommended” packages:
dplyr
,ggplot2
, etc.
When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:
There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.
There are also many packages associated with the Bioconductor project.
People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.
1.4 Using R and RStudio
If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. — Nicholas Tierney
[Source]
The RStudio layout has the following features:
- On the upper left, something called a Rmarkdown script
- On the lower left, the R console
- On the lower right, the view for files, plots, packages, help, and viewer.
- On the upper right, the environment / history pane
The R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we’ll learn about these files later).
The file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.
1.4.1 Installing R and RStudio
- If you have not already, install R first. If you already have R installed, make sure it is a fairly recent version, version 4.0 or newer. If yours is older, I suggest you update (install a new R version).
- Once you have R installed, install the free version of RStudio Desktop. Again, make sure it’s a recent version.
Installing R and RStudio should be fairly straightforward. However, a great set of detailed instructions is in Rafael Irizarry’s dsbook
:
I personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).
1.4.2 RStudio default options
To first get set up, I highly recommend changing the following setting
Tools > Global Options (or Cmd + ,
on macOS)
Under the General tab:
- For workspace
- Uncheck restore .RData into workspace at startup
- Save workspace to .RData on exit : “Never”
- For History
- Uncheck “Always save history (even when not saving .RData)
- Uncheck “Remove duplicate entries in history”
This means that you won’t save the objects and other things that you create in your R session and reload them. This is important for two reasons
- Reproducibility: you don’t want to have objects from last week cluttering your session
- Privacy: you don’t want to save private data or other things to your session. You only want to read these in.
Your “history” is the commands that you have entered into R.
Additionally, not saving your history means that you won’t be relying on things that you typed in the last session, which is a good habit to get into!
1.4.3 Installing and loading R packages
As we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not.
The “official” place for packages is the CRAN website. If you are interested in packages on a specific topic, the CRAN task views provide curated descriptions of packages sorted by topic.
To install an R package from CRAN, one can simply call the install.packages()
function and pass the name of the package as an argument. For example, to install the ggplot2
package from CRAN: open RStudio,go to the R prompt (the >
symbol) in the lower-left corner and type
and the appropriate version of the package will be installed.
Often, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.
- As you installed the
ggplot2
package, what other packages were installed? - What happens if you tried to install
GGplot2
?
It could be that you already have all packages required by ggplot2
installed. In that case, you will not see any other packages installed. To see which of the packages above ggplot2
needs (and thus installs if it is not present), type into the R console:
::package_dependencies("ggplot2") tools
In RStudio, you can also install (and update/remove) packages by clicking on the ‘Packages’ tab in the bottom right window.
It is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the remotes
package and then use the following function
::install_github() remotes
You only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the library()
command. For instance to load the ggplot2
package, type
library('ggplot2')
You may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.
This was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.
1.4.4 Getting started in RStudio
While one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the R Studio Essentials collection of videos.
For more information on setting up and getting started with R, RStudio, and R packages, read the Getting Started chapter in the dsbook
:
This chapter gives some tips, shortcuts, and ideas that might be of interest even to those of you who already have R and/or RStudio experience.
1.5 Session Info
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.2 compiler_4.3.1 fastmap_1.1.1 cli_3.6.1
[5] tools_4.3.1 htmltools_0.5.6.1 rstudioapi_0.15.0 yaml_2.3.7
[9] rmarkdown_2.25 knitr_1.44 jsonlite_1.8.7 xfun_0.40
[13] digest_0.6.33 rlang_1.1.1 evaluate_0.22