install.packages("ggplot2")
There are only two kinds of languages: the ones people complain about and the ones nobody uses. —Bjarne Stroustrup
Pre-lecture materials
Read ahead
Acknowledgements
Material for this lecture was borrowed and adopted from
Learning objectives
Overview and history of R
Below is a very quick introduction to R, to get you set up and running. We’ll go deeper into R and coding later.
tl;dr (R in a nutshell)
Like every programming language, R has its advantages and disadvantages. If you search the internet, you will quickly discover lots of folks with opinions about R. Some of the features that are useful to know are:
- R is open-source, freely accessible, and cross-platform (multiple OS).
- R is a “high-level” programming language, relatively easy to learn.
- While “Low-level” programming languages (e.g. Fortran, C, etc) often have more efficient code, they can also be harder to learn because it is designed to be close to a machine language.
- In contrast, high-level languages deal more with variables, objects, functions, loops, and other abstract CS concepts with a focus on usability over optimal program efficiency.
- R is great for statistics, data analysis, websites, web apps, data visualizations, and so much more!
- R integrates easily with document preparation systems like \(\LaTeX\), but R files can also be used to create
.docx
,.pdf
,.html
,.ppt
files with integrated R code output and graphics. - The R Community is very dynamic, helpful and welcoming.
- Check out the #rstats or #rtistry on Twitter, TidyTuesday podcast and community activity in the R4DS Online Learning Community, and r/rstats subreddit.
- If you are looking for more local resources, check out R-Ladies Baltimore.
- Through R packages, it is easy to get lots of state-of-the-art algorithms.
- Documentation and help files for R are generally good.
While we use R in this course, it is not the only option to analyze data. Maybe the most similar to R, and widely used, is Python, which is also free. There is also commercial software that can be used to analyze data (e.g., Matlab, Mathematica, Tableau, SAS, SPSS). Other more general programming languages are suitable for certain types of analyses as well (e.g., C, Fortran, Perl, Java, Julia).
Depending on your future needs or jobs, you might have to learn one or several of those additional languages. The good news is that even though those languages are all different, they all share general ways of thinking and structuring code. So once you understand a specific concept (e.g., variables, loops, branching statements or functions), it applies to all those languages. Thus, learning a new programming language is much easier once you already know one. And R is a good one to get started with.
With the skills gained in this course, hopefully you will find R a fun and useful programming language for your future projects.
[Source: Artwork by Allison Horst]
Basic Features of R
Today R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.
One nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.
Another key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R’s base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data.
R has maintained the original S philosophy (see box below), which is that it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.
Finally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow, Twitter #rstats, #rtistry, and Reddit.
Free Software
A major advantage that R has over many other statistical packages and is that it’s free in the sense of free software (it’s also free in the sense of free beer). The copyright for the primary source code for R is held by the R Foundation and is published under the GNU General Public License version 2.0.
According to the Free Software Foundation, with free software, you are granted the following four freedoms
The freedom to run the program, for any purpose (freedom 0).
The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.
The freedom to redistribute copies so you can help your neighbor (freedom 2).
The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.
Design of the R System
The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.
The R system is divided into 2 conceptual parts:
- The “base” R system that you download from CRAN:
- Everything else.
R functionality is divided into a number of packages.
The “base” R system contains, among other things, the
base
package which is required to run R and contains the most fundamental functions.The other packages contained in the “base” system include
utils
,stats
,datasets
,graphics
,grDevices
,grid
,methods
,tools
,parallel
,compiler
,splines
,tcltk
,stats4
.There are also “Recommended” packages:
boot
,class
,cluster
,codetools
,foreign
,KernSmooth
,lattice
,mgcv
,nlme
,rpart
,survival
,MASS
,spatial
,nnet
,Matrix
.
When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:
There are over 10,000 packages on CRAN that have been developed by users and programmers around the world.
There are also many packages associated with the Bioconductor project.
People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.
Limitations of R
No programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old days”).
Another commonly cited limitation of R is that objects must generally be stored in physical memory (though this is increasingly not true anymore). This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.
At a higher level one “limitation” of R is that its functionality is based on consumer demand and (voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.
Using R and RStudio
If R is the engine and bare bones of your car, then RStudio is like the rest of the car. The engine is super critical part of your car. But in order to make things properly functional, you need to have a steering wheel, comfy seats, a radio, rear and side view mirrors, storage, and seatbelts. — Nicholas Tierney
[Source]
The RStudio layout has the following features:
- On the upper left, something called a Rmarkdown script
- On the lower left, the R console
- On the lower right, the view for files, plots, packages, help, and viewer.
- On the upper right, the environment / history pane
The R console is the bit where you can run your code. This is where the R code in your Rmarkdown document gets sent to run (we’ll learn about these files later).
The file/plot/pkg viewer is a handy browser for your current files, like Finder, or File Explorer, plots are where your plots appear, you can view packages, see the help files. And the environment / history pane contains the list of things you have created, and the past commands that you have run.
Installing R and RStudio
- If you have not already, install R first. If you already have R installed, make sure it is a fairly recent version, version 4.0 or newer. If yours is older, I suggest you update (install a new R version).
- Once you have R installed, install the free version of RStudio Desktop. Again, make sure it’s a recent version.
If things don’t work, ask for help in the Courseplus discussion board.
I personally only have experience with Mac, but everything should work on all the standard operating systems (Windows, Mac, and even Linux).
RStudio default options
To first get set up, I highly recommend changing the following setting
Tools > Global Options (or Cmd + ,
on macOS)
Under the General tab:
- For workspace
- Uncheck restore .RData into workspace at startup
- Save workspace to .RData on exit : “Never”
- For History
- Uncheck “Always save history (even when not saving .RData)
- Uncheck “Remove duplicate entries in history”
This means that you won’t save the objects and other things that you create in your R session and reload them. This is important for two reasons
- Reproducibility: you don’t want to have objects from last week cluttering your session
- Privacy: you don’t want to save private data or other things to your session. You only want to read these in.
Your “history” is the commands that you have entered into R.
Additionally, not saving your history means that you won’t be relying on things that you typed in the last session, which is a good habit to get into!
Installing and loading R packages
As we discussed, most of the functionality and features in R come in the form of add-on packages. There are tens of thousands of packages available, some big, some small, some well documented, some not. We will be using many different packages in this course. Of course, you are free to install and use any package you come across for any of the assignments.
The “official” place for packages is the CRAN website. If you are interested in packages on a specific topic, the CRAN task views provide curated descriptions of packages sorted by topic.
To install an R package from CRAN, one can simply call the install.packages()
function and pass the name of the package as an argument. For example, to install the ggplot2
package from CRAN: open RStudio,go to the R prompt (the >
symbol) in the lower-left corner and type
and the appropriate version of the package will be installed.
Often, a package needs other packages to work (called dependencies), and they are installed automatically. It usually does not matter if you use a single or double quotation mark around the name of the package.
It could be that you already have all packages required by ggplot2
installed. In that case, you will not see any other packages installed. To see which of the packages above ggplot2
needs (and thus installs if it is not present), type into the R console:
::package_dependencies("ggplot2") tools
In RStudio, you can also install (and update/remove) packages by clicking on the ‘Packages’ tab in the bottom right window.
It is very common these days for packages to be developed on GitHub. It is possible to install packages from GitHub directly. Those usually contain the latest version of the package, with features that might not be available yet on the CRAN website. Sometimes, in early development stages, a package is only on GitHub until the developer(s) feel it is good enough for CRAN submission. So installing from GitHub gives you the latest. The downside is that packages under development can often be buggy and not working right. To install packages from GitHub, you need to install the remotes
package and then use the following function
::install_github() remotes
We will not do that now, but it is quite likely that at one point later in this course we will.
You only need to install a package once, unless you upgrade/re-install R. Once installed, you still need to load the package before you can use it. That has to happen every time you start a new R session. You do that using the library()
command. For instance to load the ggplot2
package, type
library('ggplot2')
You may or may not see a short message on the screen. Some packages show messages when you load them, and others do not.
This was a quick overview of R packages. We will use a lot of them, so you will get used to them rather quickly.
Getting started in RStudio
While one can use R and do pretty much every task, including all the ones we cover in this class, without using RStudio, RStudio is very useful, has lots of features that make your R coding life easier and has become pretty much the default integrated development environment (IDE) for R. Since RStudio has lots of features, it takes time to learn them. A good resource to learn more about RStudio are the R Studio Essentials collection of videos.
Post-lecture materials
Final Questions
Here are some post-lecture questions to help you think about the material discussed.
Additional Resources
rtistry
[‘Water Colours’ from Danielle Navarro https://art.djnavarro.net]