A brief intro to the why and what for machine learning
Before class, you can prepare by reading the following materials:
Material for this lecture was borrowed and adopted from
At the end of this lesson you will:
Today we are going to be talking about machine learning. This is one of the hottest areas in all of statistics/data science. Machine learning is in the news a ton lately. Some examples include:
This is such a big idea that universities are investing major dollars to hire faculty in AI and machine learning.
On a more personal note you might be interested in AI and machine learning because it is one of the most in demand parts of being a data scientist right now. If you get really good at it you can make a lot of money.
The other really exciting reason to focus on AI and ML right now is that there is a lot of room for statistical science. Some of the biggest open problems include:
There are a ton more, including how to do EDA for machine learning, understanding the potential confounders and bias, understanding the predictive value of a positive and more.
Ok so machine learning is super hot right now, but what is machine learning really? You may have learned about the central dogma of statistics that you sample from a population
[Source]
Then you try to guess what will happen in the population from the sample.
[Source]
For prediction, we have a similar sampling problem
[Source]
But now we are trying to build a rule that can be used to predict a single observation’s value of some characteristic using the others.
[Source]
We can make this more concrete with a little mathematical notation.
This section borrowed from Rafa Irizarry’s excellent Data Science Book
In machine learning, data comes in the form of:
Here, we will use \(Y\) to denote the outcome and \(X_1, \dots, X_p\) to denote features. Note that features are sometimes referred to as predictors or covariates. We consider all these to be synonyms.
Goal: we want to build an algorithm that takes feature values as input and returns a prediction for the outcome when we do not know the outcome.
The machine learning approach is to train an algorithm using a dataset for which we do know the outcome to identify patterns in the training data, and then apply this algorithm in the future to make a prediction when we do not know the outcome.
Prediction problems can be divided into categorical and continuous outcomes.
For categorical outcomes, \(Y\) can be any one of \(K\) classes. The number of classes can vary greatly across applications. For example, in the digit reader data, \(K=10\) with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In speech recognition, the outcome are all possible words or phrases we are trying to detect. Spam detection has two outcomes: spam or not spam.
In this lesson, we denote the \(K\) categories with indexes \(k=1,\dots,K\). However, for binary data we will use \(k=0,1\) for mathematical conveniences.
The general set-up is as follows. We have a series of features and an unknown outcome we want to predict:
outcome | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 |
---|---|---|---|---|---|
? | X_1 | X_2 | X_3 | X_4 | X_5 |
To build a model that provides a prediction for any set of values \(X_1=x_1, X_2=x_2, \dots X_5=x_5\), we collect data for which we know the outcome:
outcome | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 |
---|---|---|---|---|---|
Y_1 | X_1,1 | X_1,2 | X_1,3 | X_1,4 | X_1,5 |
Y_2 | X_2,1 | X_2,2 | X_2,3 | X_2,4 | X_2,5 |
Y_3 | X_3,1 | X_3,2 | X_3,3 | X_3,4 | X_3,5 |
Y_4 | X_4,1 | X_4,2 | X_4,3 | X_4,4 | X_4,5 |
Y_5 | X_5,1 | X_5,2 | X_5,3 | X_5,4 | X_5,5 |
Y_6 | X_6,1 | X_6,2 | X_6,3 | X_6,4 | X_6,5 |
Y_7 | X_7,1 | X_7,2 | X_7,3 | X_7,4 | X_7,5 |
Y_8 | X_8,1 | X_8,2 | X_8,3 | X_8,4 | X_8,5 |
Y_9 | X_9,1 | X_9,2 | X_9,3 | X_9,4 | X_9,5 |
Y_10 | X_10,1 | X_10,2 | X_10,3 | X_10,4 | X_10,5 |
We use the notation \(\hat{Y}\) to denote the prediction. We use the term actual outcome to denote what we ended up observing. So we want the prediction \(\hat{Y}\) to match the actual outcome.
The central problem in machine learning can be thus written very simply as minimizing a distance metric. Let \(\hat{Y} = f(\vec{X})\) then our goal is to minimize the distance from our estimated function of the predictors to the actual value.
\[d(Y - f(\vec{X}))\]
\(d(\cdot)\) could be something as simple as the mean squared distance or something much more complex. The bulk of machine learning research in theoretical computer science and statistics departments focuses on defining different values of \(d\) and \(f\). We will talk a bit more about this in the next lesson.
A machine learning problem consists of a few different parts and its important to consider each one. To solve a (standard) machine learning problem you need:
While each of these components is a technical problem, there has been a ton of work addressing those technical details. The most pressing open issue in machine learning is realizing that though these are technical steps they are not objective steps.
In other words, how you choose the data, algorithm, metric, and definition of “good” says what you value and can dramatically change the results. A couple of cases where this was a big deal are:
Try out the moralmachine to see how this looks in practice.
Quick,Draw! is an online game where you are given an object to draw (like a cello, axe, airplane, etc.) and then you have to draw it with your finger. Then a pre-trained deep learning algorithm is applied to guess what kind of a drawing you have made. You can try it out here.
One interesting thing about this project and something to keep in mind if you are thinking about ways to get cool data is the exchange that Google is making here. They are giving you a fun game to play for free and in return you are giving them a ton of free data. This is the same exchange made by other successful startups:
Before going any further, we load a few R packages we will need
The main steps in a machine learning problem are:
We will use the Quick, Draw! dataset to discuss a few different parts of the ML process.
This is the most commonly missed step when developing a machine learning algorithm. ML can very easily be turned into an engineering problem. Just dump the outcome and the features into a black box algorithm and viola!
But this kind of thinking can lead to major problems. In general good ML questions:
For example, there is a famous case where Google predicted Flu outbreaks based on search results. But the way people searched (due to changes in the algorithm, changes in culture, or other unexplained reasons) led to variation in the search terms people were using. This led to the algorithm predicting wildly badly over time.
This is just one example of a spurious correlation, which is one of the big worries. In general all of the reasons for skepticism apply here.
In our QuickDraw! example, there are a ton of interesting analyses you could do with these data, but we are going to start with something simple. Can we predict from the drawing data what kind of object was drawn? To make matters even simpler we are going to just look at a couple of types of images: axes and clouds.
One important part of any machine learning problem is defining what success looks like. This choice very much depends on the application and what you are trying to do.
For example, when we talk about the goal in ML we are usually talking about the error rate we want to minimize and how small we want to make it. Consider for each observation we have an outcome \(y\) and a set of features \(\vec{x}\). Our goal is to create a function \(\hat{y} = \hat{f}(\vec{x})\) such that the distance, \(d(y,\hat{f}(\vec{x}))\), between the observed and the predicted \(y\) is minimized.
The two most common distances that show up in machine learning (and the ones you will always be using if you don’t change the defaults!) are:
Here, we are going to use simple accuracy and say that anything better than guessing is “good enough”.
But in general there are a number of other potential error measures:
Here are a few examples of how they might be relevant.
In general, you need to spend a good amount of time thinking about what the goal is, what the tradeoff of various different errors are and then build that into your model.
Here, we will focus on one somewhat unique issue that arises often in ML problems - the data are often huge and not sampled randomly.
Questions:
We know that a 5% random sample is better than a 5% non-random sample in measurable ways (e.g. bias, uncertainty assessment).
But is an 80% non-random sample “better” than a 5% random sample in measurable terms? 90%? 95%? 99%? (Jeremy Wu 2012)
“Which one should we trust more: a 1% survey with 60% response rate or a non-probabilistic dataset covering 80% of the population?” (Keiding and Louis, 2016, Journal of Royal Statistical Society, Series B)
There is a pretty detailed, but really good paper addressing this question by Xiao-Li Meng.
The surprising answer is that it depends! If there is correlation between the outcome value and the sampling probability even huge data sets can actually be very small in “effective sample size”. So it is worth thinking very hard about potential correlation between the outcome and the (designed or de facto) sampling scheme.
OK back to our example. Google has released some of the data from the Quick, Draw! challenge. You can either get the data in raw form or you can get some pre-processed data.
I downloaded the pre-processed data sets for clouds and axes. These data are available from Google Cloud Platform.
All these things make data easier to manage and to represent into a plot. This already represents a lot of work, but even so we still have some more pre-processing to do.
First, we are going to load some of the data into R, it comes in ndjson (new line delimited JSON) format and there are lots of drawings.
if(!file.exists(here("data", "axe.ndjson"))){
file_url_axe <- paste0("https://storage.googleapis.com/quickdraw_dataset/full/simplified/axe.ndjson")
download.file(file_url_axe,
destfile=here("data", "axe.ndjson"))
file_url_cloud <- paste0("https://storage.googleapis.com/quickdraw_dataset/full/simplified/cloud.ndjson")
download.file(file_url_cloud,
destfile=here("data", "cloud.ndjson"))
}
list.files(here("data"))
[1] "2016-07-19.csv.bz2" "axe.ndjson"
[3] "bmi_pm25_no2_sim.csv" "chicago.rds"
[5] "Chinook.sqlite" "chopped.RDS"
[7] "cloud.ndjson" "flights.csv"
[9] "maacs_sim.csv" "nycflights13"
[11] "storms_2004.csv.gz" "team_standings.csv"
[13] "tuesdata_rainfall.RDS" "tuesdata_temperature.RDS"
Next, we are going to read in 100 drawings of each class using the sample_lines()
function from the LaF
package.
?LaF::sample_lines
We will start with the axes:
set.seed(123)
axes_json = LaF::sample_lines(here("data","axe.ndjson"), n = 100)
length(axes_json)
[1] 100
And repeat for clouds:
set.seed(123)
clouds_json = LaF::sample_lines(here("data","cloud.ndjson"), n = 100)
length(clouds_json)
[1] 100
Before we talk about data exploration and processing, it is important to look at your data and think about what you find in it.
Important: I want to point out you should do this exploration only in the training set. However, in this example, we are going split the data at a later point.
If you want to know more about this concept, read about data leakage.
OK the data are not in a format we can do anything with yet. Each line is a json object:
axes_json[[1]]
[1] "{\"word\":\"axe\",\"countrycode\":\"US\",\"timestamp\":\"2017-01-23 21:25:30.06067 UTC\",\"recognized\":true,\"key_id\":\"4842320119726080\",\"drawing\":[[[69,74,75,73,70,79,98,105,111,110,70],[76,92,118,239,252,255,252,208,133,73,66]],[[70,57,20,0,28,91,107,114,115,140,134,123,116,112],[66,74,87,0,17,31,45,59,78,95,75,55,50,37]],[[45,50,48],[20,51,74]]]}"
[Source]
The next thing I did was google “quick draw data ndjson rstats”. I found a tutorial and lifted some code for processing ndjson data into data frames.
parse_drawing <- function(list){
lapply(list$drawing, function(z) {tibble(x=z[[1]], y=z[[2]])}) %>%
bind_rows(.id = "line") %>%
mutate(drawing=list$key_id, row_id=row_number())
}
What is this code doing?
drawing
is a named element that we are going to work with in each of the json objectsx
and y
contain the coordinates for the locationsbind_rows()
from dplyr
is similar to do.call(rbind)
in base R to bind many data frames in to one. Here the use of the .id
argument is like a data frame identifier. You can think of line
as drawing a single stroke (or line) when you draw something on a page.drawing
which contains an ID for each drawing (key_id
) and row_number()
creates a tracker for each drawing.Ok let’s try geting our first axe out:
first_axe <-
rjson::fromJSON(axes_json[[1]]) %>%
parse_drawing()
first_axe
# A tibble: 28 × 5
line x y drawing row_id
<chr> <dbl> <dbl> <chr> <int>
1 1 69 76 4842320119726080 1
2 1 74 92 4842320119726080 2
3 1 75 118 4842320119726080 3
4 1 73 239 4842320119726080 4
5 1 70 252 4842320119726080 5
6 1 79 255 4842320119726080 6
7 1 98 252 4842320119726080 7
8 1 105 208 4842320119726080 8
9 1 111 133 4842320119726080 9
10 1 110 73 4842320119726080 10
# … with 18 more rows
Ok this doesn’t look like much, but we could plot it to see if it looks like an axe.
first_axe %>%
ggplot(aes(x=x, y=y)) +
geom_point() +
scale_x_continuous(limits=c(0, 255))+
scale_y_reverse(limits=c(255, 0))+
theme_minimal()
This sort of looks OK, but maybe a better way to look at it is to actually draw the lines?
Hint: check out line
in the first_axe
data frame and the geom_path()
geom in ggplot2
?geom_path
Putting it all together
first_axe %>%
ggplot(aes(x=x, y=y)) +
geom_path(aes(group = line), lwd=1)+
scale_x_continuous(limits=c(0, 255))+
scale_y_reverse(limits=c(255, 0))+
theme_minimal()
Hey that sort of looks like an axe!
Questions:
line
s are there in this drawing?### try it out
OK let’s see another one.
rjson::fromJSON(axes_json[[2]]) %>%
parse_drawing() %>%
ggplot(aes(x=x, y=y)) +
geom_path(aes(group = line), lwd=1)+
scale_x_continuous(limits=c(0, 255))+
scale_y_reverse(limits=c(255, 0))+
theme_minimal()
Pro tip: If we were doing this for real, I would make plots for a large sample of these, understand the variation (and look for mislabeled drawings, messed up observations, etc.).
Next let’s look at a cloud
rjson::fromJSON(clouds_json[[1]]) %>%
parse_drawing() %>%
ggplot(aes(x=x, y=y)) +
geom_path(aes(group = line), lwd=1)+
scale_x_continuous(limits=c(0, 255))+
scale_y_reverse(limits=c(255, 0))+
theme_minimal()
Yup, looks like a cloud! You can see how the x and y axis align in the top left corner.
One of the key issues in building a model is feature engineering. Feature engineering is a step in a machine learning model where we construct the covariates (or features, \(\vec{x}\)) that you will feed into the prediction algorithms.
Feature engineering is particularly important for “unstructured” data. This is:
“information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.”
For example, we can take a pile of text (unstructured data) like this Emily Dickinson quote from the tidy text tutorial:
Convert this vector of character strings into a tibble where each row is a single character string:
text_df <-
tibble(text = text) %>%
mutate(line = row_number())
text_df
# A tibble: 4 × 2
text line
<chr> <int>
1 Because I could not stop for Death - 1
2 He kindly stopped for me - 2
3 The Carriage held but just Ourselves - 3
4 and Immortality 4
And turn it into something like counts of each word:
text_df %>%
unnest_tokens(word, text) %>%
count(word)
# A tibble: 19 × 2
word n
<chr> <int>
1 and 1
2 because 1
3 but 1
4 carriage 1
5 could 1
6 death 1
7 for 2
8 he 1
9 held 1
10 i 1
11 immortality 1
12 just 1
13 kindly 1
14 me 1
15 not 1
16 ourselves 1
17 stop 1
18 stopped 1
19 the 1
Now we have a table of the frequency of words in this text. But what if we repeated this process for all the text in one of Emily’s book to determine the most commonly used words from this author. And then we picked a second book from a different author, created a table of the most commonly used words from the second author.
To see an example, let’s check out this blog by Julia Silge:
https://juliasilge.com/blog/tidy-text-classification/
The first thing was to engineer the features from two books:
Let’s build a supervised machine learning model that learns the difference between text from the two authors:
We could then use these features (where an observation is a single line of text) to build a machine learning model that learns the difference between text from Pride and Prejudice and text from The War of the Worlds.
Here are the features that are most important in the model built by Julia:
[Source]
Then, if we give the machine learning algorithm a new line of text it has not seen before, it could predict whether it came from Jane Austen or H.G. Wells.
If the data are continuous, you may consider other types of feature engineering:
Back to our data above, e.g.
rjson::fromJSON(axes_json[[1]]) %>%
parse_drawing()
# A tibble: 28 × 5
line x y drawing row_id
<chr> <dbl> <dbl> <chr> <int>
1 1 69 76 4842320119726080 1
2 1 74 92 4842320119726080 2
3 1 75 118 4842320119726080 3
4 1 73 239 4842320119726080 4
5 1 70 252 4842320119726080 5
6 1 79 255 4842320119726080 6
7 1 98 252 4842320119726080 7
8 1 105 208 4842320119726080 8
9 1 111 133 4842320119726080 9
10 1 110 73 4842320119726080 10
# … with 18 more rows
A bunch of data processing has been done for us, but the data are not quite ready to be fed into an algorithm yet.
To do that, we would need a data frame with each row equal to one drawing and each column equal to one feature for that drawing, with an extra column for the drawing output (e.g. cloud
or axe
).
To do this, we need to think about creating a standardized grid for storing our data on. However, the choice of grid is decision left up to us (you can think of this as feature engineering).
Another thing we might want is for our data to be of a manageable size (again the choice of how we do this is another decision left up to us – more feature engineering).
Let’s start by creating a regular grid of 256 x
and y
values.
grid_dat = as_tibble(expand.grid(x = 1:256,y=1:256))
dim(grid_dat)
[1] 65536 2
head(grid_dat)
# A tibble: 6 × 2
x y
<int> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
Now we could make each x
, y
value be a grid point with a join (this is overkill)
# A tibble: 65,537 × 5
x y line drawing row_id
<dbl> <dbl> <chr> <chr> <int>
1 1 1 <NA> <NA> NA
2 2 1 <NA> <NA> NA
3 3 1 <NA> <NA> NA
4 4 1 <NA> <NA> NA
5 5 1 <NA> <NA> NA
6 6 1 <NA> <NA> NA
7 7 1 <NA> <NA> NA
8 8 1 <NA> <NA> NA
9 9 1 <NA> <NA> NA
10 10 1 <NA> <NA> NA
# … with 65,527 more rows
# A tibble: 2 × 2
`is.na(line)` n
<lgl> <int>
1 FALSE 27
2 TRUE 65510
We see most of the lines are NA. Let’s add an indicator of whether a particular value is NA or not.
[1] 4124 4909 7771 9328 11371 12660 12850 13947 14962 16710 16711
[12] 18543 18737 18746 19079 19270 19828 22037 23371 24205 30028 33904
[23] 53098 61002 64327 64355 65104
grid_axe[which(grid_axe$pixel == 1),]
# A tibble: 27 × 6
x y line drawing row_id pixel
<dbl> <dbl> <chr> <chr> <int> <dbl>
1 28 17 2 4842320119726080 16 1
2 45 20 3 4842320119726080 26 1
3 91 31 2 4842320119726080 17 1
4 112 37 2 4842320119726080 25 1
5 107 45 2 4842320119726080 18 1
6 116 50 2 4842320119726080 24 1
7 50 51 3 4842320119726080 27 1
8 123 55 2 4842320119726080 23 1
9 114 59 2 4842320119726080 19 1
10 70 66 1 4842320119726080 11 1
# … with 17 more rows
Let’s try subsampling this down to a smaller image.
We will use Hmisc::cut2()
to cut a numeric variable into intervals. It’s similar to cut()
in base R, but left endpoints are inclusive and labels are of the form [lower, upper)
, except that last interval is [lower,upper]
.
However, we will use levels.mean=TRUE
to make the new categorical vector have levels attribute that is the group means of grid_axe$x
instead of interval endpoint labels.
grid_axe$xgroup = Hmisc::cut2(grid_axe$x, g=16, levels.mean=TRUE) # g is number of quantile groups
grid_axe
# A tibble: 65,537 × 7
x y line drawing row_id pixel xgroup
<dbl> <dbl> <chr> <chr> <int> <dbl> <fct>
1 1 1 <NA> <NA> NA 0 " 9.000"
2 2 1 <NA> <NA> NA 0 " 9.000"
3 3 1 <NA> <NA> NA 0 " 9.000"
4 4 1 <NA> <NA> NA 0 " 9.000"
5 5 1 <NA> <NA> NA 0 " 9.000"
6 6 1 <NA> <NA> NA 0 " 9.000"
7 7 1 <NA> <NA> NA 0 " 9.000"
8 8 1 <NA> <NA> NA 0 " 9.000"
9 9 1 <NA> <NA> NA 0 " 9.000"
10 10 1 <NA> <NA> NA 0 " 9.000"
# … with 65,527 more rows
table(grid_axe$xgroup)
9.000 25.500 41.500 57.500 72.999 88.500 104.500 120.500
4352 4096 4096 4096 3841 4096 4096 4096
136.500 152.500 168.500 184.500 200.500 216.500 232.500 248.500
4096 4096 4096 4096 4096 4096 4096 4096
Let’s do the same for y
grid_axe$ygroup = Hmisc::cut2(grid_axe$y,g=16,levels.mean=TRUE)
grid_axe
# A tibble: 65,537 × 8
x y line drawing row_id pixel xgroup ygroup
<dbl> <dbl> <chr> <chr> <int> <dbl> <fct> <fct>
1 1 1 <NA> <NA> NA 0 " 9.000" " 9.000"
2 2 1 <NA> <NA> NA 0 " 9.000" " 9.000"
3 3 1 <NA> <NA> NA 0 " 9.000" " 9.000"
4 4 1 <NA> <NA> NA 0 " 9.000" " 9.000"
5 5 1 <NA> <NA> NA 0 " 9.000" " 9.000"
6 6 1 <NA> <NA> NA 0 " 9.000" " 9.000"
7 7 1 <NA> <NA> NA 0 " 9.000" " 9.000"
8 8 1 <NA> <NA> NA 0 " 9.000" " 9.000"
9 9 1 <NA> <NA> NA 0 " 9.000" " 9.000"
10 10 1 <NA> <NA> NA 0 " 9.000" " 9.000"
# … with 65,527 more rows
Now I can convert these to numbers so we’ll have them later
grid_axe <-
grid_axe %>%
mutate(xgroup = as.numeric(as.character(xgroup)) - 7.5) %>%
mutate(ygroup = as.numeric(as.character(ygroup)) - 7.5)
table(grid_axe$xgroup)
1.5 18 34 50 65.499 81 97 113 129 145
4352 4096 4096 4096 3841 4096 4096 4096 4096 4096
161 177 193 209 225 241
4096 4096 4096 4096 4096 4096
grid_axe
# A tibble: 65,537 × 8
x y line drawing row_id pixel xgroup ygroup
<dbl> <dbl> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 1 1 <NA> <NA> NA 0 1.5 1.5
2 2 1 <NA> <NA> NA 0 1.5 1.5
3 3 1 <NA> <NA> NA 0 1.5 1.5
4 4 1 <NA> <NA> NA 0 1.5 1.5
5 5 1 <NA> <NA> NA 0 1.5 1.5
6 6 1 <NA> <NA> NA 0 1.5 1.5
7 7 1 <NA> <NA> NA 0 1.5 1.5
8 8 1 <NA> <NA> NA 0 1.5 1.5
9 9 1 <NA> <NA> NA 0 1.5 1.5
10 10 1 <NA> <NA> NA 0 1.5 1.5
# … with 65,527 more rows
Now we can average within groups of pixels to get a smaller image
# A tibble: 256 × 3
# Groups: xgroup [16]
xgroup ygroup pixel
<dbl> <dbl> <dbl>
1 1.5 1.5 0
2 1.5 18 0
3 1.5 34 0
4 1.5 50 0
5 1.5 65.5 0
6 1.5 81 0
7 1.5 97 0
8 1.5 113 0
9 1.5 129 0
10 1.5 145 0
# … with 246 more rows
Remember this was our original axe
first_axe %>%
ggplot(aes(x=x, y=y)) +
geom_point() +
scale_x_continuous(limits=c(0, 255))+
scale_y_reverse(limits=c(255, 0))+
theme_minimal()
Now we can look at the small version - it looks similar - whew! :)
small_axe %>%
filter(pixel > 0) %>%
ggplot(aes(x = xgroup, y = ygroup)) +
geom_point() +
scale_x_continuous(limits=c(0, 255))+
scale_y_reverse(limits=c(255, 0))+
theme_minimal()
Now let’s do this for all axes and clouds
img_dat = tibble(pixel=NA, type=NA, drawing=NA, pixel_number=NA)
#First axes
for(i in 1:100){
tmp_draw = rjson::fromJSON(axes_json[[i]]) %>% parse_drawing()
grid_draw = left_join(grid_dat,tmp_draw) %>%
mutate(pixel = ifelse(is.na(line),0,1))
grid_draw$xgroup = cut2(grid_draw$x,g=16,levels.mean=TRUE)
grid_draw$ygroup = cut2(grid_draw$y,g=16,levels.mean=TRUE)
small_draw = grid_draw %>%
mutate(xgroup = as.numeric(as.character(xgroup)) - 7.5) %>%
mutate(ygroup = as.numeric(as.character(ygroup)) - 7.5) %>%
group_by(xgroup,ygroup) %>%
summarise(pixel=mean(pixel)) %>% ungroup() %>%
select(pixel) %>%
mutate(type="axe",drawing=i,pixel_number=row_number())
img_dat = img_dat %>% bind_rows(small_draw)
}
#Then clouds
for(i in 1:100){
tmp_draw = rjson::fromJSON(clouds_json[[i]]) %>% parse_drawing()
grid_draw = left_join(grid_dat,tmp_draw) %>%
mutate(pixel = ifelse(is.na(line),0,1))
grid_draw$xgroup = cut2(grid_draw$x,g=16,levels.mean=TRUE)
grid_draw$ygroup = cut2(grid_draw$y,g=16,levels.mean=TRUE)
small_draw = grid_draw %>%
mutate(xgroup = as.numeric(as.character(xgroup)) - 7.5) %>%
mutate(ygroup = as.numeric(as.character(ygroup)) - 7.5) %>%
group_by(xgroup,ygroup) %>%
summarise(pixel=mean(pixel)) %>% ungroup() %>%
select(pixel) %>%
mutate(type="cloud",drawing=i,pixel_number=row_number())
img_dat = img_dat %>% bind_rows(small_draw)
}
Now let’s look at this new data frame
img_dat = img_dat[-1,]
img_dat
# A tibble: 51,200 × 4
pixel type drawing pixel_number
<dbl> <chr> <int> <int>
1 0 axe 1 1
2 0 axe 1 2
3 0 axe 1 3
4 0 axe 1 4
5 0 axe 1 5
6 0 axe 1 6
7 0 axe 1 7
8 0 axe 1 8
9 0 axe 1 9
10 0 axe 1 10
# … with 51,190 more rows
We can use pivot_wider()
and viola we finally have a processed data set!
library(tidyr)
img_final <-
img_dat %>%
pivot_wider(names_from = pixel_number, values_from = pixel)
names(img_final) = c("type", "drawing", paste0("pixel", 1:256))
img_final
# A tibble: 200 × 258
type drawing pixel1 pixel2 pixel3 pixel4 pixel5 pixel6
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 axe 1 0 0 0 0 0 0
2 axe 2 0 0.00368 0.00368 0 0 0
3 axe 3 0 0.00368 0 0 0 0.00368
4 axe 4 0.00391 0 0.00391 0 0 0
5 axe 5 0 0 0 0.00368 0 0
6 axe 6 0 0 0.00391 0 0 0
7 axe 7 0 0 0.00781 0 0 0
8 axe 8 0 0 0.00391 0 0 0
9 axe 9 0.00781 0 0.00391 0 0 0
10 axe 10 0.0117 0 0 0.00391 0 0
# … with 190 more rows, and 250 more variables: pixel7 <dbl>,
# pixel8 <dbl>, pixel9 <dbl>, pixel10 <dbl>, pixel11 <dbl>,
# pixel12 <dbl>, pixel13 <dbl>, pixel14 <dbl>, pixel15 <dbl>,
# pixel16 <dbl>, pixel17 <dbl>, pixel18 <dbl>, pixel19 <dbl>,
# pixel20 <dbl>, pixel21 <dbl>, pixel22 <dbl>, pixel23 <dbl>,
# pixel24 <dbl>, pixel25 <dbl>, pixel26 <dbl>, pixel27 <dbl>,
# pixel28 <dbl>, pixel29 <dbl>, pixel30 <dbl>, pixel31 <dbl>, …
Now that we have our data processed an important step is to break the data up into a training, testing, and validation set.
In general, people use these words in different ways:
I actually like this proposal to call them “training, tuning, and testing” sets, so let’s use that.
But the reason for this splitting is that we want to avoid being overly optimistic or “overfitting” on the training data. That would prevent us from predicting well on new samples.
Here is a funny example from XKCD to illustrate overfitting
The basic idea is that if you keep adding predictors, the model will “predict well” on the data you have, no matter how well we expect it to do in the future.
The key thing to keep in mind is that there are two types of variation in any data set, the “signal” and the “noise”. Using math notation, imagine that the “real” model for a data set is:
\[y_i = \underbrace{g(\vec{x}_i)}_{signal} + \underbrace{e_i}_{noise}\]
Let’s use a concrete, simple example:
\[y_i = \underbrace{x^2}_{signal} + \underbrace{e_i}_{noise}\]
Imagine we want to “learn” a model of the form:
\[y_i = \sum_{k=1}^K b_k g_k(x_i) + e_i\]
Then, the model
Let’s simulate from the example above to give a better idea:
dat = tibble(x = rnorm(30),y=rnorm(30,mean=x^2))
ggplot(dat,aes(x=x,y=y)) +
geom_point() +
theme_minimal()
Now let’s fit three models to this data. One that underfits, one that fits, and one that overfits.
# To make plotting easier later
dat = dat %>%
arrange(x)
lm_under = lm(y ~ x,data=dat)
# using natural splines to model a non-linear relationship with
# piecewise cubic polynomials
# can specify df (degrees of freedom) and then ns() chooses (df - 1)
# interior knots at certain quantiles of x
lm_fits = lm(y ~ ns(x,df=2),data=dat)
lm_over = lm(y ~ ns(x,df=10),data=dat)
# add_predictions() is from the modelr R package
dat_pred = dat %>%
add_predictions(lm_fits,"fits") %>%
add_predictions(lm_under,"under") %>%
add_predictions(lm_over,"over")
dat_pred = dat_pred %>%
pivot_longer(-c(x,y),
names_to = "fit_type",
values_to = "fit")
dat_pred %>%
ggplot(aes(x=x,y=y)) +
geom_point() +
geom_line(aes(x=x, y=fit,
group=fit_type,
color=fit_type)) +
theme_minimal()
You can kind of see that the blue line misses the signal, the red line fits pretty well, and the green line seems to capture a little too much of the noise.
If we look at the errors (here: mean squared error between \(Y\) and \(\hat{Y}\)) of each approach we get:
dat_pred %>%
mutate(res2 = (y - fit)^2) %>%
group_by(fit_type) %>%
summarise(rmse = sqrt(sum(res2)))
# A tibble: 3 × 2
fit_type rmse
<chr> <dbl>
1 fits 6.10
2 over 4.58
3 under 9.12
It looks like the overfitting approach was best, but we can probably guess that won’t work on a new data set:
dat2 = tibble(x = rnorm(30), y = rnorm(30,mean=x^2))
dat2_pred = dat2 %>%
add_predictions(lm_fits,"fits") %>%
add_predictions(lm_under,"under") %>%
add_predictions(lm_over,"over")
dat2_pred <-
dat2_pred %>%
pivot_longer(-c(x,y), names_to = "fit_type", values_to = "fit")
dat2_pred %>%
ggplot(aes(x=x,y=y)) +
geom_point() +
theme_minimal() +
geom_line(aes(x=x, y=fit,
group=fit_type,
color=fit_type))
Where the best model will be the one that captures the signal (which remains fixed) and not the noise (which changes).
dat2_pred %>%
mutate(res2 = (y - fit)^2) %>%
group_by(fit_type) %>%
summarise(rmse = sqrt(sum(res2)))
# A tibble: 3 × 2
fit_type rmse
<chr> <dbl>
1 fits 5.45
2 over 6.03
3 under 7.86
Overfitting is related to another general concept - the bias variance trade-off.
In general, the more predictors you have in a model the lower the bias but the higher the variance. This is called the “bias-variance trade-off”.
To see an example of this, let’s fit these models in 100 simulated data sets and see what the models predict for a x value of 0 (the prediction should equal 0).
over = under = fits = rep(0,100)
ex_dat = tibble(x=0)
for(i in 1:100){
new_dat = tibble(x = rnorm(30),y=rnorm(30,mean=x^2))
lm_under = lm(y ~ x,data=new_dat)
lm_fits = lm(y ~ ns(x,df=2),data=new_dat)
lm_over = lm(y ~ ns(x,df=10),data=new_dat)
over[i] = predict(lm_over,ex_dat)
under[i] = predict(lm_under,ex_dat)
fits[i] = predict(lm_fits,ex_dat)
}
results = tibble(over,under,fits) %>%
pivot_longer(names_to = "type",
values_to = "prediction",
c(over,under,fits))
The results
show that when we fit the exact right model we do best (no surprise there!).
When the model is too complex we get low bias (values predicted close to zero on average) but high variance.
When the model is not complex enough we get high bias (values predicted away from zero) but low variance.
results %>%
ggplot(aes(y=prediction,group=type,fill=type)) +
geom_boxplot() +
theme_minimal()
In general, you will not know the “true” model, so the goal is to try to pick a model that gives a happy medium on the bias-variance trade-off (of course depending on your goals).
Imagine we have broken the data into three components:
Now we need to “fit” the model. Let’s briefly talk about what this means.
A machine learning model has two parts:
The algorithm would be something like regression models with splines:
\[y_i = \sum_{k=1}^K b_k g_k(x_i) + e_i\]
And the parameters would be the choices of \(b_k\), \(g_k\) and \(K\). These parameters are “fit” by estimating them from the training data (or fixing them in advance).
Within the training set you are choosing between algorithms and parameters, but like we saw above, if you use the whole training set you may end up overfitting to the noise in the data. So when selecting algorithms and parameters, you need some way to make sure you don’t just pick the algorithm that is most overfit.
The typical way people do this is by cross-validation (figure borrowed from Rafa’s Data Science Book).
\(K\)-fold cross validation just splits up the training set into \(K\) pieces. You build the model on part of the training data and apply it to the rest. This gives you a better evaluation of the out of sample error - so will allow you to rank models in a better way.
You can also use something called the bootstrap. But you need to adjust for the fact that the training and testing sets are random samples.
Here we are going to simply use training and testing using the createDataPartition()
function in the caret package with the argument p
being the percentages of data that goes into training:
train_set = createDataPartition(y = img_final$type, p = 0.5,
list=FALSE)
train_dat = img_final[train_set,]
test_dat = img_final[-train_set,]
:::keyideas
Important!!!!: We leave the test set alone until the very end!
and for those in the back
it’s worth repeating: We leave the test set alone until the very end!
:)
A lot of machine learning is considering variations on the equation:
\[d(y,f(\vec{x}))\]
where the things we are varying is the choice of distance metric \(d()\) which we have already discussed (e.g. mean squared error, accuracy, PPV).
We have talked a lot less about varying \(f()\) which is the most common topic for many machine learning books. Here, we will briefly review a couple of the key algorithms.
The first thing to keep in mind is that with well engineered features, often simple algorithms due almost as well as more advanced algorithms.
The other thing to keep in mind with these types of algorithms is that there are often important tradeoffs.
The important tradeoffs are:
There are a few key ideas you should know about that define most regression models you will run into in practice.
Today we will briefly demonstrate trees.
Classification and regression trees are an extremely popular approach to prediction. The basic algorithm for a classification tree is the following:
This is an example tree:
The big question is how to define “best separates the outcomes” in Step 4 of the general algorithm.
For continuous data, you might minimize the residual sum of squares in each group.
For binary data, you might measure misclassification or information gain.
Based on these plots of a few of the features:
ggplot(train_dat, aes(x=type,y=pixel1)) +
geom_violin() +
theme_minimal() +
scale_y_log10()
ggplot(train_dat, aes(x=type,y=pixel2)) +
geom_violin() +
theme_minimal() +
scale_y_log10()
We should be able to do ok for this model fitting. We can fit models using the caret package. The caret package simplifies a lot of model fitting for machine learning. We can use the train
command to do this in R.
CART
100 samples
257 predictors
2 classes: 'axe', 'cloud'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.14 0.8385178 0.6751338
0.26 0.7900434 0.5798719
0.54 0.6274736 0.2868931
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was cp = 0.14.
Here, you can see we have reasonable accuracy, this accuracy is estimated using bootstrapping only the training set.
We can look at the final model fit after model selection using the finalModel
argument.
mod$finalModel
n= 100
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 100 50 axe (0.50000000 0.50000000)
2) pixel246< 0.001953125 73 23 axe (0.68493151 0.31506849)
4) pixel243< 0.001953125 58 9 axe (0.84482759 0.15517241) *
5) pixel243>=0.001953125 15 1 cloud (0.06666667 0.93333333) *
3) pixel246>=0.001953125 27 0 cloud (0.00000000 1.00000000) *
You can use the rpart.plot
package to visualize what is actually going on here.
rpart.plot(mod$finalModel)
Interesting! I wonder what pixel 246 is?? Can you guess? If not, no worries. Hang tight.
The last step is model evaluation. A good model evaluation includes the components:
why is this important?
A while back Amazon developed an AI algorithm for predicting who they should hire. They did a good job of evaluating criteria 1 (above): i.e. they knew they could accurately predict in their training set.
However, they missed criteria 2 (abvoe) i.e. they did not do an exploratory analysis to identify what their model was using to make predictions.
They also did not do a careful job of evaluating the implications of their model in terms of bias.
This led to some major problems. (worth a peek)
Before evaluating our model in the test set, we want to understand what is going on with our prediction. This is an active and open area of research: the interpretation of results from black box machine learning algorithms.
We can do this in a couple of ways. One approach that has seen some traction is locally interpretable model agnostic explanations (lime). This approach fits very simple local models to approximate the complicated model in a local neighborhood.
Then for each prediction you can see how much the features are positively or negatively correlated with the complex model near that prediction. To do this, you can use the lime package.
Or you can start looking at the data for individual features.
img_final %>%
ggplot(aes(x=type,y=pixel246)) +
geom_violin() +
theme_minimal() +
scale_y_log10()
We can also look at where this pixel would be in the image:
expand.grid(x=1:16,y=1:16)[246,]
x y
246 6 16
And plot it
We can also figure out which of the images are misclassified and look at them
Let’s look at one of the missed images
missed_imgs = missed_imgs %>%
filter(type=="axe")
rjson::fromJSON(axes_json[[missed_imgs$drawing[1]]]) %>%
parse_drawing() %>%
ggplot(aes(x=x, y=y)) +
geom_path(aes(group = line), lwd=1) +
scale_x_continuous(limits=c(0, 255)) +
scale_y_reverse(limits=c(255, 0)) +
theme_minimal()
ha! makes sense now
The last step (!!) is to apply the predictions in the test set. You only do this once, but it gives you the best estimate of the out of sample error rate you’d see in practice.
confusionMatrix(factor(test_dat$type), predict(mod,test_dat))
Confusion Matrix and Statistics
Reference
Prediction axe cloud
axe 45 5
cloud 13 37
Accuracy : 0.82
95% CI : (0.7305, 0.8897)
No Information Rate : 0.58
P-Value [Acc > NIR] : 2.857e-07
Kappa : 0.64
Mcnemar's Test P-Value : 0.09896
Sensitivity : 0.7759
Specificity : 0.8810
Pos Pred Value : 0.9000
Neg Pred Value : 0.7400
Prevalence : 0.5800
Detection Rate : 0.4500
Detection Prevalence : 0.5000
Balanced Accuracy : 0.8284
'Positive' Class : axe
This accuracy is usually slightly lower than the accuracy in the training data.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Hicks (2021, Oct. 19). Statistical Computing: Introduction to machine learning. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/posts/2021-10-19-introduction-to-machine-learning/
BibTeX citation
@misc{hicks2021introduction, author = {Hicks, Stephanie}, title = {Statistical Computing: Introduction to machine learning}, url = {https://stephaniehicks.com/jhustatcomputing2021/posts/2021-10-19-introduction-to-machine-learning/}, year = {2021} }