In this tutorial we’ll discuss the tidyverse in greater detail, and then introduce the pipe operator: %>%
.
Both feature prominently throughout this course.
Let’s begin by loading all the packages that we will need.
library(tidyverse)
library(tidyquant)
1. The tidyverse is a collection of packages that is popular among data scientists who use R.
2. These packages provide on data manipulation, visualization, and analysis.
3. The reason for the popularity of the tidyverse is that there is a common focus on usability and uniformity.
4. Packages in the tidyverse all interact nicely with one another.
5. This different from much of the R ecosystem, because of the ad hoc nature of R package development.
6. If you want to do data analysis with R, I highly recommend focusing your efforts on learning the tidyverse.
7. The tidyverse is foundation of this training.
8. When we run library(tidyverse)
we are actually loading all the core tidyverse packages at once.
9. To learn more checkout www.tidyverse.org.
Let’s briefly discuss the tidyverse packages that we’ll be using in this course series.
dplyr - the main data manipulation tool that we’ll be using everywhere. The main verbs are select()
, mutate()
, filter()
, group_by()
.
tidyquant - querying online financial data sources and performing calculations on them; we’ll mainly use tq_get()
.
ggplot2 - creating beautiful visualizations.
lubridate - makes working with dates easy.
1. Let’s read-in DIA prices to from 2018.
df_dia <- tq_get("DIA", get = "stock.prices", from = "2018-01-01"
, to = "2019-01-01")
df_dia
# A tibble: 251 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-01-02 248. 248. 247. 248. 4454900 241.
2 2018-01-03 248. 249. 248. 249. 5528600 242.
3 2018-01-04 250. 251. 250. 251. 4932900 244.
4 2018-01-05 251. 253. 251. 253. 3349200 246.
5 2018-01-08 253. 253. 252. 253. 3847500 246.
6 2018-01-09 253. 254. 253. 254. 5031400 247.
7 2018-01-10 253. 254. 252. 254. 2357600 247.
8 2018-01-11 254 256. 254. 256. 2760400 249.
9 2018-01-12 257. 258. 256. 258. 3678400 251.
10 2018-01-16 260. 261. 257. 258. 8102900 251.
# … with 241 more rows
Observations:
Notice that there are 251 rows in this df_dia
but only 10 of them printed.
This is the default behavior of tibbles
, which are just the tidyverse version of a data.frame
.
1. The pipe operator is the following symbol %>%
.
2. It is a quintissential part of the tidyverse.
3. The pipe operator can be a little mysterious at first, but becoming familiar with it will pay off huge dividends in the next few tutorials, and as you progress in your used of the tidyverse.
4. The keyboard short cut for creating it is ctrl + shift + M.
5. Practice doing this a few times, just to get used to it.
# %>% %>% %>% %>%
6. Notice that the #
symbol at the beginning of a line of code turns it into a comment.
The best way to start getting acquainted with the pipe operator is to see it working in action.
1. The following code calculates average of the close
prices in df_dia
:
mean(df_dia$close)
[1] 250.5277
2. This can be rewritten with the pipe
operator as follows:
df_dia$close %>% mean()
[1] 250.5277
1. The following code rounds the first five close
prices to the nearest dollar:
round(df_dia$close[1:5])
[1] 248 249 251 253 253
2. This can be rewritten with the pipe
as follows:
df_dia$close[1:5] %>% round()
[1] 248 249 251 253 253
1. Suppose we want to calculate the average close price, and then round that average. This can be done as follows:
round(mean(df_dia$close))
[1] 251
2. In order to recreate this with pipes, we will need to use two of them as follows:
df_dia$close %>% mean() %>% round()
[1] 251
Summarizing Remarks:
Notice that we executed the above calculation by composing together the mean()
function with the round()
function.
In R, pretty much everything is a function.
As your data analysis becomes more complicated, there will be a lot of function composition going on.
Code with a lot of function composition can get be confusing, especially if you use standard notation, e.g. h(g(f(x)))
.
pipes
help keep your code organized - as we will see shortly.
Code Challenge: Find the maximum close
price in df_dia
and then round it to the nearest dollar. Try this first without pipes
, and then with pipes
.
1. So far we have used pipes
in conjunction with one or two simple built-in functions like mean()
and round()
.
2. Pipes
really become valuable when combined with multiple tidyverse function calls.
3. In this section we will use %>%
along with select()
from dplyr.
4. The following code normal function syntax to access the date
, close
, and adjusted
columns from df_dia
:
select(df_dia, date, close, adjusted)
# A tibble: 251 x 3
date close adjusted
<date> <dbl> <dbl>
1 2018-01-02 248. 241.
2 2018-01-03 249. 242.
3 2018-01-04 251. 244.
4 2018-01-05 253. 246.
5 2018-01-08 253. 246.
6 2018-01-09 254. 247.
7 2018-01-10 254. 247.
8 2018-01-11 256. 249.
9 2018-01-12 258. 251.
10 2018-01-16 258. 251.
# … with 241 more rows
5. We can rewrite this with the pipe
operator as follows:
df_dia %>% select(date, close, adjusted)
# A tibble: 251 x 3
date close adjusted
<date> <dbl> <dbl>
1 2018-01-02 248. 241.
2 2018-01-03 249. 242.
3 2018-01-04 251. 244.
4 2018-01-05 253. 246.
5 2018-01-08 253. 246.
6 2018-01-09 254. 247.
7 2018-01-10 254. 247.
8 2018-01-11 256. 249.
9 2018-01-12 258. 251.
10 2018-01-16 258. 251.
# … with 241 more rows
Observations:
The select()
takes as its first argument a data.frame
.
The next arguments for select()
are the names of the columns we want to select.
So the pattern for piping
with select()
is: data_frame %>% select(column_names)
.
The general pattern for piping
is: first_argument %>% function(other_arguments)
.
Coding Challenge: Use the pipe operator to select the date
, high
, and low
columns from df_dia
.
1. In this section, we are going to use pipes
to add a column to a data.frame
.
2. The following code separates out the date
, high
, and low
columns and assigns them to a new data.frame
called df_range
.
df_range <- df_dia %>% select(date, high, low)
df_range
# A tibble: 251 x 3
date high low
<date> <dbl> <dbl>
1 2018-01-02 248. 247.
2 2018-01-03 249. 248.
3 2018-01-04 251. 250.
4 2018-01-05 253. 251.
5 2018-01-08 253. 252.
6 2018-01-09 254. 253.
7 2018-01-10 254. 252.
8 2018-01-11 256. 254.
9 2018-01-12 258. 256.
10 2018-01-16 261. 257.
# … with 241 more rows
3. Using standard function syntax, we can add an intraday_range
column to df_range
by using the mutate()
function as follows:
mutate(df_range, intraday_range = high - low)
# A tibble: 251 x 4
date high low intraday_range
<date> <dbl> <dbl> <dbl>
1 2018-01-02 248. 247. 1.24
2 2018-01-03 249. 248. 1.15
3 2018-01-04 251. 250. 1.30
4 2018-01-05 253. 251. 1.88
5 2018-01-08 253. 252. 0.75
6 2018-01-09 254. 253. 1.57
7 2018-01-10 254. 252. 1.48
8 2018-01-11 256. 254. 1.82
9 2018-01-12 258. 256. 1.49
10 2018-01-16 261. 257. 3.84
# … with 241 more rows
4. Rewriting this using the pipe operator looks like:
df_range %>% mutate(intraday_range = high - low)
# A tibble: 251 x 4
date high low intraday_range
<date> <dbl> <dbl> <dbl>
1 2018-01-02 248. 247. 1.24
2 2018-01-03 249. 248. 1.15
3 2018-01-04 251. 250. 1.30
4 2018-01-05 253. 251. 1.88
5 2018-01-08 253. 252. 0.75
6 2018-01-09 254. 253. 1.57
7 2018-01-10 254. 252. 1.48
8 2018-01-11 256. 254. 1.82
9 2018-01-12 258. 256. 1.49
10 2018-01-16 261. 257. 3.84
# … with 241 more rows
5. Actually, the intermediate df_range
variable is unnecessary. We could accomplish the same thing by staring with df_dia
and applying %>%
twice.
df_dia %>%
select(date, high, low) %>%
mutate(intraday_range = high - low)
# A tibble: 251 x 4
date high low intraday_range
<date> <dbl> <dbl> <dbl>
1 2018-01-02 248. 247. 1.24
2 2018-01-03 249. 248. 1.15
3 2018-01-04 251. 250. 1.30
4 2018-01-05 253. 251. 1.88
5 2018-01-08 253. 252. 0.75
6 2018-01-09 254. 253. 1.57
7 2018-01-10 254. 252. 1.48
8 2018-01-11 256. 254. 1.82
9 2018-01-12 258. 256. 1.49
10 2018-01-16 261. 257. 3.84
# … with 241 more rows
Summarizing Remarks:
Piping together functions in the above way is idiomatic of the tidyverse.
Most tidyverse function take as their first argument a data.frame
, and then return as their output a data.frame
.
This is what allows piping
of multiple tidyverse functions to work.
In the above code, notice that for each %>%
, if you run all the code that comes before it, what you get is a data.frame
.
Code Challenge: Starting from df_dia
, try re-writing the above code without the pipe
operator, and without intermediate variables. This will involve applying select()
and mutuate()
to df_dia
using normal function syntax.
Discussion Question: Imagine a computation that involves 5 or 6 different tidyverse functions chained together. Do you think you would prefer coding up the pipe
or non-pipe
version?
NOTE: Pipes
can be a confusing at first, but they are an integral part of the tidyverse. If you’re still feeling a little foggy, I would recommend reviewing this tutorial.
R4DS - Chapter 18 - Pipes