Tutorial 6

Tidyverse and the Pipe Operator



In this tutorial we’ll discuss the tidyverse in greater detail, and then introduce the pipe operator: %>%.

Both feature prominently throughout this course.


Load Packages

Let’s begin by loading all the packages that we will need.

library(tidyverse)
library(tidyquant)


What is the Tidyverse?

1. The tidyverse is a collection of packages that is popular among data scientists who use R.

2. These packages provide on data manipulation, visualization, and analysis.

3. The reason for the popularity of the tidyverse is that there is a common focus on usability and uniformity.

4. Packages in the tidyverse all interact nicely with one another.

5. This different from much of the R ecosystem, because of the ad hoc nature of R package development.

6. If you want to do data analysis with R, I highly recommend focusing your efforts on learning the tidyverse.

7. The tidyverse is foundation of this training.

8. When we run library(tidyverse) we are actually loading all the core tidyverse packages at once.

9. To learn more checkout www.tidyverse.org.


Reading-In Data

1. Let’s read-in DIA prices to from 2018.

df_dia <- tq_get("DIA", get = "stock.prices", from = "2018-01-01"
                 , to = "2019-01-01")

df_dia
# A tibble: 251 x 7
   date        open  high   low close  volume adjusted
   <date>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>
 1 2018-01-02  248.  248.  247.  248. 4454900     241.
 2 2018-01-03  248.  249.  248.  249. 5528600     242.
 3 2018-01-04  250.  251.  250.  251. 4932900     244.
 4 2018-01-05  251.  253.  251.  253. 3349200     246.
 5 2018-01-08  253.  253.  252.  253. 3847500     246.
 6 2018-01-09  253.  254.  253.  254. 5031400     247.
 7 2018-01-10  253.  254.  252.  254. 2357600     247.
 8 2018-01-11  254   256.  254.  256. 2760400     249.
 9 2018-01-12  257.  258.  256.  258. 3678400     251.
10 2018-01-16  260.  261.  257.  258. 8102900     251.
# … with 241 more rows

Observations:

  1. Notice that there are 251 rows in this df_dia but only 10 of them printed.

  2. This is the default behavior of tibbles, which are just the tidyverse version of a data.frame.


The Pipe Operator

1. The pipe operator is the following symbol %>%.

2. It is a quintissential part of the tidyverse.

3. The pipe operator can be a little mysterious at first, but becoming familiar with it will pay off huge dividends in the next few tutorials, and as you progress in your used of the tidyverse.

4. The keyboard short cut for creating it is ctrl + shift + M.

5. Practice doing this a few times, just to get used to it.

# %>% %>% %>% %>% 

6. Notice that the # symbol at the beginning of a line of code turns it into a comment.


Rewriting Code Using the Pipe

The best way to start getting acquainted with the pipe operator is to see it working in action.


average of a vector

1. The following code calculates average of the close prices in df_dia:

mean(df_dia$close)
[1] 250.5277

2. This can be rewritten with the pipe operator as follows:

df_dia$close %>% mean()
[1] 250.5277


rounding elements of a vector

1. The following code rounds the first five close prices to the nearest dollar:

round(df_dia$close[1:5])
[1] 248 249 251 253 253

2. This can be rewritten with the pipe as follows:

df_dia$close[1:5] %>% round()
[1] 248 249 251 253 253


combining averaging and rounding

1. Suppose we want to calculate the average close price, and then round that average. This can be done as follows:

round(mean(df_dia$close))
[1] 251

2. In order to recreate this with pipes, we will need to use two of them as follows:

df_dia$close %>% mean() %>% round()
[1] 251

Summarizing Remarks:

  1. Notice that we executed the above calculation by composing together the mean() function with the round() function.

  2. In R, pretty much everything is a function.

  3. As your data analysis becomes more complicated, there will be a lot of function composition going on.

  4. Code with a lot of function composition can get be confusing, especially if you use standard notation, e.g. h(g(f(x))).

  5. pipes help keep your code organized - as we will see shortly.


Code Challenge: Find the maximum close price in df_dia and then round it to the nearest dollar. Try this first without pipes, and then with pipes.


Selecting Columns with the Pipe

1. So far we have used pipes in conjunction with one or two simple built-in functions like mean() and round().

2. Pipes really become valuable when combined with multiple tidyverse function calls.

3. In this section we will use %>% along with select() from dplyr.

4. The following code normal function syntax to access the date, close, and adjusted columns from df_dia:

select(df_dia, date, close, adjusted)
# A tibble: 251 x 3
   date       close adjusted
   <date>     <dbl>    <dbl>
 1 2018-01-02  248.     241.
 2 2018-01-03  249.     242.
 3 2018-01-04  251.     244.
 4 2018-01-05  253.     246.
 5 2018-01-08  253.     246.
 6 2018-01-09  254.     247.
 7 2018-01-10  254.     247.
 8 2018-01-11  256.     249.
 9 2018-01-12  258.     251.
10 2018-01-16  258.     251.
# … with 241 more rows

5. We can rewrite this with the pipe operator as follows:

df_dia %>% select(date, close, adjusted)
# A tibble: 251 x 3
   date       close adjusted
   <date>     <dbl>    <dbl>
 1 2018-01-02  248.     241.
 2 2018-01-03  249.     242.
 3 2018-01-04  251.     244.
 4 2018-01-05  253.     246.
 5 2018-01-08  253.     246.
 6 2018-01-09  254.     247.
 7 2018-01-10  254.     247.
 8 2018-01-11  256.     249.
 9 2018-01-12  258.     251.
10 2018-01-16  258.     251.
# … with 241 more rows

Observations:

  1. The select() takes as its first argument a data.frame.

  2. The next arguments for select() are the names of the columns we want to select.

  3. So the pattern for piping with select() is: data_frame %>% select(column_names).

  4. The general pattern for piping is: first_argument %>% function(other_arguments).


Coding Challenge: Use the pipe operator to select the date, high, and low columns from df_dia.


Mutating with the Pipe

1. In this section, we are going to use pipes to add a column to a data.frame.

2. The following code separates out the date, high, and low columns and assigns them to a new data.frame called df_range.

df_range <- df_dia %>% select(date, high, low)
df_range
# A tibble: 251 x 3
   date        high   low
   <date>     <dbl> <dbl>
 1 2018-01-02  248.  247.
 2 2018-01-03  249.  248.
 3 2018-01-04  251.  250.
 4 2018-01-05  253.  251.
 5 2018-01-08  253.  252.
 6 2018-01-09  254.  253.
 7 2018-01-10  254.  252.
 8 2018-01-11  256.  254.
 9 2018-01-12  258.  256.
10 2018-01-16  261.  257.
# … with 241 more rows

3. Using standard function syntax, we can add an intraday_range column to df_range by using the mutate() function as follows:

mutate(df_range, intraday_range = high - low)
# A tibble: 251 x 4
   date        high   low intraday_range
   <date>     <dbl> <dbl>          <dbl>
 1 2018-01-02  248.  247.           1.24
 2 2018-01-03  249.  248.           1.15
 3 2018-01-04  251.  250.           1.30
 4 2018-01-05  253.  251.           1.88
 5 2018-01-08  253.  252.           0.75
 6 2018-01-09  254.  253.           1.57
 7 2018-01-10  254.  252.           1.48
 8 2018-01-11  256.  254.           1.82
 9 2018-01-12  258.  256.           1.49
10 2018-01-16  261.  257.           3.84
# … with 241 more rows

4. Rewriting this using the pipe operator looks like:

df_range %>% mutate(intraday_range = high - low)
# A tibble: 251 x 4
   date        high   low intraday_range
   <date>     <dbl> <dbl>          <dbl>
 1 2018-01-02  248.  247.           1.24
 2 2018-01-03  249.  248.           1.15
 3 2018-01-04  251.  250.           1.30
 4 2018-01-05  253.  251.           1.88
 5 2018-01-08  253.  252.           0.75
 6 2018-01-09  254.  253.           1.57
 7 2018-01-10  254.  252.           1.48
 8 2018-01-11  256.  254.           1.82
 9 2018-01-12  258.  256.           1.49
10 2018-01-16  261.  257.           3.84
# … with 241 more rows

5. Actually, the intermediate df_range variable is unnecessary. We could accomplish the same thing by staring with df_dia and applying %>% twice.

df_dia %>% 
    select(date, high, low) %>% 
    mutate(intraday_range = high - low)
# A tibble: 251 x 4
   date        high   low intraday_range
   <date>     <dbl> <dbl>          <dbl>
 1 2018-01-02  248.  247.           1.24
 2 2018-01-03  249.  248.           1.15
 3 2018-01-04  251.  250.           1.30
 4 2018-01-05  253.  251.           1.88
 5 2018-01-08  253.  252.           0.75
 6 2018-01-09  254.  253.           1.57
 7 2018-01-10  254.  252.           1.48
 8 2018-01-11  256.  254.           1.82
 9 2018-01-12  258.  256.           1.49
10 2018-01-16  261.  257.           3.84
# … with 241 more rows

Summarizing Remarks:

  1. Piping together functions in the above way is idiomatic of the tidyverse.

  2. Most tidyverse function take as their first argument a data.frame, and then return as their output a data.frame.

  3. This is what allows piping of multiple tidyverse functions to work.

  4. In the above code, notice that for each %>%, if you run all the code that comes before it, what you get is a data.frame.


Code Challenge: Starting from df_dia, try re-writing the above code without the pipe operator, and without intermediate variables. This will involve applying select() and mutuate() to df_dia using normal function syntax.

Discussion Question: Imagine a computation that involves 5 or 6 different tidyverse functions chained together. Do you think you would prefer coding up the pipe or non-pipe version?


NOTE: Pipes can be a confusing at first, but they are an integral part of the tidyverse. If you’re still feeling a little foggy, I would recommend reviewing this tutorial.


Further Reading

R4DS - Chapter 18 - Pipes