Tutorial 7

Filtering with dplyr



In our previous discussion of data subsetting we saw, how to access one or more rows of a data.frame based on their position within the data.frame.

As it turns out, positional row access has limited application.

A far more useful technique is to access all the rows of data.frame the meet certain criteria.

This is known as filtering. Our main tool will be filter() from dplyr.


Loading Packages

1. Let’s load the packages we will need for this tutorial.

library(tidyverse)
library(tidyquant)


Reading-In Data

1. Let’s use tq_get() to download price data for SPY, IWM, QQQ, and DIA for the month of December 2018.

symbols <- c("SPY", "IWM", "QQQ", "DIA")
df_etf <- tq_get(symbols, get = "stock.prices"
                 , from = "2018-12-01", to = "2019-01-01")
df_etf
# A tibble: 76 x 8
   symbol date        open  high   low close    volume adjusted
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
 1 SPY    2018-12-03  280.  280.  278.  279. 103176300     276.
 2 SPY    2018-12-04  278.  279.  270.  270. 177986000     268.
 3 SPY    2018-12-06  266.  270.  262.  270. 204185400     267.
 4 SPY    2018-12-07  269.  271.  263.  264. 161018900     261.
 5 SPY    2018-12-10  263.  265.  259.  264. 151445900     261.
 6 SPY    2018-12-11  268.  268.  262.  264. 121504400     261.
 7 SPY    2018-12-12  267.  269   265.  265.  97976700     263.
 8 SPY    2018-12-13  267.  267.  264.  265.  96662700     263.
 9 SPY    2018-12-14  263.  264.  260.  260. 116961100     258.
10 SPY    2018-12-17  259.  261.  254.  255. 165492300     253.
# … with 66 more rows

Observations:

  1. Notice in our tq_get() call that the first argument is a vector of four symbols, rather than just a single symbol.

  2. This is a feature of many R functions - they know how to behave for a single input, or for a vector of multiple inputs.


Filtering for One Symbol

1. Suppose we want to grab only the rows for IWM.

2. In order to do this we would use the filter() function from dplyr as follows:

filter(df_etf, symbol == "IWM")
# A tibble: 19 x 8
   symbol date        open  high   low close   volume adjusted
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 IWM    2018-12-03  154.  154.  152.  154. 23379800     153.
 2 IWM    2018-12-04  154.  154.  147.  148. 41077900     147.
 3 IWM    2018-12-06  145.  147.  143.  147. 37581200     146.
 4 IWM    2018-12-07  147.  148.  143.  144. 28525100     143.
 5 IWM    2018-12-10  144.  145.  141.  144. 30776600     143.
 6 IWM    2018-12-11  146.  146.  142.  143. 29229300     143.
 7 IWM    2018-12-12  145.  147.  145.  145. 27894800     144.
 8 IWM    2018-12-13  145.  146.  143.  143. 23205400     142.
 9 IWM    2018-12-14  142.  143.  140.  141. 24037200     140.
10 IWM    2018-12-17  140.  141.  136.  137. 36468500     137.
11 IWM    2018-12-18  138.  139.  136.  137. 42277200     136.
12 IWM    2018-12-19  137.  139.  133.  134  47676600     134.
13 IWM    2018-12-20  134.  134.  130.  132. 57908100     131.
14 IWM    2018-12-21  132.  133.  128.  128. 59404800     128.
15 IWM    2018-12-24  128.  128.  126.  126. 28575700     126.
16 IWM    2018-12-26  126.  132.  126.  132. 40182700     132.
17 IWM    2018-12-27  130.  132.  128.  132. 39527900     132.
18 IWM    2018-12-28  132.  135.  132.  133. 35994400     133.
19 IWM    2018-12-31  134.  134.  132.  134. 29173400     134.

Observations:

  1. When checking for equality in filter() we must use the double equals ==.

  2. Our filtering doesn’t affect the original data.frame.

3. As you may have guessed, we can rewrite above code with a %>%:

df_etf %>% filter(symbol == "IWM") 
# A tibble: 19 x 8
   symbol date        open  high   low close   volume adjusted
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 IWM    2018-12-03  154.  154.  152.  154. 23379800     153.
 2 IWM    2018-12-04  154.  154.  147.  148. 41077900     147.
 3 IWM    2018-12-06  145.  147.  143.  147. 37581200     146.
 4 IWM    2018-12-07  147.  148.  143.  144. 28525100     143.
 5 IWM    2018-12-10  144.  145.  141.  144. 30776600     143.
 6 IWM    2018-12-11  146.  146.  142.  143. 29229300     143.
 7 IWM    2018-12-12  145.  147.  145.  145. 27894800     144.
 8 IWM    2018-12-13  145.  146.  143.  143. 23205400     142.
 9 IWM    2018-12-14  142.  143.  140.  141. 24037200     140.
10 IWM    2018-12-17  140.  141.  136.  137. 36468500     137.
11 IWM    2018-12-18  138.  139.  136.  137. 42277200     136.
12 IWM    2018-12-19  137.  139.  133.  134  47676600     134.
13 IWM    2018-12-20  134.  134.  130.  132. 57908100     131.
14 IWM    2018-12-21  132.  133.  128.  128. 59404800     128.
15 IWM    2018-12-24  128.  128.  126.  126. 28575700     126.
16 IWM    2018-12-26  126.  132.  126.  132. 40182700     132.
17 IWM    2018-12-27  130.  132.  128.  132. 39527900     132.
18 IWM    2018-12-28  132.  135.  132.  133. 35994400     133.
19 IWM    2018-12-31  134.  134.  132.  134. 29173400     134.

Code Challenge: Grab all the SPY prices from df_etf - make sure to use %>% .


Filtering for One Date

1. Suppose we want to grab the prices for the December 2018 regular option expiration which was 12/21/2018.

2. We can again do this with the filter() function:

df_etf %>% filter(date == "2018-12-21")
# A tibble: 4 x 8
  symbol date        open  high   low close    volume adjusted
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 SPY    2018-12-21  247.  250.  240.  241. 255345600     240.
2 IWM    2018-12-21  132.  133.  128.  128.  59404800     128.
3 QQQ    2018-12-21  153.  154.  147.  148. 141129400     147.
4 DIA    2018-12-21  229.  232.  224.  224.  10242700     223.

Code Challenge: Grab the prices from New Years Eve - make sure to use %>%.


Filtering for Two Symbols

1. What if we want grab all the prices for both QQQ and DIA?

2. We can do so by using the %in% operator in our filter() condition:

df_etf %>% 
    filter(symbol %in% c("QQQ", "DIA"))
# A tibble: 38 x 8
   symbol date        open  high   low close   volume adjusted
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 QQQ    2018-12-03  173.  173.  170.  172. 50771700     172.
 2 QQQ    2018-12-04  171.  172.  166.  166. 70594700     165.
 3 QQQ    2018-12-06  162.  167.  162.  167. 71715500     166.
 4 QQQ    2018-12-07  166.  167.  161.  161. 80432200     161.
 5 QQQ    2018-12-10  161.  164.  159.  163. 73960800     162.
 6 QQQ    2018-12-11  166.  166.  162.  164. 59058300     163.
 7 QQQ    2018-12-12  166   168.  165.  165. 53780000     164.
 8 QQQ    2018-12-13  166.  167.  164.  165. 46378500     164.
 9 QQQ    2018-12-14  163.  164.  161.  161. 56547400     160.
10 QQQ    2018-12-17  160.  162.  156.  157. 74834100     157.
# … with 28 more rows

Code Challenge: Grab all prices for SPY, IWM, and QQQ.


Filtering for a Date Range

1. Suppose that we are only concerned with prices after 12/15/2018.

2. We can isolate these prices as follows:

df_etf %>% filter(date > "2018-12-15")
# A tibble: 40 x 8
   symbol date        open  high   low close    volume adjusted
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
 1 SPY    2018-12-17  259.  261.  254.  255. 165492300     253.
 2 SPY    2018-12-18  257.  258.  253.  255. 134515100     252.
 3 SPY    2018-12-19  255.  259.  249.  251. 214992800     249.
 4 SPY    2018-12-20  250.  252.  245.  247. 252053400     245.
 5 SPY    2018-12-21  247.  250.  240.  241. 255345600     240.
 6 SPY    2018-12-24  239.  241.  234.  234. 147311600     233.
 7 SPY    2018-12-26  236.  246.  234.  246. 218485400     245.
 8 SPY    2018-12-27  243.  248.  239.  248. 186267300     247.
 9 SPY    2018-12-28  250.  251.  246.  248. 153100200     247.
10 SPY    2018-12-31  250.  250.  247.  250. 144299400     249.
# … with 30 more rows

Code Challenge: Grab all prices from before 12/7/2018.


Filtering by Multiple Criteria

1. Suppose we want all prices for SPY and IWM that come after Christmas.

2. This is achieved easily by piping together two filters.

df_etf %>% 
    filter(symbol %in% c("SPY", "IWM")) %>% 
    filter(date > "2018-12-25")
# A tibble: 8 x 8
  symbol date        open  high   low close    volume adjusted
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 SPY    2018-12-26  236.  246.  234.  246. 218485400     245.
2 SPY    2018-12-27  243.  248.  239.  248. 186267300     247.
3 SPY    2018-12-28  250.  251.  246.  248. 153100200     247.
4 SPY    2018-12-31  250.  250.  247.  250. 144299400     249.
5 IWM    2018-12-26  126.  132.  126.  132.  40182700     132.
6 IWM    2018-12-27  130.  132.  128.  132.  39527900     132.
7 IWM    2018-12-28  132.  135.  132.  133.  35994400     133.
8 IWM    2018-12-31  134.  134.  132.  134.  29173400     134.

Code Challenge: Isolate the price of QQQ on 12/31/2018 using two filter() calls piped together.


Further Reading

R4DS - 5.2 - Filter rows with filter()