Tutorial 5

Subsetting Data - Index, Slice, Select



Over the course of a data analysis, you will be working with a variety vectors and data.frames.

You will often need to access some subset of these data structures. We will call this data subsetting.

In this tutorial we explore various ways to access subsets of data, including vector indexing, vector slicing, and data.frame column selection.


Loading Packages

1. Let’s begin by loading the packages that we will need.

library(tidyverse)
library(tidyquant)


Reading-In Data

1. Next, let’s read-in QQQ price data from December 2018.

df_qqq <- tq_get("QQQ", get = "stock.prices", from = "2018-12-01"
                 , to = "2019-01-01")

head(df_qqq)
# A tibble: 6 x 7
  date        open  high   low close   volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 2018-12-03  173.  173.  170.  172. 50771700     172.
2 2018-12-04  171.  172.  166.  166. 70594700     165.
3 2018-12-06  162.  167.  162.  167. 71715500     166.
4 2018-12-07  166.  167.  161.  161. 80432200     161.
5 2018-12-10  161.  164.  159.  163. 73960800     162.
6 2018-12-11  166.  166.  162.  164. 59058300     163.


Vector Indexing

1. Accessing a single element of an atomic vector is called vector indexing.

2. Vector indexing in one of the simplest forms of subsetting.

3. To perform vector indexing we use double square brackets [[.

4. This code accesses the first element of the date column.

df_qqq$date[[1]]
[1] "2018-12-03"

5. Now the fifth element.

df_qqq$date[[5]]
[1] "2018-12-10"

Code Challenge: Without typing any numbers, write a line of code that grabs the last element of the date column of df_qqq.


Vector Slicing

1. Accessing multiple elements of a vector is called vector slicing.

2. To perform slicing on a vector, we use single square brackets [.

3. The following code grabs the first three elements of the date column.

ix <- c(1, 2, 3)
df_qqq$date[ix]
[1] "2018-12-03" "2018-12-04" "2018-12-06"

4. In the above code, the indexing variable ix was just for illustrating purposes, and we can feed the vector c(1, 2, 3) directly into the square brackets.

df_qqq$date[c(1, 2, 3, 4, 5)]
[1] "2018-12-03" "2018-12-04" "2018-12-06" "2018-12-07" "2018-12-10"

5. R has a convenient short hand for creating sequences of consecutive integers. Try typing the following:

1:3
[1] 1 2 3

6. The leads to the following pattern that you often see in R.

df_qqq$date[1:3]
[1] "2018-12-03" "2018-12-04" "2018-12-06"

7. The elements that you access don’t have to be consecutive, or in order.

ix <- c(9, 13, 1)
df_qqq$date[ix]
[1] "2018-12-14" "2018-12-20" "2018-12-03"

Code Challenge: Grab the 18th to the 15th elements of date - in that order: 18th, 17th, 16th, 15th.


Accessing a Single data.frame Column

1. We already know how to access a single data.frame column using $.

2. This is itself a form of subsetting.

df_qqq$close
 [1] 172.33 165.72 166.89 161.38 163.07 163.61 165.05 165.10 161.08 157.43
[11] 158.42 154.53 152.29 147.57 143.50 152.46 153.05 152.97 154.26


Accessing Multiple data.frame Columns

1. To access multiple columns of a data.frame we will use the select() function from the the dplyr package, which is part of the tidyverse.

2. The following code grabs the date and close price column.

select(df_qqq, date, close)
# A tibble: 19 x 2
   date       close
   <date>     <dbl>
 1 2018-12-03  172.
 2 2018-12-04  166.
 3 2018-12-06  167.
 4 2018-12-07  161.
 5 2018-12-10  163.
 6 2018-12-11  164.
 7 2018-12-12  165.
 8 2018-12-13  165.
 9 2018-12-14  161.
10 2018-12-17  157.
11 2018-12-18  158.
12 2018-12-19  155.
13 2018-12-20  152.
14 2018-12-21  148.
15 2018-12-24  144.
16 2018-12-26  152.
17 2018-12-27  153.
18 2018-12-28  153.
19 2018-12-31  154.

3. We can use select() to access a single column.

select(df_qqq, adjusted)
# A tibble: 19 x 1
   adjusted
      <dbl>
 1     172.
 2     165.
 3     166.
 4     161.
 5     162.
 6     163.
 7     164.
 8     164.
 9     160.
10     157.
11     158.
12     154.
13     152.
14     147.
15     143.
16     152.
17     153.
18     153.
19     154.

Code Challenge: We have two ways of accessing the adjusted column: df_qqq$adjusted and select(df_qqq, adjusted). What is a fundamental difference between the two? Try typing both and observe the difference in what is printed.


Accessing Rows

1. We can also access individual rows of a data.frame using square brackets.

2. I don’t use this technique that much, but it’s worth mentioning.

3. The following code grabs the entire second row of the df_qqq,

df_qqq[2, ]
# A tibble: 1 x 7
  date        open  high   low close   volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 2018-12-04  171.  172.  166.  166. 70594700     165.

4. Notice that what we get back is a data.frame.

Code Challenge: Think about how we went from vector indexing (one element) to vector slicing (multiple elements). Based on that, how can you access rows 5 through 9 of df_qqq?


Further Reading

R4DS - 5.4 - Select columns with select()