Over the course of a data analysis, you will be working with a variety vectors
and data.frames
.
You will often need to access some subset of these data structures. We will call this data subsetting.
In this tutorial we explore various ways to access subsets of data, including vector
indexing, vector
slicing, and data.frame
column selection.
1. Let’s begin by loading the packages that we will need.
library(tidyverse)
library(tidyquant)
1. Next, let’s read-in QQQ price data from December 2018.
df_qqq <- tq_get("QQQ", get = "stock.prices", from = "2018-12-01"
, to = "2019-01-01")
head(df_qqq)
# A tibble: 6 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 173. 173. 170. 172. 50771700 172.
2 2018-12-04 171. 172. 166. 166. 70594700 165.
3 2018-12-06 162. 167. 162. 167. 71715500 166.
4 2018-12-07 166. 167. 161. 161. 80432200 161.
5 2018-12-10 161. 164. 159. 163. 73960800 162.
6 2018-12-11 166. 166. 162. 164. 59058300 163.
1. Accessing a single element of an atomic vector is called vector indexing.
2. Vector indexing in one of the simplest forms of subsetting.
3. To perform vector indexing we use double square brackets [[
.
4. This code accesses the first element of the date
column.
df_qqq$date[[1]]
[1] "2018-12-03"
5. Now the fifth element.
df_qqq$date[[5]]
[1] "2018-12-10"
Code Challenge: Without typing any numbers, write a line of code that grabs the last element of the date
column of df_qqq
.
1. Accessing multiple elements of a vector is called vector slicing.
2. To perform slicing on a vector, we use single square brackets [
.
3. The following code grabs the first three elements of the date
column.
ix <- c(1, 2, 3)
df_qqq$date[ix]
[1] "2018-12-03" "2018-12-04" "2018-12-06"
4. In the above code, the indexing variable ix
was just for illustrating purposes, and we can feed the vector c(1, 2, 3)
directly into the square brackets.
df_qqq$date[c(1, 2, 3, 4, 5)]
[1] "2018-12-03" "2018-12-04" "2018-12-06" "2018-12-07" "2018-12-10"
5. R has a convenient short hand for creating sequences of consecutive integers. Try typing the following:
1:3
[1] 1 2 3
6. The leads to the following pattern that you often see in R.
df_qqq$date[1:3]
[1] "2018-12-03" "2018-12-04" "2018-12-06"
7. The elements that you access don’t have to be consecutive, or in order.
ix <- c(9, 13, 1)
df_qqq$date[ix]
[1] "2018-12-14" "2018-12-20" "2018-12-03"
Code Challenge: Grab the 18th to the 15th elements of date
- in that order: 18th, 17th, 16th, 15th.
data.frame
Column1. We already know how to access a single data.frame
column using $
.
2. This is itself a form of subsetting.
df_qqq$close
[1] 172.33 165.72 166.89 161.38 163.07 163.61 165.05 165.10 161.08 157.43
[11] 158.42 154.53 152.29 147.57 143.50 152.46 153.05 152.97 154.26
data.frame
Columns1. To access multiple columns of a data.frame
we will use the select()
function from the the dplyr package, which is part of the tidyverse.
2. The following code grabs the date
and close
price column.
select(df_qqq, date, close)
# A tibble: 19 x 2
date close
<date> <dbl>
1 2018-12-03 172.
2 2018-12-04 166.
3 2018-12-06 167.
4 2018-12-07 161.
5 2018-12-10 163.
6 2018-12-11 164.
7 2018-12-12 165.
8 2018-12-13 165.
9 2018-12-14 161.
10 2018-12-17 157.
11 2018-12-18 158.
12 2018-12-19 155.
13 2018-12-20 152.
14 2018-12-21 148.
15 2018-12-24 144.
16 2018-12-26 152.
17 2018-12-27 153.
18 2018-12-28 153.
19 2018-12-31 154.
3. We can use select()
to access a single column.
select(df_qqq, adjusted)
# A tibble: 19 x 1
adjusted
<dbl>
1 172.
2 165.
3 166.
4 161.
5 162.
6 163.
7 164.
8 164.
9 160.
10 157.
11 158.
12 154.
13 152.
14 147.
15 143.
16 152.
17 153.
18 153.
19 154.
Code Challenge: We have two ways of accessing the adjusted
column: df_qqq$adjusted
and select(df_qqq, adjusted)
. What is a fundamental difference between the two? Try typing both and observe the difference in what is printed.
1. We can also access individual rows of a data.frame
using square brackets.
2. I don’t use this technique that much, but it’s worth mentioning.
3. The following code grabs the entire second row of the df_qqq
,
df_qqq[2, ]
# A tibble: 1 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-04 171. 172. 166. 166. 70594700 165.
4. Notice that what we get back is a data.frame
.
Code Challenge: Think about how we went from vector indexing (one element) to vector slicing (multiple elements). Based on that, how can you access rows 5 through 9 of df_qqq
?
R4DS - 5.4 - Select columns with select()