In this tutorial we introduce the data.frame
structure and show how perform do basic manipulations with it.
1. Let’s begin by loading the packages that we will need.
library(tidyverse)
library(tidyquant)
1. The data.frame
structure is going to be our primary work horse.
2. A data.frame
is a convenient way to store rectangular data that consists of rows and columns.
3. Technically, we will use the tibble
data structure from the tidyverse.
4. You can think of a tibble
and a data.frame
as the same thing.
1. The data that goes into a data.frame
usually comes from an external source.
2. In this course, we will typically get our data from Yahoo Finance using the tq_get()
function from the tidyquant package.
3. Let’s read-in a week’s worth of SPY data using tq_get()
:
df_spy <- tq_get("SPY", get = "stock.prices", from = "2018-12-01"
, to = "2018-12-11")
df_spy
# A tibble: 5 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276.
2 2018-12-04 278. 279. 270. 270. 177986000 268.
3 2018-12-06 266. 270. 262. 270. 204185400 267.
4 2018-12-07 269. 271. 263. 264. 161018900 261.
5 2018-12-10 263. 265. 259. 264. 151445900 261.
1. The first thing to notice is that when df_spy
is printed, the data types of the columns are shown just below the column names.
df_spy
# A tibble: 5 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276.
2 2018-12-04 278. 279. 270. 270. 177986000 268.
3 2018-12-06 266. 270. 262. 270. 204185400 267.
4 2018-12-07 269. 271. 263. 264. 161018900 261.
5 2018-12-10 263. 265. 259. 264. 151445900 261.
2. We can check the number of rows using the nrow()
function functions.
nrow(df_spy)
[1] 5
3. We can check the number of columns using the ncol()
function.
ncol(df_spy)
[1] 7
1. The columns of a data.frame
are atomic vectors.
2. To access a particular column, we can use the $
.
3. The following code accesses the close
column of df_spy
.
df_spy$close
[1] 279.30 270.25 269.84 263.57 264.07
4. This code accesses the date
column of df_spy
.
df_spy$date
[1] "2018-12-03" "2018-12-04" "2018-12-06" "2018-12-07" "2018-12-10"
5. We can find the size of a vector
by applying the length()
function to it.
length(df_spy$close)
[1] 5
Code Challenge: Use the length()
function to find the size of the adjusted
column of df_spy
.
1. Recall that the columns of a data.frame
are vectors
.
2. As such, we can perform component-wise calculation with columns, like you would with the vectors from linear algebra.
3. Notice how the division by 100 gets broadcasted to all the vector components.
df_spy$close / 100
[1] 2.7930 2.7025 2.6984 2.6357 2.6407
4. In the following code, the addition of 100 gets broadcasted similarly.
df_spy$close + 100
[1] 379.30 370.25 369.84 363.57 364.07
5. When we add/subtract two data.frame
columns, the operation is performed component-wise.
6. The following code calculates the intraday range of SPY
for each day in df_spy
, which is the difference between the high
price and the low
price.
df_spy$high - df_spy$low
[1] 2.889984 8.950012 7.529999 8.589996 6.540009
Code Challenge: Calculate the difference between the close
and open
for each day in the data set.
1. Data wrangling often involves adding or removing columns from a data.frame
.
2. There are a variety of ways to do this.
3. In this section, we will explore the base R method.
4. In the next section we’ll use the the tidyverse to accomplish the same thing.
5. The following code saves the intraday range values in a new column of df_spy
. The new column is called range
.
df_spy$range <- df_spy$high - df_spy$low
df_spy
# A tibble: 5 x 8
date open high low close volume adjusted range
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276. 2.89
2 2018-12-04 278. 279. 270. 270. 177986000 268. 8.95
3 2018-12-06 266. 270. 262. 270. 204185400 267. 7.53
4 2018-12-07 269. 271. 263. 264. 161018900 261. 8.59
5 2018-12-10 263. 265. 259. 264. 151445900 261. 6.54
6. Notice how above code works:
range
column exists, and then assigned a value to it.7. We can easily remove a column by assigning a NULL
value to it.
df_spy$range <- NULL
df_spy
# A tibble: 5 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276.
2 2018-12-04 278. 279. 270. 270. 177986000 268.
3 2018-12-06 266. 270. 262. 270. 204185400 267.
4 2018-12-07 269. 271. 263. 264. 161018900 261.
5 2018-12-10 263. 265. 259. 264. 151445900 261.
Code Challenge: Add a symbol
column to df_spy
, and then assign the value of “SPY”
to it.
1. The dplyr package is part of the tidyverse and houses many functions for manipulating data frames
2. The mutate()
function in dplyr add columns to a data.frame
.
3. The following code adds the range
column as we did in the previous section.
mutate(df_spy, range = high - low)
# A tibble: 5 x 8
date open high low close volume adjusted range
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276. 2.89
2 2018-12-04 278. 279. 270. 270. 177986000 268. 8.95
3 2018-12-06 266. 270. 262. 270. 204185400 267. 7.53
4 2018-12-07 269. 271. 263. 264. 161018900 261. 8.59
5 2018-12-10 263. 265. 259. 264. 151445900 261. 6.54
4. However, if we take a look at df_spy
, we see that the column wasn’t added.
df_spy
# A tibble: 5 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276.
2 2018-12-04 278. 279. 270. 270. 177986000 268.
3 2018-12-06 266. 270. 262. 270. 204185400 267.
4 2018-12-07 269. 271. 263. 264. 161018900 261.
5 2018-12-10 263. 265. 259. 264. 151445900 261.
5. The previous mutate()
call simply printed the modified data.frame
to the screen.
6. In order for the mutate()
to take hold, we need to reassign the output to df_spy
.
df_spy <- mutate(df_spy, range = high - low)
df_spy
# A tibble: 5 x 8
date open high low close volume adjusted range
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276. 2.89
2 2018-12-04 278. 279. 270. 270. 177986000 268. 8.95
3 2018-12-06 266. 270. 262. 270. 204185400 267. 7.53
4 2018-12-07 269. 271. 263. 264. 161018900 261. 8.59
5 2018-12-10 263. 265. 259. 264. 151445900 261. 6.54
7. To remove a column we use the select()
function in dplyr.
select(df_spy, -range)
# A tibble: 5 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276.
2 2018-12-04 278. 279. 270. 270. 177986000 268.
3 2018-12-06 266. 270. 262. 270. 204185400 267.
4 2018-12-07 269. 271. 263. 264. 161018900 261.
5 2018-12-10 263. 265. 259. 264. 151445900 261.
Code Challenge: Verify for yourself that the removal of range
didn’t actually take hold in the above code. Modify the code to remove the column.
R4DS - 5.4 - Select columns with select()
R4DS - 5.5 - Add new variables with mutate()
R4DS - 10.1 - Introduction
R4DS - 10.2 - Creating tibbles
R4DS - 10.3 - tibble
vs data.frame