Tutorial 3

Data Frames - The Analysis Workhorse



In this tutorial we introduce the data.frame structure and show how perform do basic manipulations with it.


Loading Packages

1. Let’s begin by loading the packages that we will need.

library(tidyverse)
library(tidyquant)


Data Frames

1. The data.frame structure is going to be our primary work horse.

2. A data.frame is a convenient way to store rectangular data that consists of rows and columns.

3. Technically, we will use the tibble data structure from the tidyverse.

4. You can think of a tibble and a data.frame as the same thing.


Reading-In Data

1. The data that goes into a data.frame usually comes from an external source.

2. In this course, we will typically get our data from Yahoo Finance using the tq_get() function from the tidyquant package.

3. Let’s read-in a week’s worth of SPY data using tq_get():

df_spy <- tq_get("SPY", get = "stock.prices", from = "2018-12-01"
                 , to = "2018-12-11")
df_spy
# A tibble: 5 x 7
  date        open  high   low close    volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.
2 2018-12-04  278.  279.  270.  270. 177986000     268.
3 2018-12-06  266.  270.  262.  270. 204185400     267.
4 2018-12-07  269.  271.  263.  264. 161018900     261.
5 2018-12-10  263.  265.  259.  264. 151445900     261.


Exploring a Data Fram

1. The first thing to notice is that when df_spy is printed, the data types of the columns are shown just below the column names.

df_spy
# A tibble: 5 x 7
  date        open  high   low close    volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.
2 2018-12-04  278.  279.  270.  270. 177986000     268.
3 2018-12-06  266.  270.  262.  270. 204185400     267.
4 2018-12-07  269.  271.  263.  264. 161018900     261.
5 2018-12-10  263.  265.  259.  264. 151445900     261.

2. We can check the number of rows using the nrow() function functions.

nrow(df_spy)
[1] 5

3. We can check the number of columns using the ncol() function.

ncol(df_spy)
[1] 7


Data Frame Columns

1. The columns of a data.frame are atomic vectors.

2. To access a particular column, we can use the $.

3. The following code accesses the close column of df_spy.

df_spy$close
[1] 279.30 270.25 269.84 263.57 264.07

4. This code accesses the date column of df_spy.

df_spy$date
[1] "2018-12-03" "2018-12-04" "2018-12-06" "2018-12-07" "2018-12-10"

5. We can find the size of a vector by applying the length() function to it.

length(df_spy$close)
[1] 5

Code Challenge: Use the length() function to find the size of the adjusted column of df_spy.


Component-wise Column Operations

1. Recall that the columns of a data.frame are vectors.

2. As such, we can perform component-wise calculation with columns, like you would with the vectors from linear algebra.

3. Notice how the division by 100 gets broadcasted to all the vector components.

df_spy$close / 100
[1] 2.7930 2.7025 2.6984 2.6357 2.6407

4. In the following code, the addition of 100 gets broadcasted similarly.

df_spy$close + 100
[1] 379.30 370.25 369.84 363.57 364.07

5. When we add/subtract two data.frame columns, the operation is performed component-wise.

6. The following code calculates the intraday range of SPY for each day in df_spy, which is the difference between the high price and the low price.

df_spy$high - df_spy$low
[1] 2.889984 8.950012 7.529999 8.589996 6.540009

Code Challenge: Calculate the difference between the close and open for each day in the data set.


Adding/Removing Columns (base)

1. Data wrangling often involves adding or removing columns from a data.frame.

2. There are a variety of ways to do this.

3. In this section, we will explore the base R method.

4. In the next section we’ll use the the tidyverse to accomplish the same thing.

5. The following code saves the intraday range values in a new column of df_spy. The new column is called range.

df_spy$range <- df_spy$high - df_spy$low
df_spy
# A tibble: 5 x 8
  date        open  high   low close    volume adjusted range
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.  2.89
2 2018-12-04  278.  279.  270.  270. 177986000     268.  8.95
3 2018-12-06  266.  270.  262.  270. 204185400     267.  7.53
4 2018-12-07  269.  271.  263.  264. 161018900     261.  8.59
5 2018-12-10  263.  265.  259.  264. 151445900     261.  6.54

6. Notice how above code works:

  • We act as if the range column exists, and then assigned a value to it.
  • R realizes that it doesn’t exist, and then creates it for us.

7. We can easily remove a column by assigning a NULL value to it.

df_spy$range <- NULL
df_spy
# A tibble: 5 x 7
  date        open  high   low close    volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.
2 2018-12-04  278.  279.  270.  270. 177986000     268.
3 2018-12-06  266.  270.  262.  270. 204185400     267.
4 2018-12-07  269.  271.  263.  264. 161018900     261.
5 2018-12-10  263.  265.  259.  264. 151445900     261.

Code Challenge: Add a symbol column to df_spy, and then assign the value of “SPY” to it.


Adding/Removing Columns (tidyverse)

1. The dplyr package is part of the tidyverse and houses many functions for manipulating data frames

2. The mutate() function in dplyr add columns to a data.frame.

3. The following code adds the range column as we did in the previous section.

mutate(df_spy, range = high - low)
# A tibble: 5 x 8
  date        open  high   low close    volume adjusted range
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.  2.89
2 2018-12-04  278.  279.  270.  270. 177986000     268.  8.95
3 2018-12-06  266.  270.  262.  270. 204185400     267.  7.53
4 2018-12-07  269.  271.  263.  264. 161018900     261.  8.59
5 2018-12-10  263.  265.  259.  264. 151445900     261.  6.54

4. However, if we take a look at df_spy, we see that the column wasn’t added.

df_spy
# A tibble: 5 x 7
  date        open  high   low close    volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.
2 2018-12-04  278.  279.  270.  270. 177986000     268.
3 2018-12-06  266.  270.  262.  270. 204185400     267.
4 2018-12-07  269.  271.  263.  264. 161018900     261.
5 2018-12-10  263.  265.  259.  264. 151445900     261.

5. The previous mutate() call simply printed the modified data.frame to the screen.

6. In order for the mutate() to take hold, we need to reassign the output to df_spy.

df_spy <- mutate(df_spy, range = high - low)
df_spy
# A tibble: 5 x 8
  date        open  high   low close    volume adjusted range
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.  2.89
2 2018-12-04  278.  279.  270.  270. 177986000     268.  8.95
3 2018-12-06  266.  270.  262.  270. 204185400     267.  7.53
4 2018-12-07  269.  271.  263.  264. 161018900     261.  8.59
5 2018-12-10  263.  265.  259.  264. 151445900     261.  6.54

7. To remove a column we use the select() function in dplyr.

select(df_spy, -range)
# A tibble: 5 x 7
  date        open  high   low close    volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 2018-12-03  280.  280.  278.  279. 103176300     276.
2 2018-12-04  278.  279.  270.  270. 177986000     268.
3 2018-12-06  266.  270.  262.  270. 204185400     267.
4 2018-12-07  269.  271.  263.  264. 161018900     261.
5 2018-12-10  263.  265.  259.  264. 151445900     261.

Code Challenge: Verify for yourself that the removal of range didn’t actually take hold in the above code. Modify the code to remove the column.


Further Reading

R4DS - 5.4 - Select columns with select()

R4DS - 5.5 - Add new variables with mutate()

R4DS - 10.1 - Introduction

R4DS - 10.2 - Creating tibbles

R4DS - 10.3 - tibble vs data.frame