In this tutorial, we move towards doing tasks that are useful in a finance context. Specifically, we calculate daily returns for SPY during the month of December 2018.
This simple returns calculation will serve as a window to various idioms and patterns in base R and tidyverse programming.
In order to acocomplish this we will need to discuss two preliminary programming topics:
conditional statements: if-else
iterating with for
loops
1. As usual, let’s begin by loading the packages that we will need.
library(tidyverse)
library(tidyquant)
if
-else
1. Conditional statements are ubiquitous in programming and data analysis.
2. Here is a toy example to demonstrate the syntax of if
-else
statements in R.
payoff <- NA
strike <- 50
upx <- 45
if ((upx - strike) > 0){
payoff <- upx - strike
} else {
payoff <- 0
}
payoff
[1] 0
Code Challenge: Modify the code above by setting upx
to 55 and rerun it. What is the resulting value of payoff
?
1. Let’s use tq_get()
to grab the December 2018 SPY prices; we will use these prices to calculate our daily returns.
df_spy <- tq_get("SPY", get = "stock.prices", from = "2018-12-01"
, to = "2019-01-01")
df_spy %>% head()
# A tibble: 6 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276.
2 2018-12-04 278. 279. 270. 270. 177986000 268.
3 2018-12-06 266. 270. 262. 270. 204185400 267.
4 2018-12-07 269. 271. 263. 264. 161018900 261.
5 2018-12-10 263. 265. 259. 264. 151445900 261.
6 2018-12-11 268. 268. 262. 264. 121504400 261.
for
Loops1. A for
loop allows you to repeatedly run a block of code a fixed number of times.
2. A pattern that occurs often is iterating through a vector of consecutive integers.
3. The following code iterates through the numbers 1 through 5 and prints them to the screen:
for (ix in 1:5){
print(ix)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
4. A pattern that is common in data analysis is iterating through the column of a data frame
.
5. The following code iterates through df_spy
row by row and to prints the date
value for each row.
for (ix in 1:nrow(df_spy)){
print(df_spy$date[ix])
}
[1] "2018-12-03"
[1] "2018-12-04"
[1] "2018-12-06"
[1] "2018-12-07"
[1] "2018-12-10"
[1] "2018-12-11"
[1] "2018-12-12"
[1] "2018-12-13"
[1] "2018-12-14"
[1] "2018-12-17"
[1] "2018-12-18"
[1] "2018-12-19"
[1] "2018-12-20"
[1] "2018-12-21"
[1] "2018-12-24"
[1] "2018-12-26"
[1] "2018-12-27"
[1] "2018-12-28"
[1] "2018-12-31"
Code Challenge: Write a for
loop that prints the first five dates in df_spy
.
for
loop1. We can now combine our knowledge of conditional execution and iteration to calculate the daily returns of SPY during the month of December 2018.
2. Our code will encapsulate the following two principles about returns:
On the first day in our data set, the return will be set to 0
.
On all subsequent days the simple return is defined as (curr_price / prev_price) - 1
.
3. Let’s begin with a couple of preliminary steps.
4. First, we will sort our data frame
by date
since our calculation requires the prices to be in chronological order. This is done with the arrange()
function as follows:
df_spy <- df_spy %>% arrange(date)
df_spy %>% head()
# A tibble: 6 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276.
2 2018-12-04 278. 279. 270. 270. 177986000 268.
3 2018-12-06 266. 270. 262. 270. 204185400 267.
4 2018-12-07 269. 271. 263. 264. 161018900 261.
5 2018-12-10 263. 265. 259. 264. 151445900 261.
6 2018-12-11 268. 268. 262. 264. 121504400 261.
5. Second, let’s add a column called ret
and initialize it with NA
values. In R, an NA
value represents missing data.
df_spy <- df_spy %>% mutate(ret = NA)
df_spy %>% head()
# A tibble: 6 x 8
date open high low close volume adjusted ret
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 2018-12-03 280. 280. 278. 279. 103176300 276. NA
2 2018-12-04 278. 279. 270. 270. 177986000 268. NA
3 2018-12-06 266. 270. 262. 270. 204185400 267. NA
4 2018-12-07 269. 271. 263. 264. 161018900 261. NA
5 2018-12-10 263. 265. 259. 264. 151445900 261. NA
6 2018-12-11 268. 268. 262. 264. 121504400 261. NA
6. Now we are ready to complete the task of calculating the daily returns.
7. The following code iterates through the df_spy
, and performs a different calculation depending on whether or not it’s the first date.
for (ix in 1:nrow(df_spy)){
if (ix == 1){
# for the first date just set to zero
df_spy$ret[ix] <- 0
} else {
# grabbing prices from data frame
curr_price <- df_spy$close[ix]
prev_price <- df_spy$close[ix - 1]
df_spy$ret[ix] <- (curr_price / prev_price) - 1
}
}
df_spy %>% head()
# A tibble: 6 x 8
date open high low close volume adjusted ret
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-12-03 280. 280. 278. 279. 103176300 276. 0
2 2018-12-04 278. 279. 270. 270. 177986000 268. -0.0324
3 2018-12-06 266. 270. 262. 270. 204185400 267. -0.00152
4 2018-12-07 269. 271. 263. 264. 161018900 261. -0.0232
5 2018-12-10 263. 265. 259. 264. 151445900 261. 0.00190
6 2018-12-11 268. 268. 262. 264. 121504400 261. 0.000227
Coding Challenge: Calculate the one month return over the entirety of December 2018. (Bonus points if you can do it with using division. Hint: the prod()
function calculates the product of an array of numbers.)
1. While our for
loop solution above works, it is not an ideal. Moreover, it’s not idiomatic of data analysis programming.
2. By utilizing a few tidyverse functions we can write code that is easier to read.
3. Our key ingredient will be the lag()
function in dplyr.
4. The lag()
function take in a vector, and returns a vector of the same length with all the entries pushed down one spot.
5. Let’s observe the effect that lag()
has when applied to the first 4 element of the date
column of df_spy
.
df_spy$date[1:4]
[1] "2018-12-03" "2018-12-04" "2018-12-06" "2018-12-07"
lag(df_spy$date[1:4])
[1] NA "2018-12-03" "2018-12-04" "2018-12-06"
6. Let’s apply it similarly to the close
column:
df_spy$close[1:4]
[1] 279.30 270.25 269.84 263.57
df_spy$close[1:4] %>% lag()
[1] NA 279.30 270.25 269.84
7. The following code exploits the lag()
and calculates all the daily returns in a single line of code (rather than a for
loop). This type of vectorized code is more idiomatic data analysis in R.
(df_spy$close / lag(df_spy$close)) - 1
[1] NA -0.0324023931 -0.0015171286 -0.0232359513 0.0018970292
[6] 0.0002272049 0.0050353461 -0.0003390191 -0.0184647628 -0.0196183821
[11] -0.0010964873 -0.0149757212 -0.0162779475 -0.0261763202 -0.0264229376
[16] 0.0505248664 0.0076773664 -0.0012899867 0.0087588214
8. Using base R techniques, let’s create a new column in df_spy
called ret_vec
and populate it with the daily returns.
df_spy$ret_vec <- (df_spy$close / lag(df_spy$close)) - 1
df_spy %>%
select(date, close, ret, ret_vec) %>%
head()
# A tibble: 6 x 4
date close ret ret_vec
<date> <dbl> <dbl> <dbl>
1 2018-12-03 279. 0 NA
2 2018-12-04 270. -0.0324 -0.0324
3 2018-12-06 270. -0.00152 -0.00152
4 2018-12-07 264. -0.0232 -0.0232
5 2018-12-10 264. 0.00190 0.00190
6 2018-12-11 264. 0.000227 0.000227
1. The vectorized base solution is an improvement in many ways, but it is a little hard to read. There are multiple $
signs and multiple references to df_spy
. This style of code gets especially confusing as your analysis grows in complexity.
2. Combining tidyverse functions with the pipe
allow us affect similar vectorized code in a succinct and readable fashion.
3. The following code utilizes tidyverse idioms to add a returns column to df_spy
. We will call our new return calculation ret_tidy
.
df_spy <-
df_spy %>% mutate(ret_tidy = (close / lag(close)) - 1)
4. Notice that ret
, ret_vec
, and ret_tidy
all match.
df_spy %>%
select(date, ret, ret_vec, ret_tidy) %>%
head()
# A tibble: 6 x 4
date ret ret_vec ret_tidy
<date> <dbl> <dbl> <dbl>
1 2018-12-03 0 NA NA
2 2018-12-04 -0.0324 -0.0324 -0.0324
3 2018-12-06 -0.00152 -0.00152 -0.00152
4 2018-12-07 -0.0232 -0.0232 -0.0232
5 2018-12-10 0.00190 0.00190 0.00190
6 2018-12-11 0.000227 0.000227 0.000227
Code Challenge: Apply tidyverse idioms add a column to df_spy
called ret_adj
. This column will contain daily returns using the adjusted
prices.
R4DS - 5.3 - Arrange rows with arrange()
R4DS - 5.4 - Select columns with select()
R4DS - 5.5 - Add new variables with mutate()
R4DS - 21.1 - Introduction
R4DS - 21.2 - For loops
R4DS - 21.3 - For loop variations