The real power of data analysis with the tidyverse comes into focus when we start utilizing the group_by()
function from the dplyr package.
Talking about group_by()
in the abstract can be confusing; I think it’s best to see grouping in action by doing meaningful calculations.
The purpose of this tutorial is to introduce group_by()
by completing the following financial analysis: calculating monthly returns and volatilities for several assets.
1. Let’s begin by loading the packages that we will need.
library(tidyverse)
library(tidyquant)
1. Next, let’s use tq_get()
to grab the December 2018 price for SPY, IWM, QQQ, and DIA.
symbols <- c("SPY", "IWM", "QQQ", "DIA")
df_etf <- tq_get(symbols, get = "stock.prices"
, from = "2018-12-01", to = "2019-01-01")
df_etf
# A tibble: 76 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 2018-12-03 280. 280. 278. 279. 103176300 276.
2 SPY 2018-12-04 278. 279. 270. 270. 177986000 268.
3 SPY 2018-12-06 266. 270. 262. 270. 204185400 267.
4 SPY 2018-12-07 269. 271. 263. 264. 161018900 261.
5 SPY 2018-12-10 263. 265. 259. 264. 151445900 261.
6 SPY 2018-12-11 268. 268. 262. 264. 121504400 261.
7 SPY 2018-12-12 267. 269 265. 265. 97976700 263.
8 SPY 2018-12-13 267. 267. 264. 265. 96662700 263.
9 SPY 2018-12-14 263. 264. 260. 260. 116961100 258.
10 SPY 2018-12-17 259. 261. 254. 255. 165492300 253.
# … with 66 more rows
Code Challenge: Use filter()
to isolate all the QQQ prices in df_etf
.
Now that we have our data, let’s clarify our analysis objective:
Our data set df_etf
consists of one month of prices for four different ETFs.
We want to calculate the monthly return and the monthly volatility for each ETF.
So ultimately we will have 8 different calculations: 2 calcs for each of the four ETFs.
1. In the course of an analysis, I often find the need repeat the same calculation for many different sets of similar data.
2. To tackle this type of problem, I first execute the calculation for one set of data, and then use that experience to see how I can apply it multiple data sets.
3. Let’s take this approach now, and first calculate our two monthly statistics (return and volatility) for SPY only.
4. We can build this up as a series of tidyverse function calls connected together with the pipe
.
5. We will proceed as follows:
lag()
(a) First, let’s isolate the columns and rows we need using select()
and filter()
.
df_etf %>%
select(symbol, date, close) %>%
filter(symbol == "SPY") %>%
head()
# A tibble: 6 x 3
symbol date close
<chr> <date> <dbl>
1 SPY 2018-12-03 279.
2 SPY 2018-12-04 270.
3 SPY 2018-12-06 270.
4 SPY 2018-12-07 264.
5 SPY 2018-12-10 264.
6 SPY 2018-12-11 264.
(b) Next, let’s add a daily_ret
column using the mutate()
and lag()
functions.
df_etf %>%
select(symbol, date, close) %>%
filter(symbol == "SPY") %>%
mutate(daily_ret = (close / lag(close)) - 1) %>%
head()
# A tibble: 6 x 4
symbol date close daily_ret
<chr> <date> <dbl> <dbl>
1 SPY 2018-12-03 279. NA
2 SPY 2018-12-04 270. -0.0324
3 SPY 2018-12-06 270. -0.00152
4 SPY 2018-12-07 264. -0.0232
5 SPY 2018-12-10 264. 0.00190
6 SPY 2018-12-11 264. 0.000227
(c) and (d) Finally, we calculate the monthly values from the daily ones using the summarize()
.
df_etf %>%
select(symbol, date, close) %>%
filter(symbol == "SPY") %>%
mutate(daily_ret = (close / lag(close)) - 1) %>%
summarize(
monthly_ret = prod(1 + daily_ret, na.rm = TRUE) - 1
, monthly_vol = sd(daily_ret, na.rm = TRUE) * sqrt(252)
)
# A tibble: 1 x 2
monthly_ret monthly_vol
<dbl> <dbl>
1 -0.105 0.305
Code Challenge: Copy/paste the above code, and alter so that it uses the adjusted
price for the calculations rather than the close
price.
1. One complication that arises when we move to the full df_etf
is that we cannot simply apply our mutate()
and summarize()
statements to whole data.frame
.
2. One possible work around is to use for
loop with a variety of if
statements, but that would be painful and unidiomatic.
3. This is where group_by()
comes in handy.
4. The following code groups df_etf
using the symbol
column.
group_by(df_etf, symbol)
# A tibble: 76 x 8
# Groups: symbol [4]
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SPY 2018-12-03 280. 280. 278. 279. 103176300 276.
2 SPY 2018-12-04 278. 279. 270. 270. 177986000 268.
3 SPY 2018-12-06 266. 270. 262. 270. 204185400 267.
4 SPY 2018-12-07 269. 271. 263. 264. 161018900 261.
5 SPY 2018-12-10 263. 265. 259. 264. 151445900 261.
6 SPY 2018-12-11 268. 268. 262. 264. 121504400 261.
7 SPY 2018-12-12 267. 269 265. 265. 97976700 263.
8 SPY 2018-12-13 267. 267. 264. 265. 96662700 263.
9 SPY 2018-12-14 263. 264. 260. 260. 116961100 258.
10 SPY 2018-12-17 259. 261. 254. 255. 165492300 253.
# … with 66 more rows
Observations:
At first glance it doesn’t seem like much has changed.
But notice the # Groups: symbol[4]
in the second line of the print out.
Behind the scenes the result is a grouped data.frame
rather than just a plain data.frame
.
However, nothing of real interest happens until we apply mutate()
or summarize()
to a grouped data.frame
.
Code Challenge: Rewrite the group_by()
statement above using the pipe operator.
1. In the case of a single underlying, we can calculate the daily_ret
by a straight-forward application of mutate()
.
2. In the case of the entire data set df_etf
, which contains four underlyings, the situation is more complicated.
3. By first applying group_by(symbol)
to df_etf
, when we use the mutate()
function it will be applied within each group.
4. The following code demonstrates this:
# calculating returns with a grouped mutate
df_daily <-
df_etf %>%
group_by(symbol) %>%
mutate(daily_ret = (close / lag(close)) - 1) %>%
select(symbol, date, close, daily_ret)
df_daily %>% head()
# A tibble: 6 x 4
# Groups: symbol [1]
symbol date close daily_ret
<chr> <date> <dbl> <dbl>
1 SPY 2018-12-03 279. NA
2 SPY 2018-12-04 270. -0.0324
3 SPY 2018-12-06 270. -0.00152
4 SPY 2018-12-07 264. -0.0232
5 SPY 2018-12-10 264. 0.00190
6 SPY 2018-12-11 264. 0.000227
Observation:
df_daily
in the Viewer
, we can see that the calculation was applied correctly.
Code Challenge: Isolate the data from he first date
in df_daily
and take a look at the value of daily_ret
. Are you concerned?
1. In df_daily
we now have the daily_ret
for each of the symbols.
2. These daily_rets
are key ingredients for our desired monthly statistics.
3. Our monthly statistics are simply summarizing calculations on the daily_ret
column.
4. Since df_daily
is a grouped data.frame
, when we apply summarize()
to it the calculations will be calculated within each group.
To calculate monthly returns, we first group_by(symbol)
and then apply the summarize()
function.
df_daily %>%
group_by(symbol) %>%
summarize(month_ret = prod(1 + daily_ret, na.rm = TRUE) - 1)
# A tibble: 4 x 2
symbol month_ret
<chr> <dbl>
1 DIA -0.0987
2 IWM -0.131
3 QQQ -0.105
4 SPY -0.105
Similarly, to calculate monthly volatility, we first group_by(symbol)
and then apply the summarize()
function.
df_daily %>%
group_by(symbol) %>%
summarize(vol = sd(daily_ret, na.rm = TRUE) * sqrt(252))
# A tibble: 4 x 2
symbol vol
<chr> <dbl>
1 DIA 0.303
2 IWM 0.312
3 QQQ 0.382
4 SPY 0.305
Code Challenge: Calculate the average daily return for each ETF for the month of December 2018.
1. We’ve accomplished what we set out to do in a series of three steps:
(group_by() + mutate()
) starting with df_etf
calculate daily returns for each symbol and stored them into a new data.frame
called df_daily
.
(group_by() + summarize()
) starting with df_daily
, calculate monthly returns for each symbol using the daily_ret
column.
(group_by() + summarize()
) starting with df_daily
, calculate monthly volatility for each symbol using the daily_ret
column.
2. Breaking calculations down into step can be helpful to clarify your thoughts, but as you grow more comfortable with the tidyverse you may want to just do the calculation all at once.
3. Here’s how to perform our calculations in a single block of code:
df_etf %>%
select(symbol, date, close) %>%
group_by(symbol) %>%
mutate(daily_ret = (close / lag(close)) - 1) %>%
summarize(
monthly_ret = prod(1 + daily_ret, na.rm = TRUE) - 1
, monthly_vol = sd(daily_ret, na.rm = TRUE) * sqrt(252)
)
# A tibble: 4 x 3
symbol monthly_ret monthly_vol
<chr> <dbl> <dbl>
1 DIA -0.0987 0.303
2 IWM -0.131 0.312
3 QQQ -0.105 0.382
4 SPY -0.105 0.305
4. The code above is idiomatic of data wrangling in the tidyverse.
5. As a rule of thumb, code blocks involving more than ten pipes
should be broken up in smaller code blocks.
Code Challenge: Modify the code above to add maximum daily absolute return to the final report that is generated.
R4DS - 5.6 - Grouped summaries with summarize()
R4DS - 5.7 - Grouped mutates (and filters)
R for Finance - 5.4 - Returns
R for Finance - 5.5 - Volatility