Tutorial 9

Grouped Summarize and Mutate



The real power of data analysis with the tidyverse comes into focus when we start utilizing the group_by() function from the dplyr package.

Talking about group_by() in the abstract can be confusing; I think it’s best to see grouping in action by doing meaningful calculations.

The purpose of this tutorial is to introduce group_by() by completing the following financial analysis: calculating monthly returns and volatilities for several assets.


Loading Packages

1. Let’s begin by loading the packages that we will need.

library(tidyverse)
library(tidyquant)


Reading Data

1. Next, let’s use tq_get() to grab the December 2018 price for SPY, IWM, QQQ, and DIA.

symbols <- c("SPY", "IWM", "QQQ", "DIA")
df_etf <- tq_get(symbols, get = "stock.prices"
                 , from = "2018-12-01", to = "2019-01-01")
df_etf
# A tibble: 76 x 8
   symbol date        open  high   low close    volume adjusted
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
 1 SPY    2018-12-03  280.  280.  278.  279. 103176300     276.
 2 SPY    2018-12-04  278.  279.  270.  270. 177986000     268.
 3 SPY    2018-12-06  266.  270.  262.  270. 204185400     267.
 4 SPY    2018-12-07  269.  271.  263.  264. 161018900     261.
 5 SPY    2018-12-10  263.  265.  259.  264. 151445900     261.
 6 SPY    2018-12-11  268.  268.  262.  264. 121504400     261.
 7 SPY    2018-12-12  267.  269   265.  265.  97976700     263.
 8 SPY    2018-12-13  267.  267.  264.  265.  96662700     263.
 9 SPY    2018-12-14  263.  264.  260.  260. 116961100     258.
10 SPY    2018-12-17  259.  261.  254.  255. 165492300     253.
# … with 66 more rows

Code Challenge: Use filter() to isolate all the QQQ prices in df_etf.


Analysis Objective

Now that we have our data, let’s clarify our analysis objective:

  1. Our data set df_etf consists of one month of prices for four different ETFs.

  2. We want to calculate the monthly return and the monthly volatility for each ETF.

  3. So ultimately we will have 8 different calculations: 2 calcs for each of the four ETFs.


Trying One Symbol

1. In the course of an analysis, I often find the need repeat the same calculation for many different sets of similar data.

2. To tackle this type of problem, I first execute the calculation for one set of data, and then use that experience to see how I can apply it multiple data sets.

3. Let’s take this approach now, and first calculate our two monthly statistics (return and volatility) for SPY only.

4. We can build this up as a series of tidyverse function calls connected together with the pipe.

5. We will proceed as follows:

  1. isolate the SPY rows
  2. calculate a daily returns using lag()
  3. calculate 1-month return from daily returns
  4. calculate volatility from daily returns

(a) First, let’s isolate the columns and rows we need using select() and filter().

df_etf %>%
    select(symbol, date, close) %>% 
    filter(symbol == "SPY") %>% 
    head()
# A tibble: 6 x 3
  symbol date       close
  <chr>  <date>     <dbl>
1 SPY    2018-12-03  279.
2 SPY    2018-12-04  270.
3 SPY    2018-12-06  270.
4 SPY    2018-12-07  264.
5 SPY    2018-12-10  264.
6 SPY    2018-12-11  264.

(b) Next, let’s add a daily_ret column using the mutate() and lag() functions.

df_etf %>% 
    select(symbol, date, close) %>% 
    filter(symbol == "SPY") %>% 
    mutate(daily_ret = (close / lag(close)) - 1) %>% 
    head()
# A tibble: 6 x 4
  symbol date       close  daily_ret
  <chr>  <date>     <dbl>      <dbl>
1 SPY    2018-12-03  279.  NA       
2 SPY    2018-12-04  270.  -0.0324  
3 SPY    2018-12-06  270.  -0.00152 
4 SPY    2018-12-07  264.  -0.0232  
5 SPY    2018-12-10  264.   0.00190 
6 SPY    2018-12-11  264.   0.000227

(c) and (d) Finally, we calculate the monthly values from the daily ones using the summarize().

df_etf %>% 
    select(symbol, date, close) %>% 
    filter(symbol == "SPY") %>% 
    mutate(daily_ret = (close / lag(close)) - 1) %>% 
    summarize(
        monthly_ret = prod(1 + daily_ret, na.rm = TRUE) - 1
        , monthly_vol = sd(daily_ret, na.rm = TRUE) * sqrt(252)
    )
# A tibble: 1 x 2
  monthly_ret monthly_vol
        <dbl>       <dbl>
1      -0.105       0.305

Code Challenge: Copy/paste the above code, and alter so that it uses the adjusted price for the calculations rather than the close price.


Grouping by Symbol

1. One complication that arises when we move to the full df_etf is that we cannot simply apply our mutate() and summarize() statements to whole data.frame.

2. One possible work around is to use for loop with a variety of if statements, but that would be painful and unidiomatic.

3. This is where group_by() comes in handy.

4. The following code groups df_etf using the symbol column.

group_by(df_etf, symbol) 
# A tibble: 76 x 8
# Groups:   symbol [4]
   symbol date        open  high   low close    volume adjusted
   <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
 1 SPY    2018-12-03  280.  280.  278.  279. 103176300     276.
 2 SPY    2018-12-04  278.  279.  270.  270. 177986000     268.
 3 SPY    2018-12-06  266.  270.  262.  270. 204185400     267.
 4 SPY    2018-12-07  269.  271.  263.  264. 161018900     261.
 5 SPY    2018-12-10  263.  265.  259.  264. 151445900     261.
 6 SPY    2018-12-11  268.  268.  262.  264. 121504400     261.
 7 SPY    2018-12-12  267.  269   265.  265.  97976700     263.
 8 SPY    2018-12-13  267.  267.  264.  265.  96662700     263.
 9 SPY    2018-12-14  263.  264.  260.  260. 116961100     258.
10 SPY    2018-12-17  259.  261.  254.  255. 165492300     253.
# … with 66 more rows

Observations:

  1. At first glance it doesn’t seem like much has changed.

  2. But notice the # Groups: symbol[4] in the second line of the print out.

  3. Behind the scenes the result is a grouped data.frame rather than just a plain data.frame.

  4. However, nothing of real interest happens until we apply mutate() or summarize() to a grouped data.frame.


Code Challenge: Rewrite the group_by() statement above using the pipe operator.


Daily Returns by Symbol

1. In the case of a single underlying, we can calculate the daily_ret by a straight-forward application of mutate().

2. In the case of the entire data set df_etf, which contains four underlyings, the situation is more complicated.

3. By first applying group_by(symbol) to df_etf, when we use the mutate() function it will be applied within each group.

4. The following code demonstrates this:

# calculating returns with a grouped mutate
df_daily <-
    df_etf %>% 
        group_by(symbol) %>% 
        mutate(daily_ret = (close / lag(close)) - 1) %>% 
        select(symbol, date, close, daily_ret)

df_daily %>% head()
# A tibble: 6 x 4
# Groups:   symbol [1]
  symbol date       close  daily_ret
  <chr>  <date>     <dbl>      <dbl>
1 SPY    2018-12-03  279.  NA       
2 SPY    2018-12-04  270.  -0.0324  
3 SPY    2018-12-06  270.  -0.00152 
4 SPY    2018-12-07  264.  -0.0232  
5 SPY    2018-12-10  264.   0.00190 
6 SPY    2018-12-11  264.   0.000227

Observation:

  1. If we examine df_daily in the Viewer, we can see that the calculation was applied correctly.


Code Challenge: Isolate the data from he first date in df_daily and take a look at the value of daily_ret. Are you concerned?


Monthly Statistics by Symbol

1. In df_daily we now have the daily_ret for each of the symbols.

2. These daily_rets are key ingredients for our desired monthly statistics.

3. Our monthly statistics are simply summarizing calculations on the daily_ret column.

4. Since df_daily is a grouped data.frame, when we apply summarize() to it the calculations will be calculated within each group.


monthly returns

To calculate monthly returns, we first group_by(symbol) and then apply the summarize() function.

df_daily %>% 
    group_by(symbol) %>% 
    summarize(month_ret = prod(1 + daily_ret, na.rm = TRUE) - 1) 
# A tibble: 4 x 2
  symbol month_ret
  <chr>      <dbl>
1 DIA      -0.0987
2 IWM      -0.131 
3 QQQ      -0.105 
4 SPY      -0.105 


monthly volatility

Similarly, to calculate monthly volatility, we first group_by(symbol) and then apply the summarize() function.

df_daily %>% 
    group_by(symbol) %>% 
    summarize(vol = sd(daily_ret, na.rm = TRUE) * sqrt(252)) 
# A tibble: 4 x 2
  symbol   vol
  <chr>  <dbl>
1 DIA    0.303
2 IWM    0.312
3 QQQ    0.382
4 SPY    0.305

Code Challenge: Calculate the average daily return for each ETF for the month of December 2018.


All At Once

1. We’ve accomplished what we set out to do in a series of three steps:

  1. (group_by() + mutate()) starting with df_etf calculate daily returns for each symbol and stored them into a new data.frame called df_daily.

  2. (group_by() + summarize()) starting with df_daily, calculate monthly returns for each symbol using the daily_ret column.

  3. (group_by() + summarize()) starting with df_daily, calculate monthly volatility for each symbol using the daily_ret column.

2. Breaking calculations down into step can be helpful to clarify your thoughts, but as you grow more comfortable with the tidyverse you may want to just do the calculation all at once.

3. Here’s how to perform our calculations in a single block of code:

df_etf %>% 
    select(symbol, date, close) %>% 
    group_by(symbol) %>% 
    mutate(daily_ret = (close / lag(close)) - 1) %>% 
    summarize(
        monthly_ret = prod(1 + daily_ret, na.rm = TRUE) - 1
        , monthly_vol = sd(daily_ret, na.rm = TRUE) * sqrt(252)
    )
# A tibble: 4 x 3
  symbol monthly_ret monthly_vol
  <chr>        <dbl>       <dbl>
1 DIA        -0.0987       0.303
2 IWM        -0.131        0.312
3 QQQ        -0.105        0.382
4 SPY        -0.105        0.305

4. The code above is idiomatic of data wrangling in the tidyverse.

5. As a rule of thumb, code blocks involving more than ten pipes should be broken up in smaller code blocks.


Code Challenge: Modify the code above to add maximum daily absolute return to the final report that is generated.


Further Reading

R4DS - 5.6 - Grouped summaries with summarize()

R4DS - 5.7 - Grouped mutates (and filters)

R for Finance - 5.4 - Returns

R for Finance - 5.5 - Volatility