Calculate daily mean of data frame in r - r

I have a data frame in r that contains readings each five minutes of an hour for couple of months. I want to calculate daily mean of the var3 (data frame under) and add into this data frame as var4.
Here is my df:
>df
timestamp Var1 Var2 Var3
1 2018-07-20 13:50:00 32.0358 28.1 3.6
2 2018-07-20 13:55:00 32.0358 28.0 2.5
3 2018-07-20 14:00:00 32.0358 28.1 2.2
I find this solution from searching the forum, but it's raising error.
Here is the solution I am applying:
aggregate(ts(df$var3[, 2], freq = 288), 1, mean)
This is the error I am getting:
Error in df$var3[, 2] : incorrect number of dimensions
I think this should work for my data frame too but not able to remove this error. Please help.

Here's an approach with dplyr and lubridate.
library(dplyr)
library(lubridate)
df %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(Var4 = mean(Var3))
## A tibble: 1,000 x 6
## Groups: Day [5]
# timestamp Var1 Var2 Var3 Day Var4
# <dttm> <dbl> <dbl> <dbl> <int> <dbl>
# 1 2018-07-20 13:55:30 32.2 22.9 2.35 20 2.99
# 2 2018-07-20 14:00:30 37.7 24.8 2.99 20 2.99
# 3 2018-07-20 14:05:30 38.7 29.6 3.47 20 2.99
# 4 2018-07-20 14:10:30 30.4 24.2 3.02 20 2.99
# 5 2018-07-20 14:15:30 32.0 28.4 2.95 20 2.99
## … with 995 more rows
Sample Data
df <- data.frame(timestamp = ymd_hms("2018-07-20 13:50:30") + 60*5 * 1:1000,
Var1 = runif(100,30,40),
Var2 = runif(100,20,30),
Var3 = runif(100,2,4))

Related

What's the easiest way to summarize by group and for the whole sample?

Suppose I have data that looks like this:
Date time price minute FOMC Daily.Return
<date> <time> <dbl> <dbl> <fct> <dbl>
1 2005-01-03 16:00:00 120. 960 FALSE -1.24
2 2005-01-04 16:00:00 119. 960 FALSE -1.44
3 2005-01-05 16:00:00 118. 960 FALSE -0.354
4 2005-01-06 16:00:01 119. 960 FALSE 0.245
5 2005-01-07 15:59:00 119. 959 FALSE -0.328
6 2005-01-10 16:00:00 119. 960 FALSE 0.506
7 2005-01-11 16:00:00 118. 960 FALSE -0.279
8 2005-01-12 16:00:01 119. 960 FALSE 0.329
9 2005-01-13 16:00:00 118. 960 FALSE -0.787
10 2005-01-14 16:00:00 118. 960 FALSE 0.372
I want to summarize Daily.Return per group using the FOMC variable which is either TRUE or FALSE. That is easy with dplyr. I get the following:
daily.SPY %>% group_by(FOMC) %>%
summarise(Mean = 100 * mean(Daily.Return),
Median = 100 * median(Daily.Return),
Vol = 100 * sqrt(252) * sd(Daily.Return/100))
As expected, I get the following tibble back:
FOMC Mean Median Vol
<fct> <dbl> <dbl> <dbl>
1 FALSE 0.00551 5.24 14.9
2 TRUE 20.8 1.20 17.6
However, I would like to have a third row which would perform the same computations without grouping. It would compute the average, median and standard deviation for the whole sample, without conditioning on the group. What's the easiest way to do that within tidyverse? Thanks!
One option is to just row bind a duplicate of the whole data where you mutate() the FOMC variable to "ALL" so that you end up with that as a separate group when you group_by() and summarise().
library(tidyverse)
set.seed(1)
daily.SPY <- tibble(
FOMC = factor(rep(c(T, F), each = 25)),
Daily.Return = c(cumsum(rnorm(25)), cumsum(rnorm(25)))
)
daily.SPY %>%
bind_rows(., mutate(., FOMC = "ALL")) %>%
group_by(FOMC) %>%
summarise(Mean = 100 * mean(Daily.Return),
Median = 100 * median(Daily.Return),
Vol = 100 * sqrt(252) * sd(Daily.Return/100))
#> # A tibble: 3 x 4
#> FOMC Mean Median Vol
#> <chr> <dbl> <dbl> <dbl>
#> 1 ALL 58.4 -6.57 32.3
#> 2 FALSE -80.3 -53.6 13.9
#> 3 TRUE 197. 151. 30.5
Created on 2022-01-11 by the reprex package (v2.0.1)
You can make a function for summarizing data:
summarize_returns = function(data) {
data %>%
summarise(
Mean = 100 * mean(Daily.Return),
Median = 100 * median(Daily.Return),
Vol = 100 * sqrt(252) * sd(Daily.Return / 100),
.groups = "drop"
)
}
Then you can combine the two summaries using dplyr::bind_rows():
data %>%
group_by(FOMC) %>%
summarize_returns() %>%
bind_rows(
data %>% summarize_returns() %>% mutate(FOMC = "Total")
)
# A tibble: 3 x 4
FOMC Mean Median Vol
<chr> <dbl> <dbl> <dbl>
1 FALSE -13.6 -13.3 15.5
2 TRUE 14.4 8.79 16.6
3 Total 0.992 -1.08 16.2
My data:
library(tidyverse)
set.seed(123)
data = tibble(
FOMC = as.character(sample(c(TRUE, FALSE), 100, replace = TRUE),
Daily.Return = rnorm(100)
)

How do I use dplyr to correlate each column in a for loop?

I have a dataframe of 19 stocks, including the S&P500 (SPX), throughout time. I want to correlate each one of these stocks with the S&P for each month (Jan-Dec), making 18 x 12 = 216 different correlations, and store these in a list called stockList.
> tokens
# A tibble: 366 x 21
Month Date SPX TZERO .....(16 more columns of stocks)...... MPS
<dbl> <dttm> <dbl> <dbl> <dbl>
1 2020-01-02 3245.50 0.95 176.72
...
12 2020-12-31 3733.42 2.90 .....(16 more columns of stocks)..... 360.73
Here's where my error pops up, by using the index [i], or [[i]], in the cor() function
stockList <- list()
for(i in 1:18) {
stockList[[i]] <- tokens %>%
group_by(Month) %>%
summarize(correlation = cor(SPX, tokens[i+3], use = 'complete.obs'))
}
Error in summarise_impl(.data, dots) :
Evaluation error: incompatible dimensions.
How do I use column indexing in the cor() function when trying to summarize? Is there an alternative way?
First, to recreate data like yours:
library(tidyquant)
# Get gamestop, apple, and S&P 500 index prices
prices <- tq_get(c("GME", "AAPL", "^GSPC"),
get = "stock.prices",
from = "2020-01-01",
to = "2020-12-31")
library(tidyverse)
prices_wide <- prices %>%
select(date, close, symbol) %>%
pivot_wider(names_from = symbol, values_from = close) %>%
mutate(Month = lubridate::month(date)) %>%
select(Month, Date = date, GME, AAPL, SPX = `^GSPC`)
This should look like your data:
> prices_wide
# A tibble: 252 x 5
Month Date GME AAPL SPX
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2020-01-02 6.31 75.1 3258.
2 1 2020-01-03 5.88 74.4 3235.
3 1 2020-01-06 5.85 74.9 3246.
4 1 2020-01-07 5.52 74.6 3237.
5 1 2020-01-08 5.72 75.8 3253.
6 1 2020-01-09 5.55 77.4 3275.
7 1 2020-01-10 5.43 77.6 3265.
8 1 2020-01-13 5.43 79.2 3288.
9 1 2020-01-14 4.71 78.2 3283.
10 1 2020-01-15 4.61 77.8 3289.
# … with 242 more rows
Then I put that data in longer "tidy" format where each row has the stock value and the SPX value so I can compare them:
prices_wide %>%
# I want every row to have month, date, and SPX
pivot_longer(cols = -c(Month, Date, SPX),
names_to = "symbol",
values_to = "price") %>%
group_by(Month, symbol) %>%
summarize(correlation = cor(price, SPX)) %>%
ungroup()
# A tibble: 24 x 3
Month symbol correlation
<dbl> <chr> <dbl>
1 1 AAPL 0.709
2 1 GME -0.324
3 2 AAPL 0.980
4 2 GME 0.874
5 3 AAPL 0.985
6 3 GME -0.177
7 4 AAPL 0.956
8 4 GME 0.873
9 5 AAPL 0.792
10 5 GME -0.435
# … with 14 more rows

Disaggregate daily time series into hourly values using R

I'm working with a dataset that contains daily data of water flow. The data goes from 1-10-1998 to 30-03-2020 and looks like this:
Date QA
1998-10-01 315
1998-10-02 245
1998-10-03 179
1998-10-04 186
1998-10-05 262
1998-10-06 199
1998-10-07 319
(...)
The class(Date) is "Date" and the class(QA) is "numeric".
My goal is to turn this daily data into hourly data. For this I used the function 'td' from the package 'tempdisagg' of R:
library(tempdisagg)
td(QA~1,to="hour",method="denton-cholette")
My problem is in the definition of QA as a time series variable. When I define it as 'ts' and apply the function to disaggregate the data, the following error appears:
QA_ts <- ts(QA, start = decimal_date(as.Date("1998-10-01")), frequency = 365)
td(QA_ts ~ 1, to = "hour",method="denton-cholette")
Error in td(QA_ts ~ 1, to = "hour",method="denton-cholette") :
use a time series class other than 'ts' to deal with 'hour'
And when I define QA as another format such as "xts" or "msts" I get the following error:
newQA <- xts(QA,Date)
td(newQA ~1, to="hour",method="denton-cholette")
Error in seq.Date(lf[1], lf.end, by = to) : 'to' must be a "Date" object
I think I'm doing something wrong when defining QA as time series but I can't solve this issue.
Can anybody help me out?
thanks,
Date needs to be of class POSIXct, rather than Date, to convert to hourly frequency. Here is a reproducible example:
x <- structure(list(time = structure(c(10227, 10258, 10286, 10317,
10347, 10378, 10408), class = "Date"), value = c(315, 245, 179,
186, 262, 199, 319)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Disaggratate to days:
library(tempdisagg)
m0 <- td(x ~ 1, to = "day", method = "fast")
#> Loading required namespace: tsbox
predict(m0)
#> # A tibble: 212 x 2
#> time value
#> <date> <dbl>
#> 1 1998-01-01 10.4
#> 2 1998-01-02 10.3
#> 3 1998-01-03 10.3
#> 4 1998-01-04 10.3
#> 5 1998-01-05 10.3
#> 6 1998-01-06 10.3
#> 7 1998-01-07 10.3
#> 8 1998-01-08 10.3
#> 9 1998-01-09 10.3
#> 10 1998-01-10 10.3
#> # … with 202 more rows
If you want to disaggregate to hours, time need to be POSIXct:
x$time <- as.POSIXct(x$time)
m1 <- td(x ~ 1, to = "hour", method = "fast")
predict(m1)
#> # A tibble: 5,087 x 2
#> time value
#> <dttm> <dbl>
#> 1 1998-01-01 01:00:00 0.431
#> 2 1998-01-01 02:00:00 0.431
#> 3 1998-01-01 03:00:00 0.431
#> 4 1998-01-01 04:00:00 0.431
#> 5 1998-01-01 05:00:00 0.431
#> 6 1998-01-01 06:00:00 0.431
#> 7 1998-01-01 07:00:00 0.431
#> 8 1998-01-01 08:00:00 0.431
#> 9 1998-01-01 09:00:00 0.431
#> 10 1998-01-01 10:00:00 0.431
#> # … with 5,077 more rows
Here is a slightly more complex example for hourly disaggregation.
This post explains conversion to high-frequency in more detail.

Group by weekly data and summarize by month in R with dplyr

I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.
We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows

Trouble using object in dataframe after a pipe (decomposition of a msts object)

I do time series decomposition and I want to save the resulting objects in a dataframe. It works if I store the results in a object and use it to make the dataframe afterwards:
# needed packages
library(tidyverse)
library(forecast)
# some "time series"
vec <- 1:1000 + rnorm(1000)
# store pipe results
pipe_out <-
# do decomposition
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
# relevant data
.$seasonal
# make a dataframe with the stored seasonal data
data.frame(ts= pipe_out)
But doing the same as a one-liner fails:
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
data.frame(ts= .$seasonal)
I get the error
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"decomposed.ts"’ to a data.frame
I thought that the pipe simply moves forward the things that came up in the last step which saves us storing those things in objects. If so, shouldn't both codes result in the very same output?
EDIT (from comments)
The first code works but it is a bad solution because if one wants to extract all the vectors of the decomposed time series one would need to do it in multiple steps. Something like the following would be better:
decompose(msts(vec, start= c(2001, 1, 1),
seasonal.periods= c(7, 365.25))) %>%
data.frame(seasonal= .$seasonal, x=.$x, trend=.$trend, random=.$random)
It's unclear from your example whether you want to extract $x or $seasonal. Either way, you can extract part of a list either with the `[[`() function in base or the alias extract2() in magrittr, as you prefer. You should then use the . when you create a data.frame in the last step.
Cleaning up the code a bit to be consistent with the piping, the following works:
library(magrittr)
library(tidyverse)
library(forecast)
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start = c(2001, 1, 1), seasonal.periods= c(7, 365.25)) %>%
decompose %>%
`[[`("seasonal") %>%
# extract2("seasonal") %>% # Another option, uncomment if preferred
data.frame(ts = .) %>%
head # Just for the reprex, remove as required
#> ts
#> 1 -1.17332998
#> 2 0.07393265
#> 3 0.37631946
#> 4 0.30640395
#> 5 1.04279779
#> 6 0.20470768
Created on 2019-11-28 by the reprex package (v0.3.0)
Edit based on comment:
To do what you mention in the comments, you need to use curly brackets (see e.g. here for an explanation why). Hence, the following works:
library(magrittr)
library(tidyverse)
library(forecast)
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
{data.frame(seasonal = .$seasonal,
trend = .$trend)} %>%
head
#> seasonal trend
#> 1 -0.4332034 NA
#> 2 -0.6185832 NA
#> 3 -0.5899566 NA
#> 4 0.7640938 NA
#> 5 -0.4374417 NA
#> 6 -0.8739449 NA
However, for your specific use case, it may be clearer and easier to use magrittr::extract and then simply bind_cols:
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
magrittr::extract(c("seasonal", "trend")) %>%
bind_cols %>%
head
#> # A tibble: 6 x 2
#> seasonal trend
#> <dbl> <dbl>
#> 1 -0.433 NA
#> 2 -0.619 NA
#> 3 -0.590 NA
#> 4 0.764 NA
#> 5 -0.437 NA
#> 6 -0.874 NA
Created on 2019-11-29 by the reprex package (v0.3.0)
With daily data, decompose() does not work well because it will only handle the annual seasonality and will give relatively poor estimates of it. If the data involve human behaviour, it will probably have both weekly and annual seasonal patterns.
Also, msts objects are not great for daily data either because they don't store the dates explicitly.
I suggest you use tsibble objects with an STL decomposition instead. Here is an example using your data.
library(tidyverse)
library(tsibble)
library(feasts)
mydata <- tsibble(
day = as.Date(seq(as.Date("2001-01-01"), length=1000, by=1)),
vec = 1:1000 + rnorm(1000)
)
#> Using `day` as index variable.
mydata
#> # A tsibble: 1,000 x 2 [1D]
#> day vec
#> <date> <dbl>
#> 1 2001-01-01 0.161
#> 2 2001-01-02 2.61
#> 3 2001-01-03 1.37
#> 4 2001-01-04 3.15
#> 5 2001-01-05 4.43
#> 6 2001-01-06 7.35
#> 7 2001-01-07 7.10
#> 8 2001-01-08 10.0
#> 9 2001-01-09 9.16
#> 10 2001-01-10 10.2
#> # … with 990 more rows
# Compute a decomposition
mydata %>% STL(vec)
#> # A dable: 1,000 x 7 [1D]
#> # STL Decomposition: vec = trend + season_year + season_week + remainder
#> day vec trend season_year season_week remainder season_adjust
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001-01-01 0.161 14.7 -14.6 0.295 -0.193 14.5
#> 2 2001-01-02 2.61 15.6 -14.2 0.0865 1.04 16.7
#> 3 2001-01-03 1.37 16.6 -15.5 0.0365 0.240 16.9
#> 4 2001-01-04 3.15 17.6 -13.0 -0.0680 -1.34 16.3
#> 5 2001-01-05 4.43 18.6 -13.4 -0.0361 -0.700 17.9
#> 6 2001-01-06 7.35 19.5 -12.4 -0.122 0.358 19.9
#> 7 2001-01-07 7.10 20.5 -13.4 -0.181 0.170 20.7
#> 8 2001-01-08 10.0 21.4 -12.7 0.282 1.10 22.5
#> 9 2001-01-09 9.16 22.2 -13.8 0.0773 0.642 22.9
#> 10 2001-01-10 10.2 22.9 -12.7 0.0323 -0.0492 22.9
#> # … with 990 more rows
Created on 2019-11-30 by the reprex package (v0.3.0)
The output is a dable (decomposition table) which behaves like a dataframe most of the time. So you can extract the trend column, or either of the seasonal component columns in the usual way.

Resources