I need to calculate the rolling 14 day average for a large data set. The data set is private, although I can share a small snippet.
The data set comes from an instrument in the field which does not operate every day. For instance, a snippet of the data frame would look like so:
Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000
I have the following plot.
library(zoo)
library(tidyverse)
ggplot(data=df_days, aes(x=Date, y=Value)) +
geom_line(color="black", lwd=0.5) +
geom_point(lwd=0.5) +
geom_line(y=rollmean(df_days$Value, 14, na.pad=TRUE), color="red", lwd=0.8)
I realised that I'm actually taking the 14 point average, i.e the average of 14 data points. Is there a way to take the 14 day average, based upon the dates themselves?
1) Using the input from the question shown reproducibly in the Note at the end we calculate the number of points to use at each date, w, and then use rollapplyr with that.
library(zoo)
within(DF, {
w <- seq_along(Date) - findInterval(Date - 14, Date)
mean14 <- rollapplyr(Value, w, mean)
})
giving the following where mean14 is the mean and w is the number of points used to calculate that mean. This is calculated in such a way that if there were no missing dates then it would give the same result as rollapplyr(DF$Value, 14, mean, partial = TRUE) but if there are missing dates then it uses fewer based on the number of dates in a 14 day window. (Note that using different numbers of points for each mean can affect the variance.)
Date Value mean14 w
1 2022-01-28 196.0000 196.00000 1
2 2022-01-31 104.0000 150.00000 2
3 2022-02-01 0.0000 100.00000 3
4 2022-02-02 98.0000 99.50000 4
5 2022-02-03 0.0000 79.60000 5
6 2022-02-07 139.9231 89.65385 6
7 2022-02-08 114.5000 93.20330 7
8 2022-02-09 121.6429 96.75824 8
9 2022-02-10 96.5000 96.72955 9
10 2022-02-11 151.6364 91.80026 9
11 2022-02-14 85.8750 89.78637 9
12 2022-02-15 98.9000 100.77526 9
13 2022-02-18 209.4000 127.29716 8
14 2022-02-21 172.1818 131.32951 8
15 2022-02-22 0.0000 117.01700 8
16 2022-02-23 0.0000 101.81165 8
17 2022-02-28 264.0000 124.08030 6
18 2022-03-01 131.7500 129.55530 6
19 2022-03-03 119.3333 128.09502 7
20 2022-03-04 88.8000 110.86645 7
21 2022-03-07 152.1667 108.00714 7
22 2022-03-08 24.5000 111.50714 7
2) Another approach is to add the missing dates, fill in Value in those missing dates with NA and then use rollapplyr.
m <- merge(DF, data.frame(Date = seq(min(DF$Date), max(DF$Date), 1)), all = TRUE)
na.omit(transform(m,
mean14 = rollapplyr(Value, 14, mean, na.rm = TRUE, partial = TRUE)))
3) A variation of the above is to use zoo objects. Note that fortify.zoo(zz) can be used to create a data frame from a zoo object.
library(zoo)
z <- read.zoo(DF)
# 1
tt <- time(z)
w <- seq_along(tt) - findInterval(tt - 14, tt)
zz <- rollapplyr(z, w, mean)
# 2
m <- merge(z, zoo(, seq(start(z), end(z), 1)))
zz <- na.omit(rollapply(m, 14, mean, na.rm = TRUE))
Note
Lines <- "Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000"
DF <- read.csv(text = Lines)
DF$Date <- as.Date(DF$Date)
There may be more elegant solutions, but you can fill in the missing dates with NA:
df$Date <- as.Date(df$Date)
library(dplyr)
library(tidyr)
df %>% complete(Date = seq(min(Date),max(Date),1), fill = list(Value = NA))
Output:
# A tibble: 40 × 2
# Date Value
# <date> <dbl>
# 1 2022-01-28 196
# 2 2022-01-29 NA
# 3 2022-01-30 NA
# 4 2022-01-31 104
# 5 2022-02-01 0
# 6 2022-02-02 98
# ...
Related
I would like to calculate mean every 5 rows in my df. Here is my df :
Time
value
03/06/2021 06:15:00
NA
03/06/2021 06:16:00
NA
03/06/2021 06:17:00
20
03/06/2021 06:18:00
22
03/06/2021 06:19:00
25
03/06/2021 06:20:00
NA
03/06/2021 06:21:00
31
03/06/2021 06:22:00
23
03/06/2021 06:23:00
19
03/06/2021 06:24:00
25
03/06/2021 06:25:00
34
03/06/2021 06:26:00
42
03/06/2021 06:27:00
NA
03/06/2021 06:28:00
19
03/06/2021 06:29:00
17
03/06/2021 06:30:00
25
I already have a loop which goes well to calculate means for each 5 rows package. My problem is in my "mean function".
The problem is :
-if I put na.rm = FALSE, mean = NA as soon as there is a NA in a package of 5 values.
- if I put na.rm = TRUE in mean function, the result gives me averages that are shifted to take 5 values. I would like the NA not to interfere with the average and that when there is a NA in a package of 5 values, the average is only done on 4 values.
How can I do this? Thanks for your help !
You can solve your problem by introducing a dummy variable that groups your observarions in sets of five and then calculating the mean within group. Here's MWE, based in the tidyverse, that assumes your data is in a data.frame named df.
library(tidyverse)
df %>%
mutate(Group= 1 + floor((row_number()-1) / 5)) %>%
group_by(Group) %>%
summarise(Mean=mean(value, na.rm=TRUE), .groups="drop")
# A tibble: 4 × 2
Group Mean
<dbl> <dbl>
1 1 22.3
2 2 24.5
3 3 28
4 4 25
A solution based on purrr::map_dfr:
library(purrr)
df <- data.frame(
stringsAsFactors = FALSE,
time = c("03/06/2021 06:15:00","03/06/2021 06:16:00",
"03/06/2021 06:17:00",
"03/06/2021 06:18:00","03/06/2021 06:19:00",
"03/06/2021 06:20:00","03/06/2021 06:21:00",
"03/06/2021 06:22:00","03/06/2021 06:23:00",
"03/06/2021 06:24:00","03/06/2021 06:25:00",
"03/06/2021 06:26:00",
"03/06/2021 06:27:00","03/06/2021 06:28:00",
"03/06/2021 06:29:00","03/06/2021 06:30:00"),
value = c(NA,NA,20L,22L,
25L,NA,31L,23L,19L,25L,34L,42L,NA,19L,17L,
25L)
)
map_dfr(1:(nrow(df)-5),
~ data.frame(Group =.x, Mean = mean(df$value[.x:(.x+5)],na.rm=T)))
#> Group Mean
#> 1 1 22.33333
#> 2 2 24.50000
#> 3 3 24.20000
#> 4 4 24.00000
#> 5 5 24.60000
#> 6 6 26.40000
#> 7 7 29.00000
#> 8 8 28.60000
#> 9 9 27.80000
#> 10 10 27.40000
#> 11 11 27.40000
If you want to take average of every 5 minutes you may use lubridate's function floor_date/ceiling_date to round the time.
library(dplyr)
library(lubridate)
df %>%
mutate(time = mdy_hms(time),
time = floor_date(time, '5 mins')) %>%
group_by(time) %>%
summarise(value = mean(value, na.rm = TRUE))
# time value
# <dttm> <dbl>
#1 2021-03-06 06:15:00 22.3
#2 2021-03-06 06:20:00 24.5
#3 2021-03-06 06:25:00 28
#4 2021-03-06 06:30:00 25
I have created a ggplot using date x axis but I would like to show their values from another variable instead of dates.
df
library(tidyverse)
library(lubridate)
df <- read_rds("https://github.com/johnsnow09/covid19-df_stack-code/blob/main/vaccine_milestones.rds?raw=true")
df
Updated.On cr_bin days_to_next_10cr_vacc
<date> <fct> <drtn>
1 2021-04-11 10 Cr 85 days
2 2021-05-27 20 Cr 46 days
3 2021-06-24 30 Cr 28 days
4 2021-07-18 40 Cr 24 days
5 2021-08-06 50 Cr 19 days
6 2021-08-25 60 Cr 19 days
7 2021-09-07 70 Cr 13 days
8 2021-09-18 80 Cr 11 days
9 2021-10-02 90 Cr 14 days
df %>%
ggplot(aes(x = Updated.On, y = days_to_next_10cr_vacc)) +
geom_col() +
scale_x_date(aes(labels = cr_bin))
Also tried: scale_x_date(aes(labels = c("10","20","30","40","50","60","70","80","90")))
In the plot on the x axis I would like to have values displayed from cr_bin instead of dates as 10 Cr, 20 cr, 30 Cr ... so on 90 Cr.
I have tried above code but I am not sure what else to use in place of labels to get desired results
You need to set breaks for labels. I'm using unique, just in case there might be duplicate rows.
Also note conversion off difftime to integer.
library(tidyverse)
library(lubridate)
df <- read_rds("https://github.com/johnsnow09/covid19-df_stack-code/blob/main/vaccine_milestones.rds?raw=true")
df %>%
ggplot(aes(x = Updated.On, y = as.integer(days_to_next_10cr_vacc))) +
geom_col() +
scale_x_date(breaks = unique(df$Updated.On), labels = unique(df$cr_bin))
Created on 2021-10-21 by the reprex package (v2.0.1)
I am trying to fit linear models to a time-series where the regression begins at midnight each day and uses all data until 0600 the following morning (covering a total of 30 hrs). I want to do this for every day in the time-series, and this also needs to be applied by a grouping factor. What I ultimately need is the regression coefficients added to the data frame for the day where the regression started. I am familiar with rolling and window regressions and how to apply functions across groups using dplyr. Where I am struggling is how to code that the regression needs to start at midnight each day. If I were to use a window function, after the first day it would be shifted ahead six hours from midnight and I am not sure how to shift the window back to midnight. Seems like I need to specify a window size and a lag/lead at each iteration but can't visualize how to implement that. Any insight is appreciated.
here is some sample data. I would like to model dv ~ datetime, by = grp
df <- dplyr::arrange(data.frame(datetime = seq(as.POSIXct("2020-09-19 00:00:00"), as.POSIXct("2020-09-30 00:00:00"),"hour"),
grp = rep(c('a', 'b', 'c'), 265),
dv = rnorm(795)),grp, datetime)
We assume that we want each regression to cover 30 rows (except for any stub at the end) and that we should move forward by 24 hours for each regression so that there is one regression per date within grp.
ans <- df %>%
group_by(grp) %>%
group_modify(~ {
r <- rollapplyr(1:nrow(.), 30, by = 24,
function(ix) coef(lm(dv ~ datetime, ., subset = ix)),
align = "left", partial = TRUE)
data.frame(date = head(unique(as.Date(.$datetime)), nrow(r)),
coef1 = r[, 1], coef2 = r[, 2])
}) %>%
ungroup
giving:
> ans
# A tibble: 36 x 4
grp date coef1 coef2
<chr> <date> <dbl> <dbl>
1 a 2020-09-19 -7698. 0.00000481
2 a 2020-09-20 -2048. 0.00000128
3 a 2020-09-21 -82.0 0.0000000514
4 a 2020-09-22 963. -0.000000602
5 a 2020-09-23 2323. -0.00000145
6 a 2020-09-24 5886. -0.00000368
7 a 2020-09-25 7212. -0.00000450
8 a 2020-09-26 -17448. 0.0000109
9 a 2020-09-27 1704. -0.00000106
10 a 2020-09-28 15731. -0.00000982
# ... with 26 more rows
old
After re-reading question I replaced this with the above.
Within group create g which groups the values since the last 6 am and let width be the number of rows since the most recent 6am row. Then run rollapplyr using the width vector to define the widths to regress over.
library(dplyr)
library(zoo)
ans <- df %>%
group_by(grp) %>%
group_modify(~ {
g <- cumsum(format(.$datetime, "%H") == "06")
width = 1:nrow(.) - match(g, g) + 1
r <- rollapplyr(1:nrow(.), width,
function(ix) coef(lm(dv ~ datetime, ., subset = ix)),
partial = TRUE, fill = NA)
mutate(., coef1 = r[, 1], coef2 = r[, 2])
}) %>%
ungroup
giving:
> ans
# A tibble: 795 x 5
grp datetime dv coef1 coef2
<chr> <dttm> <dbl> <dbl> <dbl>
1 a 2020-09-19 00:00:00 -0.560 -0.560 NA
2 a 2020-09-19 01:00:00 -0.506 -24071. 0.0000150
3 a 2020-09-19 02:00:00 -1.76 265870. -0.000166
4 a 2020-09-19 03:00:00 0.0705 -28577. 0.0000179
5 a 2020-09-19 04:00:00 1.95 -248499. 0.000155
6 a 2020-09-19 05:00:00 0.845 -205918. 0.000129
7 a 2020-09-19 06:00:00 0.461 0.461 NA
8 a 2020-09-19 07:00:00 0.359 45375. -0.0000284
9 a 2020-09-19 08:00:00 -1.40 412619. -0.000258
10 a 2020-09-19 09:00:00 -0.446 198902. -0.000124
# ... with 785 more rows
Note
Input used
set.seed(123)
df <- dplyr::arrange(data.frame(datetime = seq(as.POSIXct("2020-09-19 00:00:00"), as.POSIXct("2020-09-30 00:00:00"),"hour"),
grp = rep(c('a', 'b', 'c'), 265),
dv = rnorm(795)),grp, datetime)
I need to calculate a time of consecutive dates, until the difference of time between two consecutive dates is greater than 13 seconds.
For example, in the data frame create with the code shown below, the column test has the time difference between the dates. What I need is events of time between lines with test > 13 seconds.
# Create a vector of dates with a random time difference in seconds between records
dates <- seq(as.POSIXct("2020-01-01 00:00:02"), as.POSIXct("2020-01-02 00:00:02"), by = "2 sec")
dates <- dates + sample(15, length(dates), replace = T)
# Create a data.frame
data <- data.frame(id = 1:length(dates), dates = dates)
# Create a test field with the time difference between each date and the next
data$test <- c(diff(data$dates, lag = 1), 0)
# Delete the zero and negative time
data <- data[data$test > 0, ]
head(data)
What I want is something like this:
To get to your desired result we need to define 'blocks' of observation. Each block is splitted where test is greater than 13.
We start identifying the split_point, and then using the rle function we can assign an ID to each block.
Then we can filter out the split_point, and summarize the remaining blocks. Once with the sum of seconds, then with the min of the event dates.
split_point <- data$test <=13
# Find continuous blocks
block_str <- rle(split_point)
# Create block IDs
data$block <- rep(seq_along(block_str$lengths), block_str$lengths)
data <- data[split_point, ] # Remove split points
# Summarize
final_df <- aggregate(test ~ block, data = data, FUN = sum)
dtevent <- aggregate(dates ~ block, data= data, FUN=min)
# Join the two summaries
final_df$DatetimeEvent <- dtevent$dates
head(final_df)
#> block test DatetimeEvent
#> 1 1 101 2020-01-01 00:00:09
#> 2 3 105 2020-01-01 00:01:11
#> 3 5 277 2020-01-01 00:02:26
#> 4 7 46 2020-01-01 00:04:58
#> 5 9 27 2020-01-01 00:05:30
#> 6 11 194 2020-01-01 00:05:44
Created on 2020-04-02 by the reprex package (v0.3.0)
Using dplyrfor convenience sake:
library(dplyr)
final_df <- data %>%
mutate(split_point = test <= 13,
block = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
group_by(block) %>%
filter(split_point) %>%
summarise(DateTimeEvent = min(dates), TotalTime = sum(test))
final_df
#> # A tibble: 1,110 x 3
#> block DateTimeEvent TotalTime
#> <int> <dttm> <drtn>
#> 1 1 2020-01-01 00:00:06 260 secs
#> 2 3 2020-01-01 00:02:28 170 secs
#> 3 5 2020-01-01 00:04:11 528 secs
#> 4 7 2020-01-01 00:09:07 89 secs
#> 5 9 2020-01-01 00:10:07 37 secs
#> 6 11 2020-01-01 00:10:39 135 secs
#> 7 13 2020-01-01 00:11:56 50 secs
#> 8 15 2020-01-01 00:12:32 124 secs
#> 9 17 2020-01-01 00:13:52 98 secs
#> 10 19 2020-01-01 00:14:47 83 secs
#> # … with 1,100 more rows
Created on 2020-04-02 by the reprex package (v0.3.0)
(results are different because reprex recreates the data each time)
I have a df like the following with 30 years until 2015. I want to cut every month into three data like 1-10, 11-20, and 21-31 and average all ten (less then ten) data. Thus, each month has three data. How can I do it?
1993-01-29 28.92189
1993-02-01 29.12760
1993-02-02 29.18927
1993-02-03 29.49786
1993-02-04 29.62128
1993-02-05 29.60068
1993-02-08 29.60068
1993-02-09 29.39498
------
------
2015-08-18 209.92999
2015-08-19 208.28000
2015-08-20 204.01000
2015-08-21 197.63001
2015-08-24 189.55000
2015-08-25 187.23000
2015-08-26 194.67999
2015-08-27 199.16000
2015-08-28 199.24000
tryCatch is for eliminate data start date problem. I will provide more info when i have time.
library(xts)
dates<-seq(as.Date("1993-01-29"),as.Date("2015-08-25"),"days")
sample<-rnorm(length(dates))
tmpxts<-split.xts(xts(x = sample,order.by = dates),f = "months")
mxts<-lapply(tmpxts,function(x) {
tmp<-data.frame(val=tryCatch(c(mean(x[1:10]),mean(x[11:20]),mean(x[21:length(x)])),
error=function(e) matrix(mean(x),1)))
row.names(tmp)<-tryCatch(index(x[c(1,11,21)]),error=function(e) index(x[1]))
tmp
})
do.call(rbind,mxts)
This is a base solution that builds cuts from an increasing sequence the cycles through years, months and your cuts at 1st, 11th and 21th of the month, The default for the base cut functions are to include the breaks as the "right-side" of intervals, but your specification required cuts at 1,11,and 21 (to leave 10, and 20 in the lower intervals) so I used right=TRUE:
tapply(dat$V2, cut.Date(dat$V1,
breaks=as.Date(
apply( expand.grid( c(1,11,21), 1:12, 1993:2015), 1,
function( x) paste(rev(x), collapse="-")) ), right=TRUE), FUN=mean)
1993-01-01 1993-01-11 1993-01-21 1993-02-01 1993-02-11 1993-02-21 1993-03-01
NA NA 29.02475 29.48412 NA NA NA
snipped many empty intervals
And the bottom of results included:
2015-07-21 2015-08-01 2015-08-11 2015-08-21 2015-09-01 2015-09-11 2015-09-21
NA NA 204.96250 193.97200 NA NA NA
2015-10-01 2015-10-11 2015-10-21 2015-11-01 2015-11-11 2015-11-21 2015-12-01
NA NA NA NA NA NA NA
2015-12-11
NA
The code below cuts each month separately into thirds, based on the number of days in each month.
library(dplyr)
library(lubridate)
library(ggplot2)
# Fake data
df = data.frame(date=seq.Date(as.Date("2013-01-01"),
as.Date("2013-03-31"), by="day"))
set.seed(394)
df$value = rnorm(nrow(df), sqrt(1:nrow(df)), 2)
# Cut months into thirds
df = df %>%
# Create a new column to group by Year-Month
mutate(yr_mon = paste0(year(date) , "_", month(date, label=TRUE, abbr=TRUE))) %>%
group_by(yr_mon) %>%
# Cut each month into thirds
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
# Add yr_mon to cutMonth so that we have a unique group label for
# each third of each month
cutMonth = paste0(yr_mon, "\n", cutMonth)) %>%
ungroup() %>%
# Turn cutMonth into a factor with correct date ordering
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
And here is the result:
# Show number of observations in each group
as.data.frame(table(df$cutMonth))
Var1 Freq
1 2013_Jan\n1st third 10
2 2013_Jan\n2nd third 11
3 2013_Jan\n3rd third 10
4 2013_Feb\n1st third 9
5 2013_Feb\n2nd third 10
6 2013_Feb\n3rd third 9
7 2013_Mar\n1st third 10
8 2013_Mar\n2nd third 11
9 2013_Mar\n3rd third 10
# Plot means by group (just to visualize the result of the date grouping operations)
ggplot(df, aes(cutMonth, value)) +
stat_summary(fun.y=mean, geom='point', size=4, colour="red") +
coord_cartesian(ylim=c(-0.2,10.2)) +
theme(axis.text.x = element_text(size=14))