Compute average over 20 second intervals and group by another column - r

I'm working with a large dataset of different variables collected during the dives of elephant seals. I would like to analyze my data on a fine-scale (20 second intervals). I want to bin my data into 20 second intervals, basically I just want to get the mean for every 20 seconds, so I can run more analysis on these intervals of data. However, I need to group my data by dive # so that I'm not binning information from separate dives.
There are three methods I've tried so far:
period.apply() but I cannot group with this function.
split() to subset my data by dive #, but can't seem to find a way to then calculate the mean of
different columns over 20 second intervals within these subsets.
openair package, using timeaverage() but continue to get an error (see code below).
Below is what the data looks like, and the code I've tried. I would like the means of Depth, MSA, rate_s, and HR for each 20 second window - grouped by diveNum and ~ideally~ also D_phase.
> head(seal_dives)
datetime seal_ID Depth MSA D_phase diveNum rate_s HR
1 2018-04-06 14:47:51 Congaree 4.5 0.20154042 D 1 NA 115.3846
2 2018-04-06 14:47:51 Congaree 4.5 0.20154042 D 1 NA 117.6471
3 2018-04-06 14:47:52 Congaree 4.5 0.11496760 D 1 NA 115.3846
4 2018-04-06 14:47:52 Congaree 4.5 0.11496760 D 1 NA 122.4490
5 2018-04-06 14:47:53 Congaree 4.5 0.05935992 D 1 NA 113.2075
6 2018-04-06 14:47:53 Congaree 4.5 0.05935992 D 1 NA 113.2075
#openair package using timeaverage, results in error message
> library(openair)
> seal_20<-timeAverage(
seal_dives,
avg.time = "20 sec",
data.thresh = 0,
statistic = "mean",
type = c("diveNum","D_phase"),
percentile = NA,
start.date = NA,
end.date = NA,
vector.ws = FALSE,
fill = FALSE
)
Can't find the variable(s) date
Error in checkPrep(mydata, vars, type = "default", remove.calm = FALSE, :
#converting to time series and using period.apply(), but can't find a way to group them by dive #, or use split() then convert to time series.
#create a time series data class from our data frame
> seal_dives$datetime<-as.POSIXct(seal_dives$datetime,tz="GMT")
> seal_xts <- xts(seal_dives, order.by=seal_dives[,1])
> seal_20<-period.apply(seal_xts$Depth, endpoints(seal_xts$datetime, "seconds", 20), mean)
#split data by dive # but don't know how to do averages over 20 seconds
> seal_split<-split(seal_dives, seal_dives$diveNum)
Maybe there is a magical way to do this that I haven't found on the internet yet, or maybe I'm just doing something wrong in one of my methods.

You can use floor_date function from lubridate to bin data every 20 seconds. Group them along with diveNum and D_phase to get average of other columns using across.
library(dplyr)
library(lubridate)
result <- df %>%
group_by(diveNum, D_phase, datetime = floor_date(datetime, '20 sec')) %>%
summarise(across(c(Depth, MSA, rate_s, HR), mean, na.rm = TRUE), .groups = 'drop')
result

Related

Simple Moving Average Column-Wise in R

So I clean revenue data every quarter and I need to do the two quarter moving average to predict the next five year quarterly revenue for each individual product (I know this will just end up being the same average for now). Attached here is the data frame: Revenue Df
Right now I have the data in wide format, and you'll see I created the empty forecasting columns by have the user enter a start and end date for the forecast, then it creates the columns for every quarter between. How can I fill these forecast using a moving average? I also converted it to long, and still could not figure out how to fill the forecast. Also I know the 9-30-2020 shows in the forecast, we want to replace that with the actuals even if the user inputs that date for the forecast.
for(i in ncol(Revenue_df)){
if(i<3)
{Revenue_df[,i]<- Revenue_df[,i]}
else{
Revenue_df[,i]<-(Revenue_df[,i-1]+Revenue_df[,i-2])/2
}
}
Product<- c("a","b","c","d","e")
Revenue.3_30_2020<- c(50,40,30,20,10)
Revenue.6_30_2020<- c(50,45,28,19,17)
Revenue.9_30_2020<- c(25,20,22,17,24)
revenue<- data.frame(Product,Revenue.3_30_2020,Revenue.6_30_2020,Revenue.9_30_2020)
forecast.sequence<- c("2020-09-30","2020-12-31","2021-03-31","2021-06-30","2021-09-30","2021-12-31","2022-03-31"
"2022-06-30","2022-09-30","2022-12-31","2023-03-31","2023-06-30","2023-09-30","2023-12-31","2024-03-31"
"2024-06-30","2024-09-30","2024-12-31")
forecast.sequence.amount<- paste("FC.Amount.",forecast.sequence)
revenue[,forecast.sequence.amount]<-NA
I tried this code and it did not work, any suggestions? Also attached is the code for the sample data frame shown in the picture, sorry for the bad format this is my second time asking a question on here.
This seems to be a bit simple for a product forecast. You might want to look at the forecast and fable packages for forecast functions that can account for trends and seasonality in forecasts. These would, however, require for than two data points of data. Anyway, taking your problem as given, the following code seems to do what you describe.
EDIT
I've made the forecast calculation a function to make it more straightforward to use.
library(tidyverse)
product<- c("a","b","c","d","e")
Revenue.3_30_2020<- c(50,40,30,20,10)
Revenue.6_30_2020<- c(50,45,28,19,17)
Revenue.9_30_2020<- c(25,20,22,17,24)
revenue<- data.frame( Product = product, Revenue.3_30_2020,Revenue.6_30_2020,Revenue.9_30_2020)
rev_frcst <- function(revenue, frcst_end, frcst_prefix) {
#
# Arguments:
# revenue = data frame with
# Product containing product name
# columns with the format "prefix.m_day_year" containing product quantities for past quarters
# frcst_end = end date for quarterly forecast
# frcst_prefix = string containing prefix for forecast
#
# convert revenue to long format
#
rev_long <- revenue %>% pivot_longer(cols = -Product, names_to = "Quarter", values_to = "Revenue") %>%
mutate(quarter_end = as.Date(str_remove(Quarter,"Revenue."), "%m_%d_%Y"))
num_revenue <- nrow(rev_long)/length(product)
#
# generate forecast dates
#
forecast.sequence <- seq( max(rev_long$quarter_end),
as.Date(frcst_end),
by = "quarter")[-1]
#
# Add forecast rows to data
#
rev_long <- rev_long %>%
bind_rows(expand_grid(Product=unique(revenue$Product), quarter_end = forecast.sequence) %>%
mutate(Quarter = paste(frcst_prefix, quarter_end)) ) )
#
# Define moving average function
#
mov_avg <- function(num_frcst, x) {
y <- c(x, numeric(num_frcst))
for(i in 1:num_frcst + 2) {
y[i] <- .5*(y[i-1] + y[i-2]) }
y[1:num_frcst + 2]
}
#
# Calculate forecast
#
rev_long_2 <- rev_long %>% group_by(Product) %>%
mutate(forecast = c(Revenue[1:num_revenue],
mov_avg(num_frcst =length(forecast.sequence),
x = Revenue[1:2 + num_revenue - 2]))) %>%
arrange(Product, quarter_end)
}
#
# call rev_frcst to calcuate forecast
#
rev_forecast <- rev_frcst(revenue=revenue,
frcst_end = "2024-12-31",
frcst_prefix = "FC.Amount.")
which gives
Product Quarter Revenue quarter_end forecast
<chr> <chr> <dbl> <date> <dbl>
1 a Revenue.3_30_2020 50 2020-03-30 50
2 a Revenue.6_30_2020 50 2020-06-30 50
3 a Revenue.9_30_2020 25 2020-09-30 25
4 a FC.Amount. 2020-12-30 NA 2020-12-30 37.5
5 a FC.Amount. 2021-03-30 NA 2021-03-30 31.2
6 a FC.Amount. 2021-06-30 NA 2021-06-30 34.4
7 a FC.Amount. 2021-09-30 NA 2021-09-30 32.8
8 a FC.Amount. 2021-12-30 NA 2021-12-30 33.6
9 a FC.Amount. 2022-03-30 NA 2022-03-30 33.2
10 a FC.Amount. 2022-06-30 NA 2022-06-30 33.4

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

Create 10,000 date data.frames with fake years based on 365 days window

Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28

Rolling average by time period rather than observation in R

I have a dataset with dates occurring randomly. For example:
10/21/15, 11/21/15, 11/22/15, 11/28/15,11/30/15, 12/12/15...etc
I am looking to create a rolling average by time-period NOT by at the observation level. For instance if I wanted to do a moving average of the last 7 days. I would not want to look up at the last 7 rows, but rather the last 7 days
For a tiny example:
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5,10,5,3,2)
df=data.frame(dates, value)
df$desired = c(NA,5,7.5, NA,NA)
I am obviously looking to do this for much larger dataset, but I hope you get the idea. If I was to use 7 days for example this is the result I would expect.
Notice that I don't include the current observations value into the rolling average, only the previous. I want rolling average by time period, not observation row number.
I tried looking at rollmean and dplyr but I couldnt figure it out. I don't really care how it happens though.
Thanks!
try this:
rollavgbyperiod <- function(i,window){
startdate <- dates[i]-window
enddate <- dates[i]-1
interval <- seq(startdate,enddate,1)
tmp <- value[dates %in% interval]
return(mean(tmp))
}
dates <- as.Date(dates)
window <- 7
res <- sapply(1:length(dates),function(m) rollavgbyperiod(m,window))
res[is.nan(res)] <- NA
> data.frame(dates,value,res)
dates value res
1 2015-08-07 5 NA
2 2015-08-08 10 5.0
3 2015-08-09 5 7.5
4 2015-09-09 3 NA
5 2015-10-10 2 NA
I suggest using runner package in this case. What is needed here is mean_run with k = 7 window, lagged by 1 period. Simple one-liner:
library(runner)
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5, 10, 5, 3, 2)
mean_run(x = value, k = 7, lag = 1, idx = as.Date(dates))
#[1] NA 5.0 7.5 NA NA
Check package and function documentation

Interpolating monthly time series to daily [duplicate]

I have a data.frame that contains quarterly observations. I now want to interpolate monthly values (preferred cubic, linear is fine). The intermediate goal should be to create a data.frame with DATE as the index and missing values for all the monthly observations.
Googling showed that I should create an empty data.frame for the whole time range and then merge it - but what ever I tried so far gave me errors. Here's my procedure; but since I'm a newb to r, I'm open to any suggestions for changes.
> str(ger)
'data.frame': 93 obs. of 2 variables:
$ DATE : Date, format: "1991-01-01" "1991-04-01" "1991-07-01" "1991-10-01" ...
$ VALUE: num 470780 468834 466332 472949 480359 ...
> head(ger)
DATE VALUE
1 1991-01-01 470780.3
2 1991-04-01 468834.0
3 1991-07-01 466331.6
4 1991-10-01 472949.0
5 1992-01-01 480359.2
6 1992-04-01 476744.5
emptyIndex <- seq(ger[1, 'DATE'], tail(ger[, 'DATE'], 1), by='1 month')
gerMonthly <- data.frame(DATE = emptyIndex, VALUE = NA)
merge(ger, gerMonthly, by='DATE', all.y = T)
This is the closest I got, but it gives me an undesired column format - there surely is a cleaner way to get what I want? Finally, given the format, what would be the cleanest way to get the interpolated time series?
DATE VALUE.x VALUE.y
1 1991-01-01 470780.3 NA
2 1991-02-01 NA NA
3 1991-03-01 NA NA
4 1991-04-01 468834.0 NA
5 1991-05-01 NA NA
6 1991-06-01 NA NA
I'm not quite clear on your comment about the undesired column format but if you're trying to get the interpolated values using a cubic interpolation, you might consider something like the code below
ger <- data.frame(DATE= as.Date(c("1991-01-01", "1991-04-01", "1991-07-01", "1991-10-01", "1992-01-01" )),
+ VALUE= c(470780, 468834, 466332, 472949, 480359))
DateSeq <- seq(ger$DATE[1],tail(ger$DATE,1),by="1 month")
gerMonthly <- data.frame(DATE=DateSeq, Interp.Value=spline(ger, method="natural", xout=DateSeq)$y)
merge(ger, gerMonthly, by='DATE', all.y = T)
The DATE column needs to be in Date format so the interpolation can work with numeric values.
I've usually used "natural" cubic splines but other options are available.
This format shows both the input values and the results so that you can check that the interpolation looks reasonable but you can use gerMonthly if you just want the interpolated results.
Reading your code has really helped me. To obtain the interpolated values, I did this:
library(tseries)
library(zoo)
# the last line of your code, named for convenience.
merged_data < - merge(ger, gerMonthly, by='DATE', all.y = T)
# declare your desired variable as a time series,
monthly_data <- ts(merged_data$VALUE.x, start = c(1991, 1), end = c(1998,10), frequency = 12)
# interpolate: na.approx linearly interpolates NA values in a time series, na.spline will use cubic spline interpolation.
na.approx(monthly_data) # or:
na.spline(monthly_data)

Resources