How to recursively compute average over time in R - r

Consider the follow dataset
period<-c(1,2,3,4,5)
x<-c(3,6,7,4,6)
cumulative_average<-c((3)/1,(3+6)/2,(3+6+7)/3,(3+6+7+4)/4,(3+6+7+4+6)/5)
df_test<-data.frame(value,cum_average)
df_test
period value cum_average
1 3 3
2 6 4.5
3 7 5.3
4 4 5.0
5 6 5.2
Assume that the 5 observations in the 'x' column represents the value assumed by a variable in 'period' from 1 to 5, respectively. How can I produce column 'cum_average'??
I believe that this could be done using zoo::timeAverage but when I try to lunch the package on my relatively old machine I incur in some conflict and cannot use it.
Any help would be much appreciated!
Solution
new_df <- df_test %>% mutate(avgT = cumsum(value)/period)
did the trick.
Thank you so much for your answers!

Maybe you are looking for this. You can first compute the cumulative sum as mentioned by #tmfmnk and then divide by the rownumber which tracks the number of observation, if the mean is required. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df_test %>% mutate(AvgTime=cumsum(x)/row_number())
Output:
period x AvgTime
1 1 3 3.000000
2 2 6 4.500000
3 3 7 5.333333
4 4 4 5.000000
5 5 6 5.200000
If only cumulative sum is needed:
#Code2
newdf <- df_test %>% mutate(CumTime=cumsum(x))
Output:
period x CumTime
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26
Or only base R:
#Base R
df_test$Cumsum <- cumsum(df_test$x)
Output:
period x Cumsum
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26

Using standard R:
period<-c(1,2,3,4,5)
value<-c(3,6,7,4,6)
recursive_average<-cumsum(value) / (1:length(value))
df_test<-data.frame(value, recursive_average)
df_test
value recursive_average
1 3 3.000000
2 6 4.500000
3 7 5.333333
4 4 5.000000
5 6 5.200000
If your period vector, is the vector you wish to use to calculate the average, simply replace 1:length(value) with period

We can use cummean
library(dplyr)
df_test %>%
mutate(AvgTime=cummean(value))
-output
# period value AvgTime
#1 1 3 3.000000
#2 2 6 4.500000
#3 3 7 5.333333
#4 4 4 5.000000
#5 5 6 5.200000
data
df_test <- structure(list(period = c(1, 2, 3, 4, 5), value = c(3, 6, 7,
4, 6)), class = "data.frame", row.names = c(NA, -5L))

Related

aggregate on multiple columns - keeping the original column names and structure

please consider the following example which makes use of aggregate twice.
library(dplyr)
set.seed(5)
x <- data.frame(
name = sample(c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'), 400, replace = TRUE),
strand = sample(c('+', '-'), 400, replace = TRUE),
value = sample(6, 400, replace = TRUE)
)
x_agg_hist <- aggregate( x$value,
by = list(strand = x$strand,
transcript = x$name
),
function(v) hist( v,
breaks = seq(0.5, 6.5),
plot= FALSE
)$counts
)
y <- data.frame(
name = c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'),
value = runif(5)
)
x_agg_hist$value <- y$value[match(x_agg_hist$transcript, y$name)]
x_agg_hist$division <- ifelse(x_agg_hist$value > 0.5, 1, 2) %>% as.factor()
x_agg_hist
strand transcript x.1 x.2 x.3 x.4 x.5 x.6 value division
1 - NM01 6 9 8 5 5 8 0.5661267 1
2 + NM01 4 2 8 8 8 6 0.5661267 1
3 - NM02 8 4 6 5 3 11 0.1178577 2
4 + NM02 7 6 9 8 7 7 0.1178577 2
5 - NM03 4 5 10 4 6 3 0.2572855 2
6 + NM03 6 10 5 9 5 9 0.2572855 2
7 - NM04 7 4 5 7 4 9 0.9678125 1
8 + NM04 4 3 4 10 8 9 0.9678125 1
9 - NM05 4 6 10 5 5 5 0.8891210 1
10 + NM05 11 13 5 8 12 8 0.8891210 1
So far, everything is fine. Specifically, I notice that I can select the columns of the histograms created by aggregate "collectively" using
x_agg_hist$x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 6 9 8 5 5 8
[2,] 4 2 8 8 8 6
[3,] 8 4 6 5 3 11
[4,] 7 6 9 8 7 7
[5,] 4 5 10 4 6 3
[6,] 6 10 5 9 5 9
Next, I would like to sum the histograms by 'division' and 'strand' (and normalise by the number of observations in each group).
x_agg_hist_agg_sum <- aggregate( x_agg_hist$x,
by = list(division = x_agg_hist$division,
strand = x_agg_hist$strand
),
function(v) sum(v)/length(v)
)
Note that using x_agg_hist$x to select all the columns of the histograms seems a lot more convenient than what has been proposed here (Aggregate / summarize multiple variables per group (e.g. sum, mean)).
This still works as expected.
x_agg_hist_agg_sum
division strand V1 V2 V3 V4 V5 V6
1 1 - 5.666667 6.333333 7.666667 5.666667 4.666667 7.333333
2 2 - 6.000000 4.500000 8.000000 4.500000 4.500000 7.000000
3 1 + 6.333333 6.000000 5.666667 8.666667 9.333333 7.666667
4 2 + 6.500000 8.000000 7.000000 8.500000 6.000000 8.000000
However, now aggregate has renamed the columns of the (summed) histograms in a way that does not allow selecting them collectively any more. Therefore, I was wondering if it was possible to tell aggregate to keep the original column names and structure or if there is any other method that can do so. (Of course I know that I can use x_agg_hist_agg_sum[, -c(1, 2)], but with my real data (after a lot of further processing) this would at least be a lot more difficult.)
Cheers,
mce1
I would suggest to use dplyr for such long chained operations. There are lot of benefits with it.
You can do all the transformation/manipulation and reshaping code with it in the single pipe without creating intermediate variables like x_agg_hist and x_agg_hist_agg_sum. So you don't have to remember/manage them.
The first few steps of your code code can be translated as :
library(dplyr)
x %>%
group_by(strand, name) %>%
summarise(res = hist(value, breaks = seq(0.5, 6.5),plot= FALSE)$counts) %>%
left_join(y, by = 'name') %>%
mutate(division = factor(ifelse(value > 0.5, 1, 2))) %>%
ungroup
Use pivot_wider to cast the data into wide format which will maintain the names of the data.

Rolling sum in specified range

For df I want to take the rolling sum of the Value column over the last 10 seconds, with Time given in seconds. The dataframe is very large so using dply::complete is not an option (millions of data point, millisecond level). I prefer dplyr solution but think it may be possible with datatable left_join, just cant make it work.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
Solution would add a column (Sum.10S) that takes the rolling sum of past 10 seconds:
df$Sum.10S=c(4,11,13,8,3,11,3)
Define a function sum10 which sums the last 10 seconds and use it with rollapplyr. It avoids explicit looping and runs about 10x faster than explicit looping using the data in the question.
library(zoo)
sum10 <- function(x) {
if (is.null(dim(x))) x <- t(x)
tt <- x[, "Time"]
sum(x[tt >= tail(tt, 1) - 10, "Value"])
}
transform(df, S10 = rollapplyr(df, 10, sum10, by.column = FALSE, partial = TRUE))
giving:
Row Value Time S10
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3
Well I wasn't fast enough to get the first answer in. But this solution is simpler, and doesn't require an external library.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
df$SumR<-NA
for(i in 1:nrow(df)){
df$SumR[i]<-sum(df$Value[which(df$Time<=df$Time[i] & df$Time>=df$Time[i]-10)])
}
Row Value Time SumR
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3

How to calculate value for an observation by group?

I have a data frame like so:
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
mydf
group score
1 a 10
2 a 9
3 a 2
4 a 3
5 b 1
6 b 10
7 b 1
8 b 10
9 c 3
10 c 7
11 c 1
12 c 3
I can calculate the mean of each group like so:
> by(mydf[,c("score")], mydf$group, mean)
mydf$group: a
[1] 6
-------------------------------------------------------------------
mydf$group: b
[1] 5.5
-------------------------------------------------------------------
mydf$group: c
[1] 3.5
But what I wish to do, is create a new column, say called resdidual which contains the residual from the mean of the group. It would seem like there is some way to use one of the apply functions to do this, but for some reason I can't see it.
I would want my end result to look like so:
mydf
group score residual
1 a 10 4
2 a 9 3
3 a 2 -4
4 a 3 -3
5 b 1 -4.5
6 b 10 4.5
7 b 1 -4.5
8 b 10 4.5
9 c 3 -.5
10 c 7 3.5
11 c 1 -2.5
12 c 3 -.5
Any ideas or pointers to the right direction is appreciated.
How about:
mydf$score - tapply(mydf$score, mydf$group, mean)[as.character(mydf$group)]
tapply works the same as by but with a nicer output. The [as.character(mydf$group)] subsets and replicates tapply's output so that it aligns mdf$group.
library(dplyr)
mydf %>% group_by(group) %>% mutate(residual = score - mean(score))
I take the data, I group by group, then I add a column (using mutate) which is the difference between the variable score and the mean of that variable in each group.
library(hash)
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
byResult <- by(mydf[,c("score")], mydf$group, mean)
h <- hash(keys= names(byResult), values =byResult)
residualsVar <- apply(mydf,1,function(row){
as.vector(values(h,row[1]))-as.numeric(row[2])
})
df <- cbind(mydf,residualsVar)

Using ddply to aggregate over irregular time periods in longitudinal data

I'm looking for help adapting two existing scripts.
I am working with a longitudinal dataset, and aggregating a key variable over time periods. I have a variable for both weeks and months. I'm able to aggregate over both weeks and months - but my goal is to aggregate over weeks for the first six weeks, and then move over to aggregating by months after 6 weeks+.
Aggregating by weeks and months is easy enough...
df.summary_week <- ddply(df, .(weeks), summarise,
var.mean = mean(var,na.rm=T))
Which yields something like:
weeks var.mean
1 3.99
2 5.44
3 6.7
4 8.100
5 2.765
6 2.765
7 3.765
8 4.765
9 1.765
10 4.765
11 1.765
And then aggregating by month would yield something similar:
df.summary_months <- ddply(df, .(months), summarise,
var.mean = mean(var,na.rm=T))
months var.mean
1 5.00
2 3.001
3 4.7
4 7.100
My initial idea was to simply subset the two datasets with cut points and then bind them together, but I don't know how to do that when the 1-month aggregation starts at 6 weeks rather than 8.
Thoughts, R wizards?
Basic example data.
dat <- data.frame(var=1:24,weeks=1:24,months=rep(1:6,each=4))
Means for first 6 grps should be just 1:6, then means will be values
for subsequent 4 week periods. E.g. (mean(7:10) = 8.5 etc).
Make a suitable group identifier going from weeks to months:
dat$grp <- findInterval(dat$weeks,seq(7,max(dat$weeks),4)) + 6
dat$grp <- ifelse(dat$grp==6,dat$weeks,dat$grp)
#[1] 1 2 3 4 5 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11
Group the data:
ddply(dat, .(grp), summarise, var.mean = mean(var,na.rm=T))
grp var.mean
1 1 1.0
2 2 2.0
3 3 3.0
4 4 4.0
5 5 5.0
6 6 6.0
7 7 8.5
8 8 12.5
9 9 16.5
10 10 20.5
11 11 23.5
How about just creating a new grouping column?
set.seed(1618)
dat <- data.frame(week = sample(1:26, 200, replace = TRUE),
value = rpois(200, 2))
dat <- within(dat, {
idx <- cut(week, c(0, 6, seq(10, max(week), by = 4)))
})
# head(dat)
# week value idx
# 1 6 1 (0,6]
# 2 16 2 (14,18]
# 3 9 1 (6,10]
# 4 13 2 (10,14]
# 5 8 2 (6,10]
# 6 16 2 (14,18]
library(plyr)
ddply(dat, .(idx), summarise,
mean = mean(value, na.rm = TRUE))
# idx mean
# 1 (0,6] 1.870968
# 2 (6,10] 2.259259
# 3 (10,14] 2.171429
# 4 (14,18] 1.931034
# 5 (18,22] 1.560000
# 6 (22,26] 1.954545
# checking a couple values
mean(dat[dat$week %in% 1:6, 'value'])
# [1] 1.870968
mean(dat[dat$week %in% 7:10, 'value'])
# [1] 2.259259
mean(dat[dat$week %in% 23:26, 'value'])
# [1] 1.954545

Calculate difference between values in consecutive rows by group

This is a my df (data.frame):
group value
1 10
1 20
1 25
2 5
2 10
2 15
I need to calculate difference between values in consecutive rows by group.
So, I need a that result.
group value diff
1 10 NA # because there is a no previous value
1 20 10 # value[2] - value[1]
1 25 5 # value[3] value[2]
2 5 NA # because group is changed
2 10 5 # value[5] - value[4]
2 15 5 # value[6] - value[5]
Although, I can handle this problem by using ddply, but it takes too much time. This is because I have a lot of groups in my df. (over 1,000,000 groups in my df)
Are there any other effective approaches to handle this problem?
The package data.table can do this fairly quickly, using the shift function.
require(data.table)
df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15))
#setDT(df) #if df is already a data frame
df[ , diff := value - shift(value), by = group]
# group value diff
#1: 1 10 NA
#2: 1 20 10
#3: 1 25 5
#4: 2 5 NA
#5: 2 10 5
#6: 2 15 5
setDF(df) #if you want to convert back to old data.frame syntax
Or using the lag function in dplyr
df %>%
group_by(group) %>%
mutate(Diff = value - lag(value))
# group value Diff
# <int> <int> <int>
# 1 1 10 NA
# 2 1 20 10
# 3 1 25 5
# 4 2 5 NA
# 5 2 10 5
# 6 2 15 5
For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.
You can use the base function ave() for this
df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15))
df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x)))
which returns
group value diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5
try this with tapply
df$diff<-as.vector(unlist(tapply(df$value,df$group,FUN=function(x){ return (c(NA,diff(x)))})))
Since dplyr 1.1.0, you can shorten the dplyr version with inline temporary grouping with .by:
mutate(df, diff = value - lag(value), .by = group)

Resources