I'm looking for help adapting two existing scripts.
I am working with a longitudinal dataset, and aggregating a key variable over time periods. I have a variable for both weeks and months. I'm able to aggregate over both weeks and months - but my goal is to aggregate over weeks for the first six weeks, and then move over to aggregating by months after 6 weeks+.
Aggregating by weeks and months is easy enough...
df.summary_week <- ddply(df, .(weeks), summarise,
var.mean = mean(var,na.rm=T))
Which yields something like:
weeks var.mean
1 3.99
2 5.44
3 6.7
4 8.100
5 2.765
6 2.765
7 3.765
8 4.765
9 1.765
10 4.765
11 1.765
And then aggregating by month would yield something similar:
df.summary_months <- ddply(df, .(months), summarise,
var.mean = mean(var,na.rm=T))
months var.mean
1 5.00
2 3.001
3 4.7
4 7.100
My initial idea was to simply subset the two datasets with cut points and then bind them together, but I don't know how to do that when the 1-month aggregation starts at 6 weeks rather than 8.
Thoughts, R wizards?
Basic example data.
dat <- data.frame(var=1:24,weeks=1:24,months=rep(1:6,each=4))
Means for first 6 grps should be just 1:6, then means will be values
for subsequent 4 week periods. E.g. (mean(7:10) = 8.5 etc).
Make a suitable group identifier going from weeks to months:
dat$grp <- findInterval(dat$weeks,seq(7,max(dat$weeks),4)) + 6
dat$grp <- ifelse(dat$grp==6,dat$weeks,dat$grp)
#[1] 1 2 3 4 5 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11
Group the data:
ddply(dat, .(grp), summarise, var.mean = mean(var,na.rm=T))
grp var.mean
1 1 1.0
2 2 2.0
3 3 3.0
4 4 4.0
5 5 5.0
6 6 6.0
7 7 8.5
8 8 12.5
9 9 16.5
10 10 20.5
11 11 23.5
How about just creating a new grouping column?
set.seed(1618)
dat <- data.frame(week = sample(1:26, 200, replace = TRUE),
value = rpois(200, 2))
dat <- within(dat, {
idx <- cut(week, c(0, 6, seq(10, max(week), by = 4)))
})
# head(dat)
# week value idx
# 1 6 1 (0,6]
# 2 16 2 (14,18]
# 3 9 1 (6,10]
# 4 13 2 (10,14]
# 5 8 2 (6,10]
# 6 16 2 (14,18]
library(plyr)
ddply(dat, .(idx), summarise,
mean = mean(value, na.rm = TRUE))
# idx mean
# 1 (0,6] 1.870968
# 2 (6,10] 2.259259
# 3 (10,14] 2.171429
# 4 (14,18] 1.931034
# 5 (18,22] 1.560000
# 6 (22,26] 1.954545
# checking a couple values
mean(dat[dat$week %in% 1:6, 'value'])
# [1] 1.870968
mean(dat[dat$week %in% 7:10, 'value'])
# [1] 2.259259
mean(dat[dat$week %in% 23:26, 'value'])
# [1] 1.954545
Related
Consider the follow dataset
period<-c(1,2,3,4,5)
x<-c(3,6,7,4,6)
cumulative_average<-c((3)/1,(3+6)/2,(3+6+7)/3,(3+6+7+4)/4,(3+6+7+4+6)/5)
df_test<-data.frame(value,cum_average)
df_test
period value cum_average
1 3 3
2 6 4.5
3 7 5.3
4 4 5.0
5 6 5.2
Assume that the 5 observations in the 'x' column represents the value assumed by a variable in 'period' from 1 to 5, respectively. How can I produce column 'cum_average'??
I believe that this could be done using zoo::timeAverage but when I try to lunch the package on my relatively old machine I incur in some conflict and cannot use it.
Any help would be much appreciated!
Solution
new_df <- df_test %>% mutate(avgT = cumsum(value)/period)
did the trick.
Thank you so much for your answers!
Maybe you are looking for this. You can first compute the cumulative sum as mentioned by #tmfmnk and then divide by the rownumber which tracks the number of observation, if the mean is required. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df_test %>% mutate(AvgTime=cumsum(x)/row_number())
Output:
period x AvgTime
1 1 3 3.000000
2 2 6 4.500000
3 3 7 5.333333
4 4 4 5.000000
5 5 6 5.200000
If only cumulative sum is needed:
#Code2
newdf <- df_test %>% mutate(CumTime=cumsum(x))
Output:
period x CumTime
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26
Or only base R:
#Base R
df_test$Cumsum <- cumsum(df_test$x)
Output:
period x Cumsum
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26
Using standard R:
period<-c(1,2,3,4,5)
value<-c(3,6,7,4,6)
recursive_average<-cumsum(value) / (1:length(value))
df_test<-data.frame(value, recursive_average)
df_test
value recursive_average
1 3 3.000000
2 6 4.500000
3 7 5.333333
4 4 5.000000
5 6 5.200000
If your period vector, is the vector you wish to use to calculate the average, simply replace 1:length(value) with period
We can use cummean
library(dplyr)
df_test %>%
mutate(AvgTime=cummean(value))
-output
# period value AvgTime
#1 1 3 3.000000
#2 2 6 4.500000
#3 3 7 5.333333
#4 4 4 5.000000
#5 5 6 5.200000
data
df_test <- structure(list(period = c(1, 2, 3, 4, 5), value = c(3, 6, 7,
4, 6)), class = "data.frame", row.names = c(NA, -5L))
I have a dataset with > 900,000 rows with many duplicates:
> sum(duplicated(df$colB))
[1] 904515
So when I try to quantile cut into ten equally large subsets, I get an error
> df$colC <- cut(df$colB, quantile(df$colB,c(0:10)/10), labels=FALSE,
+ include.lowest=TRUE)
Error in cut.default(df$colB, quantile(df$colB, :
'breaks' are not unique
Using unique(quantile(df$colB,c(0:10)/10)) doesn't give equally sized subsets. There must be an easy solution to make quantile cuts which also considers the number of rows, in addition to the values in colB. Starting a loop sequence would probably take forever as I have a high number of rows. Any ideas?
Dummy dataset:
set.seed(10)
B <- round(runif(100, 0, 0.4), digits=2) # gives 63 duplicates
df$colB <- B
df <- as.data.frame(df)
There might be a neater solution than this, but this will do it:
df$colC <- ceiling((1:nrow(df))*10/nrow(df))[rank(df$colB, ties.method = 'first')]
table(df$colC)
#>
#> 1 2 3 4 5 6 7 8 9 10
#> 10 10 10 10 10 10 10 10 10 10
It might be hard to imagine, but there must be a range of values in df$colB that is invariant, so quantile returns two (or more) of a single value.
A contrived example:
set.seed(42)
vec <- c(rep(10,20),sample(100,size=80,))
brks <- quantile(vec, (0:10)/10)
brks
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# 2.0 10.0 10.0 14.7 25.6 36.5 47.4 58.9 72.4 88.1 100.0
The cut function requires that there be no repeated values in its breaks= arguments. It should be informative to look at just the quantiles of your function to confirm this.
One way around this is to use .bincode, which does not enforce unique breaks.
cut(vec, brks, include.lowest = TRUE)
# Error in cut.default(vec, brks, include.lowest = TRUE) :
# 'breaks' are not unique
.bincode(vec, brks, include.lowest = TRUE)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 8 4 9 4
# [26] 10 6 4 8 10 6 4 5 1 6 5 5 1 5 9 7 6 10 5 6 4 4 9 1 9
# [51] 8 10 1 7 10 9 8 1 8 1 7 9 7 4 8 7 6 1 6 9 5 8 6 10 6
# [76] 9 1 5 3 10 6 5 9 4 5 7 10 7 8 9 4 5 7 3 8 4 10 7 8 10
(Note that there is no "2" in the return values with this data, because brks[2] is the same as brks[3], so appears to be ignored.)
One side-effect of this is that you don't get the factor labels by default, which might be useful.
labels <- sprintf("(%0.01f-%0.01f]", brks[-10], brks[-1])
substr(labels[1], 1, 1) <- "["
labels
# [1] "[2.0-10.0]" "(10.0-10.0]" "(10.0-14.7]" "(14.7-25.6]"
# [5] "(25.6-36.5]" "(36.5-47.4]" "(47.4-58.9]" "(58.9-72.4]"
# [9] "(72.4-88.1]" "(100.0-100.0]"
head(labels[ .bincode(vec, brks, include.lowest = TRUE) ])
# [1] "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]"
(Where the use of %0.01f is where you may want to customize this assumption.)
In R, I have a dataframe, with columns 'A', 'B', 'C', 'D'. The columns have 100 rows.
I need to iterate through the columns to perform a calculation for all rows in the dataframe which sums the previous 2 rows of that column, and then set in new columns ('AA', 'AB', etc) what that sum is:
A B C D
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
to
A B C D AA AB AC AD
1 2 3 4 NA NA NA NA
2 3 4 5 3 5 7 9
3 4 5 6 5 7 9 11
4 5 6 7 7 9 11 13
5 6 7 8 9 11 13 15
6 7 8 9 11 13 15 17
Can someone explain how to create a function/loop that allows me to set the columns I want to iterate over (selected columns, not all columns) and the columns I want to set?
A base one-liner:
cbind(df, setNames(df + df[c(NA, 1:(nrow(df)-1)), ], paste0("A", names(df))))
If your data is large, this one might be the fastest because it manipulates the entire data.frame.
A dplyr solution using mutate() with across().
library(dplyr)
df %>%
mutate(across(A:D,
~ .x + lag(.x),
.names = "A{col}"))
# A B C D AA AB AC AD
# 1 1 2 3 4 NA NA NA NA
# 2 2 3 4 5 3 5 7 9
# 3 3 4 5 6 5 7 9 11
# 4 4 5 6 7 7 9 11 13
# 5 5 6 7 8 9 11 13 15
# 6 6 7 8 9 11 13 15 17
If you want to sum the previous 3 rows, the second argument of across(), i.e. .fns, should be
~ .x + lag(.x) + lag(.x, 2)
which is equivalent to the use of rollsum() in zoo:
~ zoo::rollsum(.x, k = 3, fill = NA, align = 'right')
Benchmark
A benchmark test with microbenchmark package on a new data.frame with 10000 rows and 100 columns and evaluate each expression for 10 times.
# Unit: milliseconds
# expr min lq mean median uq max neval
# darren_base 18.58418 20.88498 35.51341 33.64953 39.31909 80.24725 10
# darren_dplyr_lag 39.49278 40.27038 47.26449 42.89170 43.20267 76.72435 10
# arg0naut91_dplyr_rollsum 436.22503 482.03199 524.54800 516.81706 534.94317 677.64242 10
# Grothendieck_rollsumr 3423.92097 3611.01573 3650.16656 3622.50895 3689.26404 4060.98054 10
You can use dplyr's across (and set optional names) with rolling sum (as implemented e.g. in zoo):
library(dplyr)
library(zoo)
df %>%
mutate(
across(
A:D,
~ rollsum(., k = 2, fill = NA, align = 'right'),
.names = 'A{col}'
)
)
Output:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
With A:D we've specified the range of column names we want to apply the function to. The assumption above in .names argument is that you want to paste together A as prefix and the column name ({col}).
Here's a data.table solution. As you ask for, it allows you to select which columns you want to apply it to rather than just for all columns.
library(data.table)
x <- data.table(A=1:6, B=2:7, C=3:8, D=4:9)
selected_cols <- c('A','B','D')
new_cols <- paste0("A",selected_cols)
x[, (new_cols) := lapply(.SD, function(col) col+shift(col, 1)), .SDcols = selected_cols]
x[]
NB This is 2 or 3 times faster than the fastest other answer.
That is a naive approach with nested for loops. Beware it is damn slow if you gonna iterate over hundreds thousand rows.
i <- 1
n <- 5
df <- data.frame(A=i:(i+n), B=(i+1):(i+n+1), C=(i+2):(i+n+2), D=(i+3):(i+n+3))
for (col in colnames(df)) {
for (ind in 1:nrow(df)) {
if (ind-1==0) {next}
s <- sum(df[c(ind-1, ind), col])
df[ind, paste0('S', col)] <- s
}
}
That is a cumsum method:
na.df <- data.frame(matrix(NA, 2, ncol(df)))
colnames(na.df) <- colnames(df)
cs1 <- cumsum(df)
cs2 <- rbind(cs1[-1:-2,], na.df)
sum.diff <- cs2-cs1
cbind(df, rbind(na.df[1,], cs1[2,], sum.diff[1:(nrow(sum.diff)-2),]))
Benchmark:
# Unit: milliseconds
# expr min lq mean median uq max neval
# darrentsai.rbind 11.5623 12.28025 23.38038 16.78240 20.83420 91.9135 100
# darrentsai.rbind.rev1 8.8267 9.10945 15.63652 9.54215 14.25090 62.6949 100
# pseudopsin.dt 7.2696 7.52080 20.26473 12.61465 17.61465 69.0110 100
# ivan866.cumsum 25.3706 30.98860 43.11623 33.78775 37.36950 91.6032 100
I believe, most of the time the cumsum method wastes on df allocations. If correctly adapted to data.table backend, it could be the fastest.
Specify the columns we want. We show several different ways to do that. Then use rollsumr to get the desired columns, set the column names and cbind DF with it.
library(zoo)
# jx <- names(DF) # if all columns wanted
# jx <- sapply(DF, is.numeric) # if all numeric columns
# jx <- c("A", "B", "C", "D") # specify columns by name
jx <- 1:4 # specify columns by position
r <- rollsumr(DF[jx], 2, fill = NA)
colnames(r) <- paste0("A", colnames(r))
cbind(DF, r)
giving:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
Note
The input in reproducible form:
DF <- structure(list(A = 1:6, B = 2:7, C = 3:8, D = 4:9),
class = "data.frame", row.names = c(NA, -6L))
For df I want to take the rolling sum of the Value column over the last 10 seconds, with Time given in seconds. The dataframe is very large so using dply::complete is not an option (millions of data point, millisecond level). I prefer dplyr solution but think it may be possible with datatable left_join, just cant make it work.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
Solution would add a column (Sum.10S) that takes the rolling sum of past 10 seconds:
df$Sum.10S=c(4,11,13,8,3,11,3)
Define a function sum10 which sums the last 10 seconds and use it with rollapplyr. It avoids explicit looping and runs about 10x faster than explicit looping using the data in the question.
library(zoo)
sum10 <- function(x) {
if (is.null(dim(x))) x <- t(x)
tt <- x[, "Time"]
sum(x[tt >= tail(tt, 1) - 10, "Value"])
}
transform(df, S10 = rollapplyr(df, 10, sum10, by.column = FALSE, partial = TRUE))
giving:
Row Value Time S10
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3
Well I wasn't fast enough to get the first answer in. But this solution is simpler, and doesn't require an external library.
df = data.frame(Row=c(1,2,3,4,5,6,7),Value=c(4,7,2,6,3,8,3),Time=c(10021,10023,10027,10035,10055,10058,10092))
df$SumR<-NA
for(i in 1:nrow(df)){
df$SumR[i]<-sum(df$Value[which(df$Time<=df$Time[i] & df$Time>=df$Time[i]-10)])
}
Row Value Time SumR
1 1 4 10021 4
2 2 7 10023 11
3 3 2 10027 13
4 4 6 10035 8
5 5 3 10055 3
6 6 8 10058 11
7 7 3 10092 3
This is a my df (data.frame):
group value
1 10
1 20
1 25
2 5
2 10
2 15
I need to calculate difference between values in consecutive rows by group.
So, I need a that result.
group value diff
1 10 NA # because there is a no previous value
1 20 10 # value[2] - value[1]
1 25 5 # value[3] value[2]
2 5 NA # because group is changed
2 10 5 # value[5] - value[4]
2 15 5 # value[6] - value[5]
Although, I can handle this problem by using ddply, but it takes too much time. This is because I have a lot of groups in my df. (over 1,000,000 groups in my df)
Are there any other effective approaches to handle this problem?
The package data.table can do this fairly quickly, using the shift function.
require(data.table)
df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15))
#setDT(df) #if df is already a data frame
df[ , diff := value - shift(value), by = group]
# group value diff
#1: 1 10 NA
#2: 1 20 10
#3: 1 25 5
#4: 2 5 NA
#5: 2 10 5
#6: 2 15 5
setDF(df) #if you want to convert back to old data.frame syntax
Or using the lag function in dplyr
df %>%
group_by(group) %>%
mutate(Diff = value - lag(value))
# group value Diff
# <int> <int> <int>
# 1 1 10 NA
# 2 1 20 10
# 3 1 25 5
# 4 2 5 NA
# 5 2 10 5
# 6 2 15 5
For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.
You can use the base function ave() for this
df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15))
df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x)))
which returns
group value diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5
try this with tapply
df$diff<-as.vector(unlist(tapply(df$value,df$group,FUN=function(x){ return (c(NA,diff(x)))})))
Since dplyr 1.1.0, you can shorten the dplyr version with inline temporary grouping with .by:
mutate(df, diff = value - lag(value), .by = group)