Rolling weighted mean across two factor levels or time points - r

I would like to create a rolling 2 quarter average for alpha, bravo and charlie (and lots of other variables. Research is taking me to zoo and lubricate packages but seem to always go back to rolling within one variable or grouping
set.seed(123)
dates <- c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16", "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")
df <- data.frame(dates = sample(dates, 100, replace = TRUE, prob=rep(c(.03,.07,.03,.08, .05),2)),
alpha = rnorm(100, 5), bravo = rnorm(100, 10), charlie = rnorm(100, 15))
I'm looking for something like
x <- df %>% mutate_if(is.numeric, funs(rollmean(., 2, align='right', fill=NA)))
Desired result: a weighted average across "Q4'15" & "Q1'16", "Q1'16" & "Q2'16", etc for each column of data (alpha, bravo, charlie). Not looking for the average of the paired quarterly averages.
Here is what the averages would be for the Q4'15&"Q1'16" time point
df %>% filter(dates %in% c("Q4'15", "Q1'16")) %>% select(-dates) %>% summarise_all(mean)

I like data.table for this, and I have a solution for you but there may be a more elegant one. Here is what I have:
Data
Now as data.table:
R> suppressMessages(library(data.table))
R> set.seed(123)
R> datesvec <- c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16",
+ "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")
R> df <- data.table(dates = sample(dates, 100, replace = TRUE,
+ prob=rep(c(.03,.07,.03,.08, .05),2)),
+ alpha = rnorm(100, 5),
+ bravo = rnorm(100, 10),
+ charlie = rnorm(100, 15))
R> df[ , ind := which(datesvec==dates), by=dates]
R> setkey(df, ind) # optional but may as well
R> head(df)
dates alpha bravo charlie ind
1: Q4'15 5.37964 11.05271 14.4789 1
2: Q4'15 7.05008 10.36896 15.0892 1
3: Q4'15 4.29080 12.12845 13.6047 1
4: Q4'15 5.00576 8.93667 13.3325 1
5: Q4'15 3.53936 9.81707 13.6360 1
6: Q1'16 3.45125 10.56299 16.0808 2
R>
The key here is that we need to restore / maintain the temporal ordering of your quarters which your data representation does not have.
Average by quarter
This is easy with data.table:
R> ndf <- df[ ,
+ .(qtr=head(dates,1), # label of quarter
+ sa=sum(alpha), # sum of a in quarter
+ sb=sum(bravo), # sum of b in quarter
+ sc=sum(charlie), # sum of c in quarter
+ n=.N), # number of observations
+ by=ind]
R> ndf
ind qtr sa sb sc n
1: 1 Q4'15 25.2656 52.3039 70.1413 5
2: 2 Q1'16 65.8562 132.6650 192.7921 13
3: 3 Q2'16 10.3422 17.8061 31.3404 2
4: 4 Q3'16 84.6664 168.1914 256.9010 17
5: 5 Q4'16 41.3268 87.8253 139.5873 9
6: 6 Q1'17 42.6196 85.4059 134.8205 9
7: 7 Q2'17 76.5190 162.0784 241.2597 16
8: 8 Q3'17 42.8254 83.2483 127.2600 8
9: 9 Q4'17 68.1357 133.5794 198.1920 13
10: 10 Q1'18 37.0685 78.4107 120.2808 8
R>
Lag those averages once
R> ndf[, `:=`(psa=shift(sa), # previous sum of a
+ psb=shift(sb), # previous sum of b
+ psc=shift(sc), # previous sum of c
+ pn=shift(n))] # previous nb of obs
R> ndf
ind qtr sa sb sc n psa psb psc pn
1: 1 Q4'15 25.2656 52.3039 70.1413 5 NA NA NA NA
2: 2 Q1'16 65.8562 132.6650 192.7921 13 25.2656 52.3039 70.1413 5
3: 3 Q2'16 10.3422 17.8061 31.3404 2 65.8562 132.6650 192.7921 13
4: 4 Q3'16 84.6664 168.1914 256.9010 17 10.3422 17.8061 31.3404 2
5: 5 Q4'16 41.3268 87.8253 139.5873 9 84.6664 168.1914 256.9010 17
6: 6 Q1'17 42.6196 85.4059 134.8205 9 41.3268 87.8253 139.5873 9
7: 7 Q2'17 76.5190 162.0784 241.2597 16 42.6196 85.4059 134.8205 9
8: 8 Q3'17 42.8254 83.2483 127.2600 8 76.5190 162.0784 241.2597 16
9: 9 Q4'17 68.1357 133.5794 198.1920 13 42.8254 83.2483 127.2600 8
10: 10 Q1'18 37.0685 78.4107 120.2808 8 68.1357 133.5794 198.1920 13
R>
Average over current and previous quarter
R> ndf[is.finite(psa), # where we have valid data
+ `:=`(ra=(sa+psa)/(n+pn), # total sum / total n == avg
+ rb=(sb+psb)/(n+pn),
+ rc=(sc+psc)/(n+pn))]
R> ndf[,c(1:2, 11:13)]
ind qtr ra rb rc
1: 1 Q4'15 NA NA NA
2: 2 Q1'16 5.06233 10.27605 14.6074
3: 3 Q2'16 5.07989 10.03141 14.9422
4: 4 Q3'16 5.00045 9.78935 15.1706
5: 5 Q4'16 4.84589 9.84680 15.2496
6: 6 Q1'17 4.66369 9.62395 15.2449
7: 7 Q2'17 4.76554 9.89937 15.0432
8: 8 Q3'17 4.97268 10.22195 15.3550
9: 9 Q4'17 5.28386 10.32513 15.4977
10: 10 Q1'18 5.00972 10.09476 15.1654
R>
taking advantage of the fact that the total sum over two quarters divided by the total number of observations is the same as the mean of those two quarters. (And this reflects an edit following an earlier thinko of mine).
Spot check
We can use the selection feature of data.table to compute two of those rows by hand, I am picked those for indices <1,2> and <4,5> here:
R> df[ ind <= 2, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
a b c
1: 5.06233 10.276 14.6074
R> df[ ind == 4 | ind == 5, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
a b c
1: 4.84589 9.8468 15.2496
R>
This pans out fine, and the approach should scale easily to millions of rows thanks to data.table.
PS: All in One
As you mentioned pipes etc, you can write all this with chained data.table operations. Not my preferred style, but possible. The following creates the exact same out without ever creating an ndf temporary as above:
## All in one
df[ , ind := which(datesvec==dates), by=dates][
,
.(qtr=head(dates,1), # label of quarter
sa=sum(alpha), # sum of a in quarter
sb=sum(bravo), # sum of b in quarter
sc=sum(charlie), # sum of c in quarter
n=.N), # number of observations
by=ind][
,
`:=`(psa=shift(sa), # previous sum of a
psb=shift(sb), # previous sum of b
psc=shift(sc), # previous sum of c
pn=shift(n))][
is.finite(psa), # where we have valid data
`:=`(ra=(sa+psa)/(n+pn), # total sum / total n == avg
rb=(sb+psb)/(n+pn),
rc=(sc+psc)/(n+pn))][
,c(1:2, 11:13)][]

Related

Calculate the number of individual with more than i days of life with data.table in R

Here is my simplified data.table:
Individual
time_alive (day)
ID1
1
ID2
5
ID3
7
ID4
5
I need to calculate the number of individual alive at every day.
I achieved this by doing a loop
for (i in c(-1:600)) {
y<-summarise(DT , time_alive > i )
Alive[i+2,]<-length(y[y==TRUE])
}
However this is really long with a data.frame of more than 2B observations.
I wanted to try an alternative with data.table but I am stuck at only 1 day number of alive calculation:
DT[,.N,time_alive> i][time_alive==TRUE,2]
Here, i cannot be replaced by a vector but only by 1 number. I want to calculate the number of individual with more than i days of life, without doing a loop.
My result expected for the simplified data would be:
Day
Number of individual alive
1
4
2
3
3
3
4
3
5
3
6
1
7
1
8
0
Best solution in one line, with data.table which is much faster than looping:
DT[, .(Day = seq_len(1 + max(time_alive)))][DT[,.(time_alive)], .(.N), on = .(Day <= time_alive), by = Day]
# #r2evans suggestion about making it a one-liner
# replaced res = data.table('day' = 1:max(DT$time_alive))
DT[, .(day = seq_len(1 + max(time_alive)))][
# my original solution
DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
# or
DT[,time_alive > TARGET_NUMBER, by = individual]
I have two solutions based on what you have provided. One or both of them should be what you're looking for. See below for details/explanation
# load in data
DT = data.table('individual' = 1:4, 'time_alive' = c(1,5,7,5))
# set your target number
TARGET_NUMBER = 5
# group by individual,
# then check if the number of days they were alive is greater than your target
# this answers "i want to calculate the number of
# individual with more than "i" days of life
DT[,time_alive > TARGET_NUMBER, by = individual]
individual V1
1: 1 FALSE
2: 2 FALSE
3: 3 TRUE
4: 4 FALSE
# if the result you want is that table you created. that is a little different:
# create a table with days ranging from 1 to the maximum survivor
res = data.table('day' = 1:max(DT$time_alive))
day
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
# use joins
# join by time alive being greater than or equal to the day
# group by the specific day, and count how many observations we have
# allow.cartesian because the mapping isn't one-to-one
res[DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
day N
1: 1 4
2: 2 3
3: 3 3
4: 4 3
5: 5 3
6: 6 1
7: 7 1
I would approach the problem in a different way.
If you run data.frame(Alive = cumsum(rev(table(c(1,5,7,5))))) (Or in your general case data.frame(Alive = cumsum(rev(table(DT$time_alive))))) you will have the information you need, with the only caveat that if there is any day that had no deaths, you will end up with gaps in the data.
data.table
library(data.table)
DT[, .(Day = seq_len(max(time_alive) + 1))
][, Number := rowSums(outer(Day, DT$time_alive, `<=`))]
# Day Number
# <int> <num>
# 1: 1 4
# 2: 2 3
# 3: 3 3
# 4: 4 3
# 5: 5 3
# 6: 6 1
# 7: 7 1
# 8: 8 0
(I'm assuming that DT will have no more than 1 row per Individual.)
Data
DT <- setDT(structure(list(Individual = c("ID1", "ID2", "ID3", "ID4"), time_alive = c(1L, 5L, 7L, 5L)), class = c("data.table", "data.frame"), row.names = c(NA, -4L)))

Calculate rolling performance in R

I have the following dataframe,
FECHA Pxlast
1 2010-12-31 332.636
2 2011-01-07 334.327
3 2011-01-14 341.771
4 2011-01-21 331.241
5 2011-01-28 333.252
I have to calculate a new column called "Rolling 4 weeks", the values are based on the following idea, for example for index 5, it will be pxlast[5]/pxlast[5-number of weeks]-1.
Then I get the performance for 4 weeks, in this example it should be pxlast[5] = 333.252 , pxlast[5-4] = 332.636, then I divide it and I subtract - 1 , so the result is -0,384.
Ok, I can do it using "for" loop ; but reading about some functions that could do it properly I find the function chart.rollingPerformance, from the PerformanceAnalytics.
It applied the rolling over a parameter called FUN, for example "mean", it will calculate the mean between width space, but I don't know how to calculate Performance correctly.
Here is the output dataframe for example.
FECHA Pxlast Rolling4W
1 2010-12-31 332.636 NA
2 2011-01-07 334.327 NA
3 2011-01-14 341.771 NA
4 2011-01-21 331.241 NA
5 2011-01-28 333.252 -0,384
The NA values are because we are calculating performance from a width of 4 spaces ( weeks ).
Is there any function to do it without loops?
require(data.table)
d <- data.table(x = 1:5, y=c(1, 2, 4, 7, 11))
d[, z := shift(y, 4)]
d[, z := y/z - 1]
d
# x y z
# 1: 1 1 NA
# 2: 2 2 NA
# 3: 3 4 NA
# 4: 4 7 NA
# 5: 5 11 10

How do I search for conditions between rows of a data.table defined by a time lag?

I have a dataset of prescription records. Each row is a prescription for a single drug on a particular day. I have divided the drugs into two groups with partial overlap. I would like to identify where prescriptions have been issued from both drug groups within 3 days of each other but not include where the same drug has been issued from group 1 and 2 identifying the date of the latter drug of the pair.
An example of my data:
library(data.table)
set.seed(10)
DT <- data.table(day = sample(c(1:31), 30, replace = TRUE),
drug_group = sample(c(1, 2), 30, replace = TRUE),
drug_1 = sample(c("A", "B", "C"), 30, replace = TRUE),
drug_2 = sample(c("A", "D", "E"), 30, replace = TRUE))
DT[drug_group == 1, drug := drug_1]
DT[drug_group == 2, drug := drug_2]
DT[, c("drug_1", "drug_2") := NULL]
setkey(DT, day)
so the following:
day drug_group drug
1: 2 1 B
2: 3 1 C
3: 4 1 B
4: 7 2 E
5: 8 1 A
6: 9 2 A
7: 9 2 D
8: 9 1 C
9: 10 1 A
10: 12 1 A
...
24: 22 2 D
25: 22 2 E
26: 24 1 A
27: 25 1 A
28: 26 2 D
29: 26 1 C
30: 27 1 C
I would like to obtain a result like this:
day interaction_present
1: 1 FALSE
2: 2 FALSE
3: 3 FALSE
4: 4 FALSE
5: 5 FALSE
6: 6 FALSE
...
26: 26 TRUE
29: 29 FALSE
30: 30 FALSE
I am pretty sure I could do this by looping over each row in turn but I have been admonished repeatedly for using loops instead of vectorising and I wondered if this type of task was feasible without a loop? I have looked at using the data.table shift() function to set up lags but I am wary of creating too many new columns since my actual data.table is over a million rows.
Sorry if this is a trivial issue or if it has been asked before but I have been stuck on it all afternoon and I am giving up for the day!
if I am understanding your question correctly the following should work. Any changes can easily be made to the logic decisions within the function as well as the time lag start and end variables.
timelagadj <- function(i){
## this should be changed depending on what you mean by "within"
## 3 days. This currently goes i-3,i+3
## but if i-3 or i+3 doesn't exist in dayDT$day
## then we pull the value one above/below i-3/i+3 respectively
start <- max(dayDT$day[i]-3,dayDT$day[1])
start <- ifelse(dayDT$day[findInterval(start,dayDT$day)] < start,
findInterval(start,dayDT$day)+1,
findInterval(start,dayDT$day))
end <- min(dayDT$day[i]+3,dayDT$day[nrow(dayDT)])
end <- findInterval(end,dayDT$day)
## now we pull the relevant group ID and drug ID
gIDs <- dayDT$groupID[start:end]
dIDs <- dayDT$drugID[start:end]
## here we unlist the paste made before
## to group by days
gIDs <- unlist(strsplit(gIDs,"_"))
dIDs <- unlist(strsplit(dIDs,"_"))
## now we can apply our logic rule based
## on the criteria you mentioned
if(length(unique(gIDs))>1){
tmp <- unique(data.frame(gIDs,dIDs))
if(length(unique(tmp$gIDs))!=length(unique(tmp$dIDs))) T else F
}else F
}
dayDT <- DT[,list("drugID"=paste(drug,collapse="_"),"groupID"=paste(drug_group,collapse="_")),by=day]
res <- sapply(1:nrow(dayDT),function(m) timelagadj(m))
res <- dayDT[,list(day,"interaction_present"=res)]

Subsetting with variable selection range

I have to make a set of selections that vary by the day on this dataset (dat), which is composed by species (sp), day (day, in POSIXct) and area (ar):
sp day ar
A 1-Jan-00 2
B 1-Jan-00 6
C 2-Jan-00 2
A 2-Jan-00 1
D 2-Jan-00 4
E 2-Jan-00 12
F 3-Jan-00 8
A 4-Jan-00 3
G 4-Jan-00 2
B 4-Jan-00 1
I need to subset where species "A" occurs. However, the areas to be selected will vary by day, given by this matrix (dat.ar):
day ar.select
1-Jan-00 (1,6)
2-Jan-00 (1,12)
3-Jan-00 (4,8)
4-Jan-00 (3,12)
More specifically, for areas where species "A" occurs, on 1-jan-00, I need only areas 1 and 6. For 2-jan-00, areas 1 and 12, and so on.
As an example, the desired output on this example for this selection is given below:
sp day ar
A 2-Jan-00 1
A 4-Jan-00 3
I haven't had much success getting a for loop, as I am still trying to learn the semantics in R. In summary, a rough idea of what must be done, but still struggling with the language. Here is a sketch of where I think this should go:
dat1 = with(dat,sapply(day[sp=="A" & dat.ar$day.s[i] ],
function(x) ar == (ar[sp=="A" & day == x]==dat.ar$ar.select[j])
final=dat[rowSums(dat1) > 0, ]
I believe I have to fit a for loop, that would go through dat.ar, specifying the areas to be selected in dat. But despite my efforts in trying to get for the for loop, I haven't gotten anywhere near. I am not even sure if combining an sapply and a for loop is the right way to go about this.
In case someone wishes to reproduce the problem:
sp=c("A","B","C","A","D","E","F","A","G","B")
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "3-Jan-00", "4-Jan-00", "4-Jan-00", "4-Jan-00")
day=as.POSIXct(day, format="%d-%b-%y")
ar=c(2,6,2,1,4,12,8,3,2,1)
dat= as.data.frame(cbind(sp, day, ar))
day.s=c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s=as.POSIXct(day.s, format="%d-%b-%y")
a.s=c(1,1,4,3)
a.e=c(6,12,8,12)
ar.select=paste(a.s, a.e, sep=",")
dat.ar=cbind(day.s, ar.select)
Any help is much appreciated.
You could merge your table of conditions to the original dataset and filter them conditionally. Consider a1 and a2 like your sp and day values, and obs to be like your ar value.
library(data.table)
dataset <- data.table(
a1 = c("A","B","C","B","A","A","A","A"),
a2 = c("P","Q","Q","Q","R","R","P","Q"),
obs = c(3,2,3,4,2,4,8,0)
)
constraints <- data.table(
a1 = c("A","B","C","A","B","C","A","B","C"),
a2 = c("P","P","P","Q","Q","Q","R","R","R"),
lower = c(1,2,3,4,3,2,3,2,5),
upper = c(6,4,5,7,5,6,5,3,7)
)
checkingdataset <- merge(dataset,constraints, by = c("a1","a2"), all.x = TRUE)
checkingdataset[obs <= upper & obs >= lower, obs.keep := TRUE]
# a1 a2 obs lower upper obs.keep
#1: A P 3 1 6 TRUE
#2: A P 8 1 6 NA
#3: A Q 0 4 7 NA
#4: A R 2 3 5 NA
#5: A R 4 3 5 TRUE
#6: B Q 2 3 5 NA
#7: B Q 4 3 5 TRUE
#8: C Q 3 2 6 TRUE
First, I would not use as.data.frame(cbind(...)) to make your data.frames. Second, I would create dat.ar in much the same structure that you have created dat. Third, I would then just use merge to get the result you are looking for.
dat <- data.frame(sp=c("A","B","C","A","D","E","F","A","G","B"),
day=c("1-Jan-00", "1-Jan-00", "2-Jan-00", "2-Jan-00",
"2-Jan-00", "2-Jan-00", "3-Jan-00", "4-Jan-00",
"4-Jan-00", "4-Jan-00"),
ar=c(2,6,2,1,4,12,8,3,2,1))
dat$day <- as.POSIXct(dat$day, format="%d-%b-%y")
day.s <- c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00")
day.s <- as.POSIXct(day.s, format="%d-%b-%y")
a.s <- c(1,1,4,3)
a.e <- c(6,12,8,12)
ar.select <- paste(a.s, a.e, sep=",")
dat.ar <- data.frame(sp = "A", day = day.s, ar = ar.select)
dat.ar <- cbind(dat.ar[-3],
read.csv(text = as.character(dat.ar$ar), header = FALSE))
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")
dat.ar
# sp day variable ar
# 1 A 2000-01-01 V1 1
# 2 A 2000-01-02 V1 1
# 3 A 2000-01-03 V1 4
# 4 A 2000-01-04 V1 3
# 5 A 2000-01-01 V2 6
# 6 A 2000-01-02 V2 12
# 7 A 2000-01-03 V2 8
# 8 A 2000-01-04 V2 12
merge(dat, dat.ar)
# sp day ar variable
# 1 A 2000-01-02 1 V1
# 2 A 2000-01-04 3 V1
Of course, I would just suggest that you make your dat.ar object in a more friendly manner to begin with. Why paste values together if you are going to separate them out later anyway? ;)
dat.ar <- data.frame(sp = "A",
day = c("1-Jan-00", "2-Jan-00", "3-Jan-00", "4-jan-00"),
a.s = c(1,1,4,3), a.e = c(6,12,8,12))
dat.ar$day <- as.POSIXct(dat.ar$day, format="%d-%b-%y")
library(reshape2)
dat.ar <- melt(dat.ar, id.vars=1:2, value.name="ar")

aggregate data frame by equal buckets

I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733

Resources