R: find data frame index of multiple conditions - r

Given two data frames s and q with five observations each:
set.seed(8)
s <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
q <- data.frame(id=sample(c('Z','X'), 5, T),
t0=sample(1:10, 5, T),
t1 = sample(11:30, 5, T))
> s
id t0 t1
1 Z 8 20
2 Z 3 12
3 X 10 19
4 X 8 21
5 Z 7 13
> q
id t0 t1
1 X 3 30
2 Z 5 12
3 Z 7 23
4 Z 3 21
5 X 7 27
The midpoint for the observations between the variables t0 and t1 is (e.g. for s data):
s$t0+(s$t1-s$t0)/2
To find the index of the (first) observation in s whose midpoint is closest to, say, the first observation in q I can do:
i <- which.min(abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2)))
s[i,]
gives:
id t0 t1
3 X 10 19
But I cannot figure out how to find the same index in the original data s if I also want to condition on the id variable (e.g. pseudo code like: which.min(....) & s$id == q$id[1] - in this case the midpoint is sought among ids being 'X'). This SO is close but not spot on.
Again: I need a index to be used in the original 5-row data set.

Set the which.min argument to infinity when your condition is not obeyed:
val <- abs((s$t0+(s$t1-s$t0)/2 - (q$t0[1]+(q$t1[1]-q$t0[1])/2))
val[s$id != q$id[1]] <- Inf
i <- which.min(val)
By the way, you can simplify the expression in the first character as:
val <- abs((s$t0+s$t1)/2-(q$t0[1]+q$t1[1])/2)
or even
val <- abs(s$t0+s$t1-q$t0[1]-q$t1[1])/2

Related

Combining/summing two positions in a vector of integers in R

I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the
#first two [1,1] and last three [1,1,9] positions in the vector.
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive.
#They could be randomly selected from any position.
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.
Here is a function I designed to perform the task you described.
The vec_merge function takes the following arguments:
x: an integer vector.
event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.
sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
# Check if event_perc makes sense
if (event_perc > 1 | event_perc <= 0){
stop("event_perc should be between 0 to 1.")
}
# Check if sample_n makes sense
if (any(sample_n < 2)){
stop("sample_n should be at least larger than 2")
}
# Determine the event numbers
n <- round(length(x) * event_perc)
# Determine the sample number of each event
sample_vec <- sample(sample_n, size = n, replace = TRUE)
names(sample_vec) <- paste0("S", 1:n)
# Check if the sum of sample_vec is larger than the length of x
# If yes, stop the function and print a message
if (length(x) < sum(sample_vec)){
stop("Too many samples. Decrease event_perc or sampel_n")
}
# Determine the number that will not be merged
n2 <- length(x) - sum(sample_vec)
# Create a vector with replicated 1 based on m
non_merge_vec <- rep(1, n2)
names(non_merge_vec) <- paste0("N", 1:n2)
# Combine sample_vec and non_merge_vec, and then randomly sorted the vector
combine_vec <- c(sample_vec, non_merge_vec)
combine_vec2 <- sample(combine_vec, size = length(combine_vec))
# Expand the vector
expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
expand_vec <- inverse.rle(expand_list)
# Create a data frame with x and expand_vec
dat <- data.frame(number = x,
group = factor(expand_vec, levels = unique(expand_vec)))
dat$index <- 1
dat2 <- aggregate(cbind(dat$number, dat$index),
by = list(group = dat$group),
FUN = sum)
# # Convert dat2 to a matrix, remove the group column
dat2$group <- NULL
mat <- as.matrix(dat2)
return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.
set.seed(123)
vec_merge(1:10)
# number index
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 9 2
# [5,] 6 1
# [6,] 7 1
# [7,] 17 2
# [8,] 10 1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
if (sum(events) > length(vec)) stop("Too many events to merge")
# Create "groups" for the events
merge_events_seq <- rep(seq_along(events), events)
# Create "groups" for the rest of the data
remainder <- sequence((length(vec) - sum(events))) + length(events)
# Combine both groups and shuffle them so that the
# positions being combined are not necessarily consecutive
inds <- sample(c(merge_events_seq, remainder))
# Aggregate using `data.table`
temp <- data.table(values = vec, groups = inds)[
, list(count = length(values),
total = sum(values),
pos = toString(.I),
original = toString(values)), groups][, groups := NULL]
# Drop the other columns if required. Return the output.
if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
The count of values that were included in a particular sum (your ex.indices).
The total after summing relevant values (your ex.want).
The positions of the original values from the input vector.
The original values themselves, in case you want to verify it later.
The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
## [1] 4 7 1 2 11 14 18 19 1 10
merge_events <- c(2, 3)
fun(have, merge_events)
## count total pos original
## 1: 1 4 1 4
## 2: 1 7 2 7
## 3: 2 2 3, 9 1, 1
## 4: 1 2 4 2
## 5: 3 40 5, 8, 10 11, 19, 10
## 6: 1 14 6 14
## 7: 1 18 7 18
fun(events = c(3, 4))
## count total pos original
## 1: 4 39 1, 4, 6, 8 4, 2, 14, 19
## 2: 3 36 2, 5, 7 7, 11, 18
## 3: 1 1 3 1
## 4: 1 1 9 1
## 5: 1 10 10 10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
## [1] 6 10 10 6 15 20 28 20 26 12 25 23 6 25 8 12 25 23 24 6
fun(input, events = c(4, 7, 2, 3))
## count total pos original
## 1: 7 92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2: 1 10 2 10
## 3: 3 71 6, 9, 14 20, 26, 25
## 4: 4 69 7, 12, 13, 16 28, 23, 6, 12
## 5: 2 45 8, 17 20, 25
## 6: 1 12 10 12
## 7: 1 8 15 8
## 8: 1 23 18 23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1] 6 10 6 15 25 24 6
sum(.Last.value)
## [1] 92

How to read out Max values for a number of variables and form the mean using spatialy close values in R?

Ahoy,
below is a df similar to the one I have to work with but way smaller:
(I left out a lot of rows to make it easier on the eyes.)
x y variable values
1 1 5 a 9
2 2 5 a 2
3 3 5 a 9
4 4 5 a 8
5 5 5 a 4
...
22 2 1 a 7
23 3 1 a 9
24 4 1 a 7
25 5 1 a 10
26 1 5 b 7
27 2 5 b 8
...
48 3 1 b 8
49 4 1 b 7
50 5 1 b 2
The df above is created by an fluorescence plate reader which scans light intensity within an area by dividing it in into 25 sectors (5x5) and measuring each sector individually giving one value each. The order of measurements is upper left corner sector first and lower right corner sector last. To make it more graphical:
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Filling in the values from the df given above (+coordinates), it would look like this:
(5) ___ 9 - 2 - 9 - 8 - 4
(4) ___ 7 - 7 - 2 - 5 - 3
(3) ___ 5 - 4 - 7 - 8 - 9
(2) ___ 6 - 6 - 3 - 5 - 9
(1) ___ 4 - 7 - 9 - 7 - 10
(y^,x>) (1) (2) (3) (4) (5)
What I need is to read out the max value for each variable and calculate the mean of this value and the (up to) 9 fields surrounding it. In the area/"variable" above("a") the highest value of a sector is 10 in the lower right corner, which is surrounded by the values 5,9 and 7. Hence the Result I am looking for for variable "a" is 7.75 ((5+9+7+10)/4).
I imagine the code to resemble something like this (I am aware that this is not how you write r, but I don't know better):
mean(max value within variable,value at x(of max value within variable)-1,y(of max value within variable)),value at x(of max value within variable)-1,y(of max value within variable)+1) .....
The next issue challenge is that the instrument will perform scans of 96 areas (="variables"). And ideally I need a solution that automatically gives me this special mean value for every/all variables without me having to write the almost identical code 96times.
I know this is asking a bit much but I have been working on it for a while and I just cant come up with a solution or even a good way of googling it.
Thank you very much for any help!
Tim,
Ps: Using this R code creates a random version of the df I present above:
df <- data.frame(x = c(1:5), y = rep(c(5:1), each=5),variable = rep(c("a", "b"), each=25 ), values = floor(runif(50, min=1, max=10)))
This updated answer will provide the mean value of the maximum value and it's up to 9 surrounding values, within each variable group.
library(dplyr)
# Create the function
get.means <- function(df){
# Get a data frame of rows with the maximum value
max.rows <- df[df$values == max(df$values), ]
# Create an empty data frame
means.df <- data.frame(variable = character(), x = integer(), y = integer(), value = numeric(), mean = numeric(), stringsAsFactors = FALSE)
# Create an iterator for the data frame
iterator <- 1
# Loop through each row of the maximum value data frame
for(i in c(1:nrow(max.rows))){
# Get the x value for the current row
x <- max.rows$x[i]
# Get the y value for the current row
y <- max.rows$y[i]
# Set the range of x values to process based on the x coordinate
if(x == 1){
x.range <- c(1, 2)
} else if(x == 5){
x.range <- c(4, 5)
} else{
x.range <- c(x-1, x, x+1)
}
# Set the range of y values to process based on the y coordinate
if(y == 1){
y.range <- c(1, 2)
} else if(y == 5){
y.range <- c(4, 5)
} else{
y.range <- c(y-1, y, y+1)
}
# Get a matrix of the values from the original data frame, which are in both the current y and x ranges
vals <- as.matrix(df[(df$y %in% y.range) & (df$x %in% x.range), 'values'])
# Get the mean of the values
mean.val <- mean(vals)
# Insert the current variable value to the data frame for the new row
means.df[iterator, 'variable'] <- as.character(max.rows$variable[i])
# Insert the current x, y, value, and mean values for the new row
means.df[iterator, c('x','y','value', 'mean')] <- c(x, y, max.rows$values[i], mean.val)
# Increment the iterator
iterator <- iterator + 1
}
# Return the final data frame
return(means.df)
}
# Create a test data frame
df <- data.frame(x = c(1:5), y = rep(c(5:1), each=5),variable = rep(c("a", "b"), each=25 ), values = floor(runif(50, min=1, max=10)))
# Get the means for each max value within the variable grouping
df1 <- df %>%
group_by(variable) %>%
do(get.means(.))

How to pass a multivariate vector valued function (with variable length output) to aggregate

I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2

Finding a paired values from the row throughout the dataframe in R

I am trying to polishing my R skills and sort of hit my limit.
The issue I am trying to find the solution as follows.
Suppose my dataframe is as below ,
n = c(2, 15, 31 , 33)
n2 = c( 10 , 9, 10 , 40)
n3 = c( 11 , 10 , 11 , 42)
df = data.frame(n , n2 , n3)
> df
n n2 n3
1 2 10 11
2 15 9 10
3 31 10 11
4 33 40 42
if I would like to go through each row , and generate a random pair :eg 2,10 and go through each of the rest of the rows to find a repeated pair and print out the common pairs and number of occurrence , how can I do that?
In the above example , the only pair that repeat is 10 and 11 at rows 1 and 3.
So far I have thought about the pseudo code as follows
for(each row in the dataframe)
{
for (each of the values in the row)
{
for every pair
}
find a repeated pair
if found store in a dataframe
}
and to find the random pair using combn function.
But I am a little lost at the part on iteration through out the dataframe rows.
Pls help.
Thanks a lot!
I think this is what you want. Instead of thinking about selecting every combination of two values for each row, we'll get every combination of two column numbers - which will be the same for every row. Then we use plyr::count as a convenience function to count rows with the same values for an entire data frame at once. This way we can loop over the combinations of column indices rather than over rows. I use apply, but you could write it as a for loop instead.
pairs = combn(ncol(df), m = 2)
result = apply(pairs, MAR = 2, FUN = function(p) {
plyr::count(df[p])
})
names(result) = apply(pairs, MAR = 2, FUN = paste, collapse = "_")
The result is a list where each item is is a data frame with two columns and a freq column giving the number of rows in the original data each value-pair occurred.
result
# $`1_2`
# n n2 freq
# 1 2 10 1
# 2 15 9 1
# 3 31 10 1
# 4 33 40 1
#
# $`1_3`
# n n3 freq
# 1 2 11 1
# 2 15 10 1
# 3 31 11 1
# 4 33 42 1
#
# $`2_3`
# n2 n3 freq
# 1 9 10 1
# 2 10 11 2
# 3 40 42 1
If you want to omit the values that aren't repeated, we can just subset them out:
lapply(result, subset, freq > 1)
# $`1_2`
# [1] n n2 freq
# <0 rows> (or 0-length row.names)
#
# $`1_3`
# [1] n n3 freq
# <0 rows> (or 0-length row.names)
#
# $`2_3`
# n2 n3 freq
# 2 10 11 2
Slightly different method
n = c(2, 15, 31 , 15) # changed the dataset to have some common pairs in n and n2 too
n2 = c( 10 , 9, 10 , 9)
n3 = c( 11 , 10 , 11 , 42)
df = data.frame(n , n2 , n3)
library(dplyr)
library(rlang)
library(utils)
cols<-colnames(df) # define the columns that you want to do the pair checking for
combinations<- as.data.frame(combn(cols,2),stringsAsFactors = FALSE)
# picks up all combinations of columns
#iterates over each pair of columns
all_combs<- lapply(names(combinations[cols1]), function(x){
df %>%
group_by(!! sym( combinations[[x]][1]),!! sym( combinations[[x]][2])) %>%
filter(n()>1) # groups by the two columns, and filters out pairs that occur more than once. You can add a distinct command below if you
#dont want them repeated
})
all_combs_df <- do.call("rbind", all_combs)# all_combs is in a list format, use rbind to convert into a dataframe
all_combs_df
the output is this
n n2 n3
<dbl> <dbl> <dbl>
1 15. 9. 10.
2 15. 9. 42.
3 2. 10. 11.
4 31. 10. 11.

aggregate data frame by equal buckets

I would like to aggregate an R data.frame by equal amounts of the cumulative sum of one of the variables in the data.frame. I googled quite a lot, but probably I don't know the correct terminology to find anything useful.
Suppose I have this data.frame:
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> head(x)
p v
1 10.002904 4
2 10.132200 2
3 10.026105 6
4 10.001146 2
5 9.990267 2
6 10.115907 6
7 10.199895 9
8 9.949996 8
9 10.165848 8
10 9.953283 6
11 10.072947 10
12 10.020379 2
13 10.084002 3
14 9.949108 8
15 10.065247 6
16 9.801699 3
17 10.014612 8
18 9.954638 5
19 9.958256 9
20 10.031041 7
I would like to reduce the x to a smaller data.frame where each line contains the weighted average of p, weighted by v, corresponding to an amount of n units of v. Something of this sort:
> n <- 100
> cum.v <- cumsum(x$v)
> f <- cum.v %/% n
> x.agg <- aggregate(cbind(v*p, v) ~ f, data=x, FUN=sum)
> x.agg$'v * p' <- x.agg$'v * p' / x.agg$v
> x.agg
f v * p v
1 0 10.039369 98
2 1 9.952049 94
3 2 10.015058 104
4 3 9.938271 103
5 4 9.967244 100
6 5 9.995071 69
First question, I was wondering if there is a better (more efficient approach) to the code above. The second, more important, question is how to correct the code above in order to obtain more precise bucketing. Namely, each row in x.agg should contain exacly 100 units of v, not just approximately as it is the case above. For example, the first row contains the aggregate of the first 17 rows of x which correspond to 98 units of v. The next row (18th) contains 5 units of v and is fully included in the next bucket. What I would like to achieve instead would be attribute 2 units of row 18th to the first bucket and the remaining 3 units to the following one.
Thanks in advance for any help provided.
Here's another method that does this with out repeating each p v times. And the way I understand it is, the place where it crosses 100 (see below)
18 9.954638 5 98
19 9.958256 9 107
should be changed to:
18 9.954638 5 98
19.1 9.958256 2 100 # ---> 2 units will be considered with previous group
19.2 9.958256 7 107 # ----> remaining 7 units will be split for next group
The code:
n <- 100
# get cumulative sum, an id column (for retrace) and current group id
x <- transform(x, cv = cumsum(x$v), id = seq_len(nrow(x)), grp = cumsum(x$v) %/% n)
# Paste these two lines in R to install IRanges
source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
require(IRanges)
ir1 <- successiveIRanges(x$v)
ir2 <- IRanges(seq(n, max(x$cv), by=n), width=1)
o <- findOverlaps(ir1, ir2)
# gets position where multiple of n(=100) occurs
# (where we'll have to do something about it)
pos <- queryHits(o)
# how much do the values differ from multiple of 100?
val <- start(ir2)[subjectHits(o)] - start(ir1)[queryHits(o)] + 1
# we need "pos" new rows of "pos" indices
x1 <- x[pos, ]
x1$v <- val # corresponding values
# reduce the group by 1, so that multiples of 100 will
# belong to the previous row
x1$grp <- x1$grp - 1
# subtract val in the original data x
x$v[pos] <- x$v[pos] - val
# bind and order them
x <- rbind(x1,x)
x <- x[with(x, order(id)), ]
# remove unnecessary entries
x <- x[!(duplicated(x$id) & x$v == 0), ]
x$cv <- cumsum(x$v) # updated cumsum
x$id <- NULL
require(data.table)
x.dt <- data.table(x, key="grp")
x.dt[, list(res = sum(p*v)/sum(v), cv = tail(cv, 1)), by=grp]
Running on your data:
# grp res cv
# 1: 0 10.037747 100
# 2: 1 9.994648 114
Running on #geektrader's data:
# grp res cv
# 1: 0 9.999680 100
# 2: 1 10.040139 200
# 3: 2 9.976425 300
# 4: 3 10.026622 400
# 5: 4 10.068623 500
# 6: 5 9.982733 562
Here's a benchmark on a relatively big data:
set.seed(12345)
x <- data.frame(cbind(p=rnorm(1e5, 10, 0.1), v=round(runif(1e5, 1, 10))))
require(rbenchmark)
benchmark(out <- FN1(x), replications=10)
# test replications elapsed relative user.self
# 1 out <- FN1(x) 10 13.817 1 12.586
It takes about 1.4 seconds on 1e5 rows.
If you are looking for precise bucketing, I am assuming value of p is same for 2 "split" v
i.e. in your example, value of p for 2 units of row 18th that go in first bucket is 9.954638
With above assumption, you can do following for not super large datasets..
> set.seed(12345)
> x <- data.frame(cbind(p=rnorm(100, 10, 0.1), v=round(runif(100, 1, 10))))
> z <- unlist(mapply(function(x,y) rep(x,y), x$p, x$v, SIMPLIFY=T))
this creates a vector with each value of p repeated v times for each row and result is combined into single vector using unlist.
After this aggregation is trivial using aggregate function
> aggregate(z, by=list((1:length(z)-0.5)%/%100), FUN=mean)
Group.1 x
1 0 9.999680
2 1 10.040139
3 2 9.976425
4 3 10.026622
5 4 10.068623
6 5 9.982733

Resources