I am working with the R programming language.
I simulated this dataset which contains 1000 coin flips - then I calculated the number of "2 Flip Sequences":
Coin <- c('H', 'T')
Results = sample(Coin,1000, replace = TRUE)
My_Data = data.frame(id = 1:1000, Results)
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second Freq
1 H H 255
2 T H 245
3 H T 246
4 T T 253
I am curious - is it possible to extend the above code for "3 Flip Sequences"?
For example - I tried modifying parts of the code to see how the results change (and hoped to stumble across the correct way to write this code):
# First Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = head(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 245
3 H T H 0
4 T T H 0
5 H H T 0
6 T H T 0
7 H T T 246
8 T T T 253
# Second Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 0
3 H T H 0
4 T T H 245
5 H H T 246
6 T H T 0
7 H T T 0
8 T T T 253
I am not sure which of these options are correct?
In general, I am looking to understand the logic as to how I can adapt the above code for an "arbitrary number of coin flips" (e.g. "4 flip sequences", "5 flip sequences", etc.)
Also, this might not be the most efficient way to calculate these frequencies - I would also be interested in learning about other ways that might be more efficient ( e.g. as the overall size of the data increases).
Thanks!
It might be helpful to work with strings.
coin <- c("H", "T")
results <- sample(coin, 1000, replace = TRUE)
Then to get sequence counts (assuming overlapping sequences also count) for triples, we could do something like:
triples <- table(
sapply(
1:(length(results) - 3),
function(i) sprintf(
"%s%s%s",
results[i],
results[i + 1],
results[i + 2]
)
)
)
which gives me something like:
HHH HHT HTH HTT THH THT TTH TTT
132 129 138 115 129 124 116 114
This idea could be generalized fairly easily, for example:
n_sequences <- function(n, results) {
helper <- function(i, n) if (n < 1) "" else sprintf(
"%s%s",
helper(i, n - 1),
results[i + n - 1]
)
result <- data.frame(
table(
sapply(
1:(length(results) - n + 1),
function(i) helper(i, n)
)
)
)
colnames(result) <- c("Sequence", "Frequency")
result
}
For example:
n_sequences(5, results)
Gives me something like:
Sequence Frequency
1 HHHHH 34
2 HHHHT 31
3 HHHTH 36
4 HHHTT 31
5 HHTHH 35
6 HHTHT 36
7 HHTTH 20
8 HHTTT 37
9 HTHHH 35
10 HTHHT 34
11 HTHTH 41
12 HTHTT 27
13 HTTHH 27
14 HTTHT 24
15 HTTTH 34
16 HTTTT 30
17 THHHH 31
18 THHHT 36
19 THHTH 36
20 THHTT 26
21 THTHH 34
22 THTHT 32
23 THTTH 31
24 THTTT 27
25 TTHHH 32
26 TTHHT 28
27 TTHTH 25
28 TTHTT 31
29 TTTHH 33
30 TTTHT 31
31 TTTTH 30
32 TTTTT 20
You could first cut along 3 + 1 breaks, split it along the levels. The interaction can now be tabled to get the result.
My_Data$cut3 <- cut(seq_len(nrow(My_Data)), seq.int(1, nrow(My_Data), length.out=3 + 1), include.lowest=TRUE)
(res <- interaction(split(My_Data$Results, My_Data$cut3)) |> table() |> as.data.frame())
# Var1 Freq
# 1 H.H.H 51
# 2 T.H.H 58
# 3 H.T.H 43
# 4 T.T.H 49
# 5 H.H.T 38
# 6 T.H.T 51
# 7 H.T.T 64
# 8 T.T.T 46
To get the desired output, we can strsplit Var1.
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c('first', 'second', 'third', 'Freq'))
# first second third Freq
# 1 H H H 51
# 2 T H H 58
# 3 H T H 43
# 4 T T H 49
# 5 H H T 38
# 6 T H T 51
# 7 H T T 64
# 8 T T T 46
Note, that nrow of your data should be divisible by 3.
Edit
To generalize, we may write a small function.
f <- \(x, n) {
ct <- cut(seq_len(nrow(x)), seq.int(1L, nrow(x), length.out=n + 1L), include.lowest=TRUE)
res <- interaction(split(x$Results, ct)) |> table() |> as.data.frame()
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c(LETTERS[seq_len(n)], 'Freq'))
}
f(My_Data, 4)
# A B C D Freq
# 1 H H H H 13
# 2 T H H H 25
# 3 H T H H 18
# 4 T T H H 17
# 5 H H T H 18
# 6 T H T H 15
# 7 H T T H 21
# 8 T T T H 24
# 9 H H H T 26
# 10 T H H T 15
# 11 H T H T 16
# 12 T T H T 18
# 13 H H T T 22
# 14 T H T T 18
# 15 H T T T 10
# 16 T T T T 24
Data:
set.seed(42)
My_Data <- data.frame(id=1:1200, Results=sample(c('H', 'T'), 1200, replace=TRUE))
A slightly generalized solution with tidyverse tools. Change the sets variable for longer or shorter sequences.
coin <- c("H", "T")
sets <- 4
rolls <- 10000
results <- sample(coin, sets * rolls, rep = TRUE)
named_results <- purrr::map_chr(
0:(rolls - 1),
~ paste0(results[(sets * .x + 1):(sets * .x + sets)],
collapse = ""
)
)
dplyr::count(tibble::tibble(x = named_results), x)
with output
# A tibble: 16 x 2
x n
<chr> <int>
1 HHHH 629
2 HHHT 627
3 HHTH 638
4 HHTT 599
5 HTHH 602
6 HTHT 633
7 HTTH 596
8 HTTT 661
9 THHH 631
10 THHT 589
11 THTH 633
12 THTT 647
13 TTHH 660
14 TTHT 637
15 TTTH 623
16 TTTT 595
sets = 8 would give something like
# A tibble: 256 x 2
x n
<chr> <int>
1 HHHHHHHH 37
2 HHHHHHHT 36
3 HHHHHHTH 43
4 HHHHHHTT 35
5 HHHHHTHH 38
6 HHHHHTHT 27
7 HHHHHTTH 32
8 HHHHHTTT 28
9 HHHHTHHH 33
10 HHHHTHHT 38
# ... with 246 more rows
Related
I have a dataset that has data from all football (Soccer) players in the top 5 leagues and I am trying to build a scout function that retrieves a shortlist of players that are in the top 85th percentile of the chosen metrics.
I tried experimenting with the function with a simple argument to see if it was working:
scout(Total_Big_5_new,"Nutmegs")
but it returns this error:
the condition has length > 1
In addition: Warning message:
In percentile(database$metric) : NAs introduced by coercion
The code for the scout function is here:
scout <- function(database, ...) {
l <- list(...)
l2 <- list()
j <- 1
for(metric in l){
if(metric %in% colnames(database)){
l2[[j]] <- percentile(database[[metric]])
j <- j + 1
}else{
print(paste("The stat", metric, "is not recorded"))
}
}
i <- 1
k <- 1
shortlist <- list()
for (player in database){
compared <- select(database, unlist(l))
if (all(compared) > all(unlist(l2))){
shortlist[[i]] <- player
i <- i + 1
}
}
return(shortlist)
}
and the percentile function:
percentile <- function(metric, value = 0.85) {
answer <- unname(quantile(metric, c(value)))
return(as.numeric(paste(answer)))
}
Edit:
For example, say I make a dataframe with random data
df <- as_tibble(data.frame(
Player = c(LETTERS[1:13]),
Goals = c(sample(1:45, 13, replace=FALSE)),
Assists = c(sample(1:31, 13, replace=FALSE)),
Nutmegs = c(sample(1:28, 13, replace = FALSE)),
Dribbles = c(sample(43:208, 13, replace = FALSE))
))
Which returns this df:
Player Goals Assists Nutmegs Dribbles
<chr> <int> <int> <int> <int>
1 A 23 16 1 125
2 B 7 2 19 195
3 C 21 4 28 142
4 D 28 19 23 112
5 E 8 27 26 152
6 F 17 23 16 45
7 G 30 6 25 206
8 H 26 24 8 136
9 I 18 3 27 99
10 J 31 25 7 198
11 K 4 21 13 82
12 L 1 13 22 66
13 M 43 7 4 194
In this data frame, my percentile function would return 25.4. As seen below
percentile(df$Goals, 0.65) = 25.4
The aim of the scout function that I am creating is to retrieve the name of the players that exceed that value. EG
scout(df,"Goals")
should return players: D, G, H, J and M
I have a big database and I'm trying to create a new column starting from an existing one doing the difference between elements in consecutive cells ( same column, different row):
existing_column
new_column
A
A-B
B
B-C
C
C-D
D
D-E
...
...
Z
Z-NULL
The way I'm doing it is to duplicate existing column into a dummy one, remove first element, adding NULL as last element and subtracting the existing column and the dummy one ... is there a better way? Thank you
exist <-c("A","B","C","D","E")
db<-data.frame(exist)
dummy<-exist[-1]
dummy[length(dummy)+1]<-"NULL"
new_col<-paste(exist,"-",dummy)
new_col
db<-data.frame(exist,new_col)
db
Does this work:
library(dplyr)
df <- data.frame(existing_column = LETTERS)
df %>% mutate(new_column = paste(existing_column, lead(existing_column, default = 'NULL'), sep = '-'))
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NULL
Try the code below
transform(
df,
new_column = paste(existing_column, c(existing_column[-1], NA), sep = "-")
)
which gives
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NA
If you are working with numeric data just represented as characters in your example, you can use mutate() and lead()
df<-data.frame(old_col=sample(1:10))
df%>%mutate(new_col=old_col-lead(old_col, default = 0))
old_col new_col
1 10 4
2 6 -3
3 9 8
4 1 -1
5 2 -5
6 7 3
7 4 1
8 3 -5
9 8 3
10 5 5
In case there is a need of a fast data.table version
dt[, new_column:=paste(exist, shift(exist, type="lead"), sep="-")]
Edit. Turns it isn't much faster:
df = data.table(exist = rep(letters, 80000))
> m = microbenchmark::microbenchmark(
... a = df %>% mutate(new_column = paste(exist, lead(exist, default = 'NULL'), sep = '-')),
...
... b = transform(
... df,
... new_column = paste(exist, c(exist[-1], NA), sep = "-")
... ),
...
... d = df[, new_column := paste(exist, shift(exist, type="lead"), sep="-")]
... )
> m
Unit: milliseconds
expr min lq mean median uq max neval
a 292.2430 309.6150 342.0191 323.9778 361.0937 603.8449 100
b 349.4509 383.3391 475.0177 423.8864 472.0276 2136.2970 100
d 294.6786 302.8530 332.3989 315.6228 340.9642 641.8345 100
I am attempting to write a simulation that involves randomly re-assigning items to categories with some restrictions.
Lets say I have a collection of pebbles 1 to N distributed across buckets A through J:
set.seed(100)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
head(df1)
#> pebble bucket
#> 1 1 D
#> 2 2 C
#> 3 3 F
#> 4 4 A
#> 5 5 E
#> 6 6 E
I want to randomly re-assign pebbles to buckets. Without restrictions I could do it like so:
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
colSums(table(df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
Importantly this re-assigns pebbles while ensuring that each bucket retains the same number (because we are sampling without replacement).
However, I have a set of restrictions such that certain pebbles cannot be assigned to certain buckets. I encode the restrictions in df2:
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
df2
#> pebble bucket
#> 1 33 I
#> 2 39 I
#> 3 5 A
#> 4 36 C
#> 5 55 J
#> 6 66 A
#> 7 92 J
#> 8 95 H
#> 9 2 C
#> 10 49 I
The logic here is that pebbles 33 and 39 cannot be placed in bucket I, or pebble 5 in bucket A, etc. I would like to permute which pebbles are in which bucket subject to these restrictions.
So far, I've thought of tackling it in a loop as below, but this does not result in buckets retaining the same number of pebbles:
perms <- character(0)
cnt <- 1
for (p in df1$pebble) {
perms[cnt] <- sample(df1$bucket[!df1$bucket %in% df2$bucket[df2$pebble==p]], 1)
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G H I J
#> 6 7 12 22 15 1 14 7 7 9
I then tried sampling positions, and then removing that position from the available buckets and the available remaining positions. This is also not working, and I suspect it is because I am sampling my way into branches of the tree that do not yield solutions.
set.seed(42)
perms <- character(0)
cnt <- 1
ids <- 1:nrow(df1)
bckts <- df1$bucket
for (p in df1$pebble) {
id <- sample(ids[!bckts %in% df2$bucket[df2$pebble==p]], 1)
perms[cnt] <- bckts[id]
bckts <- bckts[-id]
ids <- ids[ids!=id]
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G J
#> 1 1 4 1 2 1 2 2
Any thoughts or advice much appreciated (and apologies for the length).
EDIT:
I foolishly forgot to clarify that I was previously solving this by just resampling until I got a draw that didn't violate any of the conditions in df2, but I now have many conditions such that this would make my code take too long to run. I am still up for trying to force it if I could figure out a way to make forcing it faster.
I have a solution (I managed to write it in base R, but the data.table solution is easier to understand and write:
random.permutation.df2 <- data.frame(pebble = df1$pebble, bucket = rep(NA,length(df1$pebble)))
for(bucket in unique(df1$bucket)){
N <- length( random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] )
random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] <-
sample(c(rep(bucket,sum(df1$bucket == bucket)),rep(NA,N-sum(df1$bucket == bucket))))
}
The idea is to sample the authorised peeble for each bucket: those that are not in df2, and those that are not already filled. You sample then a vector of the good length, choosing between NAs (for the following buckets values) and the value in the loop, and voilà.
Now easier to read with data.table
library(data.table)
random.permutation.df2 <- setDT(random.permutation.df2)
df2 <- setDT(df2)
for( bucketi in unique(df1$bucket)){
random.permutation.df2[is.na(bucket) & !pebble %in% df2[bucket == bucketi, pebble],
bucket := sample(c(rep(bucketi,sum(df1$bucket == bucket)),rep(NA,.N-sum(df1$bucket == bucket))))]
}
it has the two conditions
> colSums(table(df1))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
> colSums(table(random.permutation.df2))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
To verify that there isn't any contradiction with df2
> df2
pebble bucket
1: 37 D
2: 95 H
3: 90 C
4: 80 C
5: 31 D
6: 84 G
7: 76 I
8: 57 H
9: 7 E
10: 39 A
> random.permutation.df2[pebble %in% df2$pebble,.(pebble,bucket)]
pebble bucket
1: 7 D
2: 31 H
3: 37 J
4: 39 F
5: 57 B
6: 76 E
7: 80 F
8: 84 B
9: 90 H
10: 95 D
Here a brute force approach where one simply tries long enough until a valid solution is found:
set.seed(123)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
Random permutation does not match the condition, so try new ones:
merge(random.permutation.df1, df2)
#> pebble bucket
#> 1 60 J
while(TRUE) {
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
if(nrow(merge(random.permutation.df1, df2)) == 0)
break;
}
New permutation matches the condition:
merge(random.permutation.df1, df2)
#> [1] pebble bucket
#> <0 Zeilen> (oder row.names mit Länge 0)
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7
colSums(table(df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7
The seq function in R would give me a sequence from x to y with a constant step m:
seq(x, y, m)
E.g. seq(1,9,2) = c(1,3,5,7,9).
What would be the most elegant way to get a sequence from x to y with alternating steps m1 and m2, such that something like "seq(x, y, c(m1, m2))" would give me c(x, x + m1, (x + m1) + m2, (x + m1 + m2) + m1, ..., y), each time adding one of the steps (not necessarily reaching up to y, of course, as in seq)?
Example: x = 1; y = 19; m1 = 2; m2 = 4 and I get c(1,3,7,9,13,15,19).
I arrived the solution by:
1. Use cumsum with a vector c(from,rep(by,times),...), with by repeated times = ceiling((to-from)/sum(by)) times.
2. Truncate the sequence by !(seq > to).
seq_alt <- function(from, to, by) {
seq <- cumsum(c(from,rep(by,ceiling((to-from)/sum(by)))))
return(seq[! seq > to])
}
First n terms of this sequence you can generate with
x = 1; m1 = 2; m2 = 4
n <- 0:10 # first 11 terms
x + ceiling(n/2)*m1 + ceiling((n-1)/2)*m2
# [1] 1 3 7 9 13 15 19 21 25 27 31
Here is another idea,
fun1 <- function(x, y, j, z){
if(j >= y) {return(x)}else{
s1 <- seq(x, y, j+z)
s2 <- seq(x+j, y, j+z)
return(sort(c(s1, s2)))
}
}
fun1(1, 19, 2, 4)
#[1] 1 3 7 9 13 15 19
fun1(1, 40, 4, 3)
#[1] 1 5 8 12 15 19 22 26 29 33 36 40
fun1(3, 56, 7, 10)
#[1] 3 10 20 27 37 44 54
fun1(1, 2, 2, 4)
#[1] 1
Here is an alternative that uses diffinv This method over allocates the values, so as a stopping rule, I get the elements that are less than or equal to the stopping value.
seqAlt <- function(start, stop, by1, by2) {
out <- diffinv(rep(c(by1, by2), ceiling(stop / (by1 + by2))), xi=start)
return(out[out <= stop])
}
seqAlt(1, 19, 2, 4)
[1] 1 3 7 9 13 15 19
You could use Reduce with accumulate = TRUE to iteratively add either 2 or 4:
Reduce(`+`, rep(c(2,4), 10), init = 1, accumulate = TRUE)
# [1] 1 3 7 9 13 15 19 21 25 27 31 33 37 39 43 45 49 51 55 57 61
The number of times you repeat c(2,4) will determine sequence length; since it is 10 above, the sequence is length 20.
The purrr package has an accumulate wrapper, if you prefer the syntax:
purrr::accumulate(rep(c(2,4), 10), `+`, .init = 1)
## [1] 1 3 7 9 13 15 19 21 25 27 31 33 37 39 43 45 49 51 55 57 61
perfect example of recycling vectors in R
# 1.
x = 1; y = 19; m1 = 2; m2 = 4
(x:y)[c(TRUE, rep(FALSE, m1-1), TRUE, rep(FALSE,m2-1))]
# [1] 1 3 7 9 13 15 19
# 2.
x = 3; y = 56; m1 = 7; m2 = 10
(x:y)[c(TRUE, rep(FALSE, m1-1), TRUE, rep(FALSE,m2-1))]
# [1] 3 10 20 27 37 44 54
I have a data.frame with two columns:
category quantity
a 20
b 30
c 100
d 10
e 1
f 23
g 3
h 200
I need to write a function with two parameters: dataframe, bin_size which runs a cumsum over the quantity column, does a split of the subsequent row if the the cumsum exceeds the bin_size and adds a running bin number as an additional column.
Say, by entering this:
function(dataframe, 50)
in the above example should give me:
category quantity cumsum bin_nbr
a 20 20 1
b 30 50 1
c 50 50 2
c 50 50 3
d 10 10 4
e 1 11 4
f 23 34 4
g 3 37 4
h 13 50 4
h 50 50 5
h 50 50 6
h 50 50 7
h 37 37 8
Explanation:
row a + b sum up to 50 --> bin_nbr 1
row c is 100 -> split into 2 rows # 50 -> bin nbr 2, bin_nbr 3
row d,e,f,g sum up to 37 -> bin_nbr 4
I need another 13 from row h to fill in bin_nbr 4 to 50
The rest of the remaining quantity from h will be spitted into 4 bins -> bin_nbr 5, 6, 7, 8
I couldn't think of a clean way to do this with apply/data.table etc since you have an inter-row dependency and a changing size data frame. You can probably do it in an iterative/recursive manner, but I felt it would be quicker to figure out to just write the loop. One challenge is that it is difficult to know the final size of your object, so this is likely to be slow. You can mitigate the problem somewhat by switching from a df to a matrix (code should work fine, except transform bits) if performance is an issue in this application.
fun <- function(df, binsize){
df$cumsum <- cumsum(df$quantity)
df$bin <- 1
i <- 1
repeat {
if((extra <- (df[i, "cumsum"] - binsize)) > 0) { # Bin finished halfway through
top <- if(i > 1L) df[1L:(i - 1L), ] else df[0L, ]
mid <- transform(df[i, ], quantity=quantity-extra, cumsum=cumsum-extra)
bot <- transform(df[i, ], quantity=extra, cumsum=extra, bin=bin + 1L)
end <- if(i >= nrow(df)) df[0L, ] else df[(i + 1L):nrow(df), ]
end <- transform(end, cumsum=cumsum(end$quantity) + extra, bin=bin + 1L)
df <- rbind(top, mid, bot, end)
} else if (extra == 0 && nrow(df) > i) { # Bin finished cleanly
df[(i + 1L):nrow(df), ]$cumsum <- df[(i + 1L):nrow(df), ]$cumsum - binsize
df[(i + 1L):nrow(df), ]$bin <- df[(i + 1L):nrow(df), ]$bin + 1L
}
if(nrow(df) < (i <- i + 1)) break
}
rownames(df) <- seq(len=nrow(df))
df
}
fun(df, binsize)
# category quantity cumsum bin
# 1 a 20 20 1
# 2 b 30 50 1
# 3 c 50 50 2
# 4 c 50 50 3
# 5 d 10 10 4
# 6 e 1 11 4
# 7 f 23 34 4
# 8 g 3 37 4
# 9 h 13 50 4
# 10 h 50 50 5
# 11 h 50 50 6
# 12 h 50 50 7
# 13 h 37 37 8
Another solution with a loop:
DF <- read.table(text="category quantity
a 20
b 30
c 100
d 10
e 1
f 23
g 3
h 200", header=TRUE)
bin_size <- 50
n_bin <- ceiling(sum(DF$quantity)/bin_size)
DF$bin <- findInterval(cumsum(DF$quantity)-1, c(0, seq_len(n_bin)*50))
DF$cumsum <- cumsum(DF$quantity)
result <- lapply(seq_along(DF[,1]), function(i, df) {
if (i==1) {
d <- df[i, "bin"]
} else {
d <- df[i, "bin"]-df[i-1, "bin"]
}
if (d > 1) {
res <- data.frame(
category = df[i, "category"],
bin_nbr = df[i, "bin"]-seq_len(d+1)+1
)
res[,"quantity"] <- bin_size
if (i!=1) {
res[nrow(res),"quantity"] <- df[i-1, "bin"]*bin_size-df[i-1, "cumsum"]
} else {
res[nrow(res),"quantity"] <- 0
}
res[1,"quantity"] <- df[i, "quantity"]-sum(res[-1,"quantity"])
return(res[res$quantity > 0,])
} else {
return(data.frame(
category = df[i, "category"],
quantity = df[i, "quantity"],
bin_nbr = df[i, "bin"]
))
}
}, df=DF)
res <- do.call(rbind, result)
res <- res[order(res$category, res$bin_nbr),]
library(plyr)
res <- ddply(res, .(bin_nbr), transform, cumsum=cumsum(quantity))
res
# category quantity bin_nbr cumsum
# 1 a 20 1 20
# 2 b 30 1 50
# 3 c 50 2 50
# 4 c 50 3 50
# 5 d 10 4 10
# 6 e 1 4 11
# 7 f 23 4 34
# 8 g 3 4 37
# 9 h 13 4 50
# 10 h 50 5 50
# 11 h 50 6 50
# 12 h 50 7 50
# 13 h 37 8 37
This amounts to merging the bin boundaries with the data which gives this loop-free solution:
library(zoo)
fun <- function(DF, binsize = 50) {
nr <- nrow(DF)
DF2 <- data.frame(cumsum = seq(0, sum(DF$quantity), binsize) + binsize, bin_nbr = 1:nr)
DF.cs <- transform(DF, cumsum = cumsum(DF$quantity))
m <- na.locf(merge(DF.cs, DF2, all = TRUE), fromLast = TRUE)
m$bin_nbr <- as.numeric(m$bin_nbr)
cs <- as.numeric(m$cumsum)
m$quantity <- c(cs[1], diff(cs))
m$cumsum <- ave(m$quantity, m$bin_nbr, FUN = cumsum)
na.omit(m)[c("category", "quantity", "cumsum", "bin_nbr")]
}
giving:
> fun(DF)
category quantity cumsum bin_nbr
1 a 20 20 1
2 b 30 50 1
3 c 50 50 2
4 c 50 50 3
5 d 10 10 4
6 e 1 11 4
7 f 23 34 4
8 g 3 37 4
9 h 13 50 4
10 h 50 50 5
11 h 50 50 6
12 h 50 50 7
13 h 37 37 8
Note: For purposes of reproducing the result above this is the input we used:
Lines <- "category quantity
a 20
b 30
c 100
d 10
e 1
f 23
g 3
h 200
"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
REVISION An error in the code was corrected.