Create grouping based on cumulative sum and another group - r

This question is nearly identical to:
Create new group based on cumulative sum and group
However, when I apply the accepted solution to my data, it doesn't have the expected result.
In a nutshell, I have a data with two variables: domain and value. Domain is a group variable with multiple observations and value is some continuous value that I would like to accumulate by domain and great a new group variable, newgroup. There are three main rules:
I accumulate only within each domain. If I reach the end of the domain, then the accumulation is reset.
If the accumulated sum is at least 1.0 then the observations whose values added up to at least 1.0 are assigned to a different value for group1. Please note that this rule can be satisfied by a single observation.
If the last group in a domain has an accumulated sum less than 1.0, then merge that with the second to last group within the same domain. This is reflected in the variable group2
The data below has been simplified. The data will usually consist of 10^5 - 10^6 rows, so a vectorized solution would be ideal.
Example Data
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
domain value
1 1.0
1 0.0
1 2.0
1 2.5
1 0.1
2 0.1
2 0.5
2 0.0
2 0.2
2 0.6
2 0.0
2 0.0
2 0.1
Desired Output
cumsum_val <- c(1,0,2,2.5,0.1,0.1,0.6,0.6,0.8,1.4,0,0,0.1)
group1 <- c(1,2,2,3,4,5,5,5,5,5,6,6,6)
group2 <- c(1,2,2,3,3,4,4,4,4,4,4,4,4) #Satisfies Rule #3
df_want <- data.frame(domain,value,cumsum_val,group1,group2)
domain value cumsum_val group1 group2
1 1.0 1.0 1 1
1 0.0 0.0 2 2
1 2.0 2.0 2 2
1 2.5 2.5 3 3
1 0.1 0.1 4 3
2 0.1 0.1 5 4
2 0.5 0.6 5 4
2 0.0 0.6 5 4
2 0.2 0.8 5 4
2 0.6 1.4 5 4
2 0.0 0.0 6 4
2 0.0 0.0 6 4
2 0.1 0.1 6 4
I used the following code:
sum0 <- function(x, y) { if (x + y >= 1.0) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(df_raw$value, df_raw$domain, FUN = is_start))
## 1 2 3 4 5 6 6 6 6 6 7 8 9
but the last line does not produce the same values as group1 above. Generating group1 output is what is mainly causing me issues. Can someone help me understand the function is_start and how that is supposed to produce the groupings?
EDIT
akrun provided some working code in the comments for the simplified example above. However, there are still some situations where it doesn't work. For example,
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
The output is show below with new coming from akrun's code and group1 and group2 are the desired groupings based on rules #2 and #3. The discrepancy between new and group2 occurs mainly in the first 3 rows.
domain value new group1 group2
1 1.0 1 1 1
1 0.0 2 2 2
1 1.0 3 2 2
1 0.0 4 3 3
1 2.0 4 3 3
1 2.5 5 4 4
1 0.1 5 5 4
2 0.1 6 6 5
2 0.5 6 6 5
2 0.0 6 6 5
2 0.2 6 6 5
2 0.6 6 6 5
2 0.0 6 7 5
2 0.0 6 7 5
2 0.1 6 7 5
EDIT 2
I have updated with a working answer.

This works! It uses a combination of purrr's accumulate (similar to cumsum but more versatile) and cumsum with appropriate use of group_by to get what you're looking for. I've added comments to indicate what each part is doing. I'll note that next_group2 is a bit of a misnomer--it's more of a not_next_group2, but hopefully the rest is clear.
library(tidyverse)
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
## Modified from: https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset
sum_reset_at = function(val_col, threshold, include.equals = TRUE) {
if (include.equals) {
purrr::accumulate({{val_col}}, ~if_else(.x>=threshold , .y, .x+.y))
} else {
purrr::accumulate({{val_col}}, ~if_else(.x>threshold , .y, .x+.y))
}
}
df_raw %>%
group_by(domain) %>%
mutate(cumsum_val = sum_reset_at(value, 1)) %>%
mutate(next_group1 = ifelse(lag(cumsum_val) >= 1 | row_number() == 1, 1, 0)) %>% ## binary interpretation of whether there should be a new group
ungroup %>%
mutate(group1 = cumsum(next_group1)) %>% ## generate new groups
group_by(domain, group1) %>%
mutate(next_group2 = ifelse(max(cumsum_val) < 1 & row_number() == 1, 1, 0)) %>% ## similar to above, but grouped by your new group1; we ask it only to transition at the first value of the group that doesn't reach 1
ungroup %>%
mutate(group2 = cumsum(next_group1 - next_group2)) %>% ## cancel out the next_group1 binary if it meets the conditions of next_group2
select(-starts_with("next_"))
And as specified, this produces:
# A tibble: 13 x 5
domain value cumsum_val group1 group2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 1 0 0 2 2
3 1 2 2 2 2
4 1 2.5 2.5 3 3
5 1 0.1 0.1 4 3
6 2 0.1 0.1 5 4
7 2 0.5 0.6 5 4
8 2 0 0.6 5 4
9 2 0.2 0.8 5 4
10 2 0.6 1.4 5 4
11 2 0 0 6 4
12 2 0 0 6 4
13 2 0.1 0.1 6 4

The solution below is adapted from Group vector on conditional sum.
Helper Rcpp Function
library(Rcpp)
cppFunction('
IntegerVector CreateGroup(NumericVector x, int cutoff) {
IntegerVector groupVec (x.size());
int group = 1;
int threshid = 0;
double runSum = 0;
for (int i = 0; i < x.size(); i++) {
runSum += x[i];
groupVec[i] = group;
if (runSum >= cutoff) {
group++;
runSum = 0;
}
}
return groupVec;
}
')
Main Function
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
df_raw %>%
group_by(domain) %>%
mutate(group1 = CreateGroup(value,1),
group1 = ifelse(group1==max(group1) & last(value) < 1,
max(group1)-1,group1)) %>%
ungroup() %>%
mutate(group2 = rleid(group1))
domain value group1 group2
1 1.0 1 1
1 0.0 2 2
1 1.0 2 2
1 0.0 3 3
1 2.0 3 3
1 2.5 4 4
1 0.1 4 4
2 0.1 1 5
2 0.5 1 5
2 0.0 1 5
2 0.2 1 5
2 0.6 1 5
2 0.0 1 5
2 0.0 1 5
2 0.1 1 5

Related

Filter a group of a data.frame based on multiple conditions

I am looking for an elegant way to filter the values of a specific group of big data.frame based on multiple conditions.
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0.2,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like only for the time point 1 to filter out all the values that are smaller than 1 but bigger than 0.1
I want my data.frame to look like this.
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Any help is highly appreciated.
With dplyr you can do
library(dplyr)
data %>% filter(!(time == 1 & (value <= 0.1 | value >= 1)))
# group time value
# 1 A 1 0.2
# 2 A 2 0.1
# 3 B 2 10.0
# 4 C 2 20.0
# 5 A 3 10.0
# 6 B 3 20.0
# 7 C 3 30.0
Or if you have too much free time and you decided to avoid dplyr:
ind <- with(data, (data$time==1 & (data$value > 0.1 & data$value < 1)))
ind <- ifelse((data$time==1) & (data$value > 0.1 & data$value < 1), TRUE, FALSE)
#above two do the same
data$ind <- ind
data <- data[!(data$time==1 & ind==F),]
data$ind <- NULL
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Another simple option would be to use subset twice and then append the results in a row wise manner.
rbind(
subset(data, time == 1 & value > 0.1 & value < 1),
subset(data, time != 1)
)
# group time value
# 1 A 1 0.2
# 4 A 2 0.1
# 5 B 2 10.0
# 6 C 2 20.0
# 7 A 3 10.0
# 8 B 3 20.0
# 9 C 3 30.0

Retaining the last value of a column

I have the following data frame,
Input
For all observations where Month > tenor, the last value of the rate column should be retained for each account for the remaining months. Eg:- Customer 1 has tenor = 5, so for all months greater than 5, the last rate value is retained.
I am using the following code
df$rate <- ifelse(df$Month > df$tenor,tail(df$rate, n=1),df$rate)
But here, the last value is NA so it does not work
Expected output is
Output
this will work, but please have a reproducible example. Others want to help you, not do your homework.
df <- data.frame(
customer = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
Month = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10),
tenor = c(5,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3),
rate = c(0.2,0.3,0.4,0.5,0.6,NA,NA,NA,NA,NA,0.1,0.2,0.3,NA,NA,NA,NA,NA,NA,NA)
)
fn <- function(cus, mon, ten, rat){
if (mon > ten & is.na(rat)){
return(dplyr::filter(df, customer == cus, Month == ten, tenor == ten)$rate)
}
return(rat)
}
df2 <- mutate(df,
newrate = Vectorize(fn)(customer, Month, tenor, rate)
)
One option is:
library(dplyr)
library(tidyr)
df %>%
group_by(cus_no) %>%
fill(rate, .direction = "down") %>%
ungroup()
# A tibble: 20 x 4
customer Month tenor rate
<dbl> <dbl> <dbl> <dbl>
1 1 1 5 0.2
2 1 2 5 0.3
3 1 3 5 0.4
4 1 4 5 0.5
5 1 5 5 0.6
6 1 6 5 0.6
7 1 7 5 0.6
8 1 8 5 0.6
9 1 9 5 0.6
10 1 10 5 0.6
11 2 1 3 0.1
12 2 2 3 0.2
13 2 3 3 0.3
14 2 4 3 0.3
15 2 5 3 0.3
16 2 6 3 0.3
17 2 7 3 0.3
18 2 8 3 0.3
19 2 9 3 0.3
20 2 10 3 0.3
I can't replicate your data frame so this is a guess right now.
I think dplyr should be the solution:-
library(dplyr)
df%>%
group_by(Month)%>%
replace_na(last(rate))
should work

Assigning Subgroups in Data R

I know how to do basic stuff in R, but I am still a newbie. I am also probably asking a pretty redundant question (but I don't know how to enter it into google so that I find the right hits).
I have been getting hits like the below:
Assign value to group based on condition in column
R - Group by variable and then assign a unique ID
I want to assign subgroups into groups, and create a new column out of them.
I have data like the following:
dataframe:
ID SubID Values
1 15 0.5
1 15 0.2
2 13 0.1
2 13 0
1 14 0.3
1 14 0.3
2 10 0.2
2 10 1.6
6 31 0.7
6 31 1.0
new dataframe:
ID SubID Values groups
1 15 0.5 2
1 15 0.2 2
2 13 0.1 2
2 13 0 2
1 14 0.3 1
1 14 0.3 1
2 10 0.2 1
2 10 1.6 1
6 31 0.7 1
6 31 1.0 1
I have tried the following in R, but I am not getting the desired results:
newdataframe$groups <- dataframe %>% group_indices(,dataframe$ID, dataframe$SubID)
newdataframe<- dataframe %>% group_by(ID, SubID) %>% mutate(groups=group_indices(,dataframe$ID, dataframe$SubID))
I am not sure how to frame the question in R. I want to group by ID, and SubID, and then assign those subgroups in that are grouped by IDs and reset the the grouping count on each ID.
Any help would be really appreciated.
Here is an alternative approach which uses the rleid() function from the data.table package. rleid() generates a run-length type id column.
According to the expected result, the OP expects SubId to be numbered by order of value and not by order of appearance. Therefore, we need to call arrange().
library(dplyr)
df %>%
group_by(ID) %>%
arrange(SubID) %>%
mutate(groups = data.table::rleid(SubID))
ID SubID Values groups
<int> <int> <dbl> <int>
1 2 10 0.2 1
2 2 10 1.6 1
3 2 13 0.1 2
4 2 13 0 2
5 1 14 0.3 1
6 1 14 0.3 1
7 1 15 0.5 2
8 1 15 0.2 2
9 6 31 0.7 1
10 6 31 1 1
Note that the row order has changed.
BTW: With data.table, the code is less verbose and the original row order is maintained:
library(data.table)
setDT(df)[order(ID, SubID), groups := rleid(SubID), by = ID][]
ID SubID Values groups
1: 1 15 0.5 2
2: 1 15 0.2 2
3: 2 13 0.1 2
4: 2 13 0.0 2
5: 1 14 0.3 1
6: 1 14 0.3 1
7: 2 10 0.2 1
8: 2 10 1.6 1
9: 6 31 0.7 1
10: 6 31 1.0 1
There are multiple ways to do this one way would be to group_by ID and create a unique number for each SubID by converting it to factor and then to integer.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(groups = as.integer(factor(SubID)))
# ID SubID Values groups
# <int> <int> <dbl> <int>
# 1 1 15 0.5 2
# 2 1 15 0.2 2
# 3 2 13 0.1 2
# 4 2 13 0 2
# 5 1 14 0.3 1
# 6 1 14 0.3 1
# 7 2 10 0.2 1
# 8 2 10 1.6 1
# 9 6 31 0.7 1
#10 6 31 1 1
In base R, we can use ave with similar logic
df$groups <- with(df, ave(SubID, ID, FUN = factor))

Error: incompatible size when mutating in dplyr

I have a trouble with the mutate function in dplyr and the error says;
Error: incompatible size (0), expecting 5 (the group size) or 1
There are some previous posts and I tried some of the solutions but no luck for my case.
group-factorial-data-with-multiple-factors-error-incompatible-size-0-expe
r-dplyr-using-mutate-with-na-omit-causes-error-incompatible-size-d
grouped-operations-that-result-in-length-not-equal-to-1-or-length-of-group-in-dp
Here is what I tried,
ff <- c(seq(0,0.2,0.1),seq(0,-0.2,-0.1))
flip <- c(c(0,0,1,1,1,1),c(1,1,0,0,0,0))
df <- data.frame(ff,flip,group=gl(2,6))
> df
ff flip group
1 0.0 0 1
2 0.1 0 1
3 0.2 1 1
4 0.0 1 1
5 -0.1 1 1
6 -0.2 1 1
7 0.0 1 2
8 0.1 1 2
9 0.2 0 2
10 0.0 0 2
11 -0.1 0 2
12 -0.2 0 2
I want to add new group called c1 and c2 based on some conditions as follows
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)%>%
mutate(c1=ff[head(which(forward>0),1)],c2=ff[tail(which(backward>0),1)])
Error: incompatible size (0), expecting 5 (the group size) or 1
I also add do and tried
do(data.frame(., c1=ff[head(which(.$forward>0),1)],c2=ff[tail(which(.$backward>0),1)]))
Error in data.frame(., c1 = ff[head(which(.$forward > 0), 1)], c2 = ff[tail(which(.$backward > :
arguments imply differing number of rows: 5, 1, 0
but when I only mutate c1 column everything seems to be working. Why?
Just expanding on #allistaire's comment.
Your specified conditions are the cause of the error. specifically, tail(which(backward>0),1)
Given code can be optimised to get rid of the spread()
you can try
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
arrange(group)%>%
mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)])
It seems like you are looking to identify influx points where direction changes, for each group. In this scenario, please clarify exactly how flip is related, or maybe if you change flip <- c(c(0,0,1,1,1,1),c(1,1,0,0,0,0)) to flip <- c(c(0,0,1,1,1,1),c(1,1,0,1,1,1)) so that flip marks change in direction of ff , you can use
dff <- df%>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
arrange(group)%>%
mutate(c1=ff[head(which(direc=="forward" & flip > 0),1)]) %>%
mutate(c2=ff[tail(which(direc=="backward"& flip >0),1)])
which gives:
Source: local data frame [12 x 6]
Groups: group [2]
ff flip group direc c1 c2
<dbl> <dbl> <fctr> <chr> <dbl> <dbl>
1 0.0 0 1 forward 0.2 -0.2
2 0.1 0 1 forward 0.2 -0.2
3 0.2 1 1 forward 0.2 -0.2
4 0.0 1 1 backward 0.2 -0.2
5 -0.1 1 1 backward 0.2 -0.2
6 -0.2 1 1 backward 0.2 -0.2
7 0.0 1 2 forward 0.0 -0.2
8 0.1 1 2 forward 0.0 -0.2
9 0.2 0 2 forward 0.0 -0.2
10 0.0 1 2 backward 0.0 -0.2
11 -0.1 1 2 backward 0.0 -0.2
12 -0.2 1 2 backward 0.0 -0.2
It might be informative to step through the pipe to see what is going on.
df %>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)
# Source: local data frame [10 x 4]
# Groups: group [2]
# ff group backward forward
# <dbl> <fctr> <dbl> <dbl>
# 1 -0.2 1 1 NA
# 2 -0.1 1 1 NA
# 3 0.0 1 1 0
# 4 0.1 1 NA 0
# 5 0.2 1 NA 1
# 6 -0.2 2 0 NA
# 7 -0.1 2 0 NA
# 8 0.0 2 0 1
# 9 0.1 2 NA 1
# 10 0.2 2 NA 0
BTW: Why arrange(group,group)? Doubling the order variable is pointless.
Looking here, you'll see that you have (1) backward values that are not greater than 0. When you run something like which(FALSE) you get integer(0). This might be a good time to realize that dplyr needs the vector length of the rhs to be the same length as the number of rows in the group.
Instead of your mutate, I'll show it with a slight modification: return the number of unique values returned in the which call for c2:
df %>%
group_by(group)%>%
mutate(flip=as.numeric(flip),direc=ifelse(c(0,diff(ff))<0,"backward","forward"))%>%
spread(direc,flip)%>%
arrange(group,group)%>%
mutate(
c1 = ff[head(which(forward>0),1)],
c2len = length(which(backward > 0))
)
# Source: local data frame [10 x 6]
# Groups: group [2]
# ff group backward forward c1 c2len
# <dbl> <fctr> <dbl> <dbl> <dbl> <int>
# 1 -0.2 1 1 NA 0.2 3
# 2 -0.1 1 1 NA 0.2 3
# 3 0.0 1 1 0 0.2 3
# 4 0.1 1 NA 0 0.2 3
# 5 0.2 1 NA 1 0.2 3
# 6 -0.2 2 0 NA 0.0 0
# 7 -0.1 2 0 NA 0.0 0
# 8 0.0 2 0 1 0.0 0
# 9 0.1 2 NA 1 0.0 0
# 10 0.2 2 NA 0 0.0 0
In order to meaningfully index on ff, you need something other than integer(0) in your returns.

How to manipulate a data.frame by factor with dplyr

df <- data.frame(a=factor(c(1,1,2,2,3,3) ), b=c(1,1, 10,10, 20,20) )
a b
1 1 1
2 1 1
3 2 10
4 2 10
5 3 20
6 3 20
I want to split the data frame by column a, calculate b/sum(b) in each group, and put the result in column c. With plyr I can do:
fun <- function(x){
x$c=x$b/sum(x$b)
x
}
ddply(df, .(a), fun )
and have
a b c
1 1 1 0.5
2 1 1 0.5
3 2 10 0.5
4 2 10 0.5
5 3 20 0.5
6 3 20 0.5
but how can I do it with dplyr?
df %.% group_by(a) %.% do(fun)
returns a list instead of a data.frame.
df %>%
group_by(a) %>%
mutate(c=b/sum(b))
a b c
1 1 1 0.5
2 1 1 0.5
3 2 10 0.5
4 2 10 0.5
5 3 20 0.5
6 3 20 0.5
Just to mention an R base solution, you can use transform (R base equivalent to mutate) and ave function to split vectors and apply functions.
> transform(df, c=ave(b,a, FUN= function(b) b/sum(b)))
a b c
1 1 1 0.5
2 1 1 0.5
3 2 10 0.5
4 2 10 0.5
5 3 20 0.5
6 3 20 0.5

Resources