Creating groups based on running totals against a value - r

I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group.
My data looks something like this
ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2
What is want is something like this
ID X Y Z CumSum Group
1 A A 1 1 1
2 A B 5 6 1
3 A C 2 8 1
4 A D 42 50 1
5 A E 10 10 2
6 A F 2 12 2
7 A G 0 12 2
8 A H 3 15 2
9 A I 0 15 2
10 A J 8 23 2
11 A K 19 42 2
12 A L 3 45 2
13 A M 1 1 3
14 A N 1 2 3
15 A O 2 4 3
16 A P 0 4 3
17 A Q 1 5 3
18 A R 2 7 3
Please let me know how I can achieve this with R.
EDIT: I extended the minimum reproducible example for more clarity
EDIT 2: I have one extra question on this topic. What if, the variable X which is A only right now is also changing. For example, it can be B for a while then can go to being C. How can I prevent the code from generating groups that are not within two categories of X. For example if Group = 3, then how can I make sure that 3 is not in category A and B?

A function for this is available in the MESS-package...
library(MESS)
library(data.table)
DT[, Group := MESS::cumsumbinning(Z, 50)][, Cumsum := cumsum(Z), by = .(Group)][]
output
ID X Y Z Group Cumsum
1: 1 A A 1 1 1
2: 2 A B 5 1 6
3: 3 A C 2 1 8
4: 4 A D 42 1 50
5: 5 A E 10 2 10
6: 6 A F 2 2 12
7: 7 A G 0 2 12
8: 8 A H 3 2 15
9: 9 A I 0 2 15
10: 10 A J 8 2 23
11: 11 A K 19 2 42
12: 12 A L 3 2 45
sample data
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3")

Define Accum which adds x to acc resetting to x if acc is 45 or more. Use Reduce to apply that to Z giving r (which is the cumulative sum column). The values greater than or equal to 45 are the group ends so attach a unique group id to them in g by using a cumsum starting from the end and going backwards toward the beginning giving g which has unique values for each group. Finally modify the group id's in g so that they start from 1. We run this with the input in the Note at the end which duplicates the last line several times so that 3 groups can be shown. No packages are used.
Accum <- function(acc, x) if (acc < 45) acc + x else x
applyAccum <- function(x) Reduce(Accum, x, accumulate = TRUE)
cumsumr <- function(x) rev(cumsum(rev(x))) # reverse cumsum
GroupNo <- function(x) {
y <- cumsumr(x >= 45)
max(y) - y + 1
}
transform(transform(DF, Cumsum = ave(Z, ID, FUN = applyAccum)),
Group = ave(Cumsum, ID, FUN = GroupNo))
giving:
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
13 12 A L 3 3 3
14 12 A L 3 6 3
Note
The input in reproducible form:
Lines <- "ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
12 A L 3
12 A L 3"
DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)

One tidyverse possibility could be:
df %>%
mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)),
Group = cumsum(Cumsum >= 45),
Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1)
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2

Not a pretty solution, but functional.
df$Group<-0
group<-1
while (df$Group[nrow(df)]==0) {
df$ww[df$Group==0]<-cumsum(df$Z[df$Group==0])
df$Group[df$Group==0 & (lag(df$ww)<=45 | is.na(lag(df$ww)) | lag(df$Group!=0))]<-group
group=group+1
}
df
ID X Y Z ww Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
OK, yeah, #tmfmnk 's solution is vastly better:
Unit: milliseconds
expr min lq mean median uq max neval
tm 2.224536 2.805771 6.76661 3.221449 3.990778 303.7623 100
iod 19.198391 22.294222 30.17730 25.765792 35.768616 110.2062 100

Or using data.table:
library(data.table)
n <- 45L
DT[, cs := Reduce(function(tot, z) if (tot+z > n) z else tot+z, Z, accumulate=TRUE)][,
Group := .GRP, by=cumsum(c(1L, diff(cs))<0L)]
output:
ID X Y Z cs Group
1: 1 A A 1 1 1
2: 2 A B 5 6 1
3: 3 A C 2 8 1
4: 4 A D 42 42 1
5: 5 A E 10 10 2
6: 6 A F 2 12 2
7: 7 A G 0 12 2
8: 8 A H 3 15 2
9: 9 A I 0 15 2
10: 10 A J 8 23 2
11: 11 A K 19 42 2
12: 12 A L 3 45 2
13: 13 A M 1 1 3
14: 14 A N 1 2 3
15: 15 A O 2 4 3
16: 16 A P 0 4 3
17: 17 A Q 1 5 3
18: 18 A R 2 7 3
data:
library(data.table)
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2")

Related

Sum values incrementally for panel data

I have a very basic question as I am relatively new to R. I was wondering how to add a value in a particular column to the previous one for each cross-sectional unit in my data separately. My data looks like this:
firm date value
A 1 10
A 2 15
A 3 20
A 4 0
B 1 0
B 2 1
B 3 5
B 4 10
C 1 3
C 2 2
C 3 10
C 4 1
D 1 7
D 2 3
D 3 6
D 4 9
And I want to achieve the data below. So I want to sum values for each cross-sectional unit incrementally.
firm date value cumulative value
A 1 10 10
A 2 15 25
A 3 20 45
A 4 0 45
B 1 0 0
B 2 1 1
B 3 5 6
B 4 10 16
C 1 3 3
C 2 2 5
C 3 10 15
C 4 1 16
D 1 7 7
D 2 3 10
D 3 6 16
D 4 9 25
Below is a reproducible example code. I tried lag() but couldn't figure out how to repeat it for each firm.
firm <- c("A","A","A","A","B","B","B","B","C","C","C", "C","D","D","D","D")
date <- c("1","2","3","4","1","2","3","4","1","2","3","4", "1", "2", "3", "4")
value <- c(10, 15, 20, 0, 0, 1, 5, 10, 3, 2, 10, 1, 7, 3, 6, 9)
data <- data.frame(firm = firm, date = date, value = value)
Does this work:
library(dplyr)
df %>% group_by(firm) %>% mutate(cumulative_value = cumsum(value))
# A tibble: 16 x 4
# Groups: firm [4]
firm date value cumulative_value
<chr> <int> <int> <int>
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25
Using base R with ave
data$cumulative_value <- with(data, ave(value, firm, FUN = cumsum))
-output
> data
firm date value cumulative_value
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25

Suming up consecutive values in groups [duplicate]

This question already has answers here:
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 2 years ago.
I'd like to sum up consecutive values in one column by groups, without long explanation, I have df like this:
set.seed(1)
gr <- c(rep('A',3),rep('B',2),rep('C',5),rep('D',3))
vals <- floor(runif(length(gr), min=0, max=10))
idx <- c(seq(1:3),seq(1:2),seq(1:5),seq(1:3))
df <- data.frame(gr,vals,idx)
gr vals idx
1 A 2 1
2 A 3 2
3 A 5 3
4 B 9 1
5 B 2 2
6 C 8 1
7 C 9 2
8 C 6 3
9 C 6 4
10 C 0 5
11 D 2 1
12 D 1 2
13 D 6 3
And I'm looking for this one:
gr vals idx
1 A 2 1
2 A 5 2
3 A 10 3
4 B 9 1
5 B 11 2
6 C 8 1
7 C 17 2
8 C 23 3
9 C 29 4
10 C 29 5
11 D 2 1
12 D 3 2
13 D 9 3
So ex. in group C we have 8+9=17 (first and second element of the group) and second value is replaced by the sum. Then 17+6=23 (sum of previously summed elements and third element), 3rd element replaced by the new result and so on...
I was looking for some solution here but it isn't what I'm looking for.
Ok, I think I got it
df %>%
group_by(gr) %>%
mutate(nvals = cumsum(vals))
gr vals idx nvals
1 A 2 1 2
2 A 3 2 5
3 A 5 3 10
4 B 9 1 9
5 B 2 2 11
6 C 8 1 8
7 C 9 2 17
8 C 6 3 23
9 C 6 4 29
10 C 0 5 29
11 D 2 1 2
12 D 1 2 3
13 D 6 3 9

determining total number of times distinct values 0 or 1 or na in each column in a data frame in R

I have 15 columns and I want to group by values in each column by either 0 or 1 or na.
my dataset
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
NA,1.0,0.0,0.0,NA,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,NA,NA,NA,NA
1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,NA,0.0,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
0.0,1.0,1.0,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
NA,NA,1.0,NA,NA,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
0.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,NA,NA,NA,NA,NA
I want output to be like:
A B C D E F G H I J K L M N O
0 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
1 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
NA 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
We can loop through the dataset and apply the table with useNA="always"
sapply(df1, table, useNA="always")
If there are only a particular value in a column, say 1, then convert it to factor with levels specified as 0 and 1
sapply(df1, function(x) table(factor(x, levels = 0:1), useNA = "always"))
# A B C D E F G H I J K L M N O
#0 4 3 8 7 17 15 14 11 14 12 12 10 8 11 9
#1 19 21 17 17 6 9 10 12 8 11 8 10 12 9 11
#<NA> 2 1 0 1 2 1 1 2 3 2 5 5 5 5 5

Change variable value-- repeated IDs

I've this data set
id <- c(0,0,1,1,2,2,3,3,4,4)
gender <- c("m","m","f","f","f","f","m","m","m","m")
x1 <-c(1,1,1,1,2,2,3,3,10,10)
x2 <- c(3,7,5,6,9,15,10,15,12,20)
alldata <- data.frame(id,gender,x1,x2)
which looks like:
id gender x1 x2
0 m 1 3
0 m 1 7
1 f 1 5
1 f 1 6
2 f 2 9
2 f 2 15
3 m 3 10
3 m 3 15
4 m 10 12
4 m 10 20
Notice that for each unique id x1 are similar, but x2 are different. I need to sort data by id and x2 (from smallest to largest)
and then for each unique id I need to set x1(for the second record) = x2 (for the first record).
The data would look like:
id gender x1 x2
0 m 1 3
0 m 3 7
1 f 1 5
1 f 5 6
2 f 2 9
2 f 9 15
3 m 3 10
3 m 10 15
4 m 10 12
4 m 12 20
I found this easier using data.table
> library(data.table)
> dt = data.table(alldata)
> setkey(dt, id, x2) #sort the data
This next line says: within each ID for x1, take the first value of x1, then every remaining value take from x2 as needed.
> dt[,x1 := c(x1[1], x2)[1:.N],keyby=id]
> dt
id gender x1 x2
1: 0 m 1 3
2: 0 m 3 7
3: 1 f 1 5
4: 1 f 5 6
5: 2 f 2 9
6: 2 f 9 15
7: 3 m 3 10
8: 3 m 10 15
9: 4 m 10 12
10: 4 m 12 20
Here's another possible solution using the seq command to select every other record:
alldata <- alldata[order(id, x2),]
alldata$x1[seq(2, length(alldata$x1), 2)] <- alldata$x2[seq(1, length(alldata$x2) - 1, 2)]
Here is a dplyr solution.
library(dplyr)
arrange(alldata,id,x2) %>%
group_by(id) %>%
mutate(x1= c(first(x1), first(x2)))
Source: local data frame [10 x 4]
Groups: id
id gender x1 x2
1 0 m 1 3
2 0 m 3 7
3 1 f 1 5
4 1 f 5 6
5 2 f 2 9
6 2 f 9 15
7 3 m 3 10
8 3 m 10 15
9 4 m 10 12
10 4 m 12 20
`rownames<-`(do.call(rbind,by(alldata,alldata$id,function(g) { o <- order(g$x2); g$x1[o[2]] <- g$x2[o[1]]; g; })),NULL);
## id gender x1 x2
## 1 0 m 1 3
## 2 0 m 3 7
## 3 1 f 1 5
## 4 1 f 5 6
## 5 2 f 2 9
## 6 2 f 9 15
## 7 3 m 3 10
## 8 3 m 10 15
## 9 4 m 10 12
## 10 4 m 12 20

add new row in R based on a factor variable

I have a data.frame with multiple entries per unique id. I need to determine which rows exceed a predefined time limit of 60 seconds. I have already attached a column that is populated with the term "toolong" to indicate the row I need to split the time column. I then want to create a new row directly below the row that has "toolong" and preserve all the same information as the "parent row" except to change the action column to "l" and the time column to the previous time - 60. The parent row will contain all the same information except the action column will be changed to "for" and the time to 60 secs. There are a total of 32 columns in the original database so preserving all the contents of the row other than action and time is necessary.
Example:
id <- c(1,1,1,1,2,2,2,2)
resting <- c("f","f","toolong","f","f","f","toolong","f")
action <- c("h","h","l","d","h","h","l","d")
time <- c(90,12,120,14,90,12,110,14)
other <- c(1,2,3,4,5,6,5,4)
dat <- data.frame(cbind(id,resting,action,time,other))
How I would like it to look:
id2 resting2 action2 time2 other2
1 1 f h 90 1
2 1 f h 12 2
3 1 toolong for 60 3
4 1 toolong l 60 3
5 1 f d 14 4
6 2 f h 90 5
7 1 f h 12 6
8 2 toolong for 60 5
9 2 toolong l 50 5
10 2 f d 14 4
Thanks, Tim
First, repeat the rows that are toolong...
R>rowID <- rep(1:8, times=as.factor(resting))
R>dat2 <- dat[rowID,]
R>dat2
id resting action time other
1 1 f h 90 1
2 1 f h 12 2
3 1 toolong l 120 3
3.1 1 toolong l 120 3
4 1 f d 14 4
5 2 f h 90 5
6 2 f h 12 6
7 2 toolong l 110 5
7.1 2 toolong l 110 5
8 2 f d 14 4
Then, for the duplicated ones, subtract of 60min per previous record...
R>dups <- unlist(tapply(duplicated(rowID), rowID,cumsum))
R>dat2$time <- dat2$time - 60*dups
R>dat2[dat2$resting == "toolong", "time"] <- pmin(60, dat2[dat2$resting == "toolong", "time"] )
R>dat2
id resting action time other
1 1 f h 90 1
2 1 f h 12 2
3 1 toolong l 60 3
3.1 1 toolong l 60 3
4 1 f d 14 4
5 2 f h 90 5
6 2 f h 12 6
7 2 toolong l 60 5
7.1 2 toolong l 50 5
8 2 f d 14 4
dat2 <- rbind(dat, dat[ dat$resting=="toolong" , ])
dat2 <- dat2[order(rownames(dat2)), ]
dat2[duplicated(dat2), "action"] <- "l"
names(dat2) <- paste0(names(dat2), "2")
dat2
#-------
id2 resting2 action2 time2 other2
1 1 f h 90 1
2 1 f h 12 2
3 1 toolong l 120 3
31 1 toolong l 120 3
4 1 f d 14 4
5 2 f h 90 5
6 2 f h 12 6
7 2 toolong l 110 5
71 2 toolong l 110 5
8 2 f d 14 4
The other way for constructing duplicate rownames to be used as a selection vector was with mapply and adding 1 to a logical vector. That may have some advantage as the periods in the rownames are a better "dupe" indicator.
dat[ unlist(mapply( rep, rownames(dat), 1+(dat$resting=="toolong"))) , ]
id resting action time other
1 1 f h 90 1
2 1 f h 12 2
3 1 toolong l 120 3
3.1 1 toolong l 120 3
4 1 f d 14 4
5 2 f h 90 5
6 2 f h 12 6
7 2 toolong l 110 5
7.1 2 toolong l 110 5
8 2 f d 14 4
Addressing the comment::modifications of the question:
dat2$action2 <- as.character(dat2$action2)
dat2[ dat2$resting2=="toolong" & !duplicated(dat2) , "action2"] <- "for"
dat2
id2 resting2 action2 time2 other2
1 1 f h 90 1
2 1 f h 12 2
3 1 toolong for 120 3
31 1 toolong l 120 3
4 1 f d 14 4
5 2 f h 90 5
6 2 f h 12 6
7 2 toolong for 110 5
71 2 toolong l 110 5
8 2 f d 14 4

Resources