I have the following data frame:
id = c("A","A","A","A","A","A","B","B","B","B","B","B","C","C","C","C","C","C")
month = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6)
amount = c(0,0,10,0,0,0,0,10,0,10,0,0,0,0,0,10,10,0)
df <- data.frame(id, month, amount)
What I need to do (by ID) is:
Calculate (by way of a negative number) the difference in months between zero and non-zero "amount" rows until such time as the "amount" equals 0. When this happens, the time = 0. THEN, once an "amount" exceeds zero in the sequence, the calculation (by way of a positive number) will look back and calculate the difference in months between non-zero and the historical zero "amount" row.
The solution would look like:
solution = c(-2,-1,0,1,2,3,-1,0,1,0,1,2,-3,-2,-1,0,0,1)
As you can probably tell, its pretty tough to search for this multi-layered problem. Ideally the answer would be using data.table as i'm dealing with millions of rows, but dplyr would also suit my needs.
Any help appreciated.
S.
library(data.table)
setDT(DT)
DT[, g := rleid(id, amount != 0)]
DT[, g_id := g - g[1L], by=id]
DT[, v :=
if (g_id == 0L)
-(.N:1)
else if (g_id %% 2 == 0)
1:.N
else
0L
, by=.(id, g_id)]
all.equal(DT$v, solution) # TRUE
To see how it works:
id month amount g g_id v
1: A 1 0 1 0 -2
2: A 2 0 1 0 -1
3: A 3 10 2 1 0
4: A 4 0 3 2 1
5: A 5 0 3 2 2
6: A 6 0 3 2 3
7: B 1 0 4 0 -1
8: B 2 10 5 1 0
9: B 3 0 6 2 1
10: B 4 10 7 3 0
11: B 5 0 8 4 1
12: B 6 0 8 4 2
13: C 1 0 9 0 -3
14: C 2 0 9 0 -2
15: C 3 0 9 0 -1
16: C 4 10 10 1 0
17: C 5 10 10 1 0
18: C 6 0 11 2 1
You can drop the extra columns with DT[, c("g", "g_id") := NULL].
With tidyr and dplyr
library(dplyr)
library(tidyr)
df_new <- df %>%
group_by(id) %>%
# identify non-zero instances
mutate(temp = ifelse(amount != 0, month, NA)) %>%
# fill down first
fill(temp, .direction = "down") %>%
# fill up after
fill(temp, .direction = "up") %>%
# calculate difference
mutate(solution = month - temp) %>%
# remove temp
select(-temp)
Result
# id month amount solution
# <fctr> <dbl> <dbl> <dbl>
# 1 A 1 0 -2
# 2 A 2 0 -1
# 3 A 3 10 0
# 4 A 4 0 1
# 5 A 5 0 2
# 6 A 6 0 3
# 7 B 1 0 -1
# 8 B 2 10 0
# 9 B 3 0 1
# 10 B 4 10 0
# 11 B 5 0 1
# 12 B 6 0 2
# 13 C 1 0 -3
# 14 C 2 0 -2
# 15 C 3 0 -1
# 16 C 4 10 0
# 17 C 5 10 0
# 18 C 6 0 1
Related
Say I have the following data.table:
library(data.table)
DT <- data.table(
ID = rep(c(1,2,3),4),
day = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3)),
Status = c(rep('A',3),'A','B','B','A','C','B','A','D','C')
)
What I would like to achieve is that for each ID, find number of observations (in this case if sorted by days, the number of day it takes to hit a specific Status. So if I need to do this for Status C, the result would be:
0 for ID 1 (since doesn't contain status C), 3 for ID 2, and 4 for ID 3.
The only way came to my mind was to write a function and do nested for loops, but I am sure there should be much better/faster/more efficient ways.
Appreciate any help.
A possible data.table approach adding one column for the number of days to reach each status (0 if never reached):
library(data.table)
## status id's
status_ids <- unique(DT$Status)
status_cols <- paste("status", status_ids, sep = "_")
## add one column for each status id
setorder(DT, ID, day)
DT[, (status_cols) := lapply(status_ids, \(s) ifelse(any(Status == s), min(day[Status == s]), 0)), by = "ID"]
DT
#> ID day Status status_A status_B status_C status_D
#> 1: 1 1 A 1 0 0 0
#> 2: 1 2 A 1 0 0 0
#> 3: 1 3 A 1 0 0 0
#> 4: 1 4 A 1 0 0 0
#> 5: 2 1 A 1 2 3 4
#> 6: 2 2 B 1 2 3 4
#> 7: 2 3 C 1 2 3 4
#> 8: 2 4 D 1 2 3 4
#> 9: 3 1 A 1 2 4 0
#> 10: 3 2 B 1 2 4 0
#> 11: 3 3 B 1 2 4 0
#> 12: 3 4 C 1 2 4 0
You can split by ID and return the first match of day.
sapply(split(DT[,2:3], DT$ID), \(x) x$day[match("C", x$Status)])
# 1 2 3
#NA 3 4
Does this work:
library(dplyr)
DT %>% left_join(
DT %>% group_by(ID) %>% summarise(col = row_number()[Status == 'C'])
) %>% replace_na(list(col= 0))
`summarise()` has grouped output by 'ID'. You can override using the
`.groups` argument.
Joining, by = "ID"
ID day Status col
1: 1 1 A 0
2: 2 1 A 3
3: 3 1 A 4
4: 1 2 A 0
5: 2 2 B 3
6: 3 2 B 4
7: 1 3 A 0
8: 2 3 C 3
9: 3 3 B 4
10: 1 4 A 0
11: 2 4 D 3
12: 3 4 C 4
I have a data set. Each ID has a sequence of either 0 and 1, a column of date and another column of num.day. I would like to find the start and end of the longest 1 sequence for each ID. Then calculate the date gap between start to end. After that add the end date's corresponding num.day.
For example, the following table ID = 1, the longest sequence start with date 4 (one record above) and end with 9. So the gap is 5. The add the end dates's corresponding num.day should be day_gap = (9-4) +3 = 8. If one ID has multiple sequences have the same longest length, then take the max day_gap of these sequences for this ID.
Here is the code to create the dummy table
library(data.table)
ID=c(rep(1,10),rep(2,10),rep(3,10))
set.seed(1)
fill=sample(c(0,1),length(ID),replace=TRUE)
dat=data.table(ID,fill)[,date:=seq(.N),by="ID"][date==1,fill:=0]
set.seed(1)
dat$num.days=sample(1:10,nrow(dat),replace=TRUE)
Here is my solution with dplyr:
dat %>%
group_by(ID) %>%
mutate(group = cumsum(fill != lag(fill, default = fill[[1]])) + 1) %>%
ungroup() %>%
filter(fill == 1) %>%
group_by(ID, group) %>%
mutate(fill_group_no_max = max(row_number())) %>%
ungroup() %>%
group_by(ID) %>%
filter(fill_group_no_max == max(fill_group_no_max),
group == max(group)) %>%
summarise(dategap = max(date) - min(date) + last(num.days)) %>%
ungroup()
Update: Sorry! I forgot the summarise part
An option using base::rle in data.table:
dat[, {
r <- rle(fill)
i <- which.max(r$lengths * r$values)
.(day_gap = r$lengths[i] + num.days[sum(r$lengths[1L:i])])
}, ID]
output:
ID day_gap
1: 1 11
2: 2 12
3: 3 8
data:
library(data.table)
ID = c(rep(1,10),rep(2,10),rep(3,10))
set.seed(1)
fill = sample(c(0,1),length(ID),replace=TRUE)
dat = data.table(ID,fill)[,date:=seq(.N),by="ID"][date==1,fill:=0]
set.seed(1)
dat$num.days=sample(1:10,nrow(dat),replace=TRUE)
dat
dat:
ID fill date num.days
1: 1 0 1 3
2: 1 0 2 4
3: 1 1 3 6
4: 1 1 4 10
5: 1 0 5 3
6: 1 1 6 9
7: 1 1 7 10
8: 1 1 8 7
9: 1 1 9 7
10: 1 0 10 1
11: 2 0 1 3
12: 2 0 2 2
13: 2 1 3 7
14: 2 0 4 4
15: 2 1 5 8
16: 2 0 6 5
17: 2 1 7 8
18: 2 1 8 10
19: 2 0 9 4
20: 2 1 10 8
21: 3 0 1 10
22: 3 0 2 3
23: 3 1 3 7
24: 3 0 4 2
25: 3 0 5 3
26: 3 0 6 4
27: 3 0 7 1
28: 3 0 8 4
29: 3 1 9 9
30: 3 0 10 4
ID fill date num.days
I have the following dataframe (df):
A B T Required col (window = 3)
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1 4
5 6 0 0 2
6 4 1 1 0
7 7 1 1 1
8 8 1 1 1
9 1 0 0 1
I would like to add the required column, as followed:
Insert in the current row the previous row value of A or B.
If in the last 3 (window) rows most of time the content of A column is equal to T column - choose A, otherwise - B. (There can be more columns - so the content of the column with the most times equal to T will be chosen).
What is the most efficient way to do it for big data table.
I changed the column named T to be named TC to avoid confusion with T as an abbreviation for TRUE
library(tidyverse)
library(data.table)
df[, newcol := {
equal <- A == TC
map(1:.N, ~ if(.x <= 3) NA
else if(sum(equal[.x - 1:3]) > 3/2) A[.x - 1]
else B[.x - 1])
}]
df
# N A B TC newcol
# 1: 1 1 0 1 NA
# 2: 2 3 0 3 NA
# 3: 3 4 0 4 NA
# 4: 4 2 1 1 4
# 5: 5 6 0 0 2
# 6: 6 4 1 1 0
# 7: 7 7 1 1 1
# 8: 8 8 1 1 1
# 9: 9 1 0 0 1
This works too, but it's less clear, and likely less efficient
df[, newcol := shift(A == TC, 1:3) %>%
pmap_lgl(~sum(...) > 3/2) %>%
ifelse(shift(A), shift(B))]
data:
df <- fread("
N A B TC
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1
5 6 0 0
6 4 1 1
7 7 1 1
8 8 1 1
9 1 0 0
")
Probably much less efficient than the answer by Ryan, but without additional packages.
A<-c(1,3,4,2,6,4,7,8,1)
B<-c(0,0,0,1,0,1,1,1,0)
TC<-c(1,3,4,1,0,1,1,1,0)
req<-rep(NA,9)
df<-data.frame(A,B,TC,req)
window<-3
for(i in window:(length(req)-1)){
equal <- sum(df$A[(i-window+1):i]==df$TC[(i-window+1):i])
if(equal > window/2){
df$req[i+1]<-df$A[i]
}else{
df$req[i+1]<-df$B[i]
}
}
Given the following example data table:
library(data.table)
DT <- fread("grp y exclude
a 1 0
a 2 0
a 3 0
a 4 1
a 5 0
a 7 1
a 8 0
a 9 0
a 10 0
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 1
b 7 1
b 8 0
b 9 0
b 10 0
c 5 1
d 1 0")
I want to select
by group grp
all rows that have y==5
and up to two rows before and after each row from 2 within the grouping.
but 3. only those rows that have exclude==0.
Assuming each group has max one row with y==5, this would yield the desired result for 1.-3.:
idx <- -2:2 # 2 rows before match, the matching row itself, and two rows after match
(row_numbers <- DT[,.I[{
x <- rep(which(y==5),each=length(idx))+idx
x[x>0 & x<=.N]
}], by=grp]$V1)
# [1] 3 4 5 6 7 12 13 14 15 16 20
DT[row_numbers]
# grp y exclude
# 1: a 3 0
# 2: a 4 1
# 3: a 5 0 # y==5 + two rows before and two rows after
# 4: a 7 1
# 5: a 8 0
# 6: b 3 0
# 7: b 4 1
# 8: b 5 0 # y==5 + two rows before and two rows after
# 9: b 6 1
# 10: b 7 1
# 11: c 5 1 # y==5 + nothing, because the group has only 1 element
However, how do I incorporate 4. so that I get
# grp y exclude
# 1: a 2 0
# 2: a 3 0
# 3: a 5 0
# 4: a 8 0
# 5: a 9 0
# 6: b 2 0
# 7: b 3 0
# 8: b 5 0
# 9: b 8 0
# 10: b 9 0
# 11: c 5 1
? Feels like I'm close, but I guess I looked too long at heads and whiches, now, so I'd be thankful for some fresh ideas.
A bit more simplified:
DT[DT[, rn := .I][exclude==0 | y==5][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1]
# grp y exclude rn
#1: a 2 0 2
#2: a 3 0 3
#3: a 5 0 5
#4: a 8 0 7
#5: a 9 0 8
#6: b 2 0 11
#7: b 3 0 12
#8: b 5 0 14
#9: b 8 0 17
#10: b 9 0 18
#11: c 5 1 20
You are very close. This should do it:
row_numbers <- DT[exclude==0 | y==5, .I[{
x <- rep(which(y==5), each=length(idx)) + idx
x[x>0 & x<=.N]
}], by=grp]$V1
DT[row_numbers]
I have code that needs to be flexible, and I cannot hard code in column names when I do grouping. As such, I want to hard code column numbers to do grouping, since these are easy to specify over range changes. (Column 1 through X or so, rather than using the names of cols 1,2,..X)
Example data set:
set.seed(007)
DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
DF
X Y Z Q
1 1 1 3 4
2 2 0 1 2
3 3 0 5 4
4 4 0 5 2
5 5 0 5 5
6 6 1 0 1
7 7 0 3 0
8 8 1 2 4
9 9 0 5 5
10 10 0 2 5
11 11 0 4 3
12 12 0 1 4
13 13 1 1 4
14 14 0 1 3
15 15 0 2 4
16 16 0 5 2
17 17 1 2 0
18 18 0 4 1
19 19 1 5 2
20 20 0 2 1
A grouping (by Z and Q) that finds the X that maximizes Y, and returns both:
DF =data.table(DF)
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Result:
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Z Q Y X
1: 3 4 1 1
2: 1 2 0 2
3: 5 4 0 3
4: 5 2 1 19
5: 5 5 0 5
6: 0 1 1 6
7: 3 0 0 7
8: 2 4 1 8
9: 2 5 0 10
10: 4 3 0 11
11: 1 4 1 13
12: 1 3 0 14
13: 2 0 1 17
14: 4 1 0 18
15: 2 1 0 20
I want to do this purely using column numbers, because of the nature of my code. Additionally, If there were another column, I would potentially want to group by that extra column. And I would also want to potentially return another argmax in the first part.
Maybe just pick off names(DF) with column numbers, combined with eval(parse(...))?
useColNums <- function(data, a, b) {
n <- names(data)
y <- n[a[1]]
x <- n[a[2]]
groupby <- sprintf("list(%s)", paste(n[b], collapse=","))
argmax <- sprintf("list(%1$s=max(%1$s),%2$s=%2$s[which.max(%1$s)])", y, x)
data[, eval(parse(text=argmax)), by=eval(parse(text=groupby))]
}
x <- useColNums(DF, 2:1, 3:4)
y <- DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
identical(x, y)
# [1] TRUE
Did you find an answer that works for you? Something like this is possible, but it is not pretty, which may mean it is hard to maintain:
DF[, list(Y=max(eval(as.symbol(colnames(DF)[2]))),
X=eval(as.symbol(colnames(DF)[1]))[which.max(eval(as.symbol(colnames(DF)[2])))]),
by=list(Z=eval(as.symbol(colnames(DF)[3])),
Q=eval(as.symbol(colnames(DF)[4])))]
Now you could put those as.symbol(colnames()) into a function and make this easier to read:
cn <- function( dt, col ) { as.symbol(colnames(dt)[col]) }
DF[, list(Y=max(eval(cn(DF,2))),
X=eval(cn(DF,1))[which.max(eval(cn(DF,2)))]),
by=list(Z=eval(cn(DF,3)), Q=eval(cn(DF,4)))]
Does this solve that problem of grouping by column numbers for you?
You could use a combination of grep with your code:
> set.seed(007)
> DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
> DF = data.table(DF)
> coly <- na
> DF[, list(Y=max(Y),X=X[which.max(Y)]), by=c(col1 <- names(DF)[grep("Q", colnames(DF))], names(DF)[grep("Z", colnames(DF))])]
Q Z Y X
1: 4 3 1 1
2: 2 1 0 2
3: 4 5 0 3
4: 2 5 1 19
5: 5 5 0 5
6: 1 0 1 6
7: 0 3 0 7
8: 4 2 1 8
9: 5 2 0 10
10: 3 4 0 11
11: 4 1 1 13
12: 3 1 0 14
13: 0 2 1 17
14: 1 4 0 18
15: 1 2 0 20