data.table: Select n specific rows before & after other rows meeting a condition - r

Given the following example data table:
library(data.table)
DT <- fread("grp y exclude
a 1 0
a 2 0
a 3 0
a 4 1
a 5 0
a 7 1
a 8 0
a 9 0
a 10 0
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 1
b 7 1
b 8 0
b 9 0
b 10 0
c 5 1
d 1 0")
I want to select
by group grp
all rows that have y==5
and up to two rows before and after each row from 2 within the grouping.
but 3. only those rows that have exclude==0.
Assuming each group has max one row with y==5, this would yield the desired result for 1.-3.:
idx <- -2:2 # 2 rows before match, the matching row itself, and two rows after match
(row_numbers <- DT[,.I[{
x <- rep(which(y==5),each=length(idx))+idx
x[x>0 & x<=.N]
}], by=grp]$V1)
# [1] 3 4 5 6 7 12 13 14 15 16 20
DT[row_numbers]
# grp y exclude
# 1: a 3 0
# 2: a 4 1
# 3: a 5 0 # y==5 + two rows before and two rows after
# 4: a 7 1
# 5: a 8 0
# 6: b 3 0
# 7: b 4 1
# 8: b 5 0 # y==5 + two rows before and two rows after
# 9: b 6 1
# 10: b 7 1
# 11: c 5 1 # y==5 + nothing, because the group has only 1 element
However, how do I incorporate 4. so that I get
# grp y exclude
# 1: a 2 0
# 2: a 3 0
# 3: a 5 0
# 4: a 8 0
# 5: a 9 0
# 6: b 2 0
# 7: b 3 0
# 8: b 5 0
# 9: b 8 0
# 10: b 9 0
# 11: c 5 1
? Feels like I'm close, but I guess I looked too long at heads and whiches, now, so I'd be thankful for some fresh ideas.

A bit more simplified:
DT[DT[, rn := .I][exclude==0 | y==5][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1]
# grp y exclude rn
#1: a 2 0 2
#2: a 3 0 3
#3: a 5 0 5
#4: a 8 0 7
#5: a 9 0 8
#6: b 2 0 11
#7: b 3 0 12
#8: b 5 0 14
#9: b 8 0 17
#10: b 9 0 18
#11: c 5 1 20

You are very close. This should do it:
row_numbers <- DT[exclude==0 | y==5, .I[{
x <- rep(which(y==5), each=length(idx)) + idx
x[x>0 & x<=.N]
}], by=grp]$V1
DT[row_numbers]

Related

R Configure Data With Data.Table

data=data.frame("Student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"Grade"=c(5,6,7,3,4,5,4,5,6,8,9,10,2,3,4),
"Pass"=c(NA,0,1,0,1,1,0,1,0,0,NA,NA,0,0,0),
"NEWPass"=c(0,0,1,0,1,1,0,1,1,0,0,0,0,0,0),
"GradeNEWPass"=c(7,7,7,4,4,4,5,5,5,10,10,10,4,4,4),
"GradeBeforeNEWPass"=c(6,6,6,3,3,3,4,4,4,10,10,10,4,4,4))
I have a data.frame called data. It has column names Student, Grade and Pass. I wish to do this:
NEWPass: Take Pass and for every Student fill in NA values with the previous value. If the first value is 'NA' than put a zero. Then this should be a running maximum.
GradeNEWPass: Take the lowest value of Grade that a Student got a one in NEWPass. If a Student did not get a one in NEWPass, this equals to the maximum grade.
GradeBeforeNEWPass: Take the value of Grade BEFORE a Student got a one in NEWPass. If a Student did not get a one in NEWPass, this equals to the maximum grade.
__
Attempts:
setDT(data)[, NEWPassTry := cummax(Pass), by = Student]
data$GradeNEWPass = data$NEWPassTry * data$Grade
data[, GradeNEWPass := min(GradeNEWPass), by = Student]
Not pretty, admittedly, but your logic includes words like "if any ... for a student", so it's a group-wise conditional, not a row-wise conditional.
library(magrittr) # just for %>% for breakout, not required
mydata %>%
.[, NEWPass2 :=
cummax(fifelse(seq_len(.N) == 1 & is.na(Pass), 0,
zoo::na.locf(Pass, na.rm = FALSE))), by = .(Student) ] %>%
.[, GradeNEWPass2 :=
if (any(NEWPass2 > 0)) min(Grade[ NEWPass2 > 0 ]) else max(Grade),
by = .(Student) ] %>%
.[, GradeBeforeNEWPass2 :=
if (NEWPass2[1] == 0 && any(NEWPass2 > 0)) Grade[ which(NEWPass2 > 0)[1] - 1 ] else max(Grade),
by = .(Student) ]
# Student Grade Pass NEWPass GradeNEWPass GradeBeforeNEWPass NEWPass2 GradeNEWPass2 GradeBeforeNEWPass2
# 1: 1 5 NA 0 7 6 0 7 6
# 2: 1 6 0 0 7 6 0 7 6
# 3: 1 7 1 1 7 6 1 7 6
# 4: 2 3 0 0 4 3 0 4 3
# 5: 2 4 1 1 4 3 1 4 3
# 6: 2 5 1 1 4 3 1 4 3
# 7: 3 4 0 0 5 4 0 5 4
# 8: 3 5 1 1 5 4 1 5 4
# 9: 3 6 0 1 5 4 1 5 4
# 10: 4 8 0 0 10 10 0 10 10
# 11: 4 9 NA 0 10 10 0 10 10
# 12: 4 10 NA 0 10 10 0 10 10
# 13: 5 2 0 0 4 4 0 4 4
# 14: 5 3 0 0 4 4 0 4 4
# 15: 5 4 0 0 4 4 0 4 4
I'm using magrittr::%>% solely to break it out into stages of computation, it is not required.
We can use data.table methods. Grouped by 'Student', create an index ('i1') where the 'Pass' is 1 and not an NA, then get the first position of 1 with which and head ('i2'), while calculating the max of 'Grade' ('mx'), then create the three columns based on the indexes ('v1' - get the cumulative maximum of the binary, 'v2' - if there are any 1s, then subset the 'Grade' with the index 'i2' or else return 'mx', similarly 'v3'- the index is subtracted 1 to get the 'Grade' value
library(data.table)
setDT(data)[, c('NEWPass1', 'GradeNEWPass1', 'GradeBeforeNEWPass1') :={
i1 <- Pass == 1 & !is.na(Pass)
i2 <- head(which(i1), 1)
mx <- max(Grade, na.rm = TRUE)
v1 <- cummax(+(i1))
v2 <- if(any(i1)) Grade[i2] else mx
v3 <- if(any(i1)) Grade[max(1, i2-1)] else mx
.(v1, v2, v3)}, Student]
data
# Student Grade Pass NEWPass GradeNEWPass GradeBeforeNEWPass NEWPass1 GradeNEWPass1 GradeBeforeNEWPass1
# 1: 1 5 NA 0 7 6 0 7 6
# 2: 1 6 0 0 7 6 0 7 6
# 3: 1 7 1 1 7 6 1 7 6
# 4: 2 3 0 0 4 3 0 4 3
# 5: 2 4 1 1 4 3 1 4 3
# 6: 2 5 1 1 4 3 1 4 3
# 7: 3 4 0 0 5 4 0 5 4
# 8: 3 5 1 1 5 4 1 5 4
# 9: 3 6 0 1 5 4 1 5 4
#10: 4 8 0 0 10 10 0 10 10
#11: 4 9 NA 0 10 10 0 10 10
#12: 4 10 NA 0 10 10 0 10 10
#13: 5 2 0 0 4 4 0 4 4
#14: 5 3 0 0 4 4 0 4 4
#15: 5 4 0 0 4 4 0 4 4

How to efficiently find last preceding row with nonzero value in R data.table

Introduction
I am trying to find the optimal way how to find the last preceding row with nonzero value in a given column and return a value of different column on that row. I want to do it in R data.table and i am looking for maximum efficiency of that operation.
Example
let's have a data table like so:
set.seed(123)
DT = data.table(x=rep(c("b","a","c"),each=6),
y=rep(1:6, 3),
z = rbinom(18, 1, 0.3))
That gives us the following data table:
x y z
1: b 1 0
2: b 2 1
3: b 3 0
4: b 4 1
5: b 5 1
6: b 6 0
7: a 1 0
8: a 2 1
9: a 3 0
10: a 4 0
11: a 5 1
12: a 6 0
13: c 1 0
14: c 2 0
15: c 3 0
16: c 4 1
17: c 5 0
18: c 6 0
Now, the table is for each value in column x ordered by the column y. For each group given by the values in column x, I would like to create a column which would give me for each row the value of y from the row with last nonzero value of z.
Right now I am using lapply for each y and grouping by x which gives the desired result:
DT[, list(y,
z,
output = lapply(y, function(x) max(y[z != 0 & y <= x]))
),
by = 'x']
The question
Can i make my code from the example more efficient?
You might try using nafill:
# create a dummy column that is only populated for nonzero z (and hence NA elsewhere)
DT[z != 0, y_copy := y]
# nafill on this column using LOCF strategy by group:
DT[ , ans := nafill(y_copy, type = 'locf'), by = x][]
# x y z y_copy ans
# 1: b 1 0 NA NA
# 2: b 2 1 2 2
# 3: b 3 0 NA 2
# 4: b 4 1 4 4
# 5: b 5 1 5 5
# 6: b 6 0 NA 5
# 7: a 1 0 NA NA
# 8: a 2 1 2 2
# 9: a 3 0 NA 2
# 10: a 4 0 NA 2
# 11: a 5 1 5 5
# 12: a 6 0 NA 5
# 13: c 1 0 NA NA
# 14: c 2 0 NA NA
# 15: c 3 0 NA NA
# 16: c 4 1 4 4
# 17: c 5 0 NA 4
# 18: c 6 0 NA 4
For now, nafill is a development only feature (data.table 1.12.3+) but 1.12.4 should be on CRAN in the next week or two. For the moment, you can install this with install.packages('data.table', type = 'source', repos = 'http://Rdatatable.github.io/data.table')
If you don't want to create y_copy, you could do this inline with is.na<-:
DT[ , ans := nafill(`is.na<-`(y, z == 0), type = 'locf'), by = x]
This will be inefficient because z==0 is calculated repeatedly by group (instead of as a single vector); you could do this in the first step then:
DT[ , z_zero := z == 0]
But this means another dummy column (with less storage than y_copy if y is numeric, character, or complex)
Another option using rolling join:
DT[, output:= DT[z==1][.SD, on=.(x, y), roll=Inf, x.y]]
output:
x y z output
1: b 1 0 NA
2: b 2 1 2
3: b 3 0 2
4: b 4 1 4
5: b 5 1 5
6: b 6 0 5
7: a 1 0 NA
8: a 2 1 2
9: a 3 0 2
10: a 4 0 2
11: a 5 1 5
12: a 6 0 5
13: c 1 0 NA
14: c 2 0 NA
15: c 3 0 NA
16: c 4 1 4
17: c 5 0 4
18: c 6 0 4
An option with non-equi join
library(data.table)
library(zoo)
DT[DT[z!=0, .(y1 = y, x)], output := y1, on = .(x, y <= y1),
mult = 'last'][, output := na.locf0(output), x]
DT
# x y z output
# 1: b 1 0 NA
# 2: b 2 1 2
# 3: b 3 0 2
# 4: b 4 1 4
# 5: b 5 1 5
# 6: b 6 0 5
# 7: a 1 0 NA
# 8: a 2 1 2
# 9: a 3 0 2
#10: a 4 0 2
#11: a 5 1 5
#12: a 6 0 5
#13: c 1 0 NA
#14: c 2 0 NA
#15: c 3 0 NA
#16: c 4 1 4
#17: c 5 0 4
#18: c 6 0 4

How to refer to multiple previous rows in R data.table

I have a question regarding data.table in R
i have a dataset like this
data <- data.table(a=c(1:7,12,32,13),b=c(1,5,6,7,8,3,2,5,1,4))
a b
1: 1 1
2: 2 5
3: 3 6
4: 4 7
5: 5 8
6: 6 3
7: 7 2
8: 12 5
9: 32 1
10: 13 4
Now i want to generate a third column c, which gonna compare the value of each row of a, to all previous values of b and check if there is any value of b is bigger than a. For e.g, at row 5, a=5, and previous value of b is 1,5,6,7. so 6 and 7 is bigger than 5, therefore value of c should be 1, otherwise it would be 0.
The result should be like this
a b c
1: 1 1 NA
2: 2 5 0
3: 3 6 1
4: 4 7 1
5: 5 8 1
6: 6 3 1
7: 7 2 1
8: 12 5 0
9: 32 1 0
10: 13 4 0
I tried with a for loop but it takes a very long time. I also tried shift but i can not refer to multiple previous rows with shift. Anyone has any recommendation?
library(data.table)
data <- data.table(a=c(1:7,12,32,13),b=c(1,5,6,7,8,3,2,5,1,4))
data[,c:= a <= shift(cummax(b))]
This is a base R solution (see the dplyr solution below):
data$c = NA
data$c[2:nrow(data)] <- sapply(2:nrow(data), function(x) { data$c[x] <- any(data$a[x] < data$b[1:(x-1)]) } )
## a b c
## 1: 1 1 NA
## 2: 2 5 0
## 3: 3 6 1
## 4: 4 7 1
## 5: 5 8 1
## 6: 6 3 1
## 7: 7 2 1
## 8: 12 5 0
## 9: 32 1 0
## 10: 13 4 0
EDIT
Here is a simpler solution using dplyr
library(dplyr)
### Given the cumulative max and comparing to 'a', set see to 1/0.
data %>% mutate(c = ifelse(a < lag(cummax(b)), 1, 0))
## a b c
## 1 1 1 NA
## 2 2 5 0
## 3 3 6 1
## 4 4 7 1
## 5 5 8 1
## 6 6 3 1
## 7 7 2 1
## 8 12 5 0
## 9 32 1 0
## 10 13 4 0
### Using 'shift' with dplyr
data %>% mutate(c = ifelse(a <= shift(cummax(b)), 1, 0))

using column numbers for grouping in data table rather than names in R

I have code that needs to be flexible, and I cannot hard code in column names when I do grouping. As such, I want to hard code column numbers to do grouping, since these are easy to specify over range changes. (Column 1 through X or so, rather than using the names of cols 1,2,..X)
Example data set:
set.seed(007)
DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
DF
X Y Z Q
1 1 1 3 4
2 2 0 1 2
3 3 0 5 4
4 4 0 5 2
5 5 0 5 5
6 6 1 0 1
7 7 0 3 0
8 8 1 2 4
9 9 0 5 5
10 10 0 2 5
11 11 0 4 3
12 12 0 1 4
13 13 1 1 4
14 14 0 1 3
15 15 0 2 4
16 16 0 5 2
17 17 1 2 0
18 18 0 4 1
19 19 1 5 2
20 20 0 2 1
A grouping (by Z and Q) that finds the X that maximizes Y, and returns both:
DF =data.table(DF)
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Result:
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Z Q Y X
1: 3 4 1 1
2: 1 2 0 2
3: 5 4 0 3
4: 5 2 1 19
5: 5 5 0 5
6: 0 1 1 6
7: 3 0 0 7
8: 2 4 1 8
9: 2 5 0 10
10: 4 3 0 11
11: 1 4 1 13
12: 1 3 0 14
13: 2 0 1 17
14: 4 1 0 18
15: 2 1 0 20
I want to do this purely using column numbers, because of the nature of my code. Additionally, If there were another column, I would potentially want to group by that extra column. And I would also want to potentially return another argmax in the first part.
Maybe just pick off names(DF) with column numbers, combined with eval(parse(...))?
useColNums <- function(data, a, b) {
n <- names(data)
y <- n[a[1]]
x <- n[a[2]]
groupby <- sprintf("list(%s)", paste(n[b], collapse=","))
argmax <- sprintf("list(%1$s=max(%1$s),%2$s=%2$s[which.max(%1$s)])", y, x)
data[, eval(parse(text=argmax)), by=eval(parse(text=groupby))]
}
x <- useColNums(DF, 2:1, 3:4)
y <- DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
identical(x, y)
# [1] TRUE
Did you find an answer that works for you? Something like this is possible, but it is not pretty, which may mean it is hard to maintain:
DF[, list(Y=max(eval(as.symbol(colnames(DF)[2]))),
X=eval(as.symbol(colnames(DF)[1]))[which.max(eval(as.symbol(colnames(DF)[2])))]),
by=list(Z=eval(as.symbol(colnames(DF)[3])),
Q=eval(as.symbol(colnames(DF)[4])))]
Now you could put those as.symbol(colnames()) into a function and make this easier to read:
cn <- function( dt, col ) { as.symbol(colnames(dt)[col]) }
DF[, list(Y=max(eval(cn(DF,2))),
X=eval(cn(DF,1))[which.max(eval(cn(DF,2)))]),
by=list(Z=eval(cn(DF,3)), Q=eval(cn(DF,4)))]
Does this solve that problem of grouping by column numbers for you?
You could use a combination of grep with your code:
> set.seed(007)
> DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
> DF = data.table(DF)
> coly <- na
> DF[, list(Y=max(Y),X=X[which.max(Y)]), by=c(col1 <- names(DF)[grep("Q", colnames(DF))], names(DF)[grep("Z", colnames(DF))])]
Q Z Y X
1: 4 3 1 1
2: 2 1 0 2
3: 4 5 0 3
4: 2 5 1 19
5: 5 5 0 5
6: 1 0 1 6
7: 0 3 0 7
8: 4 2 1 8
9: 5 2 0 10
10: 3 4 0 11
11: 4 1 1 13
12: 3 1 0 14
13: 0 2 1 17
14: 1 4 0 18
15: 1 2 0 20

Multiply rows in 2 columns by grouping flagged rows in R

I have a dataframe DF, which has the following data; around 300000 rows
<DF
A B C
1 2 0
2 5 0
4 5 2
4 7 0
7 8 0
9 7 -2
2 5 0
4 7 0
5 1 2
4 7 0
7 8 0
9 7 -2
2 5 0
4 7 0
5 1 2
I want to perform a mathematical operation on the data set with the following logic
Select all rows till first occurrence of 2 in C(ignoring -2 in the middle)
Compute avg of all these of (A*B) and add it column D (Implying all these rows in column D will have the same value)
Select all rows from first occurrence of 2 till the second occurrence
Compute avg of (A*B) for these rows and add it column D
... Do the same till
Select all rows from last occurrence of 2 till the second last occurrence
Compute avg of (A*B) for these rows and add it column D
The result should look like
<Result
A B C D
1 2 0 6
2 5 0 6
4 5 2 34.16667
4 7 0 34.16667
7 8 0 34.16667
9 7 -2 34.16667
2 5 0 34.16667
4 7 0 34.16667
5 1 2 27.85714
4 7 0 27.85714
7 8 0 27.85714
9 7 -2 27.85714
2 5 0 27.85714
4 7 0 27.85714
5 1 2 NA
How to implement this logic in R? Thanks in advance!
Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT(DF)), grouped by the cumulative sum of logical vector (C==2), we get the mean of A * B and multiply with the the values generated by NA^(.N==1) (.N==1 returns a logical vector of TRUE/FALSE for number of rows that are equal to 1 or not and using NA^ converts this to NA/1) so that all groups that have only one element returns NA and others have the mean(A*B).
library(data.table)
setDT(DF)[, D := NA^(.N==1)*mean(A*B) , .(grp = cumsum(C==2))]
DF
# A B C D
# 1: 1 2 0 6.00000
# 2: 2 5 0 6.00000
# 3: 4 5 2 34.16667
# 4: 4 7 0 34.16667
# 5: 7 8 0 34.16667
# 6: 9 7 -2 34.16667
# 7: 2 5 0 34.16667
# 8: 4 7 0 34.16667
# 9: 5 1 2 31.66667
#10: 4 7 0 31.66667
#11: 7 8 0 31.66667
#12: 9 7 -2 31.66667
#13: 2 5 0 31.66667
#14: 4 7 0 31.66667
#15: 5 1 2 NA
Using dplyr,
library(dplyr)
df <- df %>%
mutate(ind = cumsum(C == 2)) %>%
group_by(ind) %>%
mutate(D = mean(A*B), D = replace(D, n() == 1, NA))

Resources