How to refer to multiple previous rows in R data.table - r

I have a question regarding data.table in R
i have a dataset like this
data <- data.table(a=c(1:7,12,32,13),b=c(1,5,6,7,8,3,2,5,1,4))
a b
1: 1 1
2: 2 5
3: 3 6
4: 4 7
5: 5 8
6: 6 3
7: 7 2
8: 12 5
9: 32 1
10: 13 4
Now i want to generate a third column c, which gonna compare the value of each row of a, to all previous values of b and check if there is any value of b is bigger than a. For e.g, at row 5, a=5, and previous value of b is 1,5,6,7. so 6 and 7 is bigger than 5, therefore value of c should be 1, otherwise it would be 0.
The result should be like this
a b c
1: 1 1 NA
2: 2 5 0
3: 3 6 1
4: 4 7 1
5: 5 8 1
6: 6 3 1
7: 7 2 1
8: 12 5 0
9: 32 1 0
10: 13 4 0
I tried with a for loop but it takes a very long time. I also tried shift but i can not refer to multiple previous rows with shift. Anyone has any recommendation?

library(data.table)
data <- data.table(a=c(1:7,12,32,13),b=c(1,5,6,7,8,3,2,5,1,4))
data[,c:= a <= shift(cummax(b))]

This is a base R solution (see the dplyr solution below):
data$c = NA
data$c[2:nrow(data)] <- sapply(2:nrow(data), function(x) { data$c[x] <- any(data$a[x] < data$b[1:(x-1)]) } )
## a b c
## 1: 1 1 NA
## 2: 2 5 0
## 3: 3 6 1
## 4: 4 7 1
## 5: 5 8 1
## 6: 6 3 1
## 7: 7 2 1
## 8: 12 5 0
## 9: 32 1 0
## 10: 13 4 0
EDIT
Here is a simpler solution using dplyr
library(dplyr)
### Given the cumulative max and comparing to 'a', set see to 1/0.
data %>% mutate(c = ifelse(a < lag(cummax(b)), 1, 0))
## a b c
## 1 1 1 NA
## 2 2 5 0
## 3 3 6 1
## 4 4 7 1
## 5 5 8 1
## 6 6 3 1
## 7 7 2 1
## 8 12 5 0
## 9 32 1 0
## 10 13 4 0
### Using 'shift' with dplyr
data %>% mutate(c = ifelse(a <= shift(cummax(b)), 1, 0))

Related

Fill Missing Values

data=data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,3))
library(dplyr);library(tidyverse)
data$timeWANTattempt=data$timeHAVE
data <- data %>%
group_by(student) %>%
fill(timeWANTattempt)+3
I have 'timeHAVE' and I want to replace missing times with the previous time +3. I show my dplyr attempt but it does not work. I seek a data.table solution. Thank you.
you can try.
data %>%
group_by(student) %>%
mutate(n_na = cumsum(is.na(timeHAVE))) %>%
mutate(timeHAVE = ifelse(is.na(timeHAVE), timeHAVE[n_na == 0 & lead(n_na) == 1] + 3*n_na, timeHAVE))
student timeHAVE timeWANT n_na
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 1 4 4 0
3 1 7 7 0
4 1 10 10 0
5 2 2 2 0
6 2 5 5 0
7 2 8 8 1
8 2 11 11 1
9 3 6 6 0
10 3 9 9 1
11 3 12 12 2
12 3 15 15 3
13 4 3 3 0
I included the little helper n_na which counts NA's in a row. Then the second mutate muliplies the number of NAs with three and adds this to the first non-NA element before NA's
Here's an approach using 'locf' filling
setDT(data)
data[ , by = student, timeWANT := {
# carry previous observations forward whenever missing
locf_fill = nafill(timeHAVE, 'locf')
# every next NA, the amount shifted goes up by another 3
na_shift = cumsum(idx <- is.na(timeHAVE))
# add the shift, but only where the original data was missing
locf_fill[idx] = locf_fill[idx] + 3*na_shift[idx]
# return the full vector
locf_fill
}]
Warning that this won't work if a given student can have more than one non-consecutive set of NA values in timeHAVE
Another data.table option without grouping:
setDT(data)[, w := fifelse(is.na(timeHAVE) & student==shift(student),
nafill(timeHAVE, "locf") + 3L * rowid(rleid(timeHAVE)),
timeHAVE)]
output:
student timeHAVE timeWANT w
1: 1 1 1 1
2: 1 4 4 4
3: 1 7 7 7
4: 1 10 10 10
5: 2 2 2 2
6: 2 5 5 5
7: 2 NA 8 8
8: 2 11 11 11
9: 3 6 6 6
10: 3 NA 9 9
11: 3 NA 12 12
12: 3 NA 15 15
13: 4 NA NA NA
14: 4 3 3 3
data with student=4 having NA for the first timeHAVE:
data = data.frame("student"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4),
"timeHAVE"=c(1,4,7,10,2,5,NA,11,6,NA,NA,NA,NA,3),
"timeWANT"=c(1,4,7,10,2,5,8,11,6,9,12,15,NA,3))

data.table: Select n specific rows before & after other rows meeting a condition

Given the following example data table:
library(data.table)
DT <- fread("grp y exclude
a 1 0
a 2 0
a 3 0
a 4 1
a 5 0
a 7 1
a 8 0
a 9 0
a 10 0
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 1
b 7 1
b 8 0
b 9 0
b 10 0
c 5 1
d 1 0")
I want to select
by group grp
all rows that have y==5
and up to two rows before and after each row from 2 within the grouping.
but 3. only those rows that have exclude==0.
Assuming each group has max one row with y==5, this would yield the desired result for 1.-3.:
idx <- -2:2 # 2 rows before match, the matching row itself, and two rows after match
(row_numbers <- DT[,.I[{
x <- rep(which(y==5),each=length(idx))+idx
x[x>0 & x<=.N]
}], by=grp]$V1)
# [1] 3 4 5 6 7 12 13 14 15 16 20
DT[row_numbers]
# grp y exclude
# 1: a 3 0
# 2: a 4 1
# 3: a 5 0 # y==5 + two rows before and two rows after
# 4: a 7 1
# 5: a 8 0
# 6: b 3 0
# 7: b 4 1
# 8: b 5 0 # y==5 + two rows before and two rows after
# 9: b 6 1
# 10: b 7 1
# 11: c 5 1 # y==5 + nothing, because the group has only 1 element
However, how do I incorporate 4. so that I get
# grp y exclude
# 1: a 2 0
# 2: a 3 0
# 3: a 5 0
# 4: a 8 0
# 5: a 9 0
# 6: b 2 0
# 7: b 3 0
# 8: b 5 0
# 9: b 8 0
# 10: b 9 0
# 11: c 5 1
? Feels like I'm close, but I guess I looked too long at heads and whiches, now, so I'd be thankful for some fresh ideas.
A bit more simplified:
DT[DT[, rn := .I][exclude==0 | y==5][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1]
# grp y exclude rn
#1: a 2 0 2
#2: a 3 0 3
#3: a 5 0 5
#4: a 8 0 7
#5: a 9 0 8
#6: b 2 0 11
#7: b 3 0 12
#8: b 5 0 14
#9: b 8 0 17
#10: b 9 0 18
#11: c 5 1 20
You are very close. This should do it:
row_numbers <- DT[exclude==0 | y==5, .I[{
x <- rep(which(y==5), each=length(idx)) + idx
x[x>0 & x<=.N]
}], by=grp]$V1
DT[row_numbers]

Shifting row values by lag value in another column

I have a rather large dataset and I am interested in "marching" values forward through time based on values from another column. For example, if I have a Value = 3 at Time = 0 and a DesiredShift = 2, I want the 3 to shift down two rows to be at Time = 2. Here is a reproducible example.
Build reproducible fake data
library(data.table)
set.seed(1)
rowsPerID <- 8
dat <- CJ(1:2, 1:rowsPerID)
setnames(dat, c("ID","Time"))
dat[, Value := rpois(.N, 4)]
dat[, Shift := sample(0:2, size=.N, replace=TRUE)]
Fake Data
# ID Time Value Shift
# 1: 1 1 3 2
# 2: 1 2 3 2
# 3: 1 3 4 1
# 4: 1 4 7 2
# 5: 1 5 2 2
# 6: 1 6 7 0
# 7: 1 7 7 1
# 8: 1 8 5 0
# 9: 2 1 5 0
# 10: 2 2 1 1
# 11: 2 3 2 0
# 12: 2 4 2 1
# 13: 2 5 5 2
# 14: 2 6 3 1
# 15: 2 7 5 1
# 16: 2 8 4 1
I want each Value to shift forward according the the Shift column. So the
DesiredOutput column for row 3 will be equal to 3 since the value at Time=1 is
Value = 3 and Shift = 2.
Row 4 shows 3+4=7 since 3 shifts down 2 and 4 shifts down 1.
I would like to be able to do this by ID group and hopefully take advantage
of data.table since speed is of interest for this problem.
Desired Result
# ID Time Value Shift DesiredOutput
# 1: 1 1 3 2 NA
# 2: 1 2 3 2 NA
# 3: 1 3 4 1 3
# 4: 1 4 7 2 3+4 = 7
# 5: 1 5 2 2 NA
# 6: 1 6 7 0 7+7 = 14
# 7: 1 7 7 1 2
# 8: 1 8 5 0 7+5 = 12
# 9: 2 1 5 0 5
# 10: 2 2 1 1 NA
# 11: 2 3 2 0 1+2 = 3
# 12: 2 4 2 1 NA
# 13: 2 5 5 2 2
# 14: 2 6 3 1 NA
# 15: 2 7 5 1 3+5=8
# 16: 2 8 4 1 5
I was hoping to get this working using the data.table::shift function, but I am unsure how to make this work using multiple lag parameters.
Try this:
dat[, TargetIndex:= .I + Shift]
toMerge = dat[, list(Out = sum(Value)), by='TargetIndex']
dat[, TargetIndex:= .I]
# dat = merge(dat, toMerge, by='TargetIndex', all=TRUE)
dat[toMerge, on='TargetIndex', DesiredOutput:= i.Out]
> dat
# ID Time Value Shift TargetIndex DesiredOutput
# 1: 1 1 3 2 1 NA
# 2: 1 2 3 2 2 NA
# 3: 1 3 4 1 3 3
# 4: 1 4 7 2 4 7
# 5: 1 5 2 2 5 NA
# 6: 1 6 7 0 6 14
# 7: 1 7 7 1 7 2
# 8: 1 8 5 0 8 12
# 9: 2 1 5 0 9 5
# 10: 2 2 1 1 10 NA
# 11: 2 3 2 0 11 3
# 12: 2 4 2 1 12 NA
# 13: 2 5 5 2 13 2
# 14: 2 6 3 1 14 NA
# 15: 2 7 5 1 15 8
# 16: 2 8 4 1 16 5

using column numbers for grouping in data table rather than names in R

I have code that needs to be flexible, and I cannot hard code in column names when I do grouping. As such, I want to hard code column numbers to do grouping, since these are easy to specify over range changes. (Column 1 through X or so, rather than using the names of cols 1,2,..X)
Example data set:
set.seed(007)
DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
DF
X Y Z Q
1 1 1 3 4
2 2 0 1 2
3 3 0 5 4
4 4 0 5 2
5 5 0 5 5
6 6 1 0 1
7 7 0 3 0
8 8 1 2 4
9 9 0 5 5
10 10 0 2 5
11 11 0 4 3
12 12 0 1 4
13 13 1 1 4
14 14 0 1 3
15 15 0 2 4
16 16 0 5 2
17 17 1 2 0
18 18 0 4 1
19 19 1 5 2
20 20 0 2 1
A grouping (by Z and Q) that finds the X that maximizes Y, and returns both:
DF =data.table(DF)
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Result:
DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
Z Q Y X
1: 3 4 1 1
2: 1 2 0 2
3: 5 4 0 3
4: 5 2 1 19
5: 5 5 0 5
6: 0 1 1 6
7: 3 0 0 7
8: 2 4 1 8
9: 2 5 0 10
10: 4 3 0 11
11: 1 4 1 13
12: 1 3 0 14
13: 2 0 1 17
14: 4 1 0 18
15: 2 1 0 20
I want to do this purely using column numbers, because of the nature of my code. Additionally, If there were another column, I would potentially want to group by that extra column. And I would also want to potentially return another argmax in the first part.
Maybe just pick off names(DF) with column numbers, combined with eval(parse(...))?
useColNums <- function(data, a, b) {
n <- names(data)
y <- n[a[1]]
x <- n[a[2]]
groupby <- sprintf("list(%s)", paste(n[b], collapse=","))
argmax <- sprintf("list(%1$s=max(%1$s),%2$s=%2$s[which.max(%1$s)])", y, x)
data[, eval(parse(text=argmax)), by=eval(parse(text=groupby))]
}
x <- useColNums(DF, 2:1, 3:4)
y <- DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
identical(x, y)
# [1] TRUE
Did you find an answer that works for you? Something like this is possible, but it is not pretty, which may mean it is hard to maintain:
DF[, list(Y=max(eval(as.symbol(colnames(DF)[2]))),
X=eval(as.symbol(colnames(DF)[1]))[which.max(eval(as.symbol(colnames(DF)[2])))]),
by=list(Z=eval(as.symbol(colnames(DF)[3])),
Q=eval(as.symbol(colnames(DF)[4])))]
Now you could put those as.symbol(colnames()) into a function and make this easier to read:
cn <- function( dt, col ) { as.symbol(colnames(dt)[col]) }
DF[, list(Y=max(eval(cn(DF,2))),
X=eval(cn(DF,1))[which.max(eval(cn(DF,2)))]),
by=list(Z=eval(cn(DF,3)), Q=eval(cn(DF,4)))]
Does this solve that problem of grouping by column numbers for you?
You could use a combination of grep with your code:
> set.seed(007)
> DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
> DF = data.table(DF)
> coly <- na
> DF[, list(Y=max(Y),X=X[which.max(Y)]), by=c(col1 <- names(DF)[grep("Q", colnames(DF))], names(DF)[grep("Z", colnames(DF))])]
Q Z Y X
1: 4 3 1 1
2: 2 1 0 2
3: 4 5 0 3
4: 2 5 1 19
5: 5 5 0 5
6: 1 0 1 6
7: 0 3 0 7
8: 4 2 1 8
9: 5 2 0 10
10: 3 4 0 11
11: 4 1 1 13
12: 3 1 0 14
13: 0 2 1 17
14: 1 4 0 18
15: 1 2 0 20

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources