> tempDT <- data.table(colA = c("E","E","A","A","E","A","E")
+ , lags = c(NA,1,1,2,3,1,2))
> tempDT
colA lags
1: E NA
2: E 1
3: A 1
4: A 2
5: E 3
6: A 1
7: E 2
I have column colA, and need to find lags between current row and the previous row whose colA == "E".
Note: if we could find the row reference for the previous row whose colA == "E", then we could calculate the lags. However, I don't know how to achieve it.
1) Define lastEpos which given i returns the position of the last E among the first i rows and apply that to each row number:
lastEpos <- function(i) tail(which(tempDT$colA[1:i] == "E"), 1)
tempDT[, lags := .I - shift(sapply(.I, lastEpos))]
Here are a few variations:
2) i-1 In this variation lastEpos returns the positions of the last E among the first i-1 rows rather than i:
lastEpos <- function(i) tail(c(NA, which(tempDT$colA[seq_len(i-1)] == "E")), 1)
tempDT[, lags := .I - sapply(.I, lastEpos)]
3) Position Similar to (2) but uses Position:
lastEpos <- function(i) Position(c, tempDT$colA[seq_len(i-1)] == "E", right = TRUE)
tempDT[, lags := .I - sapply(.I, lastEpos)]
4) rollapply
library(zoo)
w <- lapply(1:nrow(tempDT), function(i) -rev(seq_len(i-1)))
tempDT[, lags := .I - rollapply(colA == "E", w, Position, f = c, right = TRUE)]
5) sqldf
library(sqldf)
sqldf("select a.colA, a.rowid - b.rowid lags
from tempDT a left join tempDT b
on b.rowid < a.rowid and b.colA = 'E'
group by a.rowid")
Related
Let's suppose the following data table:
a = runif(40)
b = c(rep(NA,5), runif(5), rep(NA,3),runif(3),NA,runif(3), c(rep(NA,3), runif(7), rep(NA,4), runif(3), NA,NA, runif(1)))
c = rep(1:4,each=10)
DT = data.table(a,b,c)
I want to eliminate the rows with the first NA values in b for every unique value in c (first NAs when c==1, when c==2...), but not the rows with the NAs that come after.
I can do it by using a loop:
for(i in unique(DT$c))
{
first_NA = which(DT$c==i)[1]
last_NA = which(!is.na(DT[,b]) & DT$c==i)[1] - 1
DT = DT[-c(first_NA:last_NA)]
}
But I wonder if there is any simpler way of doing this by using a function for the whole data table using groups (by in data table or groupby in dplyr), without just applying it to columns.
Thank you!
You can filter out the first NA values in b through
DT[, .SD[cumsum( !is.na(b) ) != 0], by = .(c)]
You have to mark these lines then keep those not marked.
# mark values
DT <- DT[, by=c,
flag := is.na(b[1]) # first value of b is NA
& (seq_len(.N)==1) # only for first value
]
# discard marked
DT <- DT[(!flag)]
# remove flag
DT[, flag:=NULL]
or in a row
DT[, by=c, flag:=is.na(b[1]) & (seq_len(.N)==1)][(!flag)][, flag:=NULL]
I am trying to apply a filter to a group in a datatable only if a certain value exists. If it doesn't exist, the filter is not applicable and all the rows of the group are retained. Similar to this
I am looking for a data table version of this answer, if possible, but with some additional criteria.
Firstly, I tried the following:
test <- data.table(grp=c(1,1,1,10,10,10,12,12), c=c("a", "b", "c", "b", "c", "c","a","b"))
test[test[, .I[c=="a" | all(c!="a")], by = grp]$V1]
Suggestions to improve are welcome.
Additional criteria that I am trying to incorporate is to check whether grp belongs to another list. If it belongs to the list, the filter is applicable
lst <- c("1", "8")
test[test[, .I[(c=="a" & grp %in% lst) | all(c!="a")], by = grp]$V1]
Here, the filter applies only to grp value 1 and not to 12 as it does not exist in lst. Instead of returning all rows with grp value 12, it is dropping them entirely. Obviously, it is wrong and I would like to know how to incorporate the condition.
Expected result:
grp c
1: 1 a
2: 10 b
3: 10 c
4: 10 c
5: 12 a
6: 12 b
For grp=1, it exists in lst and hence filter is applied.
For grp=10, no filter is needed as there is not a single row with c="a"
For grp=12, filter is applicable BUT as it doesn't belong to lst, the filter is not used.
Thanks
Here is one way using the same logic. In addition to the OP's logic, add an OR (|) condition to return all the rows of group that are not included in the 'lst' object
test[test[, all(c != 'a')| (c == 'a' & .BY %in% lst)|
!.BY %in% lst, by = grp]$V1]
-output
# grp c
#1: 1 a
#2: 10 b
#3: 10 c
#4: 10 c
#5: 12 a
#6: 12 b
Or we can use an if/else condition
test[test[, .I[if(!.BY %in% lst) TRUE else
(c=="a" & grp %in% lst) | all(c!="a")] , by = grp]$V1]
Here's a solution that uses a helper column:
> test <- data.table(grp=c(1,1,1,10,10,10,12,12), c=c("a", "b", "c", "b", "c", "c","a","b"))
> lst <- c(1, 8)
> dtFiltered <- test[, filtera := !all(c != "a") & (grp %in% lst), by = grp][!filtera | c == "a"][, filtera := NULL]
How can I evaluate a column of a data.table with values of the same column, each value against the value of the next two positions. The following example ilustrates the problem and desired result.
library(data.table)
dt <- data.table(a = c(2, 3, 2, 4))
result <- data.table(a = c(2, 3, 2, 4), b = c(T, F, NA, NA))
We can use shift to create two lead columns based on 'a' by specifying n= 1:2. Loop through the columns with lapply, check whether it is equal to 'a', Reduce it to a single logical vector with | and assign it to 'b' column
dt[, b := Reduce(`|`, lapply(shift(a, 1:2, type = 'lead'), `==`, a))]
dt
# a b
#1: 2 TRUE
#2: 3 FALSE
#3: 2 NA
#4: 4 NA
As #Mike H. suggested if we are comparing only for the next values, then doing this individually may be better to understand
dt[, b := (shift(a, 1, type = 'lead') == a) | (shift(a, 2, type = 'lead') ==a)]
You could do a rolling join on row number:
dt[, r := .I]
dt[head(1:.N, -2), found :=
dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, .N, by=.EACHI]$N > 0L]
a r found
1: 2 1 TRUE
2: 3 2 FALSE
3: 2 3 NA
4: 4 4 NA
To see how it works, replace .N with x.r:
dt[head(1:.N, -2), dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, x.r, by=.EACHI]]
a r x.r
1: 2 2 3
2: 3 3 NA
The idea is that we look for the nearest a match starting from r+1 and giving up after rolling one more ahead.
I have a data.table that looks like this:
dt <- data.table(a = 1, b = 1, c = 1)
I need column b to be treated as an integer vector of variable length, so I can append additional elements to it. For instance, I want to add 2 to column b in the first row. I tried
dt[a == 1, b := c(b, 2)]
but that doesn't work. It gives me a warning:
Warning message:
In `[.data.table`(dt, a == 1, `:=`(b, c(b, 2))) :
Supplied 2 items to be assigned to 1 items of column 'b' (1 unused)
What's the right syntax for this?
dt <- data.table(a = 1, b = 1:3, c = 1)
dt[, b := .(lapply(b, c, 2))][]
# a b c
#1: 1 1,2 1
#2: 1 2,2 1
#3: 1 3,2 1
If requiring a conversion to list first (i.e. when not already a list, and subsetting or doing a by), add dt[, b := .(as.list(b))] before the above.
I am wondering if there is an elegant data.table (v1.9.4) way to do the following:
group a DT by two variables and then compute some function on the grouped tables (.SDs) for all entries in .SD but one and that one should be rolling through .SD and putting the result back in DT. The result is thus (potentially) unique for each entry in the .SDs (and hence DT). You can think of it as computing some value for a peer group of an entry in DT and that peer group is determined by the two grouping variables (same properties as the entry in DT) but the entry itself.
I accomplished this with loops around a simple := in data.table's j, but was wondering if there is a pure data.table solution. I could imagine something like .SD[i != id , := , by=1:nrow(.SD)] inside DT[] could do the trick but:
Using := in the j of .SD is reserved for future use as a (tortuously) flexible way to update DT by reference by group
The solution I have is (compute sum() for group determined by b and c except rolling ID):
DT <- data.table(ID = c("a","a","b","b","c","c"),
b = c(1, 2, 1, 2, 1, 2),
c = c("x", "x", "y", "z", "y", "x"),
Var1 = 1:6)
for (id2 in unique(DT$ID)) {
for (b2 in unique(DT$b)) {
c2 <- DT[ID==id2 & b==b2, c]
DT[ID == id2 & b == b2,
Var1_sum := sum(DT[ID! = id2 & b == b2 & c == c2, Var1], na.rm=TRUE)]
}
}
DT
ID b c Var1 Var1_sum
1: a 1 x 1 0
2: a 2 x 2 6
3: b 1 y 3 5
4: b 2 z 4 0
5: c 1 y 5 3
6: c 2 x 6 2
Do we need that future feature := in .SD's j for this?