I am wondering if there is an elegant data.table (v1.9.4) way to do the following:
group a DT by two variables and then compute some function on the grouped tables (.SDs) for all entries in .SD but one and that one should be rolling through .SD and putting the result back in DT. The result is thus (potentially) unique for each entry in the .SDs (and hence DT). You can think of it as computing some value for a peer group of an entry in DT and that peer group is determined by the two grouping variables (same properties as the entry in DT) but the entry itself.
I accomplished this with loops around a simple := in data.table's j, but was wondering if there is a pure data.table solution. I could imagine something like .SD[i != id , := , by=1:nrow(.SD)] inside DT[] could do the trick but:
Using := in the j of .SD is reserved for future use as a (tortuously) flexible way to update DT by reference by group
The solution I have is (compute sum() for group determined by b and c except rolling ID):
DT <- data.table(ID = c("a","a","b","b","c","c"),
b = c(1, 2, 1, 2, 1, 2),
c = c("x", "x", "y", "z", "y", "x"),
Var1 = 1:6)
for (id2 in unique(DT$ID)) {
for (b2 in unique(DT$b)) {
c2 <- DT[ID==id2 & b==b2, c]
DT[ID == id2 & b == b2,
Var1_sum := sum(DT[ID! = id2 & b == b2 & c == c2, Var1], na.rm=TRUE)]
}
}
DT
ID b c Var1 Var1_sum
1: a 1 x 1 0
2: a 2 x 2 6
3: b 1 y 3 5
4: b 2 z 4 0
5: c 1 y 5 3
6: c 2 x 6 2
Do we need that future feature := in .SD's j for this?
Related
> tempDT <- data.table(colA = c("E","E","A","A","E","A","E")
+ , lags = c(NA,1,1,2,3,1,2))
> tempDT
colA lags
1: E NA
2: E 1
3: A 1
4: A 2
5: E 3
6: A 1
7: E 2
I have column colA, and need to find lags between current row and the previous row whose colA == "E".
Note: if we could find the row reference for the previous row whose colA == "E", then we could calculate the lags. However, I don't know how to achieve it.
1) Define lastEpos which given i returns the position of the last E among the first i rows and apply that to each row number:
lastEpos <- function(i) tail(which(tempDT$colA[1:i] == "E"), 1)
tempDT[, lags := .I - shift(sapply(.I, lastEpos))]
Here are a few variations:
2) i-1 In this variation lastEpos returns the positions of the last E among the first i-1 rows rather than i:
lastEpos <- function(i) tail(c(NA, which(tempDT$colA[seq_len(i-1)] == "E")), 1)
tempDT[, lags := .I - sapply(.I, lastEpos)]
3) Position Similar to (2) but uses Position:
lastEpos <- function(i) Position(c, tempDT$colA[seq_len(i-1)] == "E", right = TRUE)
tempDT[, lags := .I - sapply(.I, lastEpos)]
4) rollapply
library(zoo)
w <- lapply(1:nrow(tempDT), function(i) -rev(seq_len(i-1)))
tempDT[, lags := .I - rollapply(colA == "E", w, Position, f = c, right = TRUE)]
5) sqldf
library(sqldf)
sqldf("select a.colA, a.rowid - b.rowid lags
from tempDT a left join tempDT b
on b.rowid < a.rowid and b.colA = 'E'
group by a.rowid")
How can I evaluate a column of a data.table with values of the same column, each value against the value of the next two positions. The following example ilustrates the problem and desired result.
library(data.table)
dt <- data.table(a = c(2, 3, 2, 4))
result <- data.table(a = c(2, 3, 2, 4), b = c(T, F, NA, NA))
We can use shift to create two lead columns based on 'a' by specifying n= 1:2. Loop through the columns with lapply, check whether it is equal to 'a', Reduce it to a single logical vector with | and assign it to 'b' column
dt[, b := Reduce(`|`, lapply(shift(a, 1:2, type = 'lead'), `==`, a))]
dt
# a b
#1: 2 TRUE
#2: 3 FALSE
#3: 2 NA
#4: 4 NA
As #Mike H. suggested if we are comparing only for the next values, then doing this individually may be better to understand
dt[, b := (shift(a, 1, type = 'lead') == a) | (shift(a, 2, type = 'lead') ==a)]
You could do a rolling join on row number:
dt[, r := .I]
dt[head(1:.N, -2), found :=
dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, .N, by=.EACHI]$N > 0L]
a r found
1: 2 1 TRUE
2: 3 2 FALSE
3: 2 3 NA
4: 4 4 NA
To see how it works, replace .N with x.r:
dt[head(1:.N, -2), dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, x.r, by=.EACHI]]
a r x.r
1: 2 2 3
2: 3 3 NA
The idea is that we look for the nearest a match starting from r+1 and giving up after rolling one more ahead.
I have a data.table that looks like this:
dt <- data.table(a = 1, b = 1, c = 1)
I need column b to be treated as an integer vector of variable length, so I can append additional elements to it. For instance, I want to add 2 to column b in the first row. I tried
dt[a == 1, b := c(b, 2)]
but that doesn't work. It gives me a warning:
Warning message:
In `[.data.table`(dt, a == 1, `:=`(b, c(b, 2))) :
Supplied 2 items to be assigned to 1 items of column 'b' (1 unused)
What's the right syntax for this?
dt <- data.table(a = 1, b = 1:3, c = 1)
dt[, b := .(lapply(b, c, 2))][]
# a b c
#1: 1 1,2 1
#2: 1 2,2 1
#3: 1 3,2 1
If requiring a conversion to list first (i.e. when not already a list, and subsetting or doing a by), add dt[, b := .(as.list(b))] before the above.
I often want to process one row of a data.table at a time. I've been using
d[, j, by=rownames(d)]
but this doesn't always seem to work (sometimes getting an error message about by appearing to evaluate to column names), and in any case isn't a very clean expression of what I'm trying to do.
Let me give a specific example.
d = data.table(a=c(1,2),b=c(3,4))
f = function(x,y) x[1]+y[1] #expects length 1 vectors x and y and adds them
d[, id := 1:.N]
d[, f(a,b), by=id]
d[, id := NULL]
The situation is that I have a function f that is not vectorized. I've decorated d with an id column so I can process one row at a time. I'm looking for a better way to do this.
Here's another example, without a function f:
d[, list(a=a,b=b,s=a:b), by = id]
d[, id := NULL]
This seems to do the job, following the example I found at https://arelbundock.com/posts/datatable_rowwise/
# Your example data frame and function
d = data.table(a = c(1, 2), b = c(3, 4))
f <- function(x, y) {x[1] + y[1]}
# Try the Map() function for row-wise operations
d[, z := Map(f, a, b)]
produces
#> a b z
#> 1: 1 3 4
#> 2: 2 4 6
How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]