Vector as entry in `data.table` - r

I have a data.table that looks like this:
dt <- data.table(a = 1, b = 1, c = 1)
I need column b to be treated as an integer vector of variable length, so I can append additional elements to it. For instance, I want to add 2 to column b in the first row. I tried
dt[a == 1, b := c(b, 2)]
but that doesn't work. It gives me a warning:
Warning message:
In `[.data.table`(dt, a == 1, `:=`(b, c(b, 2))) :
Supplied 2 items to be assigned to 1 items of column 'b' (1 unused)
What's the right syntax for this?

dt <- data.table(a = 1, b = 1:3, c = 1)
dt[, b := .(lapply(b, c, 2))][]
# a b c
#1: 1 1,2 1
#2: 1 2,2 1
#3: 1 3,2 1
If requiring a conversion to list first (i.e. when not already a list, and subsetting or doing a by), add dt[, b := .(as.list(b))] before the above.

Related

Evaluating same column data.table in r

How can I evaluate a column of a data.table with values of the same column, each value against the value of the next two positions. The following example ilustrates the problem and desired result.
library(data.table)
dt <- data.table(a = c(2, 3, 2, 4))
result <- data.table(a = c(2, 3, 2, 4), b = c(T, F, NA, NA))
We can use shift to create two lead columns based on 'a' by specifying n= 1:2. Loop through the columns with lapply, check whether it is equal to 'a', Reduce it to a single logical vector with | and assign it to 'b' column
dt[, b := Reduce(`|`, lapply(shift(a, 1:2, type = 'lead'), `==`, a))]
dt
# a b
#1: 2 TRUE
#2: 3 FALSE
#3: 2 NA
#4: 4 NA
As #Mike H. suggested if we are comparing only for the next values, then doing this individually may be better to understand
dt[, b := (shift(a, 1, type = 'lead') == a) | (shift(a, 2, type = 'lead') ==a)]
You could do a rolling join on row number:
dt[, r := .I]
dt[head(1:.N, -2), found :=
dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, .N, by=.EACHI]$N > 0L]
a r found
1: 2 1 TRUE
2: 3 2 FALSE
3: 2 3 NA
4: 4 4 NA
To see how it works, replace .N with x.r:
dt[head(1:.N, -2), dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, x.r, by=.EACHI]]
a r x.r
1: 2 2 3
2: 3 3 NA
The idea is that we look for the nearest a match starting from r+1 and giving up after rolling one more ahead.

R data.table join with roll

dd = data.table(a = c(1,1), b = c(1,2), v = c(1, NA))
dd
# a b v
# 1: 1 1 1
# 2: 1 2 NA
setkey(dd, a,b)
dd[.(1,2), roll = TRUE, rollends = c(TRUE, TRUE)]
# a b v
# 1: 1 2 NA
What have I missed here? Why isn't v carried forward?
Rolling join doesn't need to do rolling here as you are matching exact row (1, 2). Rolling matching is made when there is no match on actual values, in your case it has exact match. See below example which, I modified dd so there is no match on .(1,2).
library(data.table)
dd = data.table(a = c(1,1), b = c(1,3), v = c(1, NA))
dd[.(1,2), roll = TRUE, rollends = c(TRUE, TRUE)]
# a b v
#1: 1 2 1
See ?data.table//roll (emphasis mine):
When i is a data.table and its row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then:
+Inf (or TRUE) rolls the prevailing value in x forward. It is also known as last observation carried forward (LOCF)
...

Grouping and applying function to .SD but one rolling entry

I am wondering if there is an elegant data.table (v1.9.4) way to do the following:
group a DT by two variables and then compute some function on the grouped tables (.SDs) for all entries in .SD but one and that one should be rolling through .SD and putting the result back in DT. The result is thus (potentially) unique for each entry in the .SDs (and hence DT). You can think of it as computing some value for a peer group of an entry in DT and that peer group is determined by the two grouping variables (same properties as the entry in DT) but the entry itself.
I accomplished this with loops around a simple := in data.table's j, but was wondering if there is a pure data.table solution. I could imagine something like .SD[i != id , := , by=1:nrow(.SD)] inside DT[] could do the trick but:
Using := in the j of .SD is reserved for future use as a (tortuously) flexible way to update DT by reference by group
The solution I have is (compute sum() for group determined by b and c except rolling ID):
DT <- data.table(ID = c("a","a","b","b","c","c"),
b = c(1, 2, 1, 2, 1, 2),
c = c("x", "x", "y", "z", "y", "x"),
Var1 = 1:6)
for (id2 in unique(DT$ID)) {
for (b2 in unique(DT$b)) {
c2 <- DT[ID==id2 & b==b2, c]
DT[ID == id2 & b == b2,
Var1_sum := sum(DT[ID! = id2 & b == b2 & c == c2, Var1], na.rm=TRUE)]
}
}
DT
ID b c Var1 Var1_sum
1: a 1 x 1 0
2: a 2 x 2 6
3: b 1 y 3 5
4: b 2 z 4 0
5: c 1 y 5 3
6: c 2 x 6 2
Do we need that future feature := in .SD's j for this?

Reference `data.table` column by name

Suppose there is:
DT = data.table(a=1, b=2, "a+b"=8)
and there is variable col="a+b" referencing the third column of DT
How to perform an operation on that column by reference? Let's say I want multiply col by 2, so in the above example the result should be 8*2=16, not (1+2)*2=6
For example, this obviously doesn't work:
DT[, c:=as.name(col)*2]
It sounds like you're looking for get:
DT = data.table(a=1, b=2, "a+b"=8)
col = "a+b"
DT[, get(col) * 2]
# [1] 16
DT[, c := get(col) * 2]
DT
# a b a+b c
# 1: 1 2 8 16

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Resources