Evaluating same column data.table in r - r

How can I evaluate a column of a data.table with values of the same column, each value against the value of the next two positions. The following example ilustrates the problem and desired result.
library(data.table)
dt <- data.table(a = c(2, 3, 2, 4))
result <- data.table(a = c(2, 3, 2, 4), b = c(T, F, NA, NA))

We can use shift to create two lead columns based on 'a' by specifying n= 1:2. Loop through the columns with lapply, check whether it is equal to 'a', Reduce it to a single logical vector with | and assign it to 'b' column
dt[, b := Reduce(`|`, lapply(shift(a, 1:2, type = 'lead'), `==`, a))]
dt
# a b
#1: 2 TRUE
#2: 3 FALSE
#3: 2 NA
#4: 4 NA
As #Mike H. suggested if we are comparing only for the next values, then doing this individually may be better to understand
dt[, b := (shift(a, 1, type = 'lead') == a) | (shift(a, 2, type = 'lead') ==a)]

You could do a rolling join on row number:
dt[, r := .I]
dt[head(1:.N, -2), found :=
dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, .N, by=.EACHI]$N > 0L]
a r found
1: 2 1 TRUE
2: 3 2 FALSE
3: 2 3 NA
4: 4 4 NA
To see how it works, replace .N with x.r:
dt[head(1:.N, -2), dt[.SD[, .(a = a, r = r + 1L)], on=.(a, r), roll=-1, x.r, by=.EACHI]]
a r x.r
1: 2 2 3
2: 3 3 NA
The idea is that we look for the nearest a match starting from r+1 and giving up after rolling one more ahead.

Related

Nice way to group data in a `data.table` when the new column name is given as a character vector

In other words, my question is about the j argument to data.table when the name of the new column is a character vector. For example:
dt <- data.table(x = c(1, 1, 2, 2, 3, 3), y = rnorm(6))
agg_col_name <- 'avg'
grouped_dt <- dt[, .(z = mean(y)), by = x]
setnames(grouped_dt, 'z', agg_col_name)
> grouped_dt
x avg
1: 1 -0.2554987
2: 2 -0.4245852
3: 3 -0.4881073
There should be a more elegant way to do the last two statements as one, yes?
Perhaps this is a question about how to create suitable list for the j argument.
Although probably not what you are looking for, but you could use setNames inside, where it wraps around (.(z = mean(y)).
library(data.table)
dt[, setNames(.(z = mean(y)), agg_col_name), by = x]
Or use setnames after doing the summary:
setnames(dt[, mean(y), by = x], 'V1', agg_col_name)[]
Output
x avg
1: 1 0.5626526
2: 2 0.3549653
3: 3 -0.2861405
However, as mentioned in the comments, it is easier to do with the dev version of data.table. You can see more about the development of this feature at [programming on data.table #4304]:(https://github.com/Rdatatable/data.table/pull/4304).
# Latest development version:
data.table::update.dev.pkg()
library(data.table)
dt[, .(z = mean(y)), by = x, env = list(z=agg_col_name)]
# x avg
#1: 1 -0.1640783
#2: 2 0.5375794
#3: 3 0.1539785

Vector as entry in `data.table`

I have a data.table that looks like this:
dt <- data.table(a = 1, b = 1, c = 1)
I need column b to be treated as an integer vector of variable length, so I can append additional elements to it. For instance, I want to add 2 to column b in the first row. I tried
dt[a == 1, b := c(b, 2)]
but that doesn't work. It gives me a warning:
Warning message:
In `[.data.table`(dt, a == 1, `:=`(b, c(b, 2))) :
Supplied 2 items to be assigned to 1 items of column 'b' (1 unused)
What's the right syntax for this?
dt <- data.table(a = 1, b = 1:3, c = 1)
dt[, b := .(lapply(b, c, 2))][]
# a b c
#1: 1 1,2 1
#2: 1 2,2 1
#3: 1 3,2 1
If requiring a conversion to list first (i.e. when not already a list, and subsetting or doing a by), add dt[, b := .(as.list(b))] before the above.

R - Selecting columns from data table with for loop issue [duplicate]

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Grouping and applying function to .SD but one rolling entry

I am wondering if there is an elegant data.table (v1.9.4) way to do the following:
group a DT by two variables and then compute some function on the grouped tables (.SDs) for all entries in .SD but one and that one should be rolling through .SD and putting the result back in DT. The result is thus (potentially) unique for each entry in the .SDs (and hence DT). You can think of it as computing some value for a peer group of an entry in DT and that peer group is determined by the two grouping variables (same properties as the entry in DT) but the entry itself.
I accomplished this with loops around a simple := in data.table's j, but was wondering if there is a pure data.table solution. I could imagine something like .SD[i != id , := , by=1:nrow(.SD)] inside DT[] could do the trick but:
Using := in the j of .SD is reserved for future use as a (tortuously) flexible way to update DT by reference by group
The solution I have is (compute sum() for group determined by b and c except rolling ID):
DT <- data.table(ID = c("a","a","b","b","c","c"),
b = c(1, 2, 1, 2, 1, 2),
c = c("x", "x", "y", "z", "y", "x"),
Var1 = 1:6)
for (id2 in unique(DT$ID)) {
for (b2 in unique(DT$b)) {
c2 <- DT[ID==id2 & b==b2, c]
DT[ID == id2 & b == b2,
Var1_sum := sum(DT[ID! = id2 & b == b2 & c == c2, Var1], na.rm=TRUE)]
}
}
DT
ID b c Var1 Var1_sum
1: a 1 x 1 0
2: a 2 x 2 6
3: b 1 y 3 5
4: b 2 z 4 0
5: c 1 y 5 3
6: c 2 x 6 2
Do we need that future feature := in .SD's j for this?

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Resources