Reference `data.table` column by name - r

Suppose there is:
DT = data.table(a=1, b=2, "a+b"=8)
and there is variable col="a+b" referencing the third column of DT
How to perform an operation on that column by reference? Let's say I want multiply col by 2, so in the above example the result should be 8*2=16, not (1+2)*2=6
For example, this obviously doesn't work:
DT[, c:=as.name(col)*2]

It sounds like you're looking for get:
DT = data.table(a=1, b=2, "a+b"=8)
col = "a+b"
DT[, get(col) * 2]
# [1] 16
DT[, c := get(col) * 2]
DT
# a b a+b c
# 1: 1 2 8 16

Related

Non-zero Values of Data.Table Column

I would like to extract the non-zero values of a specific column a of my data.table. Here is an example
set.seed(42)
DT <- data.table(
id = c("b","b","b","a","a","c"),
a = sample(c(0,1), 6, replace=TRUE),
b = 7:12,
c = 13:18
)
col <- "a"
If DT is a data.frame, I can do
x <- DT[,col] # I can do DT[,..col] to translate this line
x[x>0] # here is where I am stuck
Since DT is a data.table, this code fails. The error message is: "i is invalid type (matrix)".
I tried as.vector(x) but without success.
Any hint appreciated. This seems to be a beginner question. However, searching SO and the introduction vignette for data.table did not turn up a solution.
We can either use .SDcols to specify the column
DT[DT[, .SD[[1]] > 0, .SDcols = col]]
or with get
DT[DT[ ,get(col) > 0]]
DT[get(col) > 0][[col]]
#[1] 1 1
Or another option is [[
DT[DT[[col]] > 0]
# id a b c
#1: a 1 11 17
#2: c 1 12 18
Or to get only the column
DT[DT[[col]] >0][[col]]
#[1] 1 1
you can use filter:
DT %>% filter(column_name > 0)

R - Selecting columns from data table with for loop issue [duplicate]

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

data.table in r : subset using column index

DT - data.table with column "A"(column index==1), "B"(column index 2), "C" and etc
for example next code makes subset DT1 which consists rows where A==2:
DT1 <- DT[A==2, ]
BUT How can I make subsets like DT1 using only column index??
for example, code like next not works :
DT1 <- DT[.SD==2, .SDcols = 1]
It is not recommended to use column index instead of column names as it makes your code difficult to understand and agile for any changes that could happen to your data. (See, for example, the first paragraph of the first question in the package FAQ.) However, you can subset with column index as follows:
DT = data.table(A = 1:5, B = 2:6, C = 3:7)
DT[DT[[1]] == 2]
# A B C
#1: 2 3 4
We can get the row index with .I and use that to subset the DT
DT[DT[, .I[.SD==2], .SDcols = 1]]
# A B C
#1: 2 3 4
data
DT <- data.table(A = 1:5, B = 2:6, C = 3:7)

Getting one row per group in R data.table

I often want to process one row of a data.table at a time. I've been using
d[, j, by=rownames(d)]
but this doesn't always seem to work (sometimes getting an error message about by appearing to evaluate to column names), and in any case isn't a very clean expression of what I'm trying to do.
Let me give a specific example.
d = data.table(a=c(1,2),b=c(3,4))
f = function(x,y) x[1]+y[1] #expects length 1 vectors x and y and adds them
d[, id := 1:.N]
d[, f(a,b), by=id]
d[, id := NULL]
The situation is that I have a function f that is not vectorized. I've decorated d with an id column so I can process one row at a time. I'm looking for a better way to do this.
Here's another example, without a function f:
d[, list(a=a,b=b,s=a:b), by = id]
d[, id := NULL]
This seems to do the job, following the example I found at https://arelbundock.com/posts/datatable_rowwise/
# Your example data frame and function
d = data.table(a = c(1, 2), b = c(3, 4))
f <- function(x, y) {x[1] + y[1]}
# Try the Map() function for row-wise operations
d[, z := Map(f, a, b)]
produces
#> a b z
#> 1: 1 3 4
#> 2: 2 4 6

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Resources