R - Selecting columns from data table with for loop issue [duplicate] - r

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3

For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.

It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.

If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4

From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]

#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Related

Non-zero Values of Data.Table Column

I would like to extract the non-zero values of a specific column a of my data.table. Here is an example
set.seed(42)
DT <- data.table(
id = c("b","b","b","a","a","c"),
a = sample(c(0,1), 6, replace=TRUE),
b = 7:12,
c = 13:18
)
col <- "a"
If DT is a data.frame, I can do
x <- DT[,col] # I can do DT[,..col] to translate this line
x[x>0] # here is where I am stuck
Since DT is a data.table, this code fails. The error message is: "i is invalid type (matrix)".
I tried as.vector(x) but without success.
Any hint appreciated. This seems to be a beginner question. However, searching SO and the introduction vignette for data.table did not turn up a solution.
We can either use .SDcols to specify the column
DT[DT[, .SD[[1]] > 0, .SDcols = col]]
or with get
DT[DT[ ,get(col) > 0]]
DT[get(col) > 0][[col]]
#[1] 1 1
Or another option is [[
DT[DT[[col]] > 0]
# id a b c
#1: a 1 11 17
#2: c 1 12 18
Or to get only the column
DT[DT[[col]] >0][[col]]
#[1] 1 1
you can use filter:
DT %>% filter(column_name > 0)

data.table in r : subset using column index

DT - data.table with column "A"(column index==1), "B"(column index 2), "C" and etc
for example next code makes subset DT1 which consists rows where A==2:
DT1 <- DT[A==2, ]
BUT How can I make subsets like DT1 using only column index??
for example, code like next not works :
DT1 <- DT[.SD==2, .SDcols = 1]
It is not recommended to use column index instead of column names as it makes your code difficult to understand and agile for any changes that could happen to your data. (See, for example, the first paragraph of the first question in the package FAQ.) However, you can subset with column index as follows:
DT = data.table(A = 1:5, B = 2:6, C = 3:7)
DT[DT[[1]] == 2]
# A B C
#1: 2 3 4
We can get the row index with .I and use that to subset the DT
DT[DT[, .I[.SD==2], .SDcols = 1]]
# A B C
#1: 2 3 4
data
DT <- data.table(A = 1:5, B = 2:6, C = 3:7)

How to apply a operation on a subset of columns in data.table in R [duplicate]

How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:
df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
# col1
# 1 4
# 2 5
# 3 6
How can I perform this same operation for a data.table, either with or without := notation? The obvious thing of dt[ , list(colname)] doesn't work (nor did I expect it to).
Two ways to programmatically select variable(s):
with = FALSE:
DT = data.table(col1 = 1:3)
colname = "col1"
DT[, colname, with = FALSE]
# col1
# 1: 1
# 2: 2
# 3: 3
'dot dot' (..) prefix:
DT[, ..colname]
# col1
# 1: 1
# 2: 2
# 3: 3
For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).
To assign to variable(s), wrap the LHS of := in parentheses:
DT[, (colname) := 4:6]
# col1
# 1: 4
# 2: 5
# 3: 6
The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:
Using with = FALSE with := is now deprecated in all cases, given that wrapping
the LHS of := with parentheses has been preferred for some time.
colVar = "col1"
DT[, (colVar) := 1] # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change
See also Details section in ?`:=`:
DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol
And to answer further question in comment, here's one way (as usual there are many ways) :
DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15
or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :
expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"
eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28
If you do that a lot, you can define a helper function EVAL :
EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45
Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.
Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.
set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66
*This is not an answer really, but I don't have enough street cred to post comments :/
Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?
colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))
Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.
Retrieve multiple columns from data.table via variable or function:
library(data.table)
x <- data.table(this=1:2,that=1:2,whatever=1:2)
# === explicit call
x[, .(that, whatever)]
x[, c('that', 'whatever')]
# === indirect via variable
# ... direct assignment
mycols <- c('that','whatever')
# ... same as result of a function call
mycols <- grep('a', colnames(x), value=TRUE)
x[, ..mycols]
x[, .SD, .SDcols=mycols]
# === direct 1-liner usage
x[, .SD, .SDcols=c('that','whatever')]
x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]
which all yield
that whatever
1: 1 1
2: 2 2
I find the .SDcols way the most elegant.
With development version 1.14.3, data.table has gained a new interface for programming on data.table, see item 10 in New Features. It uses the new env = parameter.
library(data.table) # development version 1.14.3 used
dt <- data.table(col1 = 1:3)
colname <- "col1"
dt[, cn := cn + 3L, env = list(cn = colname)][]
col1
<int>
1: 4
2: 5
3: 6
For multiple columns and a function applied on column values.
When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.
The example below converts integer columns to numeric columns
a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class) # show classes of columns
# a b c1
# "integer" "integer" "character"
# column name character vector
nm <- c("a", "b")
# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]
sapply(a1, class)
# a b c1
# "numeric" "numeric" "character"
You could try this:
colname <- as.name("COL_NAME")
DT2 <- DT[, list(COL_SUM=sum(eval(colname, .SD))), by = c(group)]

data.table: How to fast stack(DT) operation, and return data.table instead of returned data.frame

DT <- data.table(a = c(1, 3), b = c(5, 2))
DT1 <- stack(DT)
> stack(DT1)
values ind
1 1 a
2 3 a
3 5 b
4 2 b
Now, it is changed to data.frame. Sure, I can use setDT(DT1) to change it back to data.table,
> DT1
values ind
1: 1 a
2: 3 a
3: 5 b
4: 2 b
I would like to now, is there other ways in data.table to perform **stack** operation (or something like stack function with more efficiency) and directly return data.table instead of data.frame?
Thank you.
You can use gather from the package tidyr
library(tidyr)
DT <- data.table(a = c(1, 3), b = c(5, 2))
DT1 <- gather(DT, ind, values, a:b)
The second argument is the name of the new "key" variable, the third argument is the name of the new "value" variable, the last argument is the columns you want to gather (all in this case). Also, gather automatically calls a faster version of melt for data.tables.

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Resources