data.table in r : subset using column index - r

DT - data.table with column "A"(column index==1), "B"(column index 2), "C" and etc
for example next code makes subset DT1 which consists rows where A==2:
DT1 <- DT[A==2, ]
BUT How can I make subsets like DT1 using only column index??
for example, code like next not works :
DT1 <- DT[.SD==2, .SDcols = 1]

It is not recommended to use column index instead of column names as it makes your code difficult to understand and agile for any changes that could happen to your data. (See, for example, the first paragraph of the first question in the package FAQ.) However, you can subset with column index as follows:
DT = data.table(A = 1:5, B = 2:6, C = 3:7)
DT[DT[[1]] == 2]
# A B C
#1: 2 3 4

We can get the row index with .I and use that to subset the DT
DT[DT[, .I[.SD==2], .SDcols = 1]]
# A B C
#1: 2 3 4
data
DT <- data.table(A = 1:5, B = 2:6, C = 3:7)

Related

What is the function of `with` parameter when selecting dataframe subset

I encounter this code in one of the Kaggle Notebook:
corrplot.mixed(corr = cor(videos[,c("category_id","views","likes",
"dislikes","comment_count"),with=F]))
videos is a data.frame
"category_id","views","likes","dislikes","comment_count" are columns in the videos data.frame
Would like to understand what is the function of the with parameter when selecting dataframe subset?
As mentioned by #user20650 it might be a data.table. Although in this case your code should work even without with = F.
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = 5:1, c = 1:5)
To subset column a and b using character vector you could do
dt[, c('a', 'b'), with = F]
# a b
#1: 1 5
#2: 2 4
#3: 3 3
#4: 4 2
#5: 5 1
However, as mentioned this would work the same without with = F.
dt[, c('a', 'b')]
with = F is helpful when you have a vector of column names stored in a variable.
cols <- c('a', 'b')
dt[, cols] ##Error
dt[, cols, with = F] ##Works

How to construct an empty data.table with the colum names of an existing data.table?

I would like to create an empty data.table in R with colum names from another existing data.table.
Somehow I could not find a solution for that.
I would like to do something like that:
require(data.table)
dt1 <- data.table(fn = c("A","B","C"), x = c(1,2,3), y = c(2,3,4), a = 1, b = 2, c = 3)
dt2 <- data.table(names=colnames(dt1)) # Gives 6 rows instead of 6 cols
How can this be achieved?
Thanks!
You can also take your old dt1, clear it and keep as dt2
dt2 <- dt1[0,]
dt2
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c
It isn't precisely what did you want, but it always some solution.
One option could be:
dt2 <- setnames(data.table(matrix(nrow = 0, ncol = length(dt1))), names(dt1))
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c

R - Selecting columns from data table with for loop issue [duplicate]

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Create column names based on "by" argument the data.table way

Say I have the following data.table
dt <- data.table(var = c("a", "b"), val = c(1, 2))
Now I want to add two new columns to dt, named a, and b with the respective values (1, 2). I can do this with a loop, but I want to do it the data.table way.
The result would be a data.table like this:
dt.res <- data.table(var = c("a", "b"), val = c(1, 2), #old vars
a = c(1, NA), b = c(NA, 2)) # newly created vars
So far I came up with something like this
dt[, c(xx) := val, by = var]
where xx would be a data.table-command similar to .N which addresses the value of the by-group.
Thanks for the help!
Appendix: The for-loop way
The non-data.table-way with a for-loop instead of a by-argument would look something like this:
for (varname in dt$var){
dt[var == varname, c(varname) := val]
}
Based on the example showed, we can use dcast from the data.table to convert the long format to wide, and join with the original dataset on the 'val' column.
library(data.table)#v1.9.6+
dt[dcast(dt, val~var, value.var='val'), on='val']
# var val a b
#1: a 1 1 NA
#2: b 2 NA 2
Or as #CathG mentioned in the comments, for previous versions either merge or set the key column and then join.
merge(dt, dcast.data.table(dt, val~var, value.var='val'))

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Resources