I would like to extract the non-zero values of a specific column a of my data.table. Here is an example
set.seed(42)
DT <- data.table(
id = c("b","b","b","a","a","c"),
a = sample(c(0,1), 6, replace=TRUE),
b = 7:12,
c = 13:18
)
col <- "a"
If DT is a data.frame, I can do
x <- DT[,col] # I can do DT[,..col] to translate this line
x[x>0] # here is where I am stuck
Since DT is a data.table, this code fails. The error message is: "i is invalid type (matrix)".
I tried as.vector(x) but without success.
Any hint appreciated. This seems to be a beginner question. However, searching SO and the introduction vignette for data.table did not turn up a solution.
We can either use .SDcols to specify the column
DT[DT[, .SD[[1]] > 0, .SDcols = col]]
or with get
DT[DT[ ,get(col) > 0]]
DT[get(col) > 0][[col]]
#[1] 1 1
Or another option is [[
DT[DT[[col]] > 0]
# id a b c
#1: a 1 11 17
#2: c 1 12 18
Or to get only the column
DT[DT[[col]] >0][[col]]
#[1] 1 1
you can use filter:
DT %>% filter(column_name > 0)
Related
I encounter this code in one of the Kaggle Notebook:
corrplot.mixed(corr = cor(videos[,c("category_id","views","likes",
"dislikes","comment_count"),with=F]))
videos is a data.frame
"category_id","views","likes","dislikes","comment_count" are columns in the videos data.frame
Would like to understand what is the function of the with parameter when selecting dataframe subset?
As mentioned by #user20650 it might be a data.table. Although in this case your code should work even without with = F.
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = 5:1, c = 1:5)
To subset column a and b using character vector you could do
dt[, c('a', 'b'), with = F]
# a b
#1: 1 5
#2: 2 4
#3: 3 3
#4: 4 2
#5: 5 1
However, as mentioned this would work the same without with = F.
dt[, c('a', 'b')]
with = F is helpful when you have a vector of column names stored in a variable.
cols <- c('a', 'b')
dt[, cols] ##Error
dt[, cols, with = F] ##Works
How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]
I have a data table containing 3 columns, one of them
contains a key:value list of different lengths.
I wish to rearrange the table such that each row will have only one key, conditioned on the value
for example, suppose that I wish to get all rows for whom the value is <= 2 so that each key is on its own row:\
input_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"ha:llo\":1,\"wor:ld\":2,\"doog:bye\":3}"),
c=c(1))
the wanted table then should be
tbl_output <- data.table::data.table(a=c("AA",
"AA"),b=c("ha:llo","wor:ld"), c=c(1,1), s=c(1,2))
I had tried the following function:
data_table_clean <- function(dt){
dt[ ,"b" := data.table::tstrsplit(b, ',', fixed = T),by=c(a, c)]
dt[,c('b', 's'):= data.table::tstrsplit(b, ':', fixed=TRUE)]
return(dt[s <=2,])
}
this produces the following error
"Error in eval(expr, envir, enclos) : object 'a' not found"
Any suggestions are welcome, off course.
The keys are actually of the form :
input2_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"99:1d:3u:7y:89:67\":1,\"99:1D:34:YY:T6:Y6\":2,\"ll:5Y:UY:56:R5:R6\":3}"),
c=c(1))
and accordingly the output table should be:
tbl2_output <- data.table::data.table(a=c("AA",
"AA"),b=c(""99:1d:3u:7y:89:67","99:1D:34:YY:T6:Y6"),
c=c(1,1), s=c(1,2))
Thank you!
update
data_table_clean <- function(dt){
res <- dt[, data.table::tstrsplit(unlist(strsplit(gsub('[{}"]', '', b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE),
by = .(a, c)][V2 > -100]
data.table::setnames(res, 3:4, c("b", "s"))
res
}
when running this I get the following error:
Error in .subset(x, j) : invalid subscript type 'list'
One option would be to extract the characters that we need in the final output. We use str_extract to do that after grouping by 'a', 'c'. The output is a list, which we unlist, get the non-numeric and numeric into two columns and then subset the rows with the condition s<3.
library(stringr)
library(data.table)
input_tbl[, {
tmp <- unlist(str_extract_all(b, "[A-Za-z]+:[A-Za-z]+|\\d+"))
list(b=tmp[c(TRUE, FALSE)], s=tmp[c(FALSE, TRUE)])
}, by = .(a,c)][s<3]
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Or if we are using strsplit/tstrsplit, grouped by 'a', 'c', we remove the curly brackets and quotes ([{}]") with gsub, split by , (strsplit), unlist the output, and then use tstrsplit to split by : that is followed by a number. The subset part is similar as above.
res <- input_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b), ',', fixed=TRUE)), ":(?=\\d)", perl=TRUE) ,.(a,c)][V2<3]
setnames(res, 3:4, c("b", "s"))
res
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Update
For the updated dataset, we can do the tstrsplit on the last delimiter (:)
res1 <- input2_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE) ,
by = .(a, c)][V2 < 3]
setnames(res1, 3:4, c("b", "s"))
res1
# a c b s
# 1: AA 1 99:1d:3u:7y:89:67 1
# 2: AA 1 99:1D:34:YY:T6:Y6 2
Since it seems like you are working with a JSON object, why not use something that parses the JSON, for example, the "jsonlite" package?
With that, you can make a simple function, that looks like this:
myFun <- function(invec) {
require(jsonlite)
x <- fromJSON(invec)
list(b = names(x), s = unlist(x))
}
Now, applied to your dataset, you would get:
input_tbl[, myFun(b), by = .(a, c)]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
# 3: AA 1 doog:bye 3
And, for the subsetting:
input_tbl[, myFun(b), by = .(a, c)][s <= 2]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
You can probably also even rewrite the myFun function to add a "threshold" argument that lets you subset within the function itself.
I have a dataset of the type 900,000 x 500, but the following shows a toy dataset apt for the question.
library(data.table)
df1 <- data.table(x = c(1,2,4,0), y = c(0,0,10,15), z = c(1,1,1,0))
I would like to do the following:
For columns y and z
select rows the value for which = 0
replace these with the max+1, where max is computed over the entire column
I am very new to data.table. Looking at examples of questions here at stackoverflow, I couldn't find a similar question, except this:
How to replace NA values in a table *for selected columns*? data.frame, data.table
My own attempt is as follows, but this does not work:
for (col in c("x", "y")) df1[(get(col)) == 0, (col) := max(col) + 1)
Obviously, I haven't gotten accustomed to data.table, so I'm banging my head against the wall at the moment...
If anybody could provide a dplyr solution in addition to data.table, I would be thankful.
We can use set and assign the rows where the value is 0 with the max of that column +1.
for(j in c("y", "z")){
set(df1, i= which(!df1[[j]]), j=j, value= max(df1[[j]])+1)
}
df1
# x y z
#1: 1 16 1
#2: 2 16 1
#3: 4 10 1
#4: 0 15 2
NOTE: The set method will be very efficient as the overhead of [.data.table is avoided
Or a less efficient method would be to specify the columns of interest in .SDcols, loop through the columns (lapply(..), replace the value based on the logical index, and assign (:=) the output back to the columns.
df1[, c('y', 'z') := lapply(.SD, function(x)
replace(x, !x, max(x)+1)), .SDcols= y:z]
The dplyr version is pretty simple (I think)
> library(dplyr)
# indented for clarity
> mutate(df1,
y= ifelse(y>0, y, max(y)+1),
z= ifelse(z>0, z, max(z)+1))
x y z
1 1 16 1
2 2 16 1
3 4 10 1
4 0 15 2
EDIT
As noted by David Arenburg in comments this is helpful for the toy example but not for the data mentione dwith 500 columns. He suggests something similar to:
df1 %>% mutate_each(funs(ifelse(. > 0, ., max(.) + 1)), -1)
where -1 specifies all but the first column
As an alternative, ifelse(test, yes, no) might be useful
Along the lines
library(data.table)
dt <- data.table(x = c(1,2,4,0), y = c(0,0,10,15), z = c(1,1,1,0))
print(dt)
dt[, y := ifelse(!y, max(y) + 1, y)]
print(dt)
How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]