This is in data.table 1.9.4.
context
I'm wrapping up an ML training operation in a function call, and I want to get the levels of a column of a data.table that has been passed in. I noticed that this requires that the column argument be dequoted using get():
A minimal example to demonstrate failing approaches:
library(data.table)
test.table <- data.table(col1 = rep(c(0,1), times = 10), col2 = 1:20)
col.id <- "col1"
str(test.table[,levels(col.id),with=FALSE])
Classes ‘data.table’ and 'data.frame': 0 obs. of 0 variables
- attr(*, ".internal.selfref")=<externalptr>
> str(test.table[,levels(factor(col.id)), with=FALSE])
Classes ‘data.table’ and 'data.frame': 20 obs. of 1 variable:
$ col1: num 0 1 0 1 0 1 0 1 0 1 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(test.table[,levels(as.factor(col.id)), with=FALSE])
Classes ‘data.table’ and 'data.frame': 20 obs. of 1 variable:
$ col1: num 0 1 0 1 0 1 0 1 0 1 ...
- attr(*, ".internal.selfref")=<externalptr>
levels(test.table[,factor(col.id), with=FALSE])
NULL
levels(test.table[,as.factor(col.id), with=FALSE])
NULL
And yet, test.table[,col.id, with = FALSE] is a valid way to access the column.
Here's some things that work:
> test.table[,levels(as.factor(get(col.id)))]
[1] "0" "1"
> test.table[,levels(as.factor(get(col.id)))]
[1] "0" "1"
> test.table[,levels(factor(get(col.id)))]
[1] "0" "1"
> levels(test.table[,factor(get(col.id))])
[1] "0" "1"
Why is this? Is it intended?
Related
I have a data frame and I am trying to convert class of each variable of dt based on col_type.
Find example below for more detail.
> dt
id <- c(1,2,3,4)
a <- c(1,4,5,6)
b <- as.character(c(0,1,1,4))
c <- as.character(c(0,1,1,0))
d <- c(0,1,1,0)
dt <- data.frame(id,a,b,c,d, stringsAsFactors = FALSE)
> str(dt)
'data.frame': 4 obs. of 5 variables:
$ id: num 1 2 3 4
$ a : num 1 4 5 6
$ b : chr "0" "1" "1" "4"
$ c : chr "0" "1" "1" "0"
$ d : num 0 1 1 0
Now, I am trying to convert class of each column based on below data frame.
> var
var <- c("id","a","b","c","d")
type <- c("character","numeric","numeric","integer","character")
col_type <- data.frame(var,type, stringsAsFactors = FALSE)
> col_type
var type
1 id character
2 a numeric
3 b numeric
4 c integer
5 d character
I want to convert id to class mention in col_type data frame and so on for all other columns.
My Attempts:
setDT(dt)
for(i in 1:ncol(dt)){
if(colnames(dt)[i]%in%col_type$var){
a <- col_type[col_type$var==paste0(intersect(colnames(dt)[i],col_type$var)),]
dt[,col_type$var[i]:=eval(parse(text = paste0("as.",col_type$type[i],"(",col_type$var[i],")")))]
}
}
Note- My solution works but it is really slow and I am wondering if I can do it more efficiently and cleanly.
Suggestions will be appreciated.
I would read the data in with the colClasses argument derived from the col_type table:
library(data.table)
library(magrittr)
setDT(col_type)
res = capture.output(fwrite(dt)) %>% paste(collapse="\n") %>%
fread(colClasses = col_type[, setNames(type, var)])
str(res)
Classes ‘data.table’ and 'data.frame': 4 obs. of 5 variables:
$ id: chr "1" "2" "3" "4"
$ a : num 1 4 5 6
$ b : num 0 1 1 4
$ c : int 0 1 1 0
$ d : chr "0" "1" "1" "0"
- attr(*, ".internal.selfref")=<externalptr>
If you can do this when the data is read in initially, it simplifies to...
res = fread("file.csv", colClasses = col_type[, setNames(type, var)])
It's straightforward to do all of this without data.table.
If somehow the data is never read into R (received as RDS?), there's:
setDT(dt)
res = dt[, Map(as, .SD, col_type$type), .SDcols=col_type$var]
str(res)
Classes ‘data.table’ and 'data.frame': 4 obs. of 5 variables:
$ id: chr "1" "2" "3" "4"
$ a : num 1 4 5 6
$ b : num 0 1 1 4
$ c : int 0 1 1 0
$ d : chr "0" "1" "1" "0"
- attr(*, ".internal.selfref")=<externalptr>
Consider base R's get() inside Map which can be used to retrieve a function from its string literal using as.* functions. Then bind list of vectors into a dataframe.
vec_list <- Map(function(v, t) get(paste0("as.", t))(dt[[v]]), col_type$var, col_type$type)
dt_new <- data.frame(vec_list, stringsAsFactors = FALSE)
str(dt_new)
# 'data.frame': 4 obs. of 5 variables:
# $ id: chr "1" "2" "3" "4"
# $ a : num 1 4 5 6
# $ b : num 0 1 1 4
# $ c : int 0 1 1 0
# $ d : chr "0" "1" "1" "0"
Possibly wrap get() in tryCatch if conversions can potentially fail.
If you have a data.frame with numeric columns the conversion is without problems, as explained here.
dtf=data.frame(matrix(rep(5,10),ncol=2))
#str(dtf)
dtfz <- zoo(dtf)
class(dtfz)
#[1] "zoo"
str(as.data.frame(dtfz))
#'data.frame': 5 obs. of 2 variables:
# $ X1: num 5 5 5 5 5
# $ X2: num 5 5 5 5 5
But if you have a data.frame with text columns everything is converted to factors, even when setting stringsAsFactors = FALSE
dtf=data.frame(matrix(rep("d",10),ncol=2),stringsAsFactors = FALSE)
#str(dtf)
dtfz <- zoo(dtf)
#class(dtfz)
#dtfz
All the following convert the strings to factors:
str(as.data.frame(dtfz))
str(as.data.frame(dtfz,stringsAsFactors = FALSE))
str(data.frame(dtfz))
str(data.frame(dtfz,stringsAsFactors = FALSE))
str(as.data.frame(dtfz, check.names=FALSE, row.names=NULL,stringsAsFactors = FALSE))
#'data.frame': 5 obs. of 2 variables:
# $ X1: Factor w/ 1 level "d": 1 1 1 1 1
# $ X2: Factor w/ 1 level "d": 1 1 1 1 1
How to avoid this behaviour when the data.frame has many text columns?
I found the solution based on a comment by #thelatemail. It works for the actual version of zoo (Sept/2017). As #G. Grothendieck commented, the future versions of zoo will consider the stringsAsFactors = FALSE argument.
str(base:::as.data.frame(coredata(dtfz),stringsAsFactors = FALSE))
#'data.frame': 5 obs. of 2 variables:
# $ X1: chr "d" "d" "d" "d" ...
# $ X2: chr "d" "d" "d" "d" ...
suppose I have the following list:
a=list()
a[1]<-c("1")
a[2]<-c("3")
a[[1]][2]<-c("a")
a[[2]][2]<-c("b")
List of 2
$ : chr [1:2] "1" "a"
$ : chr [1:2] "3" "b"
[[1]]
[1] "1" "a"
[[2]]
[1] "3" "b"
How can I convert that list into a data frame like this?
This is how the info would looks like:
table<-data.frame(col1=c("1a","3b"))
col1
1a
3b
'data.frame': 2 obs. of 1 variable:
$ col1: Factor w/ 2 levels "1a","3b": 1 2
You can use , do.call as well however, I do feel the answer in the comment is better than this:
df <- setNames(data.frame(do.call("paste0",data.frame(do.call("rbind",a)))),"col1")
You can always read about do.call from documentation:
do.call constructs and executes a function call from a name or a
function and a list of arguments to be passed to it.
Output:
> df
col1
1 1a
2 3b
> str(df)
'data.frame': 2 obs. of 1 variable:
$ col1: Factor w/ 2 levels "1a","3b": 1 2
Try this out and let me know in case of any queries.
a=list()
a[1]<-c("1")
a[2]<-c("3")
a[[1]][2]<-c("a")
a[[2]][2]<-c("b")
b <- t(data.frame(a))
data.frame(col1=paste0(b[,1],b[,2]))
I have a dataframe. I want to inspect the class of each column.
x1 = rep(1:4, times=5)
x2 = factor(rep(letters[1:4], times=5))
xdat = data.frame(x1, x2)
> class(xdat)
[1] "data.frame"
> class(xdat$x1)
[1] "integer"
> class(xdat$x2)
[1] "factor"
However, imagine that I have many columns and therefore need to use apply() to help me do the trick. But it's not working.
apply(xdat, 2, class)
x1 x2
"character" "character"
Why cannot I use apply() to see the data type of each column? or What I should do?
Thanks!
You could use
sapply(xdat, class)
# x1 x2
# "integer" "factor"
using apply would coerce the output to matrix and matrix can hold only a single 'class'. If there are 'character' columns, the result would be a single 'character' class. To understand this check
str(apply(xdat, 2, I))
#chr [1:20, 1:2] "1" "2" "3" "4" "1" "2" "3" "4" "1" ...
#- attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:2] "x1" "x2"
Now, if we check
str(lapply(xdat, I))
#List of 2
#$ x1:Class 'AsIs' int [1:20] 1 2 3 4 1 2 3 4 1 2 ...
#$ x2: Factor w/ 4 levels "a","b","c","d": 1 2 3 4 1 2 3 4 1 2 ...
I've got a list with different types in it. They are arranged in matrix form:
tmp <- list('a', 1, 'b', 2, 'c', 3)
dim(tmp) <- c(2,3)
tmp
[,1] [,2] [,3]
[1,] "a" "b" "c"
[2,] 1 2 3
That's the form I get it out of another more complex function.
Now I want to transpose it and convert to a data.frame. So I do the following:
data <- as.data.frame(t(tmp))
data
V1 V2
1 a 1
2 b 2
3 c 3
This looks great. But it's got the wrong structure:
str(data)
'data.frame': 3 obs. of 2 variables:
$ V1:List of 3
..$ : chr "a"
..$ : chr "b"
..$ : chr "c"
$ V2:List of 3
..$ : num 1
..$ : num 2
..$ : num 3
So how do I get rid of the extra level of lists?
This should do the trick:
df <- data.frame(lapply(data.frame(t(tmp)), unlist), stringsAsFactors=FALSE)
str(df)
# 'data.frame': 3 obs. of 2 variables:
# $ X1: chr "a" "b" "c"
# $ X2: num 1 2 3
The inner data.frame() call converts the matrix into a two column data.frame, with one "character" column and one "numeric" column.**
lapply(..., unlist) strips away extra list() layer.
The outer data.frame() call converts the resulting list into the data.frame you're after.
** (OK, that intermediate "character" column is really of class "factor", but it ends up making no difference in the final result. If you like, you could force it to be have class "character" by adding a stringsAsFactors=FALSE for the inner data.frame() call as well, but I don't think neglecting to do so would ever make a difference...)
Or this :
as.data.frame(matrix(unlist(tmp),ncol=2,byrow=TRUE))
You can inspect the result:
str(as.data.frame(matrix(unlist(tmp),ncol=2,byrow=TRUE)))
'data.frame': 3 obs. of 2 variables:
$ V1: Factor w/ 3 levels "a","b","c": 1 2 3
$ V2: Factor w/ 3 levels "1","2","3": 1 2 3