Elegant way to get the colclasses of a data.frame - r

I currently use the following function to list the classes of a data.frame:
sapply(names(iris),function(x) class(iris[,x]))
There must be a more elegant way to do this...

Since data.frames are already lists, sapply(iris, class) will just work. sapply won't be able to simplify to a vector for classes that extend other classes, so you could do something to take the first class, paste the classes together, etc.

EDIT If you just want to LOOK at the classes, consider using str:
str(iris) # Show "summary" of data.frame or any other object
#'data.frame':   150 obs. of  5 variables:
# $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
But to expand on #JoshuaUlrish excellent answer, a data.frame with time or ordered factor columns would cause pain with the sapply solution:
d <- data.frame(ID=1, time=Sys.time(), factor=ordered(42))
# This doesn't return a character vector anymore
sapply(d, class)
#$ID
#[1] "numeric"
#
#$time
#[1] "POSIXct" "POSIXt"
#
#$factor
#[1] "ordered" "factor"
# Alternative 1: Get the first class
sapply(d, function(x) class(x)[[1]])
# ID time factor
#"numeric" "POSIXct" "ordered"
# Alternative 2: Paste classes together
sapply(d, function(x) paste(class(x), collapse='/'))
# ID time factor
# "numeric" "POSIXct/POSIXt" "ordered/factor"
Note that none of these solutions are perfect. Getting only the first (or last) class can return something quite meaningless. Pasting makes using the compound class harder. Sometimes you might just want to detect when this happens, so an error would be preferable (and I love vapply ;-):
# Alternative 3: Fail if there are multiple-class columns
vapply(d, class, character(1))
#Error in vapply(d, class, character(1)) : values must be length 1,
# but FUN(X[[2]]) result is length 2

Related

Remove dataframes from list of dataframes using loop

I want to remove parts from a list to reduce the list to the elements of it that have a certain number of columns.
This a dummy example of what I'm trying to do:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe and not matrix
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables <- tables[-k]
}
}
another option I tried:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables[[-k]] <- NULL
}
}
I'm getting
Error in tables[[k]] : subscript out of bounds.
Is there an alternative and correct approach?
We can use Filter
Filter(function(x) ncol(x)==5, tables)
Or with sapply to create a logical index and subset the list
tables[sapply(tables, ncol)==5]
Or as #Sotos commented
tables[lengths(tables)==5]
lengths return the length of each list element convert it a logical vector and subset the list. The length of a data.frame is the number of columns it has
For a tidyverse option you can use purrr:keep for this. You just define a predicate function, if true it keeps the list element, if false it removes it. Here I've done that with the formula option.
library(purrr)
tables <- list(mtcars, iris)
result <- purrr::keep(tables, ~ ncol(.x) == 5)
str(result)
#> List of 1
#> $ :'data.frame': 150 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

using eval, parse and as.character in data.table in R

I am having some trouble running a combination of eval, parse and as.character for a data.table. I basically want to convert a given column of the data table to as.character output of the same column.
library(data.table)
options(datatable.WhenJisSymbolThenCallingScope=TRUE)
# an options that I heard may solve the problem
iris2 <- data.table(iris)
VARS <- colnames(iris)
j <- 1
iris2[,eval(parse(text = paste0(VARS[j])))] # this works fine
iris2[,eval(parse(text = paste0(VARS[j]))) := as.character(eval(parse(text = paste0(VARS[j]))))]
#but this fails
From the looks of it, it appears the eval and parse functions work fine but when it comes to updating the column with := it seems to break. Could someone tell me what the issue is?
We can use the data.table methods to transform the variables. Specify the 'VARS' or subset of 'VARS' i.e 'VARS[j]' in .SDcols, loop through the columns (in case we want to loop for multiple columns) and assign (:=) to the columns specified in 'VARS[j]`
iris2[, VARS[j] := lapply(.SD, as.character) , .SDcols = VARS[j]]
str(iris2)
#Classes ‘data.table’ and 'data.frame': 150 obs. of 5 variables:
#$ Sepal.Length: chr "5.1" "4.9" "4.7" "4.6" ...
#$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1

Data frame changes from numeric to character

I open my csv file and I control the class of each of my data:
mydataP<-read.csv("Energy_protein2.csv", stringsAsFactors=F)
apply(mydataP, 2, function(i) class(i))
#[1] "numeric"
I add a column and check the class of the data:
mydataP[ ,"ID"] <-rep(c("KOH1", "KOH2", "KOH3", "KON1", "KON2", "KON3", "WTH1", "WTH2", "WTH3","WTN1", "WTN2", "WTN3"), each=2)
apply(mydataP, 2, function(i) class(i))
Here it changes to a "character"
as.numeric(as.factor(mydataP))
#Error in sort.list(y) : 'x' must be atomic for 'sort.list'
#Have you called 'sort' on a list?
as.numeric(as.character(mydataP))
I get a vector with 117 NA
I have no idea what to do now, as soon I touch the frame it changes to character, can somebody help me? Thanks
That happens because apply converts your data.frame to matrix and those can only have one class in them.
Try this instead:
sapply(mydataP, class)
This is the reason you should normally try to avoid using apply on data.frames.
This behavior is documented in the help file (?apply):
If X is not an array but an object of a class with a non-null dim
value (such as a data frame), apply attempts to coerce it to an array
via as.matrix if it is two-dimensional (e.g., a data frame) or via
as.array.
Here's a reproducible example with the built-in iris dataset:
> apply(iris, 2, function(i) class(i))
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# "character" "character" "character" "character" "character"
> sapply(iris, class)
#Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# "numeric" "numeric" "numeric" "numeric" "factor"
> str(iris)
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
As you can see, apply converts all columns to the same class.

Test whether data is numeric or Factor/Ordinal

I'm sitting with a large dataset and want to get som basic information about my variables, first of all if they are numeric or factor/ordinal.
I'm working with a function, and want, one variable at a time, investigate if it is numeric or a factor.
To make the for loop work I'm using dataset[i] to get to the variable I want.
object<-function(dataset){
n=ncol(dataset)
for(i in 1:n){
variable_name<-names(dataset[i])
factor<-is.factor(dataset[i])
rdered<-is.ordered(dataset[i])
numeric<-is.numeric(dataset[i])
print(list(variable_name,factor,ordered,numeric))
}
}
is.ordered
My problem is that is.numeric() does not seem to work with dataset[i], all the results becomes "FALSE", but only with dataset$.
Do you have any idea how to solve this?
Try str(dataset) to get summary information on an object, but to solve your problem you need to compeletely extract your data with double square brackets. Single square bracket subsetting keeps the output as a sub-list (or data.frame) rather than extracting the contents:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.numeric(iris[1])
[1] FALSE
class(iris[1])
[1] "data.frame"
is.numeric(iris[[1]])
[1] TRUE
Assuming that dataset is something like a data.frame, you can do the following (and avoid the loop):
names = sapply(dataset, names) # or simply `colnames(dataset)`
types = sapply(dataset, class)
Then types gives you either numeric or factor. You can then simply do something like this:
is_factor = types == 'factor'

Select specific columns by string label in R frame

I want to exclude the "fldname" labeled column from a frame frm in R. If we know the index of the column say i then we can use the frm[-i] to exclude the ith column. Is there any simple way to do the same by specifying the column label string or list of label strings which i want to exclude?
I worked out a solution (corrected by Fhnuzoag):
frm[names (frm)[names (frm) != c("fldname1","fldname2")]]
frm[names (frm)[!names (frm) %in% c("fldname1","fldname2")]]
get the list of wanted strings and use them as index. Above "fldname1" and "fldname2" are the unwanted fields.
Is there a simply solution which the language syntax has?
Yes, use a combination of negation ! and %in%. For example, using iris:
x <- iris[, !names(iris) %in% c("Sepal.Width", "Sepal.Length")]
str(x)
'data.frame': 150 obs. of 3 variables:
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
I think, no. Usually I do frm[, setdiff(names(frm), excludelist)].

Resources