I'm sitting with a large dataset and want to get som basic information about my variables, first of all if they are numeric or factor/ordinal.
I'm working with a function, and want, one variable at a time, investigate if it is numeric or a factor.
To make the for loop work I'm using dataset[i] to get to the variable I want.
object<-function(dataset){
n=ncol(dataset)
for(i in 1:n){
variable_name<-names(dataset[i])
factor<-is.factor(dataset[i])
rdered<-is.ordered(dataset[i])
numeric<-is.numeric(dataset[i])
print(list(variable_name,factor,ordered,numeric))
}
}
is.ordered
My problem is that is.numeric() does not seem to work with dataset[i], all the results becomes "FALSE", but only with dataset$.
Do you have any idea how to solve this?
Try str(dataset) to get summary information on an object, but to solve your problem you need to compeletely extract your data with double square brackets. Single square bracket subsetting keeps the output as a sub-list (or data.frame) rather than extracting the contents:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.numeric(iris[1])
[1] FALSE
class(iris[1])
[1] "data.frame"
is.numeric(iris[[1]])
[1] TRUE
Assuming that dataset is something like a data.frame, you can do the following (and avoid the loop):
names = sapply(dataset, names) # or simply `colnames(dataset)`
types = sapply(dataset, class)
Then types gives you either numeric or factor. You can then simply do something like this:
is_factor = types == 'factor'
Related
I R, Is there a way to different other possible classes of a variable
For example, data iris has num and Factor
So irrespective of any data sets, can we see what all classes the variable can take?
>str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
It sounds as though you want to know whether there is a way to enumerate all the different available classes in R.
We have to be clear in our nomenclature here. In R, a "class" is simply an attribute assigned to an object. This attribute can be read to decide what methods are available to use on the object. This is further complicated by there being three distinct object-oriented systems available in base R, all of which allow us to define our own classes.
At its simplest, we can use the S3 system to arbitrarily define classes like this:
class(iris$Species) <- "NewClass"
Giving us
class(iris$Species)
#> [1] "NewClass"
and
str(iris)
#> 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : 'NewClass' int 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "levels")= chr [1:3] "setosa" "versicolor" "virginica"
Since you (and package writers) are free to create new classes and their associated methods, the number of available classes is essentially infinite and cannot be enumerated. For example, we can change the print method for "NewClass" to something pretty useless:
print.NewClass <- function(x) print("Have a nice day!")
So now when we try to look at Species in the console we get:
iris$Species
[1] "Have a nice day!"
However, underneath the hood, there is different concept called type, and there are only a finite number of these types available in R, determined by their SEXTYPE representation in the underlying C code.
data("iris")
typeof(iris$Species)
#> [1] "integer"
The currently available types in R are "logical", "integer", "double", "complex", "character", "raw", "list", "NULL", "closure", "special", "builtin", "environment", "S4", "symbol", "pairlist", "promise", "language", "char", "...", "any", "expression", "externalptr", "bytecode" and "weakref", as listed in the help file for ?typeof. Some of these are never used directly by end users.
You will note that “factor” is not a basic type at all, but actually a class.
There is also the concept of the "mode" or “storage mode” of an object, but this is effectively determined by its type
My guess is that you are looking for a particular data type to suit some requirement, and wanted to know if there are any built in classes in R that fit the bill. The answer is that it is very easy to create your own class if you need to. It is unlikely that you will need to think about types or modes at a beginner or intermediate level, and I'm guessing there are many advanced users out there who are a whizz at writing and maintaining classes, but know (or care) little about the underlying types and modes.
I´m trying to use levene Test from "car" library in R with the iris dataset.
The code I have is:
library(tidyverse)
library(car)
iris %>% group_by (Species) %>% leveneTest( Sepal.Length )
From there I´m getting the following error:
Error in leveneTest.default(., Sepal.Length) :
. is not a numeric variable
I don´t know how to fix this, since the data types seem to be of the rigth type:
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
For levene test, you need to specify a grouping factor, for example:
leveneTest(Sepal.Length ~ Species,data=iris)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 6.3527 0.002259 **
147
This test whether the variances are homogenous across groups. It doesn't quite make sense for you to group them and do the leveneTest within the group. If you intend to do something else, you can elaborate more or comment.
try to do it this way
with(iris, leveneTest(Sepal.Length, Species))
maybe you are looking for such a solution
map(iris[, 1:4], ~ leveneTest(.x, iris$Species))
I want to remove parts from a list to reduce the list to the elements of it that have a certain number of columns.
This a dummy example of what I'm trying to do:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe and not matrix
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables <- tables[-k]
}
}
another option I tried:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables[[-k]] <- NULL
}
}
I'm getting
Error in tables[[k]] : subscript out of bounds.
Is there an alternative and correct approach?
We can use Filter
Filter(function(x) ncol(x)==5, tables)
Or with sapply to create a logical index and subset the list
tables[sapply(tables, ncol)==5]
Or as #Sotos commented
tables[lengths(tables)==5]
lengths return the length of each list element convert it a logical vector and subset the list. The length of a data.frame is the number of columns it has
For a tidyverse option you can use purrr:keep for this. You just define a predicate function, if true it keeps the list element, if false it removes it. Here I've done that with the formula option.
library(purrr)
tables <- list(mtcars, iris)
result <- purrr::keep(tables, ~ ncol(.x) == 5)
str(result)
#> List of 1
#> $ :'data.frame': 150 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
I want to exclude the "fldname" labeled column from a frame frm in R. If we know the index of the column say i then we can use the frm[-i] to exclude the ith column. Is there any simple way to do the same by specifying the column label string or list of label strings which i want to exclude?
I worked out a solution (corrected by Fhnuzoag):
frm[names (frm)[names (frm) != c("fldname1","fldname2")]]
frm[names (frm)[!names (frm) %in% c("fldname1","fldname2")]]
get the list of wanted strings and use them as index. Above "fldname1" and "fldname2" are the unwanted fields.
Is there a simply solution which the language syntax has?
Yes, use a combination of negation ! and %in%. For example, using iris:
x <- iris[, !names(iris) %in% c("Sepal.Width", "Sepal.Length")]
str(x)
'data.frame': 150 obs. of 3 variables:
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
I think, no. Usually I do frm[, setdiff(names(frm), excludelist)].
I currently use the following function to list the classes of a data.frame:
sapply(names(iris),function(x) class(iris[,x]))
There must be a more elegant way to do this...
Since data.frames are already lists, sapply(iris, class) will just work. sapply won't be able to simplify to a vector for classes that extend other classes, so you could do something to take the first class, paste the classes together, etc.
EDIT If you just want to LOOK at the classes, consider using str:
str(iris) # Show "summary" of data.frame or any other object
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
But to expand on #JoshuaUlrish excellent answer, a data.frame with time or ordered factor columns would cause pain with the sapply solution:
d <- data.frame(ID=1, time=Sys.time(), factor=ordered(42))
# This doesn't return a character vector anymore
sapply(d, class)
#$ID
#[1] "numeric"
#
#$time
#[1] "POSIXct" "POSIXt"
#
#$factor
#[1] "ordered" "factor"
# Alternative 1: Get the first class
sapply(d, function(x) class(x)[[1]])
# ID time factor
#"numeric" "POSIXct" "ordered"
# Alternative 2: Paste classes together
sapply(d, function(x) paste(class(x), collapse='/'))
# ID time factor
# "numeric" "POSIXct/POSIXt" "ordered/factor"
Note that none of these solutions are perfect. Getting only the first (or last) class can return something quite meaningless. Pasting makes using the compound class harder. Sometimes you might just want to detect when this happens, so an error would be preferable (and I love vapply ;-):
# Alternative 3: Fail if there are multiple-class columns
vapply(d, class, character(1))
#Error in vapply(d, class, character(1)) : values must be length 1,
# but FUN(X[[2]]) result is length 2