Is there a way to different other possible classes of a variable - r

I R, Is there a way to different other possible classes of a variable
For example, data iris has num and Factor
So irrespective of any data sets, can we see what all classes the variable can take?
>str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

It sounds as though you want to know whether there is a way to enumerate all the different available classes in R.
We have to be clear in our nomenclature here. In R, a "class" is simply an attribute assigned to an object. This attribute can be read to decide what methods are available to use on the object. This is further complicated by there being three distinct object-oriented systems available in base R, all of which allow us to define our own classes.
At its simplest, we can use the S3 system to arbitrarily define classes like this:
class(iris$Species) <- "NewClass"
Giving us
class(iris$Species)
#> [1] "NewClass"
and
str(iris)
#> 'data.frame': 150 obs. of 5 variables:
#> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species : 'NewClass' int 1 1 1 1 1 1 1 1 1 1 ...
#> ..- attr(*, "levels")= chr [1:3] "setosa" "versicolor" "virginica"
Since you (and package writers) are free to create new classes and their associated methods, the number of available classes is essentially infinite and cannot be enumerated. For example, we can change the print method for "NewClass" to something pretty useless:
print.NewClass <- function(x) print("Have a nice day!")
So now when we try to look at Species in the console we get:
iris$Species
[1] "Have a nice day!"
However, underneath the hood, there is different concept called type, and there are only a finite number of these types available in R, determined by their SEXTYPE representation in the underlying C code.
data("iris")
typeof(iris$Species)
#> [1] "integer"
The currently available types in R are "logical", "integer", "double", "complex", "character", "raw", "list", "NULL", "closure", "special", "builtin", "environment", "S4", "symbol", "pairlist", "promise", "language", "char", "...", "any", "expression", "externalptr", "bytecode" and "weakref", as listed in the help file for ?typeof. Some of these are never used directly by end users.
You will note that “factor” is not a basic type at all, but actually a class.
There is also the concept of the "mode" or “storage mode” of an object, but this is effectively determined by its type
My guess is that you are looking for a particular data type to suit some requirement, and wanted to know if there are any built in classes in R that fit the bill. The answer is that it is very easy to create your own class if you need to. It is unlikely that you will need to think about types or modes at a beginner or intermediate level, and I'm guessing there are many advanced users out there who are a whizz at writing and maintaining classes, but know (or care) little about the underlying types and modes.

Related

Error using Levene test in R after a group by [error: is not a numeric variable]

I´m trying to use levene Test from "car" library in R with the iris dataset.
The code I have is:
library(tidyverse)
library(car)
iris %>% group_by (Species) %>% leveneTest( Sepal.Length )
From there I´m getting the following error:
Error in leveneTest.default(., Sepal.Length) :
. is not a numeric variable
I don´t know how to fix this, since the data types seem to be of the rigth type:
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
For levene test, you need to specify a grouping factor, for example:
leveneTest(Sepal.Length ~ Species,data=iris)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 6.3527 0.002259 **
147
This test whether the variances are homogenous across groups. It doesn't quite make sense for you to group them and do the leveneTest within the group. If you intend to do something else, you can elaborate more or comment.
try to do it this way
with(iris, leveneTest(Sepal.Length, Species))
maybe you are looking for such a solution
map(iris[, 1:4], ~ leveneTest(.x, iris$Species))

Remove dataframes from list of dataframes using loop

I want to remove parts from a list to reduce the list to the elements of it that have a certain number of columns.
This a dummy example of what I'm trying to do:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe and not matrix
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables <- tables[-k]
}
}
another option I tried:
#1: define the list
tables = list(mtcars,iris)
for(k in 1:length(tables)) {
# 2: be sure that each element is shaped as dataframe
tables[[k]] = as.data.frame(tables[[k]])
# 3: remove elements that have more or less than 5 columns
if(ncol(tables[[k]]) != 5) {
tables[[-k]] <- NULL
}
}
I'm getting
Error in tables[[k]] : subscript out of bounds.
Is there an alternative and correct approach?
We can use Filter
Filter(function(x) ncol(x)==5, tables)
Or with sapply to create a logical index and subset the list
tables[sapply(tables, ncol)==5]
Or as #Sotos commented
tables[lengths(tables)==5]
lengths return the length of each list element convert it a logical vector and subset the list. The length of a data.frame is the number of columns it has
For a tidyverse option you can use purrr:keep for this. You just define a predicate function, if true it keeps the list element, if false it removes it. Here I've done that with the formula option.
library(purrr)
tables <- list(mtcars, iris)
result <- purrr::keep(tables, ~ ncol(.x) == 5)
str(result)
#> List of 1
#> $ :'data.frame': 150 obs. of 5 variables:
#> ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Convenient way to access variables label after importing Stata data with haven

In R, some packages (e.g. haven) insert a label attributes to variables (e.g. haven), which explains the substantive name of the variable. For example, gdppc may have the label GDP per capita.
This is extremely useful, especially when importing data from Stata. However, I still struggle to know how to use this in my workflow.
How to quickly browse the variable and the variable label? Right now I have to do attributes(df$var), but this is hardly convenient to get a glimpse (a la names(df))
How to use these labels in plots? Again, I can use attr(df$var, "label") to access the string label. However, it seems cumbersome.
Is there any official way to use these labels in a workflow? I can certainly write a custom function that wraps around the attr, but it may break in the future when packages implement the label attribute differently. Thus, ideally I'd want an official way supported by haven (or other major packages).
A solution with purrr package from tidyverse:
df %>% map_chr(~attributes(.)$label)
Using sapply in a simple function to return a variable list as in Stata's Variable Window:
library(dplyr)
makeVlist <- function(dta) {
labels <- sapply(dta, function(x) attr(x, "label"))
tibble(name = names(labels),
label = labels)
}
This is one of the innovations addressed in rio (full disclosure: I wrote this package). Basically, it provides various ways of importing variable labels, including haven's way of doing things and foreign's. Here's a trivial example:
Start by making a reproducible example:
> library("rio")
> export(iris, "iris.dta")
Import using foreign::read.dta() (via rio::import()):
> str(import("iris.dta", haven = FALSE))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr "15 Jan 2016 20:05"
- attr(*, "formats")= chr "" "" "" "" ...
- attr(*, "types")= int 255 255 255 255 253
- attr(*, "val.labels")= chr "" "" "" "" ...
- attr(*, "var.labels")= chr "" "" "" "" ...
- attr(*, "version")= int -7
- attr(*, "label.table")=List of 1
..$ Species: Named int 1 2 3
.. ..- attr(*, "names")= chr "setosa" "versicolor" "virginica"
Read in using haven::read_dta() using its native variable attributes because the attributes are stored at the data.frame level rather than the variable level:
> str(import("iris.dta", haven = TRUE, column.labels = TRUE))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species :Class 'labelled' atomic [1:150] 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "labels")= Named int [1:3] 1 2 3
.. .. ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica"
Read in using haven::read_dta() using an alternative that we (the rio developers) have found more convenient:
> str(import("iris.dta", haven = TRUE))
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "var.labels")=List of 5
..$ Sepal.Length: NULL
..$ Sepal.Width : NULL
..$ Petal.Length: NULL
..$ Petal.Width : NULL
..$ Species : NULL
- attr(*, "label.table")=List of 5
..$ Sepal.Length: NULL
..$ Sepal.Width : NULL
..$ Petal.Length: NULL
..$ Petal.Width : NULL
..$ Species : Named int 1 2 3
.. ..- attr(*, "names")= chr "setosa" "versicolor" "virginica"
By moving the attributes to be at the level of the data.frame, they're much easier to access using attr(data, "label.var"), etc. rather than digging through each variable's attributes.
Note: the values of attributes will be NULL because I'm just writing a native R dataset to a local file in order to make this reproducible.
A simple solution with the labelled package (tidyverse)
descriptions <- var_label(data_raw) %>%
as_tibble() %>%
gather(key = variable, value = description)
Use the haven package to force to a factor
haven::as_factor(df$var, levels="label")
The purpose of the labelled package is to provide convenient functions to manipulate variable and value labels as imported with haven.
In addition, the functions lookfor and describe from the questionr package are also useful to display variable and value labels.

Writing to the global environment from a function in R

Im new to R and have some trouble understanding how to handle local and global environments. I checked the Post on local and global variables, but couldn't figure it out.
If, for example, I would like to make several plots using a function and save them like this:
PlottingFunction <- function(type) {
type <<- mydata %>%
filter(typeVariable==type) %>%
qplot(a,b)
}
lapply(ListOfTypes, PlottingFunction)
Which didn't yield the desired result. I tried using the assign() function, but couldn't get it to work either.
I want to save the graphs in the global environment so I can combine them using gridExtra. This might not be the best way to do that, but I think it might be useful to understand this issue nevertheless.
You don't need to assign the plot to a gloabl variable. All plots can be saved in one list.
For this example, I use the iris data set.
library(gridExtra)
library(ggplot2)
library(dplyr)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The modified function without assignment:
PlottingFunction <- function(type) {
iris %>%
filter(Species == type) %>%
qplot(Sepal.Length, Sepal.Width, data = .)
}
One figure per Species is created
species <- unique(iris$Species)
# [1] setosa versicolor virginica
# Levels: setosa versicolor virginica
l <- lapply(species, PlottingFunction)
Now, the function do.call can be used to call grid.arrange with the plot objects in the list l.
do.call(grid.arrange, l)

Test whether data is numeric or Factor/Ordinal

I'm sitting with a large dataset and want to get som basic information about my variables, first of all if they are numeric or factor/ordinal.
I'm working with a function, and want, one variable at a time, investigate if it is numeric or a factor.
To make the for loop work I'm using dataset[i] to get to the variable I want.
object<-function(dataset){
n=ncol(dataset)
for(i in 1:n){
variable_name<-names(dataset[i])
factor<-is.factor(dataset[i])
rdered<-is.ordered(dataset[i])
numeric<-is.numeric(dataset[i])
print(list(variable_name,factor,ordered,numeric))
}
}
is.ordered
My problem is that is.numeric() does not seem to work with dataset[i], all the results becomes "FALSE", but only with dataset$.
Do you have any idea how to solve this?
Try str(dataset) to get summary information on an object, but to solve your problem you need to compeletely extract your data with double square brackets. Single square bracket subsetting keeps the output as a sub-list (or data.frame) rather than extracting the contents:
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.numeric(iris[1])
[1] FALSE
class(iris[1])
[1] "data.frame"
is.numeric(iris[[1]])
[1] TRUE
Assuming that dataset is something like a data.frame, you can do the following (and avoid the loop):
names = sapply(dataset, names) # or simply `colnames(dataset)`
types = sapply(dataset, class)
Then types gives you either numeric or factor. You can then simply do something like this:
is_factor = types == 'factor'

Resources