Getting different results for 'class()' method - r

Here's the smallest piece of code which displays how i am getting different results for class() when called directly for columns vs when called using apply.
data.frame looks like this.
> df
A B C
1 rlm 4.047317e-03 0.0040111713
2 rlm -6.474359e-02 -0.0657461598
3 rlm 1.464302e-01 0.1451224214
4 rlm 3.508878e-01 0.3477540761
5 lm 2.701757e-01 0.2769367280
6 lm 2.580785e-03 0.0025815525
7 rlm 1.638077e-05 0.0000160895
> str(df)
'data.frame': 7 obs. of 3 variables:
$ A: chr "rlm" "rlm" "rlm" "rlm" ...
$ B: num 0.00405 -0.06474 0.14643 0.35089 0.27018 ...
$ C: num 0.00401 -0.06575 0.14512 0.34775 0.27694 ...
> class(df$A)
[1] "character"
> class(df$B)
[1] "numeric"
> apply(df, 2, class)
A B C
"character" "character" "character"
So, when called directly class of B is 'numeric', but when called using apply, it's saying 'character'.
Am i missing anything here ?

Apply coerces data.frames to matrices before applying the function. Since in a matrix each element must have the same class you end up with a character matrix (since you can convert numeric to character without information loss but not the other way). The reason for this is probably that you can apply functions by-row as well, which would be messy with data.frames since your function would need to operate on a list.
For what you want check out the lapply and sapply functions, since data.frames are basically lists with each element of the list being one of the columns.
> x <- data.frame(a = "Entry", b = 5)
> sapply(x, class)
a b
"factor" "numeric"

I get the same result. I think it might be the same behavior you see in this example:
number_m <- matrix(1:6)
mode(number_m) # "numeric"
number_m[2,1] <- "b"
mode(number_m) # "character"
number_m
converting one element of a matrix or vector to a character changes the data type of all the elements.
I get the correct result using a loop:
df <- read.table(header=TRUE, text="
A B C
1 rlm 4.047317e-03 0.0040111713
2 rlm -6.474359e-02 -0.0657461598
3 rlm 1.464302e-01 0.1451224214
4 rlm 3.508878e-01 0.3477540761
5 lm 2.701757e-01 0.2769367280
6 lm 2.580785e-03 0.0025815525
7 rlm 1.638077e-05 0.0000160895")
sapply(1:3, function(i) class(df[,i]))

Related

R: Why am I not getting type or class "factor" after converting columns to factor?

I have the following setup.
df <- data.frame(aa = rnorm(1000), bb = rnorm(1000))
apply(df, 2, typeof)
# aa bb
#"double" "double"
apply(df, 2, class)
# aa bb
#"numeric" "numeric"
Then I try to convert one of the columns to "factor". But as you can see below, I am not getting any "factor" type or classes. Am I doing anything wrong ?
df[, 1] <- as.factor(df[, 1])
apply(df, 2, typeof)
# aa bb
#"character" "character"
apply(df, 2, class)
# aa bb
#"character" "character"
Sorry I felt my original answer badly written. Why did I put that "matrix of factors" in the very beginning? Here is a better try.
From ?apply:
If ‘X’ is not an array but an object of a class with a non-null
‘dim’ value (such as a data frame), ‘apply’ attempts to coerce it
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame) or via ‘as.array’.
So a data frame is converted to a matrix by as.matrix, before FUN is applied row-wise or column-wise.
From ?as.matrix:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
The default method for ‘as.matrix’ calls ‘as.vector(x)’, and hence
e.g. coerces factors to character vectors.
I am not a native English speaker and I can't read the following (which looks rather important!). Can someone clarify it?
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying ‘as.vector’ to factors and ‘format’ to other non-character columns.
From ?as.vector:
Note that factors are _not_ vectors; ‘is.vector’ returns ‘FALSE’
and ‘as.vector’ converts a factor to a character vector for ‘mode
= "any"’.
Simply put, as long as you have a factor column in a data frame, as.matrix gives you a character matrix.
I believed this apply with data frame problem has been raised many times and the above just adds another duplicate answer. Really sorry. I failed to read OP's question carefully. What hit me in the first instance is that R can not build a true matrix of factors.
f <- factor(letters[1:4])
matrix(f, 2, 2)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "d"
## a sneaky way to get a matrix of factors by setting `dim` attribute
dim(f) <- c(2, 2)
# [,1] [,2]
#[1,] a c
#[2,] b d
#Levels: a b c d
is.matrix(f)
#[1] TRUE
class(f)
#[1] "factor" ## not a true matrix with "matrix" class
While this is interesting, it should be less-relevant to OP's question.
Sorry again for making a mess here. So bad!!
So if I do sapply would it help? Because I have many columns that need to be converted to factor.
Use lapply actually. sapply would simplify the result to an array, which is a matrix in 2D case. Here is an example:
dat <- head(trees)
sapply(dat, as.factor)
# Girth Height Volume
#[1,] "8.3" "70" "10.3"
#[2,] "8.6" "65" "10.3"
#[3,] "8.8" "63" "10.2"
#[4,] "10.5" "72" "16.4"
#[5,] "10.7" "81" "18.8"
#[6,] "10.8" "83" "19.7"
new_dat <- data.frame(lapply(dat, as.factor))
str(new_dat)
#'data.frame': 6 obs. of 3 variables:
# $ Girth : Factor w/ 6 levels "8.3","8.6","8.8",..: 1 2 3 4 5 6
# $ Height: Factor w/ 6 levels "63","65","70",..: 3 2 1 4 5 6
# $ Volume: Factor w/ 5 levels "10.2","10.3",..: 2 2 1 3 4 5
sapply(new_dat, class)
# Girth Height Volume
#"factor" "factor" "factor"
apply(new_dat, 2, class)
# Girth Height Volume
#"character" "character" "character"
Regarding typeof, factors are actually stored as integers.
sapply(new_dat, typeof)
# Girth Height Volume
#"integer" "integer" "integer"
When you dput a factor you can see this. For example:
dput(new_dat[[1]])
#structure(1:6, .Label = c("8.3", "8.6", "8.8", "10.5", "10.7",
#"10.8"), class = "factor")
The real values are 1:6. Character levels are just an attribute.

Convert matrix from character to factor

I am trying to convert a basic matrix from one type to another. This seems like a really basic question, but surprisingly I have not seen an answer to it.
Here's a simple example:
> btest <- matrix(LETTERS[1:9], ncol = 3)
> ctest <- apply(btest, 2, as.factor)
> class(ctest[1,1])
[1] "character"
The only examples I could find on stack overflow dealt with data.frame columns, which seems more straightforward...
dtest <- as.data.frame(btest, stringsAsFactors = F)
dtest[] <- lapply(dtest[colnames(dtest)], as.factor)
dtest
V1 V2 V3
1 A D G
2 B E H
3 C F I
class(dtest[1,1])
[1] "factor"
Is there a straightforward way to change a matrix from character to factor and specify the levels as well?
matrix holds only one data type. Factor is a complex data type made up of character and integer types. Matrix cannot hold two types at a time. List is the appropriate data structure for factor. Data.frame is a kind of list data structure.
The help documentation of matrix ?matrix states that
an optional data vector (including a list or expression
vector). Non-atomic classed R objects are coerced by as.vector and all
attributes discarded.
The attributes for a factor is shown below.
attributes(factor(letters[1:4]))
$levels
[1] "a" "b" "c" "d"
$class
[1] "factor"
These attributes are removed using as.vector during matrix formation.
attributes(as.vector(factor(letters[1:4])))
NULL
In R, a matrix is mostly just a vector with a dim attribute of length 2 (see ?matrix). Its class is matrix, but it usually isn't stored as an attribute, unlike with list-based objects.
Thus, you can reconstruct a factor matrix with structure:
btest <- matrix(LETTERS[1:9], ncol = 3)
btest_fac <- structure(factor(btest), dim = dim(btest), class = c('matrix', 'factor'))
btest_fac
#> [,1] [,2] [,3]
#> [1,] A D G
#> [2,] B E H
#> [3,] C F I
#> Levels: A B C D E F G H I
str(btest_fac)
#> matrix [1:3, 1:3] A B C D ...
#> - attr(*, "levels")= chr [1:9] "A" "B" "C" "D" ...
class(btest_fac)
#> [1] "matrix" "factor"
However, while this is possible, it's not very useful, as functions will deal with it unpredictably, e.g. apply will coerce it to integer. You could define your own class and appropriate methods for it, but that would be a lot more work.

R: Applying function to DataFrame

I have following code:
library(Ecdat)
data(Fair)
Fair[1:5,]
x1 = function(x){
mu = mean(x)
l1 = list(s1=table(x),std=sd(x))
return(list(l1,mu))
}
mylist <- as.list(Fair$occupation,
Fair$education)
x1(mylist)
What I wanted is that x1 outputs the result for the items selected in mylist. However, I get In mean.default(x) : argument is not numeric or logical: returning NA.
You need to use lapply if your passing a list to a function
output<-lapply(mylist,FUN=x1)
This will process your function x1 for each element in mylist and return a list of results to output.
Here the mylist is created not in the correct way and a list is not needed also as data.frame is a list with columns of equal length. So, just subset the columns of interest and apply the function
lapply(Fair[c("occupation", "education")], x1)
In the OP's code, as.list simply creates a list of length 601 with only a single element in each.
str(mylist)
#List of 601
#$ : int 7
#$ : int 6
#$ : int 1
#...
#...
Another problem in the code is that it is not even considering the 2nd argument. Using a simple example
as.list(1:3, 1:2)
#[[1]]
#[1] 1
#[[2]]
#[1] 2
#[[3]]
#[1] 3
The second argument is not at all considered. It could have been
list(1:3, 1:2)
#[[1]]
#[1] 1 2 3
#[[2]]
#[1] 1 2
But for data.frame columns, we don't need to explicitly call the list as it is a list of vectors that have equal length.
Regarding the error in OP's post, mean works on vectors and not on list or data.frame.

R: operations between vectors inside of lists and vectors outside

Supose I have a list of 3 elements and each element is a list of 2 other elements. The first, a 4-dimensional vector and the second, say, a char. The following code will produce a list exactly as I just described it:
x <- NULL
for(i in 1:3){
set.seed(i); a <- list(sample(1:4, 4, replace = T), LETTERS[i])
x <- c(x, list(a))
}
Its structure is there fore of the following type (the exact values may chage since I used the sample function):
> str(x)
List of 3
$ :List of 2
..$ : int [1:4] 2 2 3 4
..$ : chr "A"
$ :List of 2
..$ : int [1:4] 1 3 3 1
..$ : chr "B"
$ :List of 2
..$ : int [1:4] 1 4 2 2
..$ : chr "C"
Now, I have an other 4-dimensional vector, say y:
y <- 1:4
Finally I want to create a matrix resulting from the operation (say sum) between y and each 4-dimensional vector stored in the list. For the given example, this matrix would give the following result:
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 4 5 6
[3,] 6 6 5
[4,] 8 5 6
Question: How can I create the above matrix in a simple and elegant way? I was searching for some solution that could use some apply function or that could use directly the sum function in some way that I'm not aware of.
Try this:
# you can also simply write: sapply(x, function(x) x[[1]]) + y
foo <- function(x) x[[1]]
sapply(x, foo) + y
The function foo extracts the vector inside the list;
sapply returns those vectors as a matrix;
Finally, we use recycling rule for addition.
Update 1
Well, since #Frank mentioned it. I might make a little explanation. The '[[' operator in R (note the quote!) is a function, taking two arguments. The first is a vector type object, like a vector/list; while the second is the index which you want to refer to. For the following example:
a <- 1:4
a[2] # 2
'[['(a, 2) # 2
Though my original answer is easier to digest, it is not the most efficient, because for each list element, two function calls are invoked to take out the vector. While if we use '[[' directly, only one function call is invoked. Therefore, we get time savings by reducing function call overhead. Function call overhead can be noticeable, when the function is small and does not do much work.
Operators in R are essentially functions. +, *, etc are arithmetic operators and you can find them by ?'+'. Similarly, you can find ?'[['. Don't worry too much if you can't follow this at the moment. Sooner or later you will get to it.
Update 2
I don't understand how it actually does the job. When I simply ask for [[1]] at the console, I get the first element of the list (both the integer vector and the char value), not just the vector. I guess the remainder should be the magics of the sapply function.
Ah, if you have difficulty in understanding sapply (or similarly lapply), consider the following. But I will start from lapply.
output <- lapply(x, foo) is doing something like:
output <- vector("list", length = length(x))
for (i in 1:length(x)) output[[i]] <- foo(x[[i]])
So lapply returns a list:
> output
[[1]]
[1] 2 2 3 4
[[2]]
[1] 1 4 4 3
[[3]]
[1] 3 1 1 1
Yes, lapply loops through the elements of x, applying function foo, and return the result in another list.
sapply takes the similar idea, but returns a vector/matrix. You may think that sapply collapses the result of lapply to a vector/matrix.
Sure, my this part of explanation is just to make things understandable. lapply and sapply is not really implemented as R loop. They are more efficient.

Converting from a character to a numeric data frame

I have a character data frame in R which has NaNs in it. I need to remove any row with a NaN and then convert it to a numeric data frame.
If I just do as.numeric on the data frame, I run into the following
Error: (list) object cannot be coerced to type 'double'
1:
0:
As #thijs van den bergh points you to,
dat <- data.frame(x=c("NaN","2"),y=c("NaN","3"),stringsAsFactors=FALSE)
dat <- as.data.frame(sapply(dat, as.numeric)) #<- sapply is here
dat[complete.cases(dat), ]
# x y
#2 2 3
Is one way to do this.
Your error comes from trying to make a data.frame numeric. The sapply option I show is instead making each column vector numeric.
Note that data.frames are not numeric or character, but rather are a list which can be all numeric columns, all character columns, or a mix of these or other types (e.g.: Date/logical).
dat <- data.frame(x=c("NaN","2"),y=c("NaN","3"),stringsAsFactors=FALSE)
is.list(dat)
# [1] TRUE
The example data just has two character columns:
> str(dat)
'data.frame': 2 obs. of 2 variables:
$ x: chr "NaN" "2"
$ y: chr "NaN" "3
...which you could add a numeric column to like so:
> dat$num.example <- c(6.2,3.8)
> dat
x y num.example
1 NaN NaN 6.2
2 2 3 3.8
> str(dat)
'data.frame': 2 obs. of 3 variables:
$ x : chr "NaN" "2"
$ y : chr "NaN" "3"
$ num.example: num 6.2 3.8
So, when you try to do as.numeric R gets confused because it is wondering how to convert this list object which may have multiple types in it. user1317221_G's answer uses the ?sapply function, which can be used to apply a function to the individual items of an object. You could alternatively use ?lapply which is a very similar function (read more on the *apply functions here - R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate )
I.e. - in this case, to each column of your data.frame, you can apply the as.numeric function, like so:
data.frame(lapply(dat,as.numeric))
The lapply call is wrapped in a data.frame to make sure the output is a data.frame and not a list. That is, running:
lapply(dat,as.numeric)
will give you:
> lapply(dat,as.numeric)
$x
[1] NaN 2
$y
[1] NaN 3
$num.example
[1] 6.2 3.8
While:
data.frame(lapply(dat,as.numeric))
will give you:
> data.frame(lapply(dat,as.numeric))
x y num.example
1 NaN NaN 6.2
2 2 3 3.8

Resources