Convert matrix from character to factor - r

I am trying to convert a basic matrix from one type to another. This seems like a really basic question, but surprisingly I have not seen an answer to it.
Here's a simple example:
> btest <- matrix(LETTERS[1:9], ncol = 3)
> ctest <- apply(btest, 2, as.factor)
> class(ctest[1,1])
[1] "character"
The only examples I could find on stack overflow dealt with data.frame columns, which seems more straightforward...
dtest <- as.data.frame(btest, stringsAsFactors = F)
dtest[] <- lapply(dtest[colnames(dtest)], as.factor)
dtest
V1 V2 V3
1 A D G
2 B E H
3 C F I
class(dtest[1,1])
[1] "factor"
Is there a straightforward way to change a matrix from character to factor and specify the levels as well?

matrix holds only one data type. Factor is a complex data type made up of character and integer types. Matrix cannot hold two types at a time. List is the appropriate data structure for factor. Data.frame is a kind of list data structure.
The help documentation of matrix ?matrix states that
an optional data vector (including a list or expression
vector). Non-atomic classed R objects are coerced by as.vector and all
attributes discarded.
The attributes for a factor is shown below.
attributes(factor(letters[1:4]))
$levels
[1] "a" "b" "c" "d"
$class
[1] "factor"
These attributes are removed using as.vector during matrix formation.
attributes(as.vector(factor(letters[1:4])))
NULL

In R, a matrix is mostly just a vector with a dim attribute of length 2 (see ?matrix). Its class is matrix, but it usually isn't stored as an attribute, unlike with list-based objects.
Thus, you can reconstruct a factor matrix with structure:
btest <- matrix(LETTERS[1:9], ncol = 3)
btest_fac <- structure(factor(btest), dim = dim(btest), class = c('matrix', 'factor'))
btest_fac
#> [,1] [,2] [,3]
#> [1,] A D G
#> [2,] B E H
#> [3,] C F I
#> Levels: A B C D E F G H I
str(btest_fac)
#> matrix [1:3, 1:3] A B C D ...
#> - attr(*, "levels")= chr [1:9] "A" "B" "C" "D" ...
class(btest_fac)
#> [1] "matrix" "factor"
However, while this is possible, it's not very useful, as functions will deal with it unpredictably, e.g. apply will coerce it to integer. You could define your own class and appropriate methods for it, but that would be a lot more work.

Related

Why does indexing with a single character index work on a data frame but not a matrix?

In data frames, [-indexing can be performed using a single character. E.g. mtcars["mpg"].
On the other hand, trying the same on a matrix, results in NA, e.g.
m = cbind(A = 1:5, B = 1:5)
m["A"]
# NA
...implying that this is somehow an invalid way to subset a matrix.
Is this normal R behavior? If so, where is it documented?
cbind() creates a matrix, by default. mtcars is a data frame.
class(cbind(A = 1:5, B = 1:5))
# [1] "matrix" "array"
class(mtcars)
# [1] "data.frame"
Because data frames are built as lists of columns, dataframe["column_name"], using one argument in [, defaults to treating the data frame as a list, allowing you to select columns, mostly the same as dataframe[, "column_name"].
A matrix has no such list underpinnings, so if you use [ with one argument, it doesn't assume you want columns. Use matrix[, "column_name"] to select columns from a matrix.
cbind is a bad way to create data frames from scratch. You can specify cbind.data.frame(A = 1:5, B = 1:5), but it's simpler and clearer to use data.frame(A = 1:5, B = 1:5). However, if you are adding multiple columns to an existing data frame then cbind(my_data_frame, A = 1:5, B = 1:5) is fine, and will result in a data frame as long as one of the arguments is already a data frame.
This behaviour is documented in ?"[", section "Matrices and arrays":
Matrices and arrays are vectors with a dimension attribute and so
all the vector forms of indexing can be used with a single index.
It means that if you use just a single index, the object to subset is treated as an object without dimensions and so if the index is a character vector, the method will look for the names attribute, which is absent in this case (try names(m) on your matrix to check this). What you did in the question is totally equivalent to (c(1:5, 1:5))["A"]. If you use a double index instead, the method will search for the dimnames attribute to subset. Even if confusing, a matrix may have both names and dimnames. Consider this:
m<-matrix(c(1:5,1:5), ncol = 2, dimnames = list(LETTERS[1:5], LETTERS[1:2]))
names(m)<-LETTERS[1:10]
#check whether the attributes are set
str(m)
# int [1:5, 1:2] 1 2 3 4 5 1 2 3 4 5
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:5] "A" "B" "C" "D" ...
# ..$ : chr [1:2] "A" "B"
# - attr(*, "names")= chr [1:10] "A" "B" "C" "D" ...
We have set rownames, colnames and names. Let's subset it:
#a column
m[,"A"]
#A B C D E
#1 2 3 4 5
#a row
m["A",]
# A B
#1 1
#an element
m["A"]
#A
#1
Two cases here,
m = cbind(A = 1:5, B = 11:15)
typeof(m)
"integer"
And
typeof(mtcars)
"list"
So reading is different. First case needs comma,
cbind(A = 1:5, B = 11:15)[,"A"]
[1] 1 2 3 4 5

R: Why am I not getting type or class "factor" after converting columns to factor?

I have the following setup.
df <- data.frame(aa = rnorm(1000), bb = rnorm(1000))
apply(df, 2, typeof)
# aa bb
#"double" "double"
apply(df, 2, class)
# aa bb
#"numeric" "numeric"
Then I try to convert one of the columns to "factor". But as you can see below, I am not getting any "factor" type or classes. Am I doing anything wrong ?
df[, 1] <- as.factor(df[, 1])
apply(df, 2, typeof)
# aa bb
#"character" "character"
apply(df, 2, class)
# aa bb
#"character" "character"
Sorry I felt my original answer badly written. Why did I put that "matrix of factors" in the very beginning? Here is a better try.
From ?apply:
If ‘X’ is not an array but an object of a class with a non-null
‘dim’ value (such as a data frame), ‘apply’ attempts to coerce it
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame) or via ‘as.array’.
So a data frame is converted to a matrix by as.matrix, before FUN is applied row-wise or column-wise.
From ?as.matrix:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
The default method for ‘as.matrix’ calls ‘as.vector(x)’, and hence
e.g. coerces factors to character vectors.
I am not a native English speaker and I can't read the following (which looks rather important!). Can someone clarify it?
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying ‘as.vector’ to factors and ‘format’ to other non-character columns.
From ?as.vector:
Note that factors are _not_ vectors; ‘is.vector’ returns ‘FALSE’
and ‘as.vector’ converts a factor to a character vector for ‘mode
= "any"’.
Simply put, as long as you have a factor column in a data frame, as.matrix gives you a character matrix.
I believed this apply with data frame problem has been raised many times and the above just adds another duplicate answer. Really sorry. I failed to read OP's question carefully. What hit me in the first instance is that R can not build a true matrix of factors.
f <- factor(letters[1:4])
matrix(f, 2, 2)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "d"
## a sneaky way to get a matrix of factors by setting `dim` attribute
dim(f) <- c(2, 2)
# [,1] [,2]
#[1,] a c
#[2,] b d
#Levels: a b c d
is.matrix(f)
#[1] TRUE
class(f)
#[1] "factor" ## not a true matrix with "matrix" class
While this is interesting, it should be less-relevant to OP's question.
Sorry again for making a mess here. So bad!!
So if I do sapply would it help? Because I have many columns that need to be converted to factor.
Use lapply actually. sapply would simplify the result to an array, which is a matrix in 2D case. Here is an example:
dat <- head(trees)
sapply(dat, as.factor)
# Girth Height Volume
#[1,] "8.3" "70" "10.3"
#[2,] "8.6" "65" "10.3"
#[3,] "8.8" "63" "10.2"
#[4,] "10.5" "72" "16.4"
#[5,] "10.7" "81" "18.8"
#[6,] "10.8" "83" "19.7"
new_dat <- data.frame(lapply(dat, as.factor))
str(new_dat)
#'data.frame': 6 obs. of 3 variables:
# $ Girth : Factor w/ 6 levels "8.3","8.6","8.8",..: 1 2 3 4 5 6
# $ Height: Factor w/ 6 levels "63","65","70",..: 3 2 1 4 5 6
# $ Volume: Factor w/ 5 levels "10.2","10.3",..: 2 2 1 3 4 5
sapply(new_dat, class)
# Girth Height Volume
#"factor" "factor" "factor"
apply(new_dat, 2, class)
# Girth Height Volume
#"character" "character" "character"
Regarding typeof, factors are actually stored as integers.
sapply(new_dat, typeof)
# Girth Height Volume
#"integer" "integer" "integer"
When you dput a factor you can see this. For example:
dput(new_dat[[1]])
#structure(1:6, .Label = c("8.3", "8.6", "8.8", "10.5", "10.7",
#"10.8"), class = "factor")
The real values are 1:6. Character levels are just an attribute.

Dataframe within dataframe?

Consider this example:
df <- data.frame(id=1:10,var1=LETTERS[1:10],var2=LETTERS[6:15])
fun.split <- function(x) tolower(as.character(x))
df$new.letters <- apply(df[ ,2:3],2,fun.split)
df$new.letters.var1
#NULL
colnames(df)
# [1] "id" "var1" "var2" "new.letters"
df$new.letters
# var1 var2
# [1,] "a" "f"
# [2,] "b" "g"
# [3,] "c" "h"
# [4,] "d" "i"
# [5,] "e" "j"
# [6,] "f" "k"
# [7,] "g" "l"
# [8,] "h" "m"
# [9,] "i" "n"
# [10,] "j" "o"
Would be someone so kind and explain what is going on here? A new dataframe within dataframe?
I expected this:
colnames(df)
# id var1 var2 new.letters.var1 new.letters.var2
The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with
do.call(data.frame, df)
A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class
df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
#akrun solved 90% of my problem. But I had data.frames buried within data.frames, buried within data.frames and so on, without knowing the depth to which this was happening.
In this case, I thought sharing my recursive solution might be helpful to others searching this thread as I was:
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) unnest_dataframes(y)
y
}
new_data <- unnest_dataframes(df)
Although this itself sometimes has problems and it can be helpful to separate all columns of class "data.frame" from the original data set then cbind() it back together like so:
# Find all columns that are data.frame
# Assuming your data frame is stored in variable 'y'
data.frame.cols <- unname(sapply(y, function(x) class(x) == "data.frame"))
z <- y[, !data.frame.cols]
# All columns of class "data.frame"
dfs <- y[, data.frame.cols]
# Recursively unnest each of these columns
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) {
unnest_dataframes(y)
} else {
cat('Nested data.frames successfully unpacked\n')
}
y
}
df2 <- unnest_dataframes(dfs)
# Combine with original data
all_columns <- cbind(z, df2)
In this case R doesn't behave like one would expect but maybe if we dig deeper we can solve it. What is a data frame? as Norman Matloff says in his book (chapter 5):
a data frame is a list, with the components of that list being
equal-length vectors
The following code might be useful to understand.
class(df$new.letters)
[1] "matrix"
str(df)
'data.frame': 10 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ var1 : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
$ var2 : Factor w/ 10 levels "F","G","H","I",..: 1 2 3 4 5 6 7 8 9 10
$ new.letters: chr [1:10, 1:2] "a" "b" "c" "d" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "var1" "var2"
Maybe the reason why it looks strange is in the print methods. Consider this:
colnames(df$new.letters)
[1] "var1" "var2"
maybe there must something in the print methods that combine the sub-names of objects and display them all.
For example here the vectors that constitute the df are:
names(df)
[1] "id" "var1" "var2" "new.letters"
but in this case the vector new.letters also has a dim attributes (in fact it is a matrix) were dimensions have names var1 and var1 too. See this code:
attributes(df$new.letters)
$dim
[1] 10 2
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "var1" "var2"
but when we print we see all of them like they were separated vectors (and so columns of the data.frame!).
Edit: Print methods
Just for curiosity in order to improve this question I looked inside the methods of the print functions:
methods(print)
The previous code produces a very long list of methods for the generic function print but there is no one for data.frame. The one that looks for data frame (but I am sure there is a more technically way to find out that) is listof.
getS3method("print", "listof")
function (x, ...)
{
nn <- names(x)
ll <- length(x)
if (length(nn) != ll)
nn <- paste("Component", seq.int(ll))
for (i in seq_len(ll)) {
cat(nn[i], ":\n")
print(x[[i]], ...)
cat("\n")
}
invisible(x)
}
<bytecode: 0x101afe1c8>
<environment: namespace:base>
Maybe I am wrong but It seems to me that in this code there might be useful informations about why that happens, specifically when the if (length(nn) != ll) is stated.

Can we get factor matrices in R?

It seems not possible to get matrices of factor in R. Is it true? If yes, why? If not, how should I do?
f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
m <- matrix(f,4,5)
is.factor(m) # fail.
m <- factor(m,letters[1:5])
is.factor(m) # oh, yes?
is.matrix(m) # nope. fail.
dim(f) <- c(4,5) # aha?
is.factor(f) # yes..
is.matrix(f) # yes!
# but then I get a strange behavior
cbind(f,f) # is not a factor anymore
head(f,2) # doesn't give the first 2 rows but the first 2 elements of f
# should I worry about it?
In this case, it may walk like a duck and even quack like a duck, but f from:
f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
dim(f) <- c(4,5)
really isn't a matrix, even though is.matrix() claims that it strictly is one. To be a matrix as far as is.matrix() is concerned, f only needs to be a vector and have a dim attribute. By adding the attribute to f you pass the test. As you have seen, however, once you start using f as a matrix, it quickly loses the features that make it a factor (you end up working with the levels or the dimensions get lost).
There are really only matrices and arrays for the atomic vector types:
logical,
integer,
real,
complex,
string (or character), and
raw
plus, as #hadley reminds me, you can also have list matrices and arrays (by setting the dim attribute on a list object. See, for example, the Matrices & Arrays section of Hadley's book, Advanced R.)
Anything outside those types would be coerced to some lower type via as.vector(). This happens in matrix(f, nrow = 3) not because f is atomic according to is.atomic() (which returns TRUE for f because it is internally stored as an integer and typeof(f) returns "integer"), but because it has a class attribute. This sets the OBJECT bit on the internal representation of f and anything that has a class is supposed to be coerced to one of the atomic types via as.vector():
matrix <- function(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL) {
if (is.object(data) || !is.atomic(data))
data <- as.vector(data)
....
Adding dimensions via dim<-() is a quick way to create an array without duplicating the object, but this bypasses some of the checks and balances that R would do if you coerced f to a matrix via the other methods
matrix(f, nrow = 3) # or
as.matrix(f)
This gets found out when you try to use basic functions that work on matrices or use method dispatch. Note that after assigning dimensions to f, f still is of class "factor":
> class(f)
[1] "factor"
which explains the head() behaviour; you are not getting the head.matrix behaviour because f is not a matrix, at least as far as the S3 mechanism is concerned:
> debug(head.matrix)
> head(f) # we don't enter the debugger
[1] d c a d b d
Levels: a b c d e
> undebug(head.matrix)
and the head.default method calls [ for which there is a factor method, and hence the observed behaviour:
> debugonce(`[.factor`)
> head(f)
debugging in: `[.factor`(x, seq_len(n))
debug: {
y <- NextMethod("[")
attr(y, "contrasts") <- attr(x, "contrasts")
attr(y, "levels") <- attr(x, "levels")
class(y) <- oldClass(x)
lev <- levels(x)
if (drop)
factor(y, exclude = if (anyNA(levels(x)))
NULL
else NA)
else y
}
....
The cbind() behaviour can be explained from the documented behaviour (from ?cbind, emphasis mine):
The functions cbind and rbind are S3 generic, ...
....
In the default method, all the vectors/matrices must be atomic
(see vector) or lists. Expressions are not allowed. Language
objects (such as formulae and calls) and pairlists will be coerced
to lists: other objects (such as names and external pointers) will
be included as elements in a list result. Any classes the inputs
might have are discarded (in particular, factors are replaced by
their internal codes).
Again, the fact that f is of class "factor" is defeating you because the default cbind method will get called and it will strip the levels information and return the internal integer codes as you observed.
In many respects, you have to ignore or at least not fully trust what the is.foo functions tell you, because they are just using simple tests to say whether something is or is not a foo object. is.matrix() and is.atomic() are clearly wrong when it comes to f (with dimensions) from a particular point of view. They are also right in terms of their implementation or at least their behaviour can be understood from the implementation; I think is.atomic(f) is not correct, but if by "if is of an atomic type" R Core mean "type" to be the thing returned by typeof(f) then is.atomic() is right. A more strict test is is.vector(), which f fails:
> is.vector(f)
[1] FALSE
because it has attributes beyond a names attribute:
> attributes(f)
$levels
[1] "a" "b" "c" "d" "e"
$class
[1] "factor"
$dim
[1] 4 5
As to how should you get a factor matrix, well you can't, at least if you want it to retain the factor information (the labels for the levels). One solution would be to use a character matrix, which would retain the labels:
> fl <- levels(f)
> fm <- matrix(f, ncol = 5)
> fm
[,1] [,2] [,3] [,4] [,5]
[1,] "c" "a" "a" "c" "b"
[2,] "d" "b" "d" "b" "a"
[3,] "e" "e" "e" "c" "e"
[4,] "a" "b" "b" "a" "e"
and we store the levels of f for future use incase we lose some elements of the matrix along the way.
Or work with the internal integer representation:
> (fm2 <- matrix(unclass(f), ncol = 5))
[,1] [,2] [,3] [,4] [,5]
[1,] 3 1 1 3 2
[2,] 4 2 4 2 1
[3,] 5 5 5 3 5
[4,] 1 2 2 1 5
and you can always get back to the levels/labels again via:
> fm2[] <- fl[fm2]
> fm2
[,1] [,2] [,3] [,4] [,5]
[1,] "c" "a" "a" "c" "b"
[2,] "d" "b" "d" "b" "a"
[3,] "e" "e" "e" "c" "e"
[4,] "a" "b" "b" "a" "e"
Using a data frame would seem to be not ideal for this as each component of the data frame would be treated as a separate factor whereas you seem to want to treat the array as a single factor with one set of levels.
If you really wanted to do what you want, which is have a factor matrix, you would most likely need to create your own S3 class to do this, plus all the methods to go with it. For example, you might store the factor matrix as a character matrix but with class "factorMatrix", where you stored the levels alongside the factor matrix as an extra attribute say. Then you would need to write [.factorMatrix, which would grab the levels, then use the default [ method on the matrix, and then add the levels attribute back on again. You could write cbindand head methods as well. The list of required method would grow quickly however, but a simple implementation may suit and if you make your objects have class c("factorMatrix", "matrix") (i.e inherit from the "matrix" class), you'll pick up all the properties/methods of the "matrix" class (which will drop the levels and other attributes) so you can at least work with the objects and see where you need to add new methods to fill out the behaviour of the class.
Unfortunately factor support is not completely universal in R, so many R functions default to treating factors as their internal storage type, which is integer:
> typeof(factor(letters[1:3]))
[1] "integer
This is what happens with matrix, cbind. They don't know how to handle factors, but they do know what to do with integers, so they treat your factor like an integer. head is actually the opposite. It does know how to handle a factor, but it never bothers to check that your factor is also a matrix so just treats it like a normal dimensionless factor vector.
Your best bet to operate as if you had factors with your matrix is to coerce it to character. Once you are done with your operations, you can restore it back to factor form. You could also do this with the integer form, but then you risk weird stuff (you could for example do matrix multiplication on an integer matrix, but that makes no sense for factors).
Note that if you add class "matrix" to your factor some (but not all) things start working:
f <- factor(letters[1:9])
dim(f) <- c(3, 3)
class(f) <- c("factor", "matrix")
head(f, 2)
Produces:
[,1] [,2] [,3]
[1,] a d g
[2,] b e h
Levels: a b c d e f g h i
This doesn't fix rbind, etc.

Extract the factor's values positions in level

I'm returning to R after some time, and the following has me stumped:
I'd like to build a list of the positions factor values have in the facor levels list.
Example:
> data = c("a", "b", "a","a","c")
> fdata = factor(data)
> fdata
[1] a b a a c
Levels: a b c
> fdata$lvl_idx <- ????
Such that:
> fdata$lvl_idx
[1] 1 2 1 1 3
Appreciate any hints or tips.
If you convert a factor to integer, you get the position in the levels:
as.integer(fdata)
## [1] 1 2 1 1 3
In certain situations, this is counter-intuitive:
f <- factor(2:4)
f
## [1] 2 3 4
## Levels: 2 3 4
as.integer(f)
## [1] 1 2 3
Also if you silently coerce to integer, for example by using a factor as a vector index:
LETTERS[2:4]
## [1] "B" "C" "D"
LETTERS[f]
## [1] "A" "B" "C"
Converting to character before converting to integer gives the expected values. See ?factor for details.
The solution provided years ago by Matthew Lundberg is not robust. It could be that the as.integer() function was defined for a specific S3 type of factors. Imagine someone would create a new factor class to keep operators like >=.
as.myfactor <- function(x, ...) {
structure(as.factor(x), class = c("myfactor", "factor"))
}
# and that someone would create an S3 method for integers - it should
# only remove the operators, which makes sense...
as.integer.myfactor <- function(x, ...) {
as.integer(gsub("(<|=|>)+", "", as.character(x)))
}
Now this is not working anymore, - it just removes operators:
f <- as.myfactor(">=2")
as.integer(f)
#> [1] 2
But this is robust with any factor you want to know the index of the level of, using which():
f <- factor(2:4)
which(levels(f) == 2)
#> [1] 1

Resources