How to get levels for each factor variable in R - r

I understand R assigns values to a factor vector alphabetically. In this following example:
x <- as.factor(c("A","B","C","A","A","A","A","A","A","B","C","B","C","B","C","B","C"))
str(x)
This prints
Factor w/ 3 levels "A","B","C": 1 2 3 1 1 1 1 1 1 2 ...
Since I have only three levels it is easier to understand the level - value association i.e., A = 1, B = 2, so on and so forth.
In a scenario where I have hundreds of factors, is there a easier way to get it printed as a table that displays all the factors along with it level values like this:
Levels Values
A 1
B 2
C 3

Why do you want to know the underlying numeric values that R assigns to each factor level? I ask because this generally wouldn't be an important thing to keep track of. Can you say more about what you're trying to accomplish? We may be able to provide additional advice if we know more about the underlying problem you're trying to solve. For now, below are examples of how to do what you ask that also show why the results might not be what you expect.
Do all the columns in your data frame have different combinations of the same underlying categories? If not, what you're asking for could give unexpected and undesirable results. Below are a couple of examples, based on a fake data frame with 3 factor columns, two of which are upper case letters and one of which is lower case letters.
# Fake data
set.seed(2)
x = c("C","A","B","C","A","A","A","A","A","A","B","C","B","C","B","C","B","C")
dat = data.frame(x=x,
y=sample(LETTERS[1:5], length(x), replace=TRUE),
z=sample(letters[1:3], length(x), replace=TRUE),
w=rnorm(length(x)))
Note that the numeric codes assigned to each factor level are not unique across columns. The lower case letters and the upper case letters can both have factor codes 1 through 3.
# Return a list with factor levels and numeric codes for each factor column
lapply(dat[ , sapply(dat, is.factor)], function(v) {
data.frame(Levels=levels(unique(sort(v))),
Values=as.numeric(unique(sort(v))))
})
$x
Levels Values
1 A 1
2 B 2
3 C 3
$y
Levels Values
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
$z
Levels Values
1 a 1
2 b 2
3 c 3
Another potential complication is whether the order of the factor levels is the same for different columns. As an example, let's change the factor order for one of the upper case columns. This creates a new issue in that the the same letter can have a different code value in different columns and the same code can be assigned to different letters. For example, A has code 1 in column x and code 5 in column y. Furthermore, code 1 is assigned to E in column y, rather than to A.
dat$y = factor(dat$y, levels = LETTERS[5:1])
# Return a list with factor levels and numeric codes for each factor column
lapply(dat[ , sapply(dat, is.factor)], function(v) {
data.frame(Levels=levels(unique(sort(v))),
Values=as.numeric(unique(sort(v))))
})
$x
Levels Values
1 A 1
2 B 2
3 C 3
$y
Levels Values
1 E 1
2 D 2
3 C 3
4 B 4
5 A 5
$z
Levels Values
1 a 1
2 b 2
3 c 3

Related

How can I compare two factors with different levels?

Is it possible to compare two factors of same length, but different levels? For example, if we have these 2 factor variables:
A <- factor(1:5)
str(A)
Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
B <- factor(c(1:3,6,6))
str(B)
Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
If I try to compare them using, for example, the == operator:
mean(A == B)
I get the following error:
Error in Ops.factor(A, B) : level sets of factors are different
Convert to character then compare:
# data
A <- factor(1:5)
B <- factor(c(1:3,6,6))
str(A)
# Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
str(B)
# Factor w/ 4 levels "1","2","3","6": 1 2 3 4 4
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
mean(as.character(A) == as.character(B))
# [1] 0.6
Or another approach would be
mean(levels(A)[A] == levels(B)[B])
which is 2 times slower on a 1e8 dataset.
Converting to character as in #zx8754's answer is the easiest solution to this problem, and probably the one you'd want to use almost always. Another option, though, is to correct the 2 variables so that they have the same levels. You might want to do this if you want to keep these variables as factor for some reason and don't want to have to clog up your code with repeated calls to as.character.
A <- factor(1:5)
B <- factor(c(1:3,6,6))
mean(A == B)
Error in Ops.factor(A, B) : level sets of factors are different
We can take the union of the levels of both factors to get all levels in either factor, and then set remake the factors using that union as the levels. Now, even though the 2 factors have different values, the levels are the same between them and you can compare them:
C = factor(A, levels = union(levels(A), levels(B)))
D = factor(B, levels = union(levels(A), levels(B)))
mean(C==D)
[1] 0.6
As you can see, the values are unchanged, but the levels are now identical.
C
[1] 1 2 3 4 5
Levels: 1 2 3 4 5 6
D
[1] 1 2 3 6 6
Levels: 1 2 3 4 5 6

Find frequencies of unique values of one vector in a a different vector

If I have a vector of observed values X and a vector of reference values Y, how do I use R to find the frequencies of each value of Y in X?
# create X and Y
X = c(1,2,4,5,1,4)
Y = 1:6
# desired output
Y X
1 2
2 1
3 0
4 2
5 1
6 0
I know how to find the frequencies of values of X, or what values of Y are in X, but this is proving (emrbarrinsgly) difficult. I apologise if this has been asked before but I am struggling to find similar questions.
I have tried
# 'count' in the "plyr" package
count(X , "unique(Y)" )
...but this returns:
unique.Y. freq
1 1
2 1
3 1
4 1
5 1
6 1
Thanks!
We convert the 'X' to factor class specifying the levels as the unique elements of 'Y' (In this case, there are only 6 unique elements. But, if there are duplicate elements, use , levels= unique(Y)). Get the frequency of 'Y' and transformed 'X' and do the colSums.
colSums(table(Y,factor(X, levels=Y)))
# 1 2 3 4 5 6
# 2 1 0 2 1 0
Or as #docendodiscmus mentioned, we can apply table on the transformed 'X' to get the output (using this example)
table(factor(X, levels = Y))
Or use xtabs. By default, it gives the sum of duplicate elements. Here, we convert the 'Y' to logical vector so that we get automatically the frequency (by doing the sum) with xtabs.
xtabs(as.logical(Y)~factor(X, levels=Y))

How to retain/reassing the factor levels for a data.frame in R?

I have large dataset that I am using to train a machine learning algorithm in R. After all the data preprocessing, I have a dataframe that contains factors and numeric values. I split such dataset into a training set and a test set, and save them to file with write.csv().
When I read back the test.csv and train.csv files it may happen that some of the factors have lost levels. This makes some of the algorithms fail when creating design matrices.
Here is a detailed example. Assume that originally I had a dataset with 12 rows that I split into a training set of 8 rows and a test set of 4 rows. I save the 8 training rows to train.csv and the 4 rows to test.csv. Note that factor2 has levels (a,b,c,d) in train.csv:
factor1 factor2 value
1 1 a 1
2 2 b 0
3 3 c 1
4 4 d 1
5 2 a 0
6 4 c 1
7 3 b 0
8 1 a 1
but only (a,b,c) in test.csv:
factor1 factor2
1 4 a
2 2 b
3 4 c
4 1 a
And same for factor1, level 3 is missing in the test set.
When I read back the file test.csv, R assumes that factor1 has levels (1,2,4) and factor2 has levels (a,b,c). I would like to find a way to tell R the actual levels.
The solution that I thought is to save the levels at the beginning, from the original dataset with 12 points and then reassign them after reading train.csv and test.csv.
I would like to avoid using the save() method from R, because the datasets that I am creating may go to other languages/packages.
Thanks!
In R, subsetting should keep all factor levels in a vector. Here let's imagine a is our data, column a is our categorical variable, and b is our response:
a <- data.frame(a = c("a", "b", "c"), b = c(1, 2, 3))
z <- a[1:2, ]
z$a
[1] a b
Levels: a b c
If you are losing factors in your sub-setting to train and test sets, you need a better way of sub-setting.
If your problem is writing a .csv, you probably want to reinclude them as an NA in the response column. You can do this a ton of ways - here's a merge:
merge(data.frame(a = levels(z$a)), z, all=TRUE)
a b
1 a 1
2 b 2
3 c NA
Edit: From your example, using the first data as dat and the second as dat2:
levels(dat2$factor1) <- levels(dat$factor1)
levels(dat2$factor2) <- levels(dat$factor2)
would be the easiest way.

i want to do data.frame1[1,] <-data.frame[1,], but i got trouble

Consider this sampel data
df1<-data.frame(c(1,2,1),c(3,3,2),c(2,5,8))
df2<-data.frame("a","a","a")
The result that I want is
> df1
1 2 3
1 a a a
2 2 3 5
3 1 2 8
but after I do this: df1[1,] <- df2[1,]
> df1
1 2 3
1 1 1 1
2 2 3 5
3 1 2 8
why? what should I do that I can get the result what I want?
Each column in a data frame must have the same type. The key thing here is that the values in df2 are factors, not characters (because stringsAsFactors = TRUE). Factors have an underlying integer representation so when you combine a factor and a numeric in the same vector the factor is promoted to numeric type. The first level of a factor corresponds to 1 which is why a became 1.
Regarding the factor vs character type conversion, note the following:
c("a", 2, 3)
## "a" "2" "3"
c(factor("a"), 2, 3)
## 1 2 3
#Chaconne's answer gives a good explanation. If you really want to do what you say you want, you can do this:
df1<-data.frame(v1=c(1,2,1),v2=c(3,3,2),v3=c(2,5,8))
df2<-data.frame("a","a","a",stringsAsFactors=FALSE)
df1[1,] <- df2[1,]
but it will convert ("coerce") all of your data to character type, which is probably not what you want ...
Perhaps you want names(df1) <- df2[1,] ?

How will factor levels be ordered with regard to the original values?

If I create a factor from a vector of numerical values, will the factor categories be ordered automatically by the values that are now considered categories?
i.e. [1,4,7,3,2] -> categories = {1,2,3,4,7}
Short answer: yes.
Long answer: it depends. R will sort the unique values and assign the categories in that order if you convert a vector using the function factor() without calling any extra arguments :
> x <- c(3,1,4,5,1,4)
> factor(x)
[1] 3 1 4 5 1 4
Levels: 1 3 4 5
It won't however when you use the argument levels:
> factor(x, levels=unique(x))
[1] 3 1 4 5 1 4
Levels: 3 1 4 5
In this case, it takes the order of the levels as the order in which it assigns categories.

Resources