I have a data set and one of its column has factor levels "a" "b" "c" "NotPerformed". How can I change all the "NotPerformed" factors to NA?
Set the level to NA:
x <- factor(c("a", "b", "c", "NotPerformed"))
x
## [1] a b c NotPerformed
## Levels: a b c NotPerformed
levels(x)[levels(x)=='NotPerformed'] <- NA
x
## [1] a b c <NA>
## Levels: a b c
Note that the factor level is removed.
I revise my old answer and provide what you can do as of September 2016. With the development of the dplyr package, now you can use recode_factor() to do the job.
x <- factor(c("a", "b", "c", "NotPerformed"))
# [1] a b c NotPerformed
# Levels: a b c NotPerformed
library(dplyr)
recode_factor(x, NotPerformed = NA_character_)
# [1] a b c <NA>
# Levels: a b c
Or simply use the inbuilt exclude option, which works regardless of whether the initial variable is a character or factor.
x <- c("a", "b", "c", "NotPerformed")
factor(x, exclude = "NotPerformed")
[1] a b c <NA>
Levels: a b c
factor(factor(x), exclude = "NotPerformed")
[1] a b c <NA>
Levels: a b c
Set one of the levels to NA through tidyverse Pipeline, %>%.
This may serve better as a comment, but I do not have that many reputation.
In my case, the income variable is int with values of c(1:7, 9). Among the levels, "9" represents "Do not wish to answer".
## when all int should be fctr
New_data <- data %>% mutate_if(is.integer, as.factor) %>%
mutate(income = fct_recode(income, NULL = "9"))
I also tried recode(), it does not work.
Related
I know that adding new levels to an object of class factor is pretty straightforward. However, when I put the factor level to-be-added in the first position in the list, the actual values in the object (vector) change.
Here is what I am talking about:
test <- factor(c("a", "a", "a", "b", "c", "a", "c", "b"))
test
#[1] a a a b c a c b
#Levels: a b c
levels(test)
#[1] "a" "b" "c"
## Works OK
levels(test) <- c(levels(test), "d")
#[1] a a a b c a c b
#Levels: a b c d
levels(test) <- c("d", levels(test))
## The values have changed
test
#[1] d d d a b d b a
#Levels: d a b c
I'm just curious why the position of the new factor level in a list affects the factor levels and the factor itself are modified.
The levels of a factor are the character strings associated with an underlying integer-valued variable (an enumeration).
If we examine the underlying structure of this variable:
test <- factor(c("a", "a", "a", "b", "c", "a", "c", "b"))
we see:
str(test)
## Factor w/ 3 levels "a","b","c": 1 1 1 2 3 1 3 2
What levels() does is to assign codes to the integer values in order: levels(test) <- c("d","a","b","c") makes the correspondence 1 <-> "d", 2 <-> "a", 3<-> "b", 4 <-> "c". Thus the values that have an underlying value of 1 (the first through third and sixth elements of the vector) now have associated label "d".
A safer way to add a new level would be:
test <- factor(test,levels=c("d","a","b","c"))
test
## [1] a a a b c a c b
## Levels: d a b c
str(test)
## Factor w/ 4 levels "d","a","b","c": 2 2 2 3 4 2 4 3
This changes the order of the levels (which matters for plotting and parameterizing statistical models), but it uses the character values when assigning integer values ...
I have a data frame with factor variables
> a <- c("a", "b", "c")
> b <- c("c", "b", "a")
> df <- as.data.frame(cbind(a,b))
> df$a <- as.factor(df$a)
> df$b <- as.factor(df$b)
> df
a b
1 a c
2 b b
3 c a
I create new logical variable based on the similarity of var a and var b.
> df$result <- isTRUE(df$a == df$b)
But I get the result:
> df
a b result
1 a c FALSE
2 b b FALSE
3 c a FALSE
When I expected
> df
a b result
1 a c FALSE
2 b b TRUE
3 c a FALSE
(I'm using factors to replicate my real data)
What am I doing wrong? How can I achieve my goal of identifying similar variables? Thanks
Just do
df$result <- with(df, a==b)
df
# a b result
#1 a c FALSE
#2 b b TRUE
#3 c a FALSE
The a==b already returns a logical vector and we don't need isTRUE to wrap it.
As #Frank mentioned in the comments, it is better to evaluate between character class columns as difference in factor levels can result in error. We can either convert the factor to character for evaluating
with(df, as.character(a)==as.character(b))
or make the levels the same as in both columns
Un1 <- union(levels(df$a), levels(df$b))
df[] <- lapply(df, factor, levels=Un1)
with(df, a==b)
I have a dataframe, for example like this:
df <- data.frame(ID=c(8, 2, 5, 1, 4), value=c("a", "b", "c", "d", "e"))
ID value
1 8 a
2 2 b
3 5 c
4 1 d
5 4 e
I know how to select rows with a given value in the "ID" column. But how to get rows conditional on their "ID"-values in a specified order?
Example: How to extract "value" for rows with ID 4, 2 and 5 in the given order? The result I want to get is "e", "b", "c".
Using %in% gives me the results in wrong order:
df[df$ID %in% c(4, 2, 5), "value"]
[1] b c e
Levels: a b c d e
I found a workaround using rownames, but I feel like there must be a better solution to this.
# workaround
rownames(df) <- df$ID
df[as.character(c(4, 2, 5)), "value"]
[1] e b c
Levels: a b c d e
Any suggestions?
Thank you!
You can use merge and order by a new introduced rank column :
dat = merge(df,data.frame(ID=c(4,2,5),v=1:3))
dat[order(dat$v),"value"]
[1] e b c
Or one linear option:
with(merge(df,data.frame(ID=c(4,2,5),v=1:3)),value[order(v)])
sapply(c(4,2,5), function(x) df[df$ID==x,"value"])
I'm quite confused on when to use
factor(educ) or factor(agegroup)
in R. Is it used for categorical ordered data? or can I just use to it a simple categorical data with no hierarchy?
I know this is so basic. I really need some clarification.
I don't really see a clear question here, so perhaps a simple example would suffice as an answer.
Imagine we have the following data.
set1 <- c("AA", "B", "BA", "CC", "CA", "AA", "BA", "CC", "CC")
We want to factor this data.
f.set1 <- factor(set1)
Let's look at the output. Note that R has just alphabetized the levels, but does not say that this implies hierarchy (see the "levels" line).
f.set1
# [1] AA B BA CC CA AA BA CC CC
# Levels: AA B BA CA CC
is.ordered(f.set1)
# [1] FALSE
However, using as.numeric on the factored data might fool you into thinking it is hierarchical. Note that "5" comes before "4" in the output below, and note also the alphabetized output of table(f.set1) (which also happens if you simply did table(set1).
as.numeric(f.set1)
# [1] 1 2 3 5 4 1 3 5 5
table(f.set1)
# f.set1
# AA B BA CA CC
# 2 1 2 1 3
Let's now compare this with what happens when we use the ordered argument along with the levels argument. Using levels plus ordered = TRUE tells us that this categorical data is hierarchical, in the order specified by levels (not alphabetically or in the order that we've entered the data).
o.set1 <- factor(set1,
levels = c("CA", "BA", "AA", "CC", "B"),
ordered = TRUE)
Even viewing the output shows us hierarchy now.
o.set1
# [1] AA B BA CC CA AA BA CC CC
# Levels: CA < BA < AA < CC < B
is.ordered(o.set1)
# [1] TRUE
As do the functions as.numeric and table.
as.numeric(o.set1)
# [1] 3 5 2 4 1 3 2 4 4
table(o.set1)
# o.set1
# CA BA AA CC B
# 1 2 2 3 1
So, to summarize, factor() by itself just creates essentially a non-hierarchical sorted factor of your categorical data; factor() with the levels and ordered = TRUE arguments create hierarchical categories.
Alternatively, use ordered() if you directly want to create ordered factors. The order of the categories still need to be specified:
ordered(set1, levels = c("CA", "BA", "AA", "CC", "B"))
You can flag a factor as ordered by creating it with ordered(x) or with factor(x, ordered=TRUE). The "Details" section of ?factor explains that:
Ordered factors differ from factors only in their class, but
methods and the model-fitting functions treat the two classes
quite differently.
You can confirm the first part of that quote (that they differ only in their class) by comparing the attributes of these two objects:
f <- factor(letters[3:1], levels=letters[3:1])
of <- ordered(letters[3:1], levels=letters[3:1])
attributes(f)
# $levels
# [1] "c" "b" "a"
#
# $class
# [1] "factor"
attributes(of)
# $levels
# [1] "c" "b" "a"
#
# $class
# [1] "ordered" "factor"
Various factor-handling R functions (the "methods and model-fitting functions" of the second part of that quote) will then use is.ordered() to test for the presence of that "ordered" class indicator, taking it as a directive to treat an ordered factor differently than an unordered one. Here are a couple of examples:
## The print method for factors. (Type 'print.factor' to see the function's code)
print(f)
# [1] c b a
# Levels: c b a
print(of)
# [1] c b a
# Levels: c < b < a
## The contrasts function. (Type 'contrasts' to see the function's code.)
contrasts(of)
# .L .Q
# [1,] -7.071068e-01 0.4082483
# [2,] 4.350720e-18 -0.8164966
# [3,] 7.071068e-01 0.4082483
contrasts(f)
# b a
# c 0 0
# b 1 0
# a 0 1
Let's say I have a data frame like this:
df <- data.frame(a=letters[1:26],1:26)
And I would like to "re" factor a, b, and c as "a".
How do I do that?
One option is the recode() function in package car:
require(car)
df <- data.frame(a=letters[1:26],1:26)
df2 <- within(df, a <- recode(a, 'c("a","b","c")="a"'))
> head(df2)
a X1.26
1 a 1
2 a 2
3 a 3
4 d 4
5 e 5
6 f 6
Example where a is not so simple and we recode several levels into one.
set.seed(123)
df3 <- data.frame(a = sample(letters[1:5], 100, replace = TRUE),
b = 1:100)
with(df3, head(a))
with(df3, table(a))
the last lines giving:
> with(df3, head(a))
[1] b d c e e a
Levels: a b c d e
> with(df3, table(a))
a
a b c d e
19 20 21 22 18
Now lets combine levels a and e into level Z using recode()
df4 <- within(df3, a <- recode(a, 'c("a","e")="Z"'))
with(df4, head(a))
with(df4, table(a))
which gives:
> with(df4, head(a))
[1] b d c Z Z Z
Levels: b c d Z
> with(df4, table(a))
a
b c d Z
20 21 22 37
Doing this without spelling out the levels to merge:
## Select the levels you want (here 'a' and 'e')
lev.want <- with(df3, levels(a)[c(1,5)])
## now paste together
lev.want <- paste(lev.want, collapse = "','")
## then bolt on the extra bit
codes <- paste("c('", lev.want, "')='Z'", sep = "")
## then use within recode()
df5 <- within(df3, a <- recode(a, codes))
with(df5, table(a))
Which gives us the same as df4 above:
> with(df5, table(a))
a
b c d Z
20 21 22 37
Has anyone tried using this simple method? It requires no special packages, just an understanding of how R treats factors.
Say you want to rename the levels in a factor, get their indices
data <- data.frame(a=letters[1:26],1:26)
lalpha <- levels(data$a)
In this example we imagine we want to know the index for the level 'e' and 'w'
lalpha <- levels(data$a)
ind <- c(which(lalpha == 'e'), which(lalpha == 'w'))
Now we can use this index to replace the levels of the factor 'a'
levels(data$a)[ind] <- 'X'
If you now look at the dataframe factor a there will be an X where there was an e and w
I leave it to you to try the result.
You could do something like:
df$a[df$a %in% c("a","b","c")] <- "a"
UPDATE: More complicated factors.
Data <- data.frame(a=sample(c("Less than $50,000","$50,000-$99,999",
"$100,000-$249,999", "$250,000-$500,000"),20,TRUE),n=1:20)
rows <- Data$a %in% c("$50,000-$99,999", "$100,000-$249,999")
Data$a[rows] <- "$250,000-$500,000"
there are two ways.
if you don't want to drop the unused levels, ie "b" and "c", Joshua's solution is probably best.
if you want to drop the unused levels, then
df$a<-factor(ifelse(df$a%in%c("a","b","c"),"a",as.character(df$a)))
or
levels(df$a)<-ifelse(levels(df$a)%in%c("a","b","c"),"a",levels(df$a))
This is a simplified version of the chosen answer:
I've found that the easiest way to deal with this is to simply overwrite the factor levels by looking at them and then writing the numbers down to be overwritten.
df <- data.frame(a=letters[1:26],1:26)
levels(df)
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
"p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
levels(df$a)[c(1,2)] <- "c"
summary(df$a)
> c d e f g h i j k l m n o p q r s t u v w x y z
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1