Grouping/recoding factors in the same data.frame - r

Let's say I have a data frame like this:
df <- data.frame(a=letters[1:26],1:26)
And I would like to "re" factor a, b, and c as "a".
How do I do that?

One option is the recode() function in package car:
require(car)
df <- data.frame(a=letters[1:26],1:26)
df2 <- within(df, a <- recode(a, 'c("a","b","c")="a"'))
> head(df2)
a X1.26
1 a 1
2 a 2
3 a 3
4 d 4
5 e 5
6 f 6
Example where a is not so simple and we recode several levels into one.
set.seed(123)
df3 <- data.frame(a = sample(letters[1:5], 100, replace = TRUE),
b = 1:100)
with(df3, head(a))
with(df3, table(a))
the last lines giving:
> with(df3, head(a))
[1] b d c e e a
Levels: a b c d e
> with(df3, table(a))
a
a b c d e
19 20 21 22 18
Now lets combine levels a and e into level Z using recode()
df4 <- within(df3, a <- recode(a, 'c("a","e")="Z"'))
with(df4, head(a))
with(df4, table(a))
which gives:
> with(df4, head(a))
[1] b d c Z Z Z
Levels: b c d Z
> with(df4, table(a))
a
b c d Z
20 21 22 37
Doing this without spelling out the levels to merge:
## Select the levels you want (here 'a' and 'e')
lev.want <- with(df3, levels(a)[c(1,5)])
## now paste together
lev.want <- paste(lev.want, collapse = "','")
## then bolt on the extra bit
codes <- paste("c('", lev.want, "')='Z'", sep = "")
## then use within recode()
df5 <- within(df3, a <- recode(a, codes))
with(df5, table(a))
Which gives us the same as df4 above:
> with(df5, table(a))
a
b c d Z
20 21 22 37

Has anyone tried using this simple method? It requires no special packages, just an understanding of how R treats factors.
Say you want to rename the levels in a factor, get their indices
data <- data.frame(a=letters[1:26],1:26)
lalpha <- levels(data$a)
In this example we imagine we want to know the index for the level 'e' and 'w'
lalpha <- levels(data$a)
ind <- c(which(lalpha == 'e'), which(lalpha == 'w'))
Now we can use this index to replace the levels of the factor 'a'
levels(data$a)[ind] <- 'X'
If you now look at the dataframe factor a there will be an X where there was an e and w
I leave it to you to try the result.

You could do something like:
df$a[df$a %in% c("a","b","c")] <- "a"
UPDATE: More complicated factors.
Data <- data.frame(a=sample(c("Less than $50,000","$50,000-$99,999",
"$100,000-$249,999", "$250,000-$500,000"),20,TRUE),n=1:20)
rows <- Data$a %in% c("$50,000-$99,999", "$100,000-$249,999")
Data$a[rows] <- "$250,000-$500,000"

there are two ways.
if you don't want to drop the unused levels, ie "b" and "c", Joshua's solution is probably best.
if you want to drop the unused levels, then
df$a<-factor(ifelse(df$a%in%c("a","b","c"),"a",as.character(df$a)))
or
levels(df$a)<-ifelse(levels(df$a)%in%c("a","b","c"),"a",levels(df$a))

This is a simplified version of the chosen answer:
I've found that the easiest way to deal with this is to simply overwrite the factor levels by looking at them and then writing the numbers down to be overwritten.
df <- data.frame(a=letters[1:26],1:26)
levels(df)
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
"p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
levels(df$a)[c(1,2)] <- "c"
summary(df$a)
> c d e f g h i j k l m n o p q r s t u v w x y z
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Related

Is there a way to remove rows permanently in R data frames?

I'm searching for a function that I can remove rows permanently from an R data frame.
For example:
> df<-data.frame(AA=LETTERS[1:5],
NN=c(NA, 12, 21, NA, 11))
> df
# AA NN
#1 A NA
#2 B 12
#3 C 21
#4 D NA
#5 E 11
When I use complete.cases, R just drops the rows from the data frame (df) not creating a new df with new levels and row names, as observed below:
> df<-df[complete.cases(df),]
> df
# AA NN
#2 B 12
#3 C 21
#5 E 11
> levels(df$AA)
#[1] "A" "B" "C" "D" "E"
What I want is the following df:
>df
# AA NN
#1 B 12
#2 C 21
#3 E 11
> levels(df$AA)
#[1] "B" "C" "E"
Is there a way to do this in R?
Thanks!
We can wrap with droplevels to remove the unused levels after subsetting
df <- droplevels(df[complete.cases(df),])
levels(df$AA)
#[1] "B" "C" "E"
row.names(df) <- NULL

Combine a list of data frames into one preserving row names

I do know about the basics of combining a list of data frames into one as has been answered before. However, I am interested in smart ways to maintain row names. Suppose I have a list of data frames that are fairly equal and I keep them in a named list.
library(plyr)
library(dplyr)
library(data.table)
a = data.frame(x=1:3, row.names = letters[1:3])
b = data.frame(x=4:6, row.names = letters[4:6])
c = data.frame(x=7:9, row.names = letters[7:9])
l = list(A=a, B=b, C=c)
When I use do.call, the list names are combined with the row names:
> rownames(do.call("rbind", l))
[1] "A.a" "A.b" "A.c" "B.d" "B.e" "B.f" "C.g" "C.h" "C.i"
When I use any of rbind.fill, bind_rows or rbindlist the row names are replaced by a numeric range:
> rownames(rbind.fill(l))
> rownames(bind_rows(l))
> rownames(rbindlist(l))
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
When I remove the names from the list, do.call produces the desired output:
> names(l) = NULL
> rownames(do.call("rbind", l))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
So is there a function that I'm missing that provides some finer control over the row names? I do need the names for a different context so removing them is sub-optimal.
To preserve rownames, you can simply do:
do.call(rbind, unname(l))
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#f 6
#g 7
#h 8
#i 9
Or as you underlined by setting the rownames of l to NULL , this can be also done by:
do.call(rbind, setNames(l, NULL))
We can use add_rownames from dplyr package before binding:
rbind_all(lapply(l, add_rownames))
# Source: local data frame [9 x 2]
#
# rowname x
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
# 5 e 5
# 6 f 6
# 7 g 7
# 8 h 8
# 9 i 9
Why not only using rbind:
rbind(l$A, l$B, l$C)
Here it is another solution that I have just found and it works well (and efficiently) when you have large list and therefore, big dataframes.
df <- data.table::rbindlist(l)
# add a column with the rownames
df[,Col := unlist(lapply(l, rownames))]
df <- df %>% dplyr::select(Col, everything())
> df
Col x
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
More info about rbindlist here.

Sorting elements of a tie

Given this table
df <- data.frame(col1 = c(letters[3:5], "b","a"),
col2 = c(2:3, 1,1,1))
How can I tell R to return "a".
That means, from the three characters with value of 1 (a tie for the lowest value), I want to select only the first in alphabetical order
I think you want order
with(df, col1[order(col2, col1)][1])
# [1] a
# Levels: a b c d e
or
as.character(with(df, col1[order(col2, col1)][1]))
# [1] "a"
You can order column 1 by the ordered values in column 2 with
df[with(df, order(col2, col1)),]
# col1 col2
# 5 a 1
# 4 b 1
# 3 e 1
# 1 c 2
# 2 d 3
Try:
> min(as.character(df[df$col2==min(df$col2),1]))
[1] "a"
For explanation:
# first find col1 list in rows with minimum of df$col2
> xx = df[df$col2==min(df$col2),1]
> xx
[1] e b a
Levels: a b c d e
# Now find the minimum amongst these after converting factor to character:
> min(as.character(xx))
[1] "a"
>

R how to change one of the level to NA

I have a data set and one of its column has factor levels "a" "b" "c" "NotPerformed". How can I change all the "NotPerformed" factors to NA?
Set the level to NA:
x <- factor(c("a", "b", "c", "NotPerformed"))
x
## [1] a b c NotPerformed
## Levels: a b c NotPerformed
levels(x)[levels(x)=='NotPerformed'] <- NA
x
## [1] a b c <NA>
## Levels: a b c
Note that the factor level is removed.
I revise my old answer and provide what you can do as of September 2016. With the development of the dplyr package, now you can use recode_factor() to do the job.
x <- factor(c("a", "b", "c", "NotPerformed"))
# [1] a b c NotPerformed
# Levels: a b c NotPerformed
library(dplyr)
recode_factor(x, NotPerformed = NA_character_)
# [1] a b c <NA>
# Levels: a b c
Or simply use the inbuilt exclude option, which works regardless of whether the initial variable is a character or factor.
x <- c("a", "b", "c", "NotPerformed")
factor(x, exclude = "NotPerformed")
[1] a b c <NA>
Levels: a b c
factor(factor(x), exclude = "NotPerformed")
[1] a b c <NA>
Levels: a b c
Set one of the levels to NA through tidyverse Pipeline, %>%.
This may serve better as a comment, but I do not have that many reputation.
In my case, the income variable is int with values of c(1:7, 9). Among the levels, "9" represents "Do not wish to answer".
## when all int should be fctr
New_data <- data %>% mutate_if(is.integer, as.factor) %>%
mutate(income = fct_recode(income, NULL = "9"))
I also tried recode(), it does not work.

R loop over columns to calculate the number of rows that have levels in a different subset

> x <- data.table( C1=c('a','b','c','d') )
> y <- data.table( C1=c('a','b','b','a') )
> f="C1"
> x[ C1 %in% unique(y$C1),]
C1
1: a
2: b
so I can see that the levels of y$C1 cover 2 rows for x$C1.
> y[ C1 %in% unique(x$C1),]
C1
1: a
2: b
3: b
4: a
so I can see that the levels of x$C1 cover 4 rows for y$C1.
This works, but I would like to use a variable for the column name so that I can build a loop when there are many columns.
The following does not work:
> y[ f %in% unique(x$C1),]
Empty data.table (0 rows) of 1 col: C1
This works:
y[ get(f) %in% unique(x$C1),]
the reason for this is that f itself refers to the string "C1"
f
[1] "C1"
class(f)
[1] "character"
you need to refer to the column object "C1" in the data.table itself.
below is an illustration of how get works:
a <- seq(1:10)
b <- "a"
print(b)
[1] "a"
print(get(b))
[1] 1 2 3 4 5 6 7 8 9 10
You could also use:
f <- quote(C1)
y[ eval(f) %in% unique(x$C1),]
# C1
#1: a
#2: b
#3: b
#4: a

Resources