convert only some factors into a different factor - r

I'm trying to build a factor column that relates to two other factor columns with completely different factor levels. Here's example data.
set.seed(1234)
a<-sample(LETTERS[1:10],50,replace=TRUE)
b<-sample(letters[11:20],50,replace=TRUE)
df<-data.frame(a,b)
df$a<-as.factor(df$a)
df$b<-as.factor(df$b)
The rule I want to make creates a new column, c, that bases it's factor level value based on the value of column a.
if any row in column a ="F", that row in column c will equal whatever the entry is for column b. The code I'm trying:
dfn<-dim(df)[1]
for (i in 1:dfn){
df$c[i]<-ifelse(df$a[i]=="F",df$b[i],df$a[i])
}
df
only spits out the numbered index of the factor level for column b and not the actual entry. What have I done wrong?

I think you'll need to do a little finagling of character values. This seems to do it.
w <- df$a == "F"
df$c <- factor(replace(as.character(df$a), w, as.character(df$b)[w]))
Here is a quick look at the new column,
factor(replace(as.character(df$a), w, as.character(df$b)[w]))
# [1] B G G G I G A C G s G k C J C I C C B C D D B A C I n J I A
# [31] E C D p B H C C J I l G D G D p G E C H
# Levels: A B C D E G H I J k l n p s

As my previous comment, a solution with dplyr:
df %>% mutate(c = ifelse(a == "F", as.character(b), as.character(a)))

If you plan on doing anything involving combinations of the columns as factors, for example, comparisons, you should refactor to the same set of levels.
u<-union(levels(df$a),levels(df$b))
df$a<-factor(df$a,u)
df$b<-factor(df$b,u)
df$c<-df$a
ind<-df$a=="F"
df$c[ind]<-df$b[ind]
By taking this precaution, you can sensibly do
> sum(df$c==df$b)
[1] 6
> sum(df$a=="F")
[1] 6
otherwise the first line will fail.

Related

Add extra names with pervious data

I'm trying to add those data with each other, but I found "N/A" in the final output when I enter new names didn't exists in the first vector, so how can i handle it to show all the data without any "N/A"
I think you just want to merge/append two factors, an easy approach would be to convert them to character, append them and make it a factor again.
Just a simple example with letters
p <- as.factor(LETTERS[3:8])
q <- as.factor(LETTERS[1:5])
as.factor(c(as.character(p), as.character(q)))
# [1] C D E F G H A B C D E
# Levels: A B C D E F G H

R convert data table to vector in reverse order

what is the fastest way to convert the data.table:
1: A B C
2: D E F
3: G H I
into the vector: G H I D E F A B C
I use:
X <- X[order(nrow(X):1),]
X <- melt(t(X))$value
But my feeling is, that this can be optimized :-)
Thank you
One option is to reverse the index, transpose to a matrix and concatenate
c(t(X[.N:1]))

Collapsing factor level for all the factor variable in dataframe based on the count

I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. I tried this but it doesn't help.
df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))),
b=as.factor(c(rep('A',5),rep('B',5))),
c=as.factor(c(rep('A',3),rep('B',5),rep('C',2))))
myfun=function(x){
if(is.factor(x)){
levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'
}
}
df=as.data.frame(lapply(df, myfun))
Expected Output
a b c
D A A
D A A
D A A
B A B
B A B
B B B
B B B
B B B
others B others
others B others
This might get a bit messy, but here is one approach via base R,
fun1 <- function(x){levels(x) <-
c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x)}
However the above function will need to first be re-ordered and as OP states in comment, the correct one will be,
fun1 <- function(x){ x=factor(x,
levels = names(sort(table(x), decreasing = TRUE)));
levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x) }
This is now easy thanks to fct_lump() from the forcats package.
fct_lump(df$a, n = 2)
# [1] D D D B B B B B Other Other
# Levels: B D Other
The argument n controls the number of most common levels to be preserved, lumping together the others.

Loop with column binding

I am self-taught useR so please bear with me.
I have something similar the following dataset:
individual value
a 0.917741317
a 0.689673689
a 0.846208486
b 0.439198006
b 0.366260159
b 0.689985484
c 0.703381117
c 0.29467743
c 0.252435687
d 0.298108973
d 0.42951805
d 0.011187204
e 0.078516181
e 0.498118235
e 0.003877632
I would like to create a matrix with the values for a in column1, values for b in column2, etc. [I also add a 1 at the bottom of every column for a later algebra operations]
I have tried so far:
for (i in unique(df$individual)) {
values <- subset(df$value, df$individual == i)
m <- cbind(c(values[1:3],1))
}
I get a (4,1) matrix with the last individual values. What is missing to make it additive for each loop and get all as many columns as individuals?
This operation is called "reshaping". There is a base function, but I find it easier with the reshape2 package:
DF <- read.table(text="individual value
a 0.917741317
a 0.689673689
a 0.846208486
b 0.439198006
b 0.366260159
b 0.689985484
c 0.703381117
c 0.29467743
c 0.252435687
d 0.298108973
d 0.42951805
d 0.011187204
e 0.078516181
e 0.498118235
e 0.003877632", header=TRUE)
DF$id <- 1:3
library(reshape2)
DF2 <- dcast(DF, id ~ individual)
DF2[,-1]
# a b c d e
#1 0.9177413 0.4391980 0.7033811 0.2981090 0.078516181
#2 0.6896737 0.3662602 0.2946774 0.4295180 0.498118235
#3 0.8462085 0.6899855 0.2524357 0.0111872 0.003877632

rbindlist for factors with missing levels

I have several data.tables that I would like to rbindlist. The tables contain factors with (possibly missing) levels. Then rbindlist(...) behaves differently from do.call(rbind(...)):
dt1 <- data.table(x=factor(c("a", "b"), levels=letters))
rbindlist(list(dt1, dt1))[,x]
## [1] a b a b
## Levels: a b
do.call(rbind, list(dt1, dt1))[,x]
## [1] a b a b
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
If I want to keep the levels, do I have tor resort to rbind or is there a data.table way?
I guess rbindlist is faster because it doesn't do the checking of do.call(rbind.data.frame,...)
Why not to set the levels after binding?
Dt <- rbindlist(list(dt1, dt1))
setattr(Dt$x,"levels",letters) ## set attribute without a copy
from the ?setattr:
setattr() is useful in many situations to set attributes by reference and can be used on any object or part of an object, not just data.tables.
Thanks for pointing out this problem. As of version 1.8.11 it has been fixed:
dt1 <- data.table(x=factor(c("a", "b"), levels=letters))
rbindlist(list(dt1, dt1))[,x]
#[1] a b a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Resources