rbindlist for factors with missing levels - r

I have several data.tables that I would like to rbindlist. The tables contain factors with (possibly missing) levels. Then rbindlist(...) behaves differently from do.call(rbind(...)):
dt1 <- data.table(x=factor(c("a", "b"), levels=letters))
rbindlist(list(dt1, dt1))[,x]
## [1] a b a b
## Levels: a b
do.call(rbind, list(dt1, dt1))[,x]
## [1] a b a b
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
If I want to keep the levels, do I have tor resort to rbind or is there a data.table way?

I guess rbindlist is faster because it doesn't do the checking of do.call(rbind.data.frame,...)
Why not to set the levels after binding?
Dt <- rbindlist(list(dt1, dt1))
setattr(Dt$x,"levels",letters) ## set attribute without a copy
from the ?setattr:
setattr() is useful in many situations to set attributes by reference and can be used on any object or part of an object, not just data.tables.

Thanks for pointing out this problem. As of version 1.8.11 it has been fixed:
dt1 <- data.table(x=factor(c("a", "b"), levels=letters))
rbindlist(list(dt1, dt1))[,x]
#[1] a b a b
#Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Related

R convert data table to vector in reverse order

what is the fastest way to convert the data.table:
1: A B C
2: D E F
3: G H I
into the vector: G H I D E F A B C
I use:
X <- X[order(nrow(X):1),]
X <- melt(t(X))$value
But my feeling is, that this can be optimized :-)
Thank you
One option is to reverse the index, transpose to a matrix and concatenate
c(t(X[.N:1]))

Is there R function to combine many data frames

I have a sample of 2 datasets (c and d). I have combines them using combine command
c <- data.frame(x=c("a","b"),y=c("c","d"))
d <- data.frame(x=c("f","g"),y=c("h","e"))
library(gdata)
combine(c,d)
x y source
1 a c c
2 b d c
3 f h d
4 g e d
Well. Suppose I have 100 dataframe like c,d,e,f..... and so on(with same columns). Is there a way to combine all these in a quick way. Or else i need to call below command
combine(c,d,e,f........)
df <- read.csv(file.choose())
combine(df)
And the above is very time consuming. Is there a alternate to combine all dataframes easily
You can list all file you want to read in a directory using this:
listoffiles <- list.files(pattern = ".csv")
Then loop over all files and assign it a variable name with df_.
for(i in 1:length(listoffiles)) {
assign(paste0("df_", i), read.csv2(listoffiles[i]))
}
Then search for all files in your global environment.
Then you can specify a search pattern which would be "df_" and would result in a list of data.frames.
dflist <- mget(ls(.GlobalEnv, pattern = "df_"), envir = .GlobalEnv)
Then use rbindlist from data.table to combine your data.frames.
> data.table::rbindlist(dflist)
x y
1: a c
2: b d
3: f h
4: g e
If I understand the question correctly, then the OP has the names of the dataframes in a character vector, but the dataframes themselves are single objects in the global environment . In that case I would suggest the following.
Let this be the data and the character vector:
c <- data.frame(x=c("a","b"),y=c("c","d"), stringsAsFactors = FALSE)
d <- data.frame(x=c("f","g"),y=c("h","e"), stringsAsFactors = FALSE)
e <- data.frame(x=c("x","y"),y=c("o","p"), stringsAsFactors = FALSE)
df_names <- c("c", "d","e")
Then dplyr::bind_rows with c(mget(...)) should do the the job.
library(dplyr)
bind_rows(c(mget(df_names)), .id = "source")
> source x y
1 c a c
2 c b d
3 d f h
4 d g e
5 e x o
6 e y p

Collapsing factor level for all the factor variable in dataframe based on the count

I would like to keep only the top 2 factor levels based on the frequency and group all other factors into Other. I tried this but it doesn't help.
df=data.frame(a=as.factor(c(rep('D',3),rep('B',5),rep('C',2))),
b=as.factor(c(rep('A',5),rep('B',5))),
c=as.factor(c(rep('A',3),rep('B',5),rep('C',2))))
myfun=function(x){
if(is.factor(x)){
levels(x)[!levels(x) %in% names(sort(table(x),decreasing = T)[1:2])]='Others'
}
}
df=as.data.frame(lapply(df, myfun))
Expected Output
a b c
D A A
D A A
D A A
B A B
B A B
B B B
B B B
B B B
others B others
others B others
This might get a bit messy, but here is one approach via base R,
fun1 <- function(x){levels(x) <-
c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x)}
However the above function will need to first be re-ordered and as OP states in comment, the correct one will be,
fun1 <- function(x){ x=factor(x,
levels = names(sort(table(x), decreasing = TRUE)));
levels(x) <- c(names(sort(table(x), decreasing = TRUE)[1:2]),
rep('others', length(levels(x))-2));
return(x) }
This is now easy thanks to fct_lump() from the forcats package.
fct_lump(df$a, n = 2)
# [1] D D D B B B B B Other Other
# Levels: B D Other
The argument n controls the number of most common levels to be preserved, lumping together the others.

convert only some factors into a different factor

I'm trying to build a factor column that relates to two other factor columns with completely different factor levels. Here's example data.
set.seed(1234)
a<-sample(LETTERS[1:10],50,replace=TRUE)
b<-sample(letters[11:20],50,replace=TRUE)
df<-data.frame(a,b)
df$a<-as.factor(df$a)
df$b<-as.factor(df$b)
The rule I want to make creates a new column, c, that bases it's factor level value based on the value of column a.
if any row in column a ="F", that row in column c will equal whatever the entry is for column b. The code I'm trying:
dfn<-dim(df)[1]
for (i in 1:dfn){
df$c[i]<-ifelse(df$a[i]=="F",df$b[i],df$a[i])
}
df
only spits out the numbered index of the factor level for column b and not the actual entry. What have I done wrong?
I think you'll need to do a little finagling of character values. This seems to do it.
w <- df$a == "F"
df$c <- factor(replace(as.character(df$a), w, as.character(df$b)[w]))
Here is a quick look at the new column,
factor(replace(as.character(df$a), w, as.character(df$b)[w]))
# [1] B G G G I G A C G s G k C J C I C C B C D D B A C I n J I A
# [31] E C D p B H C C J I l G D G D p G E C H
# Levels: A B C D E G H I J k l n p s
As my previous comment, a solution with dplyr:
df %>% mutate(c = ifelse(a == "F", as.character(b), as.character(a)))
If you plan on doing anything involving combinations of the columns as factors, for example, comparisons, you should refactor to the same set of levels.
u<-union(levels(df$a),levels(df$b))
df$a<-factor(df$a,u)
df$b<-factor(df$b,u)
df$c<-df$a
ind<-df$a=="F"
df$c[ind]<-df$b[ind]
By taking this precaution, you can sensibly do
> sum(df$c==df$b)
[1] 6
> sum(df$a=="F")
[1] 6
otherwise the first line will fail.

Sorting rows alphabetically

My data looks like,
A B C D
B C A D
X Y M Z
O M L P
How can I sort the rows to get something like
A B C D
A B C D
M X Y Z
L M O P
Thanks,
t(apply(DF, 1, sort))
The t() function is necessary because row operations with the apply family of functions returns the results in column-major order.
What did you try? This is really straight-forward and easy to solve with a simple loop.
> s <- x
> for(i in 1:NROW(x)) {
+ s[i,] <- sort(s[i,])
+ }
> s
V1 V2 V3 V4
1 A B C D
2 A B C D
3 M X Y Z
4 L M O P
No plyr answer yet?!
foo <- matrix(sample(LETTERS,10^2,T),10,10)
library("plyr")
aaply(foo,1,sort)
Exactly the same as DWins answer except that you don't need t()
Another fast base R option from Martin Morgan in Fastest way to select i-th highest value from row and assign to new column is
matrix(a[order(row(a), a, method="radix")], ncol=ncol(a))
Timings can be found here

Resources