Merge list of uneven dataframes by rownames - r

I have several dataframes in a list, which i want to merge into one big dataframe. The actual list contains several thouands of this dataframes and i am therefore looking for a preferably efficient solution.
The list looks similar to this:
v <- data.frame(answer = c(1,1,1))
rownames(v) <- c("A","B","C")
w <- data.frame(answer = c(1,0,0))
rownames(w) <- c("A","B","D")
x <- data.frame(answer = c(1,1,1))
rownames(x) <- c("A","B","C")
y <- data.frame(answer = c(0,0,0))
rownames(y) <- c("A","C","D")
z <- data.frame(answer = c(0,0,0,1))
rownames(z) <- c("A","B","C","D")
l <- list(v,w,x,y,z)
names(l) <- c("V","W","X","Y","Z")
The final output should look like this:
v W X Y Z
A 1 1 1 0 0
B 1 0 1 NA 0
C 1 NA 1 0 0
D NA 0 NA 0 1
What i have tried already (feel free to ignore this part, if you already have a working solution)
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T),stringsAsFactors=FALSE)
and
df <- do.call(rbind.data.frame, l)
and
df<- rbindlist(l) (from library("data.frame"))
Those all loose the information contained in the rownames and only seemed to work if all dataframes have the same length and the same order.
The only one that kinda worked with my actual data was something along the lines of:
df<- suppressWarnings(Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by =
"answer", all = TRUE),l))
but i am not able to make it work with my example list and even when it worked it was extremly unefficiently and took ages once the list got longer.

Here is a base R solution using merge and Reduce:
df <- Reduce(
function(x, y) merge(x, y, by = "id", all = T),
lapply(l, function(x) { x$id <- rownames(x); x }))
colnames(df) <- c("id", names(l))
# id V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1

We create a row names column and then do the join. We loop through the list with map, create a row names column with rownames_to_column and reduce to a single dataset by doing a full_join by the row names and rename the column names if needed
library(tidyverse)
l %>%
map( ~ .x %>%
rownames_to_column('rn')) %>%
reduce(full_join, by = 'rn') %>%
rename_at(2:6, ~ names(l))
# rn V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1
Or another option is to bind_rows and then spread
l %>%
map(rownames_to_column, 'rn') %>%
bind_rows(.id = 'grp') %>%
spread(grp, answer)
# rn V W X Y Z
#1 A 1 1 1 0 0
#2 B 1 0 1 NA 0
#3 C 1 NA 1 0 0
#4 D NA 0 NA 0 1

One way of doing this using something similar to what kind of already worked for you is to first declare the rownames as a variable, then rename the columns of your data frames to match their names in the list, and then merge.
df_l <- l %>% Map(setNames, ., names(.)) %>%
map(~mutate(., r=rownames(.))) %>%
Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2,by="r"), .)
rownames(df_l) <- df_l$r
df_l$r <- NULL
To be honest, I'm not sure it is efficient though, and like you said it will probably take long as the list grows.

Related

row contains multiple values, select for only one value in R

I have a data frame (df) of
A AA B C D
23 1 1,0,0 0,1,0 0
10 0 0,0,0 1 1,1
I would like the following df2
A AA B C D
23 1 1 1 0
10 0 0 1 1
I really don't have any idea how to even begin coding this. But my shot in the dark is
df2 <- df1 %>% filter_at(vars(B, C, D), any_var(. !=0))
I get the following error: error in is.data.frame(x): object 'B' not found
Any help in this matter would be greatly appreciate as I am new to all of this, thanks.
I hope this is what you are looking for:
library(dplyr)
library(tidyr)
df %>%
separate_rows(D) %>%
separate_rows(AA, B, C) %>%
group_by(A) %>%
summarise(across(AA:D, ~ max(.x)))
# A tibble: 2 x 5
A AA B C D
<int> <int> <chr> <chr> <chr>
1 10 0 0 1 1
2 23 1 1 1 0
df$B<-as.character(df$B)
df$B[df$B == "1,0,0"] <- "1"
df$B[df$B == "0,0,0"] <- "0"
df$B<- as.factor(df$B)
df$C<-as.character(df$B)
df$C[df$C == "0,1,0"] <- "1"
df$C<- as.factor(df$B)
df$D<-as.character(df$D)
df$D[df$D == "1,1"] <- "1"
df$D<- as.factor(df$D)
Ok, I think I figured it out. This may not be the most efficient way to code for this, but it does get the job done and changes the values in df to reflect what I wanted for df2.
If I needed to keep the original df and create a new df2 then I would just put this code first.
df2<-df
Here a way with "base R". Explanation: strsplit splits at the comma and max finds the maximum value. Function sapply does vectorization.
df1 <- read.table(text="
A AA B C D
23 1 1,0,0 0,1,0 0
10 0 0,0,0 1 1,1",
header = TRUE)
f <- function(x) {
if (is.character(x)) {
as.numeric(sapply(strsplit(x, ","), max))
} else {
x
}
}
df2 <- sapply(df1, f)
df2
If the type of non-numerical cells needs to be kept, just remove as.numeric(). To be precise: in the above, max works alphabetically at the character level, not with numbers.
As an alternative, one can also do this with regular expressions.
And here a version with regular expressions, I am myself curious how to improve this:
f <- function(x) {
x <- gsub("0,", "", x)
x <- gsub(",0", "", x)
x <- gsub(",[1-9].*", "", x)
as.numeric(x)
}
df2 <- sapply(df1, f)
df2

Need to subset the main dataframe using the column information on another file

Please help I have the main dataset which I want to subset by column and the column information in on another file. In the present case, I want to create 3 dataframe from the main file with the required column are in the ColData (c(XX,CE.02), c(YY,CE.03,CE.01), c(ZZ,CE.05)).
XX <- c(1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0)
YY <- c(0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0)
ZZ <- c(1,0,1,1,0,0,0,1,0,1,0,0,1,0,1,1)
AL.01 <- c(NA,0,0,NA,NA,0,NA,0,0,3,0,0,0,3,0,0)
AL.02 <- c(NA,0,0,NA,NA,0,NA,0,0,4,0,0,0,2,0,0)
AL.03 <- c(NA,0,0,NA,NA,0,NA,0,0,3,0,0,0,3,0,0)
CE.01 <- c(NA,0,0,NA,NA,0,NA,0,0,3,0,0,0,3,0,0)
CE.02 <- c(NA,0,0,NA,NA,0,NA,0,0,3,0,0,0,2,0,0)
CE.03 <- c(NA,0,0,NA,NA,0,NA,0,0,3,0,0,0,2,0,0)
CE.04 <- c(NA,0,0,NA,NA,0,NA,0,0,3,0,0,0,1,0,0)
CE.05 <- c(NA,0,0,NA,NA,0,NA,0,0,3,0,0,0,1,0,0)
RCAQA <- c('XX','YY','ZZ')
QuestionID1 <- c('CE.02','CE.03','CE.05')
QuestionID2 <- c('','CE.01','')
MainData <- data.frame(XX,YY,ZZ,AL.01,AL.02,AL.03,CE.01,CE.02,CE.03,CE.04,CE.05)
ColData <- data.frame(RCAQA,QuestionID1,QuestionID2)
MainData
ColData
Required Output Dataframe 1 c(XX,CE.02)
Required Output Dataframe 2 c(YY,CE.03,CE.01)
Required Output Dataframe 3 c(ZZ,CE.05)
We can use asplit to split ColData by row and use lapply to select columns from MainData. We use intersect to get the common columns. This would give you list of dataframes.
lapply(asplit(ColData, 1), function(x) MainData[intersect(names(MainData), x)])
#[[1]]
# XX CE.02
#1 1 NA
#2 0 0
#3 0 0
#4 1 NA
#5 0 NA
#...
#[[2]]
# YY CE.03 CE.01
#1 0 NA NA
#2 1 0 0
#3 0 0 0
#4 1 NA NA
#5 0 NA NA
#6 0 0 0
#7 1 NA NA
#...
#[[3]]
# ZZ CE.05
31 1 NA
#2 0 0
#3 1 0
#4 1 NA
#5 0 NA
#6 0 0
#...
Using dplyr you can do this as :
library(dplyr)
ColData %>%
group_split(row_number(), .keep = FALSE) %>%
purrr::map(~MainData %>% select(any_of(unlist(.x))))
This is how I would go about df1, df2, df3:
df1 <- MainData %>% select(one_of(as.character(as.vector(ColData[1]))))
df2 <- MainData %>% select(one_of(as.character(as.vector(ColData[2]))))
df3 <- MainData %>% select(one_of(as.character(as.vector(ColData[3]))))
A base R solution
dfs <- apply(ColData, 1L, function(i, df) df[, i[i != ""]], MainData)
df1 <- dfs[[1L]]
df2 <- dfs[[2L]]
df3 <- dfs[[3L]]

Removing rows having only zeros [duplicate]

This question already has answers here:
How to remove rows with 0 values using R
(2 answers)
Closed 2 years ago.
I want to remove all the rows having either zeros or NAs. In the code below I am selecting numeric variables and then filtering out 0s. Problem here is it does not return character variables along with numeric ones in the final output.
df <- read.table(header = TRUE, text =
"x y z
a 1 2
b 0 3
c 1 NA
d 0 NA
")
df %>% select_if(is.numeric) %>% filter(rowSums(., na.rm = T)!=0)
You can use filter_if :
library(dplyr)
df %>% filter_if(is.numeric, any_vars(. != 0 & !is.na(.)))
# x y z
#1 a 1 2
#2 b 0 3
#3 c 1 NA
Or using base R :
cols <- sapply(df, is.numeric)
df[rowSums(!is.na(df[cols]) & df[cols] != 0) > 0, ]
Another dplyr option could be:
df %>%
rowwise() %>%
filter(any(across(where(is.numeric)) != 0, na.rm = TRUE))
x y z
<fct> <int> <int>
1 a 1 2
2 b 0 3
3 c 1 NA
Following the suggestions written in this new doc page after the release of dplyr version 1.0.0, you can create a helper function to substitute the superseded functions filter_if and any_vars.
Previously, filter() was paired with the all_vars() and any_vars()
helpers. Now, across() is equivalent to all_vars(), and there’s no
direct replacement for any_vars(). However you can make a simple
helper yourself
From now on, this way should be the reference method for this kind of filtering steps.
rowAny <- function(x) {rowSums(x != 0 & !is.na(x)) > 0}
df %>% filter(rowAny(across(where(is.numeric))))
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA
You could simply do
df[rowSums(suppressWarnings(sapply(df, as.double)), na.rm=TRUE) > 0, ]
# x y z
# 1 a 1 2
# 2 b 0 3
# 3 c 1 NA

Assign values to matrix based on condition [duplicate]

This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I want to create a matrix with 3 columns and many rows assigning 1 or 0 if the condition is satisfied.
I have data stored in 3 variables
df1 <- data.frame(names=c("A","B","C","D","E","F"))
df2 <- data.frame(names=c("A","B","C","F"))
df3 <- data.frame(names=c("E","F","H"))
output will be
df1 df2 df3
A 1 1 0
B 1 1 0
C 1 1 0
D 1 0 0
E 1 1 1
F 1 0 1
H 0 0 1
In first row if A is present in dataset then I will assign 1 under each column and 0 if A not present in dataset
Here is what I have tried
DF <- rbind(df1,df2,df3)
for (i in DF) {
for (j in 1:length(df1$names)) {
if(i == df1$names[j]){
A3 <-data.frame(paste0("",i),paste0(1),paste0(0),paste0(0))
names(A3) <- NULL
}
else{
A3 <-data.frame(paste0("",i),paste0(0),paste0(0),paste0(0))
}
}
}
I have written this code only for df1 but its very slow because I have more than 1500 rows in my orignal data set. What would be the fastest way to do it?
Add a grouping variable to each dataframe:
df1 <- data.frame(names=c("A","B","C","D","E","F"),group="df1")
df2 <- data.frame(names=c("A","B","C","F"),group="df2")
df3 <- data.frame(names=c("E","F","H"),group="df3")
DF <- rbind(df1,df2,df3)
Then do this:
res <- table(DF)
> res
group
names df1 df2 df3
A 1 1 0
B 1 1 0
C 1 1 0
D 1 0 0
E 1 0 1
F 1 1 1
H 0 0 1
Or if you want a dataframe:
library(reshape2)
dcast(names~group, data=DF,fun.aggregate = length)
When using the idcol parameter in rbindlist of the data.table package, there is no need of creating a grouping column for each dataframe separately:
library(data.table) # I used v1.9.5 for this
DT <- rbindlist(list(df1, df2, df3), idcol="id")
dcast(DT[, .N , by=.(id,names)], names ~ id, fill=0)
which gives:
names 1 2 3
1: A 1 1 0
2: B 1 1 0
3: C 1 1 0
4: D 1 0 0
5: E 1 0 1
6: F 1 1 1
7: H 0 0 1
%in% operator lets you check if a string is present in a vector of strings. It is also vectorised, so it works quite quick:
x=c(LETTERS[c(1:6,8)])
df=data.frame(x=x,df1=as.numeric(x %in% df1$names),
df2=as.numeric(x %in% df2$names),
df3=as.numeric(x %in% df3$names))
df
If speed is crucial, {data.table} package gives a little speed boost with %chin% operator:
library(data.table)
x=c(LETTERS[c(1:6,8)])
dt=data.table(x=x,df1=as.numeric(x %chin% as.character(df1$names)),
df2=as.numeric(x %chin% as.character(df2$names)),
df3=as.numeric(x %chin% as.character(df3$names)))
dt
The code below is slightly more general than the other answers. Also, I think it's useful to know how to dynamically create commands...
I use the data frames as you prepared them:
df1 <- data.frame( names = c( "A", "B", "C", "D", "E", "F") )
df2 <- data.frame( names = c( "A", "B", "C"," F") )
df3 <- data.frame( names = c( "E", "F", "H") )
DF <- rbind( df1, df2, df3 )
nDF <- unique( DF ) #we don't want to duplicate tests.
Then the main loop is just like this:
n_ <- 3
for( ii in 1 : n_){
nDF[ paste0( "df", ii ) ] <- as.logical( NA ) #dynamically creates a new variable in your data frame
cmnd <- paste0("nDF$names %in% df",ii,"$names") #dynamically creates the appropriate command (in this case you want to test e.g. whether "nDF$names %in% df1$names".
nDF[ paste0("df",ii)] <- eval( parse( text = cmnd ) ) #evaluates the dynamically created command and saves it into the previously created variable.
}
Should be relatively fast. But if you don't have duplicates in your data, then heroka's suggestion to this questions is probably the way to go.

R: By group, test if for each value of one variable, that value exists in another variable

I have a data frame structured something like:
a <- c(1,1,1,2,2,2,3,3,3,3,4,4)
b <- c(1,2,3,1,2,3,1,2,3,4,1,2)
c <- c(NA, NA, 2, NA, 1, 1, NA, NA, 1, 1, NA, NA)
df <- data.frame(a,b,c)
Where a and b uniquely identify an observation. I want to create a new variable, d, which indicates if each observation's value for b is present at least once in c as grouped by a. Such that d would be:
[1] 0 1 0 1 0 0 1 0 0 0 0 0
I can write a for loop which will do the trick,
attach(df)
for (i in unique(a)) {
for (j in b[a == i]) {
df$d[a == i & b == j] <- ifelse(j %in% c[a == i], 1, 0)
}
}
But surely in R there must be a cleaner/faster way of achieving the same result?
Using data.table:
library(data.table)
setDT(df) #convert df to a data.table without copying
# +() is code golf for as.integer
df[ , d := +(b %in% c), by = a]
# a b c d
# 1: 1 1 NA 0
# 2: 1 2 NA 1
# 3: 1 3 2 0
# 4: 2 1 NA 1
# 5: 2 2 1 0
# 6: 2 3 1 0
# 7: 3 1 NA 1
# 8: 3 2 NA 0
# 9: 3 3 1 0
# 10: 3 4 1 0
# 11: 4 1 NA 0
# 12: 4 2 NA 0
Adding the dplyr version for those of that persuasion. All credit due to #akrun.
library(dplyr)
df %>% group_by(a) %>% mutate(d = +(b %in% c))
And for posterity, a base R version as well (via #thelatemail below)
df <- df[order(df$a, df$b), ]
df$d <- unlist(by(df, df$a, FUN = function(x) (x$b %in% x$c) + 0L ))
The above answer by MichaelChirico apparently works well and is correct. I rarely use data.table so I don't understand the syntax. This is a way to get the same results without data.table.
invisible(lapply(unique(df$a), function(x) {
df$d[df$a==x] <<- 0L + (df$b[df$a==x] %in% df$c[df$a==x])
}))
This code gets all of the unique levels of a and then modifies the data.frame for that level of a using the logic you request. The <<- is necessary because df will otherwise be modified just in the scope of the apply and not in .GlobalEnv. With <<- it finds the parent environment where df is defined and sets df there.
Also, note the slightly different version of the + "trick" where a leading 0 which makes it clearer to the reader that the resulting vector is an integer because it must be cast that way for the addition to work. The L after the 0 indicates that 0 is in integer and not a double. Note that the notation used by MichaelChirico for this casting gives the same results (a column of class integer).

Resources