I have many data.frames, for example:
df1 = data.frame(names=c('a','b','c','c','d'),data1=c(1,2,3,4,5))
df2 = data.frame(names=c('a','e','e','c','c','d'),data2=c(1,2,3,4,5,6))
df3 = data.frame(names=c('c','e'),data3=c(1,2))
and I need to merge these data.frames, without delete the name duplicates
> result
names data1 data2 data3
1 'a' 1 1 NA
2 'b' 2 NA NA
3 'c' 3 4 1
4 'c' 4 5 NA
5 'd' 5 6 NA
6 'e' NA 2 2
7 'e' NA 3 NA
I cant find function like merge with option to handle with name duplicates. Thank you for your help.
To define my problem. The data comes from biological experiment where one sample have a different number of replicates. I need to merge all experiment, and I need to produce this table. I can't generate unique identifier for replicates.
First define a function, run.seq, which provides sequence numbers for duplicates since it appears from the output that what is desired is that the ith duplicate of each name in each component of the merge be associated. Then create a list of the data frames and add a run.seq column to each component. Finally use Reduce to merge them all.
run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))
L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$names)))
out <- Reduce(function(...) merge(..., all = TRUE), L2)[-2]
The last line gives:
> out
names data1 data2 data3
1 a 1 1 NA
2 b 2 NA NA
3 c 3 4 1
4 c 4 5 NA
5 d 5 6 NA
6 e NA 2 2
7 e NA 3 NA
EDIT: Revised run.seq so that input need not be sorted.
See other questions:
How to join data frames in R (inner, outer, left, right)
recombining-a-list-of-data-frames-into-a-single-data-frame
...
Examples:
library(reshape)
out <- merge_recurse(L)
or
library(plyr)
out<-join(df1, df2, type="full")
out<-join(out, df3, type="full")
*can be looped
or
library(plyr)
out<-ldply(L)
I think there is just not enough information in your example data frames to do this. Which 'c' in dataframe 1 should be paired with which 'c' in data frame 2? We cannot tell, so R can't either. I suspect you will have to add another variable to each of your dataframes that uniquely identifies these duplicate cases.
Related
How to simply "paste" two data frames next to each other, filling unequal rows with NAs (e.g. because I want to make a "kable" or sth similar)?
df1 <- data.frame(a = c(1,2,3),
b = c(3,4,5))
df2 <- data.frame(a = c(4,5),
b = c(5,6))
# The desired "merge"
a b a b
1 3 4 5
2 4 5 6
3 5 NA NA
Thanks to Ronak Shah, I found an easy answer in the answers to this post: How to cbind or rbind different lengths vectors without repeating the elements of the shorter vectors?
Without having to hack anything together, one can use cbind.na from the qpcR: package:
df1 <- data.frame(a = c(1,2,3),
b = c(3,4,5))
df2 <- data.frame(a = c(4,5),
b = c(5,6))
comb <- qpcR:::cbind.na(df1, df2)
As this answer is 4 years old, I wonder if there are more "modern" solutions in the popular packages like tidyverse et. al.
In base R you could do:
nr <- max(nrow(df1), nrow(df2))
cbind(df1[1:nr, ], df2[1:nr, ])
# a b a b
# 1 1 3 4 5
# 2 2 4 5 6
# 3 3 5 NA NA
I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5
This question already has answers here:
How to add multiple columns to a data.frame in one go?
(2 answers)
Closed 4 years ago.
I am in the process of reformatting a few data frames and was wondering if there is a more efficient way to add named columns to data frames, rather than the below:
colnames(df) <- c("c1", "c2)
to rename the current columns and:
df$c3 <- ""
to create a new column.
Is there a way to do this in a quicker manner? I'm trying to add dozens of named columns and this seems like an inefficient way of going through the process.
use your method in a shorter way:
cols_2_add=c("a","b","c","f")
df[,cols_2_add]=""
A way to add additional columns can be achieved using merge. Apply merge on existing dataframe with the one created with a desired columns and empty rows. This will be helpful if you want to create columns of different types.
For example:
# Existing dataframe
df <- data.frame(x=1:3, y=4:6)
#use merge to create say desired columns as a, b, c, d and e
merge(df, data.frame(a="", b="", c="", d="", e=""))
# Result
# x y a b c d e
#1 1 4
#2 2 5
#3 3 6
# Desired columns of different types
library(dplyr)
bind_rows(df, data.frame(a=character(), b=numeric(), c=double(), d=integer(),
e=as.Date(character()), stringsAsFactors = FALSE))
# x y a b c d e
#1 1 4 <NA> NA NA NA <NA>
#2 2 5 <NA> NA NA NA <NA>
#3 3 6 <NA> NA NA NA <NA>
A simple loop can help here
name_list <- c('a1','b1','c1','d1')
# example df
df <- data.frame(a = runif(3))
# this adds a new column
for(i in name_list)
{
df[[i]] <- runif(3)
}
# output
a a1 b1 c1 d1
1 0.09227574 0.08225444 0.4889347 0.2232167 0.8718206
2 0.94361151 0.58554887 0.7095412 0.2886408 0.9803941
3 0.22934864 0.73160433 0.6781607 0.7598064 0.4663031
# in case of data.table, for-set provides faster version:
# example df
df <- data.table(a = runif(3))
for(i in name_list)
set(df, j=i, value = runif(3))
Take this example data frame
temp <- data.frame('a' = 1:3, 'b' = 4:6, 'd' = 7:9)
I want to subset this data frame using a vector of column names, but if the vector contains any columns that don't exist in the data frame I want them still to be returned but as NA.
So if my vector was
colVec <- c('a', 'b', 'c')
I would want to run something along the lines of
subset(temp, select = colVec)
to get
a b c
1 4 NA
2 5 NA
3 6 NA
You can do this in two steps -- limiting to the requested columns that are in your data frame and then adding the requested columns that are not in your data frame. You can use intersect and setdiff to get these two sets of column names:
temp2 <- temp[intersect(colVec, names(temp))]
temp2[setdiff(colVec, names(temp))] <- NA
temp2
# a b c
# 1 1 4 NA
# 2 2 5 NA
# 3 3 6 NA
While merging 3 data.frames using plyr library, I encounter some values with the same name but with different values each in different data.frames.
How does the do.call(rbind.fill,list) treat this problem: by arithmetic or geometric average?
From the help page for rbind.fill:
Combine data.frames by row, filling in missing columns. rbinds a list of data frames
filling missing columns with NA.
So I'd expect it to fill columns that do not match with NA. It is also not necessary to use do.call() here.
dat1 <- data.frame(a = 1:2, b = 4:5)
dat2 <- data.frame(b = 3:2, c = 8:9)
dat3 <- data.frame(a = 5:6, c = 1:2)
rbind.fill(dat1, dat2, dat3)
a b c
1 1 4 NA
2 2 5 NA
3 NA 3 8
4 NA 2 9
5 5 NA 1
6 6 NA 2
Are you expecting something different?