Merge data frame based on column names in r - r

I have 4 data frames all with the same number of columns and identical column names.
The order of the columns is different.
I want to combine all 4 data frames together and match them with the column name.

Working Azure ML - This was the best option I found to automate this merge.
df <- maml.mapInputPort(1)
df2 <- maml.mapInputPort(2)
if (length(df2.toAdd <- setdiff (names(df), names(df2))))
df2[, c(df2.toAdd) := NA]
if (length(df.toAdd <- setdiff (names(df2), names(df))))
df[, c(df.toAdd) := NA]
df3 <- rbind(df, df2, use.names=TRUE)
maml.mapOutputPort("df3");

Suppose your 4 data frames are named df1, df2, df3 and df4, since the number of columns and the column names are identical, then why not this:
cl <- sort(colnames(df1))
mrg <- rbind(df1[,cl], df2[,cl], df3[,cl], df4[,cl])
If you want to have them in a specific order of columns, for example the order of columns in df2, then you can do this:
mrg <- mrg[,colnames(df2)]

Related

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

Problems when trying to join multiple dataframes in R

I have three data frames: df1, df2, df3 with the same number of columns and rows, in the same order.Their column names are exactly the same except for the last three columns (42:43) which are specific to each df (e.g.: col41df1, cold42df1, col43df1...col41df2, col42df2, col43df2...col41df3, col42df3, col43df3...).
I wanted to join the three data frames so that the columns that are specific to each would be appended at the end and I would end up with a data frame with 49 columns, rather than 43.I managed that with:
df_merged <- df1 %>%
left_join(df2)%>%
left_join(df3)
However, something goes wrong during the join because df_merged appears to have 6 NA values while none of the original data frames I joined had any.
Help please?
Thanks!
Since the rows are in the same order across all 3 dataframes, there's no need to use a join. Instead, simply grab the 3 columns you want from the second and third dataframes and attach them to the first, as such:
df_merged <- cbind(df1, df2[, c(41:43)], df3[, c(42:43)])
Here is an example:
df1 <- data.frame(id=c(1,2,3), value=c(5,10,25))
df2 <- data.frame(id=c(1,2,3), value=c(3,6,9), morevalues=c(4,5,9))
library(dplyr)
merged_df <- data.frame(df1, df2[,c(2:3)])
merged_df

merge datasets with conditions in r

I have a data set which looks like the following:
The 'X19' is the row number of another data set. How can I merge these two data sets such that 'FNUMM' will be added to each row appears in 'X19'?
Thanks.
This is a merge where one of the keys is the rownames of one dataset. You can do this:
cbind(df1, df2[, "FNUMM"][match(rownames(df1), df2$X19)])
Here is a reproducible example
df1 <- data.frame(ID=c(1L,1L,1L,1L,2L,2L,3L,3L),
var=c(1:8), Smoke=c('No','No','Yes','No','No','No','Yes','No'))
df2 <- data.frame(X19=c(2,5,8), FNUMM=c('a','b','c'))
cbind(df1, df2[, "FNUMM"][match(rownames(df1), df2$X19)])
Try merge(df1, df2, by = 'X19'), where df1 and df2 are your two data frames.

how to search for column names in a dataframe

I have the following data frame df2 and a vector n. How can I create a new data frame where df2 column names are same as vector n
df2 <- data.frame(x1=c(1,6,3),x2=c(4,3,1),x3=c(5,4,6),x4=c(7,6,7))
n<-c("x1","x4")
Any of these would work:
df2[n]
df2[, n] # see note below for caveat
subset(df2, select = n)
Note that in the second one if n can be of length one, i.e. one column, then it returns a vector rather than a data frame and if you want it to always return a data frame you would need instead:
df2[, n, drop = FALSE]
df3 <- subset(df2, select=c("x1", "x4"))
df3
hope it helps

Use a function to modify multiple dataframes

I have a function to deduplicate a data frame so that each person (indexed by PatID) is represented once by the latest record (largest RecID):
dedupit <- function(x) {
x <- x[order(x$PatID, -x$RecID),]
x <- x[ !duplicated(x$PatID), ]
return(x)
}
It can deduplicate and replace a dataframe if I do:
df <- dedupit(df)
But I have multiple data frames that need deduplication. Rather than write the above code for each individual data frame, I would like to apply a the dedupit function across multiple dataframes at once so that it replaces the unduplicated dataframe with the duplicated version.
I was able to make a list of the dataframes and lapply the function across each element in the list with:
listofdifs <- list(df1, df2, ....)
listofdfs <- lapply(trial, function(x) dedupit(x))
Though, it only modifies the elements of the list and does not replace the unduplicated dataframes. How do I apply this function to modify and replace multiple dataframes?
Does it work?
Name your dataframes when creating the list, so you can recover them afterwards
list.df <- list(df1 = df1, df2 = df2, df3 = df3)
list2env(lapply(list.df, dedupit), .GlobalEnv)
As a result your dataframes df1, df2, df3 will be the deduplicate version.
unlist a list of dataframes

Resources