Merge Multiple Data Frames by Row Names - r

I'm trying to merge multiple data frames by row names.
I know how to do it with two:
x = data.frame(a = c(1,2,3), row.names = letters[1:3])
y = data.frame(b = c(1,2,3), row.names = letters[1:3])
merge(x,y, by = "row.names")
But when I try using the reshape package's merge_all() I'm getting an error.
z = data.frame(c = c(1,2,3), row.names = letters[1:3])
l = list(x,y,z)
merge_all(l, by = "row.names")
Error in -ncol(df) : invalid argument to unary operator
What's the best way to do this?

Merging by row.names does weird things - it creates a column called Row.names, which makes subsequent merges hard.
To avoid that issue you can instead create a column with the row names (which is generally a better idea anyway - row names are very limited and hard to manipulate). One way of doing that with the data as given in OP (not the most optimal way, for more optimal and easier ways of dealing with rectangular data I recommend getting to know data.table instead):
Reduce(merge, lapply(l, function(x) data.frame(x, rn = row.names(x))))

maybe there exists a faster version using do.call or *apply, but this works in your case:
x = data.frame(X = c(1,2,3), row.names = letters[1:3])
y = data.frame(Y = c(1,2,3), row.names = letters[1:3])
z = data.frame(Z = c(1,2,3), row.names = letters[1:3])
merge.all <- function(x, ..., by = "row.names") {
L <- list(...)
for (i in seq_along(L)) {
x <- merge(x, L[[i]], by = by)
rownames(x) <- x$Row.names
x$Row.names <- NULL
}
return(x)
}
merge.all(x,y,z)
important may be to define all the parameters (like by) in the function merge.all you want to forward to merge since the whole ... arguments are used in the list of objects to merge.

As an alternative to Reduce and merge:
If you put all the data frames into a list, you can then use grep and cbind to get the data frames with the desired row names.
## set up the data
> x <- data.frame(x1 = c(2,4,6), row.names = letters[1:3])
> y <- data.frame(x2 = c(3,6,9), row.names = letters[1:3])
> z <- data.frame(x3 = c(1,2,3), row.names = letters[1:3])
> a <- data.frame(x4 = c(4,6,8), row.names = letters[4:6])
> lst <- list(a, x, y, z)
## combine all the data frames with row names = letters[1:3]
> gg <- grep(paste(letters[1:3], collapse = ""),
sapply(lapply(lst, rownames), paste, collapse = ""))
> do.call(cbind, lst[gg])
## x1 x2 x3
## a 2 3 1
## b 4 6 2
## c 6 9 3

Related

Bind r data.frames that contain column(s) of nested data.frames

After importing multiple .json files using jsonlite I was looking for ways to bind the resulting data.frames which contained one or more columns which themselves were nested data.frames.
I came across the following post https://r.789695.n4.nabble.com/data-frame-with-nested-data-frame-td3162660.html, which helped highlight the problem.
## Create nested data.frames
dat1 <- data.frame(x = 1)
dat1$y <- data.frame(y1 = "a", y2 = "A", stringsAsFactors = FALSE)
dat2 <- data.frame(x = 2)
dat2$y <- data.frame(y1 = "b", stringsAsFactors = FALSE)
None of these work
rbind(dat1, dat2)
dplyr::bind_rows(dat1, dat2)
data.table::rbindlist(list(dat1, dat2))
I've discovered a few workarounds which I'll post below in case they help others.
This could be done without additional packages, too. The data frames need to be partly unlisted within a list and then merged using Reduce.
Reduce(function(...) merge(..., all=TRUE), Map(unlist, list(dat1, dat2), recursive=FALSE))
# x y.y1 y.y2
# 1 1 a A
# 2 2 b <NA>
This also works with more than two nested data frames.
dat3 <- data.frame(x=2, y=data.frame(y1="c", y2="C", z="CC", stringsAsFactors=FALSE))
Reduce(function(...) merge(..., all=TRUE), Map(unlist, list(dat1, dat2, dat3), recursive=FALSE))
# x y.y1 y.y2 y.z
# 1 1 a A <NA>
# 2 2 b <NA> <NA>
# 3 2 c C CC
Data
dat1 <- structure(list(x = 1, y = structure(list(y1 = "a", y2 = "A"), class = "data.frame",
row.names = c(NA, -1L))), row.names = c(NA, -1L),
class = "data.frame")
dat2 <- structure(list(x = 2, y = structure(list(y1 = "b"), class = "data.frame",
row.names = c(NA, -1L))), row.names = c(NA, -1L),
class = "data.frame")
Flatten the data first (for base rbind data.frames need to have identical column names)
dplyr::bind_rows(
jsonlite::flatten(dat1),
jsonlite::flatten(dat2)
)
Put the data.frames into a list before binding (all approaches now work)
dat1$y <- list(dat1$y)
dat2$y <- list(dat2$y)
rbind(dat1, dat2)
dplyr::bind_rows(dat1, dat2)
data.table::rbindlist(list(dat1, dat2))
Use the tidyverse to nest the data.frames
tib1 <- tidyr::nest(dat1, y = c(y))
tib2 <- tidyr::nest(dat2, y = c(y))
tib3 <- dplyr::bind_rows(tib1, tib2)
tidyr::unnest(tib3, c(y))

Append columns to list of dataframes using lapply and mapply

I have a list of dataframes that to manipulate individually that looks like this:
df_list <- list(A1 = data.frame(v1 = 1:10,
v2 = 11:20),
A2 = data.frame(v1 = 21:30,
v2 = 31:40))
df_list
Using lapply allows me to run a function over the list of dataframes like this:
library(tidyverse)
some_func <- function(lizt, comp = 2){
lizt <- lapply(lizt, function(x){
x <- x %>%
mutate(IMPORTANT_v3 = v2 + comp)
return(x)
})
}
df_list_1 <- some_func(df_list)
df_list_1
So far so good but I need to run the function multiple times with different arguments so using mapply returns:
df_list_2 <- mapply(some_func,
comp = c(2, 3, 4),
MoreArgs = list(
lizt = df_list
),
SIMPLIFY = F
)
df_list_2
This creates a new list of dataframes for each argument fed to the function in mapply giving me 3 lists of 2 dataframes. This is good but the output I'm looking for is to append a new column to each original dataframe for each argument in the mapply that would look like this:
desired_df_list <- list(A1 = data.frame(v1 = 1:10,
v2 = 11:20,
IMPORTANT_v3 = 13:22,
IMPORTANT_v4 = 14:23,
IMPORTANT_v5 = 15:24),
A2 = data.frame(v1 = 21:30,
v2 = 31:40,
IMPORTANT_v3 = 33:42,
IMPORTANT_v4 = 34:43,
IMPORTANT_v5 = 35:44))
desired_df_list
How can I wrangle the output of lists of lists of dataframes to isolate and append only the desired new columns (IMPORTANT_v3) to the original dataframe? Also open to other options such as mutating multiple columns inside the lapply using mapply but I haven't figured out how to code that as yet.
Thanks!
Solved like this:
main_func <- function(lizt, comp = c(2:4)){
lizt <- lapply(lizt, function(x){
df <- mapply(movavg,
n = comp,
type = "w",
MoreArgs = list(x$v2),
SIMPLIFY = T
)
colnames(df) <- paste0("IMPORTANT_v", 1:ncol(df))
print(df)
print(x)
x <- cbind(x, df)
return(x)
})
}
desired_df_list_complete <- main_func(df_list)
desired_df_list_complete
using movavg from pracma package in this example.

r merge multiple data tables with list generated dynamically

This code works fine but requires knowledge of the data table names ahead of time to construct list(x,y,z)
library(data.table)
x <- data.table(i = c("a","b","c"), j = 1:3)
y <- data.table(i = c("b","c","d"), k = 4:6)
z <- data.table(i = c("c","d","a"), l = 7:9)
Reduce(function(...) merge(..., all = TRUE, by = "i"), list(x, y, z))
But I have a script that generates the data tables (the names are constructed dynamically) and creates a character vector as follows:
dtList <- c("x", "y", "z")
I want to use dtList in the Reduce code. I have tried a variety of things. None of these work
list(dtList)
as.vector(dtList, mode = "list")
Here's the code I came up with following JRR's comments that seems to work for my particular setup. dtNameList is actually read in from somewhere else in my full code but for this example, I just created a dummy version of it.
library(data.table)
dtList <- list()
dtNameList <- c("x", "y", "z")
for (k in 1:length(dtNameList)){
dt <- data.table(i = c("a","b","c"), j = 1:3)
# assign(k, dt)
dtList[[dtNameList[k]]] <- dt
}
Reduce(function(...) merge(..., all = TRUE, by = "i"), dtList)

Map() and dplyr joins

I have two lists, both of which contain similar datasets corresponding to different years. I wish to merge the datasets in both lists, element by element. When I use mapply, alongside dplyr::full_join, in the instance where the variable names don't match and I need to use the by argument, R is unable to perform the join.
library(dplyr)
set.seed(100)
first_list <- list(data.frame(x = 1:3, y = rnorm(3)),
data.frame(x = 4:6, y = rnorm(3)))
second_list <- list(data.frame(z = 1:3, w = rnorm(3)),
data.frame(z = 4:6, w = rnorm(3)))
Map(full_join, by = c("x" = "z"), first_list, second_list)
#Error: 'z' column not found in rhs, cannot join
However,
Map(function(x, y) full_join(x, y, by = c("x" = "z")), first_list, second_list)
works successfully. I am curious about this behaviour and wonder if anyone could provide some explanation.
Since Map is a wrapper to mapply, use its MoreArgs argument while the other required args (...) include lists to be vectorized over (see ?mapply):
test1 <- Map(full_join, first_list, second_list, MoreArgs=list(by = c("x" = "z")))
test2 <- Map(function(x, y) full_join(x, y, by = c("x" = "z")), first_list, second_list)
all.equal(test1, test2)
# [1] TRUE

Want to loop through the columns of dataframes in a list

I would like to loop through a list of dataframes and change the column names (I want each of the columns to have the same name)
Does anyone have a solution using the following data?
df <- data.frame(x = 1:10, y = 2:11, z = 3:12)
df2 <- data.frame(x = 1:10, y = 2:11, z = 3:12)
df3 <- data.frame(x = 1:10, y = 2:11, z = 3:12)
x <- list(df, df2, df3)
Either using a for loop or apply? Would actually love to see both if possible
Thanks,
Ben
Both hrbrmstr and David Arenburg's answers are perfect.

Resources