Combine multiple data sets with different column names [duplicate] - r

I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this?
Thanks in advance!

It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frames.
Update: Use cbind and rbind
Another option is to make a basic function like the following:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist from "data.table"
Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df from "purrr"
Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8

Another approach using dplyr:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8

I'm not sure if such a function already exists, but this seems to do the trick:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply-like calls

A blend of the other two answers:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.

Another workaround for this one is using ldply in the plyr package...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8

Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.

Here is one option using Map. First, I create a named list of dataframes. Then, I can cbind the names to each dataframe. Then, use unname to remove the row names. Finally, rbind all the dataframes together.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
df.list <- Hmisc::llist(df1, df2)
do.call(rbind, unname(Map(cbind, source = names(df.list), df.list)))
Output
source x y
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8

Related

merge multiple dataframes and give a title for each dataframe in R [duplicate]

I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this?
Thanks in advance!
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frames.
Update: Use cbind and rbind
Another option is to make a basic function like the following:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist from "data.table"
Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df from "purrr"
Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
Another approach using dplyr:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I'm not sure if such a function already exists, but this seems to do the trick:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply-like calls
A blend of the other two answers:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.
Another workaround for this one is using ldply in the plyr package...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.
Here is one option using Map. First, I create a named list of dataframes. Then, I can cbind the names to each dataframe. Then, use unname to remove the row names. Finally, rbind all the dataframes together.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
df.list <- Hmisc::llist(df1, df2)
do.call(rbind, unname(Map(cbind, source = names(df.list), df.list)))
Output
source x y
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8

Datasets name as strings and how to evaluate them [duplicate]

I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this?
Thanks in advance!
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frames.
Update: Use cbind and rbind
Another option is to make a basic function like the following:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist from "data.table"
Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df from "purrr"
Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
Another approach using dplyr:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I'm not sure if such a function already exists, but this seems to do the trick:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply-like calls
A blend of the other two answers:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.
Another workaround for this one is using ldply in the plyr package...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.
Here is one option using Map. First, I create a named list of dataframes. Then, I can cbind the names to each dataframe. Then, use unname to remove the row names. Finally, rbind all the dataframes together.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
df.list <- Hmisc::llist(df1, df2)
do.call(rbind, unname(Map(cbind, source = names(df.list), df.list)))
Output
source x y
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8

Append a Unique ID to sprintf SQL query in R [duplicate]

I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this?
Thanks in advance!
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frames.
Update: Use cbind and rbind
Another option is to make a basic function like the following:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist from "data.table"
Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df from "purrr"
Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
Another approach using dplyr:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I'm not sure if such a function already exists, but this seems to do the trick:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply-like calls
A blend of the other two answers:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.
Another workaround for this one is using ldply in the plyr package...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.
Here is one option using Map. First, I create a named list of dataframes. Then, I can cbind the names to each dataframe. Then, use unname to remove the row names. Finally, rbind all the dataframes together.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
df.list <- Hmisc::llist(df1, df2)
do.call(rbind, unname(Map(cbind, source = names(df.list), df.list)))
Output
source x y
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8

Update existing data.frame with values from another one if missing

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.
Input
> d1
id x y
1 1 1 NA
2 2 NA 3
3 3 4 NA
> d2
id x y z
1 4 NA 30 a
2 3 20 2 b
3 2 14 NA c
4 1 15 97 d
(note that the actual data.frames have hundreds of columns)
Expected output:
> d1
id x y z
1 1 1 97 d
2 2 14 3 c
3 3 4 2 b
Data and current solution:
d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])
for (col in setdiff(names(d1), "id")) {
# If missing look in d2
missing <- is.na(d1[[col]])
d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
}
for (col in setdiff(names(d2), names(d1))) {
# If column missing then add
d1[[col]] <- d2[match(d1$id, d2$id), col]
}
PS:
Likely this questions has been asked before but I'm lacking in vocabulary to search it.
Assuming you are working with 2 data.frames, here is a base solution
#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]
#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])
If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.
edit:
Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R
coalesce.mat <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
rn <- match(ans$id, elt$id)
ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
}
ans
}
allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat,
lapply(list(d1, d2), function(x) {
x[, setdiff(allcols, names(x))] <- NA
x
}))
edit:
a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
setDT(d1)
setDT(d2)
#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)
#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]
#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
d1, .SD, on=.(id)]
Here is a possibility using dplyr::left_join:
left_join(d1, d2, by = "id") %>%
mutate(
x = ifelse(!is.na(x.x), x.x, x.y),
y = ifelse(!is.na(y.x), y.x, y.y)) %>%
select(id, x, y, z)
# id x y z
#1 1 1 97 d
#2 2 14 3 c
#3 3 4 2 b
We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset
library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1),
mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
# id x y z
#1: 1 1 97 d
#2: 2 14 3 c
#3: 3 4 2 b

Combine (rbind) data frames and create column with name of original data frames

I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this?
Thanks in advance!
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frames.
Update: Use cbind and rbind
Another option is to make a basic function like the following:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist from "data.table"
Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df from "purrr"
Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
Another approach using dplyr:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I'm not sure if such a function already exists, but this seems to do the trick:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply-like calls
A blend of the other two answers:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.
Another workaround for this one is using ldply in the plyr package...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.
Here is one option using Map. First, I create a named list of dataframes. Then, I can cbind the names to each dataframe. Then, use unname to remove the row names. Finally, rbind all the dataframes together.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
df.list <- Hmisc::llist(df1, df2)
do.call(rbind, unname(Map(cbind, source = names(df.list), df.list)))
Output
source x y
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8

Resources