Combine 2 tables with common variables but no common observations - r

I would like to match 2 Data sets (tables) which only have some (not all) variables in common but not any of those obs. - So actually I want to add dataset1 to dataset2, adding the column names of dataset2, while in empty fields of the table should be filled in with NA.
So what I did is, I tried the following function;
matchcol = function(x,y){
y = y[,match(colnames(x),colnames(y))]
colnames(y)=colnames(x)
return(y)
}
sum =matchcol(dataset1,dataset2)
data = rbind(dataset1,dataset2)
But I get; "Error: NA columns indexes not supported.
What can I do? What can I change in my code.
Thx!!

To use rbind you need to have the same column names, but with bind_rows from dplyr package you don't, try this:
library(dplyr)
data <- bind_rows(dataset1, dataset2)
example :
dataset1 <- data.frame(a= 1:5,b=6:10)
dataset2 <- data.frame(a= 11:15,c=16:20)
data <- bind_rows(dataset1,dataset2)
# a b c
# 1 1 6 NA
# 2 2 7 NA
# 3 3 8 NA
# 4 4 9 NA
# 5 5 10 NA
# 6 11 NA 16
# 7 12 NA 17
# 8 13 NA 18
# 9 14 NA 19
# 10 15 NA 20

If I understand your question right, it looks like dplyr::full_join is good for that:
library(dplyr)
dataset1 <- data.frame(Var_A = 1:10, Var_B = 100:109)
dataset2 <- data.frame(Var_A = 11:20, Var_C = 200:209)
dataset_new <- full_join(dataset1, dataset2)
dataset_new
This will automatically join the two datasets by common column names and add all other columns from both datasets. And empty fields are NAs.
Does that work for you?

Related

Create multiple columns in dataframe from a function or vector in R

I want to create multiple columns in a dataframe that each calculate a different value based on values from an existing column.
Say I have the following dataframe:
date <- c('1','2','3','4','5')
close <- c('10','20','15','13','19')
test_df <- data.frame(date,close)
I want to create a new column that does the following operation with dplyr:
test_df %>%
mutate(logret = log(close / lag(close, n=1)))
However I would like to create a new column for multiple values of n such that I have columns:
logret1 for n=1,
logret2 for n=2,
logret3 for n=3
etc...
I've used the function seq(from=1, to=5, by=1) as an example to get a vector of numbers to replace n with. I've tried to create a for loop around the mutate function:
seq2 <- seq(from=1, to=5, by=1)
for (number in seq2){
new_df <- test_df %>%
mutate(logret = log(close/lag(close, n=seq2)))
}
However I get the error:
Error: Problem with `mutate()` input `logret`. x `n` must be a nonnegative integer scalar, not a double vector of length 5. i Input `logret` is `log(close2/lag(close2, n = seq2))`.
I realise I can't pass in a vector for n, however I am stuck on how to proceed.
Any help would be much appreciated, Thanks.
You can use purrr's map_dfc to add new columns :
library(dplyr)
library(purrr)
n <- 3
bind_cols(test_df, map_dfc(1:n, ~test_df %>%
transmute(!!paste0('logret', .x) := log(close / lag(close, n=.x)))))
# date close logret1 logret2 logret3
#1 1 10 NA NA NA
#2 2 20 0.6931472 NA NA
#3 3 15 -0.2876821 0.4054651 NA
#4 4 13 -0.1431008 -0.4307829 0.26236426
#5 5 19 0.3794896 0.2363888 -0.05129329
data
test_df <- data.frame(date,close)
test_df <- type.convert(test_df)
You can use data.table. It's an R package that provides an enhanced version of data.frame. This is an awesome resource to get started with https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/
library(data.table)
#Create data.table
test_dt <- data.table(date, close)
#Define the new cols names
logret_cols <- paste0('logret', 1:3)
#Create new columns
test_dt[, (logret_cols) := lapply(1:3, function(n) log(close / lag(close, n = n)))]
test_dt
# date close logret1 logret2 logret3
#1: 1 10 NA NA NA
#2: 2 20 0.6931472 NA NA
#3: 3 15 -0.2876821 0.4054651 NA
#4: 4 13 -0.1431008 -0.4307829 0.26236426
#5: 5 19 0.3794896 0.2363888 -0.05129329
data.table has an interesting way to deal with memory efficiently. If you will deal with large amount of data, take a look at this benchmarks, are awesome:
https://h2oai.github.io/db-benchmark/
EDIT
You can even do it with a mix of data.table and purrr. Here's an example using the function purrr::map()
test_dt[, (logret_cols) := map(1:3, ~log(close / lag(close, n = .x)))]
test_dt
# date close logret1 logret2 logret3
#1: 1 10 NA NA NA
#2: 2 20 0.6931472 NA NA
#3: 3 15 -0.2876821 0.4054651 NA
#4: 4 13 -0.1431008 -0.4307829 0.26236426
#5: 5 19 0.3794896 0.2363888 -0.05129329

merging on multiple columns R

I'm surprised if this isn't a duplicate, but I couldn't find the answer anywhere else.
I have two data frames, data1 and data2, that differ in one column, but the rest of the columns are the same. I would like to merge them on a unique identifying column, id. However, in the event an ID from data2 does not have a match in data1, I want the entry in data2 to be appended at the bottom, similar to plyr::rbind.fill() rather than renaming all the corresponding columns in data2 as column1.x and column1.y. I realize this isn't the clearest explanation, maybe I shouldn't be working on a Saturday. Here is code to create the two dataframes, and the desired output:
spp1 <- c('A','B','C')
spp2 <- c('B','C','D')
trait.1 <- rep(1.1,length(spp1))
trait.2 <- rep(2.0,length(spp2))
id_1 <- c(1,2,3)
id_2 <- c(2,9,7)
data1 <- data.frame(spp1,trait.1,id_1)
data2 <- data.frame(spp2,trait.2,id_2)
colnames(data1) <- c('spp','trait.1','id')
colnames(data2) <- c('spp','trait.2','id')
Desired output:
spp trait.1 trait.2 id
1 A 1.1 NA 1
2 B 1.1 2 2
3 C 1.1 NA 3
4 C NA 2 9
5 D NA 2 7
Try this:
library(dplyr)
full_join(data1, data2, by = c("id", "spp"))
Output:
spp trait.1 id trait.2
1 A 1.1 1 NA
2 B 1.1 2 2
3 C 1.1 3 NA
4 C NA 9 2
5 D NA 7 2
Alternatively, also merge would work:
merge(data1, data2, by = c("id", "spp"), all = TRUE)

How to merge multiple data.frames and sum and average columns at the same time in R

I have over 20 twenty data.frames with the same columns but differing amount of rows. My goal is to merge the data.frames by the column "Name" (which is a list of five names) and while merging I would like the rows with the same name to sum column A, sum column B, and get the mean of column C.
Here is what I am currently doing.
First I will just merge 2 data.frames at a time.
DF <- merge(x=abc, y=def, by = "Name", all = T)
Merged DF will look like such
Name A.x B.x C.x A.y B.y C.y
name1,name2,name3,name4,name5 11 24 7 NA NA NA
name1,name3,name4,name6,name7 4 8 12 3 4 7
name1,name2,name5,name6,name7 12 4 5 NA NA NA
name3,name4,name5,name6,name7 NA NA NA 15 3 28
I will then add these ifelse statements to deal with the NAs and non unique rows. For the non unique rows it will add for A add for B and for C it will get an average.
DF$A <- ifelse(is.na(DF$A.x), DF$A.y,
ifelse(is.na(DF$A.y), DF$A.x,
ifelse((!is.na(DF$A.x)) & (!is.na(DF$A.y)), DF$A.x + DF$A.y, 1)))
DF$B <- ifelse(is.na(DF$B.x), DF$B.y,
ifelse(is.na(DF$B.y), DF$B.x,
ifelse((!is.na(DF$B.x)) & (!is.na(DF$B.y)), DF$B.x + DF$B.y, 1)))
DF$C <- ifelse(is.na(DF$C.x), DF$C.y,
ifelse(is.na(DF$C.y), DF$C.x,
ifelse((!is.na(DF$C.x)) & (!is.na(DF$C.y)), (DF$C.x + DF$C.y)/2, 1)))
DF will now look like such
Name A.x B.x C.x A.y B.y C.y A B C
name1,name2,name3,name4,name5 11 24 7 NA NA NA 11 24 7
name1,name3,name4,name6,name7 4 8 12 3 4 8 7 12 10
name1,name2,name5,name6,name7 12 4 5 NA NA NA 12 4 5
name3,name4,name5,name6,name7 NA NA NA 15 3 28 15 3 28
I then keep just the Name column and the last three columns
merge1 <- DF[c(1,8,9,10)]
Then I do the same process for the next two data.frames and call it merge2. Then I will merge merge1 and merge 2.
total1 <- merge(x = merge1, y = merge2, by = "Name", all = TRUE)
I will just continue to merge two data frames at a time then merge the Totals data.frames together as well two at a time. I get my end result that I want but it is a timely process and not very efficient.
Another way I think I could do it is may be do a rbind with all the data.frames then if in the Name column any row has the same list of names as another row then make that one row, add column A, add column B and get the mean of column C. But I am not sure how to do that as well.
Here is an example of what I would like with rind
Name A B C
name1,name2,name3,name4,name5 11 24 7
name1,name3,name4,name6,name7 4 8 12
name1,name2,name5,name6,name7 12 4 5
name3,name4,name5,name6,name7 15 3 28
name1,name3,name4,name6,name7 3 4 8
The end result would look like such
Name A B C
name1,name2,name3,name4,name5 11 24 7
name1,name3,name4,name6,name7 7 12 10
name1,name2,name5,name6,name7 12 4 5
name3,name4,name5,name6,name7 15 3 28
Again, I am sure there are more efficient ways to complete what I want than what I am currently doing so any help would be greatly appreciated.
I think your second approach is the way to go, and you can do that with data.table or dplyr.
Here a few steps using data.table. First, if your data frames are abc, def, ...
do:
DF <- do.call(rbind, list(abc,def,...))
now you can transform them into a data.table
DT <- data.table(DF)
and simply do something like
DTres <- DT[,.(A=sum(A, na.rm=T), B=sum(B, na.rm=T), C=mean(C,na.rm=T)),by=name]
double check the data.table vignettes to get a better idea how that package work.
We can use dplyr
library(dplyr)
bind_rows(abc, def, ...) %>%
group_by(name) %>%
summarise(A= sum(A, na.rm= TRUE),
B = sum(B, na.rm= TRUE),
C = mean(C, na.rm=TRUE))

Rbind with new columns and data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?
library(data.table)
aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)
dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))
dt.11 <- rbind(dt.1, dt.1) # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2) # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2) # What I need, doesn't work either
I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.
Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:
If TRUE fills missing columns with NAs. By default FALSE. When
TRUE, use.names has to be TRUE, and all items of the input list has to
have non-null column names.
Thus you can do (prior to approx v1.9.6):
data.table::rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
UPDATE for v1.9.6:
This now works directly:
rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
Here is an approach that will update the missing columns in
rbind.missing <- function(A, B) {
cols.A <- names(A)
cols.B <- names(B)
missing.A <- setdiff(cols.B,cols.A)
# check and define missing columns in A
if(length(missing.A) > 0L){
# .. means "look up one level"
class.missing.A <- lapply(B[, ..missing.A], class)
nas.A <- lapply(class.missing.A, as, object = NA)
A[,c(missing.A) := nas.A]
}
# check and define missing columns in B
missing.B <- setdiff(names(A), cols.B)
if(length(missing.B) > 0L){
class.missing.B <- lapply(A[, ..missing.B], class)
nas.B <- lapply(class.missing.B, as, object = NA)
B[,c(missing.B) := nas.B]
}
# reorder so they are the same
setcolorder(B, names(A))
rbind(A, B)
}
rbind.missing(dt.1,dt.2)
## aa bb cc
## 1: 1 2 NA
## 2: 2 3 NA
## 3: 3 4 NA
## 4: 1 2 3
## 5: 2 3 4
## 6: 3 4 5
This will not be efficient for many, or large data.tables, as it only works two at a time.
The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.
the basic concept is to add missing columns in both directions: from the running master table
to the newTable and back the other way.
As #menl pointed out in the comments, simply assigning an NA is a problem, because that will
make the whole column of class logical.
One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.
Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class)
as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName)
on the LHS to assign.
We can then wrap this in a function and use S3 methods so that we can simply call
rbindFill(A, B)
Below is the function.
rbindFill.data.table <- function(master, newTable) {
# Append newTable to master
# assign to Master
#-----------------#
# identify columns missing
colMisng <- setdiff(names(newTable), names(master))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))
# assign to each column value of NA with appropriate class
master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# assign to newTable
#-----------------#
# identify columns missing
colMisng <- setdiff(names(master), names(newTable))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))
# assign to each column value of NA with appropriate class
newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# reorder columns to avoid warning about ordering
#-----------------#
colOrdering <- colOrderingByOtherCol(newTable, names(master))
setcolorder(newTable, colOrdering)
# rbind them!
#-----------------#
rbind(master, newTable)
}
# implement generic function
rbindFill <- function(x, y, ...) UseMethod("rbindFill")
Example Usage:
# Sample Data:
#--------------------------------------------------#
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
#--------------------------------------------------#
# Four iterations of calling rbindFill
master <- rbindFill(A, B)
master <- rbindFill(master, A2)
master <- rbindFill(master, C)
# Results:
master
# a b c d m n f
# 1: 1 1 1 NA NA NA NA
# 2: 2 2 2 NA NA NA NA
# 3: 3 3 3 NA NA NA NA
# 4: NA 1 1 1 A NA NA
# 5: NA 2 2 2 B NA NA
# 6: NA 3 3 3 C NA NA
# 7: 6 6 6 NA NA NA NA
# 8: 7 7 7 NA NA NA NA
# 9: 8 8 8 NA NA NA NA
# 10: 9 9 9 NA NA NA NA
# 11: NA NA 7 NA NA 0.86 TRUE
# 12: NA NA 8 NA NA -1.15 FALSE
# 13: NA NA 9 NA NA 1.10 TRUE
Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.
Also, using rbindlist() should be much faster when dealing with data.tables.
Define the tables (same as mnel's code above):
library(data.table)
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
Insert the missing variables in table A: (note the use of A2[0]
A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)
Insert the missing columns in table A2:
A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)
Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):
setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL
Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...
DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))
DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL
The result looks the same as mnels' output (except for the random numbers and the column order).
PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?
PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

How does plyr merge two columns of different data.frames with same names but different values

While merging 3 data.frames using plyr library, I encounter some values with the same name but with different values each in different data.frames.
How does the do.call(rbind.fill,list) treat this problem: by arithmetic or geometric average?
From the help page for rbind.fill:
Combine data.frames by row, filling in missing columns. rbinds a list of data frames
filling missing columns with NA.
So I'd expect it to fill columns that do not match with NA. It is also not necessary to use do.call() here.
dat1 <- data.frame(a = 1:2, b = 4:5)
dat2 <- data.frame(b = 3:2, c = 8:9)
dat3 <- data.frame(a = 5:6, c = 1:2)
rbind.fill(dat1, dat2, dat3)
a b c
1 1 4 NA
2 2 5 NA
3 NA 3 8
4 NA 2 9
5 5 NA 1
6 6 NA 2
Are you expecting something different?

Resources