Combining list of data.tables based on data.table name - r

This is an extension to the question I posted a number of years ago: here
In summary, I have an indeterminate (in advance) number of data.tables in a list, some of which have duplicate names. I want to combine all data.tables with the same name using rbindlist, and fill any columns that don't have corresponding values with NA. i.e. use the 'fill = T' element of rbindlist.
I then want to return a list with the data.tables and name them based on their original names.
I'll give an example:
library(data.table)
# Create data for list A
A_1 <- data.table(A = 1:2,B = 2:3)
A_2 <- data.table(C = 100:102,D = 300:302,E = 1:3)
A <- list(AA = A_1,BB = A_2)
# Create data for list B
B_1 <- data.table(A = 2:4,B = 1:3,F = 4:6)
B_2 <- data.table(C = 10:12,D = 20:22,G = 1:3,I = 10:12)
B <- list(AA = B_1,BB = B_2)
# Create data for list C
C_1 <- data.table(I = 1:2,J = 3:4)
C_2 <- data.table(C = 1:3,D = 4:6)
C <- list(AA = C_1,BB = C_2)
# Create list DT which is a combination of lists A, B and C
DT <- c(A,B,C)
I can combine them easily enough using the following code.
library(purrr)
pmap(.l = list(y = unique(names(DT))),
.f = function(y) rbindlist(DT[names(DT) %in% y],fill = T)) %>%
set_names(unique(names(DT)))
However, in practice, DT is formed by another function which I'd strongly prefer to simply chain (%>%) the result from that function output, and then jump straight into the pmap approach above without first 'creating' DT.
I've tried the following:
# As a substitute for the function code creating DT, I'll simply put DT itself below as
# a representation of the output
DT %>% pmap(.l = list(y = unique(names(.))),
.f = function(y) rbindlist(.[names(.) %in% y],fill = T))
This doesn't work and gives an error suggesting that there are unused elements of the function. Do you know why it doesn't work?
Note that I haven't then used 'set_names' to name the output from the pmap function, as the unique names of DT would have been 'forgotten' after creating the pmap output. I'm not sure how to use chains to do that (or if it's possible without including that output in the data.table output itself).
Let me know if this is unclear and I can try to explain it another way.
Any help would be greatly appreciated!
Thanks
Phil

We could block it in braces ({}
library(purrr) # version 1.0.0
library(dplyr)
DT %>% {
unq <- unique(names(.))
pmap(.l = list(y = unq),
.f = function(y) rbindlist(.[names(.) %in% y],fill = TRUE)) %>%
setNames(unq)
}
Or instead of looping over the names and do rbinding, split by names, which return a named list and then do the rbind
DT %>%
split(names(.)) %>%
map(list_rbind)
-output
$AA
A B F I J
1: 1 2 NA NA NA
2: 2 3 NA NA NA
3: 2 1 4 NA NA
4: 3 2 5 NA NA
5: 4 3 6 NA NA
6: NA NA NA 1 3
7: NA NA NA 2 4
$BB
C D E G I
1: 100 300 1 NA NA
2: 101 301 2 NA NA
3: 102 302 3 NA NA
4: 10 20 NA 1 10
5: 11 21 NA 2 11
6: 12 22 NA 3 12
7: 1 4 NA NA NA
8: 2 5 NA NA NA
9: 3 6 NA NA NA

Related

Convert nested list with different names to data.frame filling NA and adding column

I need a base R solution to convert nested list with different names to a data.frame
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z=list('k'))
convert(mylist)
## returns a data.frame:
##
## a b z
## 1 2 <NULL>
## 3 NA <NULL>
## NA 5 <NULL>
## 9 NA <chr [1]>
I know this could be easily done with dplyr::bind_rows or data.table::rbindlist with fill = TRUE (not ideal though since it fills character column with NULL, not NA), but I do really need a solution in base R. To simplify the problem, it is also fine with a 2-level nested list that has no 3rd level lists such as
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))
convert(mylist)
## returns a data.frame:
##
## a b z
## 1 2 NA
## 3 NA NA
## NA 5 NA
## 9 NA k
I have tried something like
convert <- function(L) as.data.frame(do.call(rbind, L))
This does not fill NA and add additional column z
Update
mylist here is just a simplified example. In reality I could not assume the names of the sublist elements (a, b and z in the example), nor the sublists lengths (2, 1, 1, 2 in the example).
Here are the assumptions for expected data.frame and the input mylist:
The column number of the expected data.frame is determined by the maximum length of the sublists which could vary from 1 to several hundreds. There is no explicit source of information about the length of each sublist (which names will appear or disappear in which sublist is unknown)
max(sapply(mylist, length)) <= 1000 ## ==> TRUE
The row number of the expected data.frame is determined by the length of mylist which could vary from 1 to several thousands
dplyr::between(length(mylist), 0, 10000) ## ==> TRUE
No explicit information for the names of the sublist elements and their orders, therefore the column names and order of the expected data.frame can only be determined intrinsically from mylist
Each sublist contains elements in types of numeric, character or list. To simplify the problem, consider only numeric and character.
A shorter solution in base R would be
make_df <- function(a = NA, b = NA, z = NA) {
data.frame(a = unlist(a), b = unlist(b), z = unlist(z))
}
do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
#> a b z
#> 1 1 2 <NA>
#> 2 3 NA <NA>
#> 3 NA 5 <NA>
#> 4 9 NA k
Update
A more general solution using the same method, but which does not require specific names would be:
build_data_frame <- function(obj) {
nms <- unique(unlist(lapply(obj, names)))
frmls <- as.list(setNames(rep(NA, length(nms)), nms))
dflst <- setNames(lapply(nms, function(x) call("unlist", as.symbol(x))), nms)
make_df <- as.function(c(frmls, call("do.call", "data.frame", dflst)))
do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
}
This allows
build_data_frame(mylist)
#> a b z
#> 1 1 2 <NA>
#> 2 3 NA <NA>
#> 3 NA 5 <NA>
#> 4 9 NA k
We can try the base R code below
subset(
Reduce(
function(...) {
merge(..., all = TRUE)
},
Map(
function(k, x) cbind(id = k, list2DF(x)),
seq_along(mylist), mylist
)
),
select = -id
)
which gives
a b z
1 1 2 NA
2 3 NA NA
3 NA 5 NA
4 9 NA k
You can do something like the following:
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))
convert <- function(mylist){
col_names <- NULL
# get all the unique names and create the df
for(i in 1:length(mylist)){
col_names <- c(col_names, names(mylist[[i]]))
}
col_names <- unique(col_names)
df <- data.frame(matrix(ncol=length(col_names),
nrow=length(mylist)))
colnames(df) <- col_names
# join data to row in df
for(i in 1:length(mylist)){
for(j in 1:length(mylist[[i]])){
df[i, names(mylist[[i]])[j]] <- mylist[[i]][names(mylist[[i]])[j]]
}
}
return(df)
}
df <- convert(mylist)
> df
a b z
1 1 2 <NA>
2 3 NA <NA>
3 NA 5 <NA>
4 9 NA k
I've got a solution. Note this only uses the pipe, and could be exchanged for native pipe, etc.
mylist %>%
#' first, ensure that the 2nd level is flat,
lapply(. %>% lapply(FUN = unlist, recursive = FALSE)) %>%
#' replace missing vars with `NA`
lapply(function(x, vars) {
x[vars[!vars %in% names(x)]]<-NA
x
}, vars = {.} %>% unlist() %>% names() %>% unique()) %>%
do.call(what = rbind) %>%
#' do nothing
identity()
In {.} it is meant to define and evaluate the function formed by unlist followed by names. Otherwise . %>% unlist() %>% names() just defines the function, and not evaluate on the input ..

"summarize" multiple incomplete columns to 1 summary column [duplicate]

I have some columns in R and for each row there will only ever be a value in one of them, the rest will be NA's. I want to combine these into one column with the non-NA value. Does anyone know of an easy way of doing this. For example I could have as follows:
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
So I would have
'a' 'x' 'y' 'z'
A 1 NA NA
B 2 NA NA
C NA 3 NA
D NA NA 4
E NA NA 5
And I would to get
'a' 'mycol'
A 1
B 2
C 3
D 4
E 5
The names of the columns containing NA changes depending on code earlier in the query so I won't be able to call the column names explicitly, but I have the column names of the columns which contains NA's stored as a vector e.g. in this example cols <- c('x','y','z'), so could call the columns using data[, cols].
Any help would be appreciated.
Thanks
A dplyr::coalesce based solution could be as:
data %>% mutate(mycol = coalesce(x,y,z)) %>%
select(a, mycol)
# a mycol
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5
Data
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
You can use unlist to turn the columns into one vector. Afterwards, na.omit can be used to remove the NAs.
cbind(data[1], mycol = na.omit(unlist(data[-1])))
a mycol
x1 A 1
x2 B 2
y3 C 3
z4 D 4
z5 E 5
Here's a more general (but even simpler) solution which extends to all column types (factors, characters etc.) with non-ordered NA's. The strategy is simply to merge the non-NA values of other columns into your merged column using is.na for indexing:
data$mycol = data$x # your new merged column. Start with x
data$mycol[!is.na(data$y)] = data$y[!is.na(data$y)] # merge with y
data$mycol[!is.na(data$z)] = data$z[!is.na(data$z)] # merge with z
> data
a x y z mycol
1 A 1 NA NA 1
2 B 2 NA NA 2
3 C NA 3 NA 3
4 D NA NA 4 4
5 E NA NA 5 5
Note that this will overwrite existing values in mycol if there are several non-NA values in the same row. If you have a lot of columns you could automate this by looping over colnames(data).
I would use rowSums() with the na.rm = TRUE argument:
cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
which gives:
> cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
You have to call the method directly (cbind.data.frame) as the first argument above is not a data frame.
Something like this ?
data.frame(a=data$a, mycol=apply(data[,-1],1,sum,na.rm=TRUE))
gives :
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
max works too. Also works on strings vectors.
cbind(data[1], mycol=apply(data[-1], 1, max, na.rm=T))
One possibility using dplyr and tidyr could be:
data %>%
gather(variables, mycol, -1, na.rm = TRUE) %>%
select(-variables)
a mycol
1 A 1
2 B 2
8 C 3
14 D 4
15 E 5
Here it transforms the data from wide to long format, excluding the first column from this operation and removing the NAs.
In a related link (suppress NAs in paste()) I present a version of paste with a na.rm option (with the unfortunate name of paste5).
With this the code becomes
cols <- c("x", "y", "z")
cbind.data.frame(a = data$a, mycol = paste2(data[, cols], na.rm = TRUE))
The output of paste5 is a character, which works if you have character data otherwise you'll need to coerce to the type you want.
Though this is not the OP case, it seems some people like the approach based on sums, how about thinking in mean and mode, to make the answer more universal. This answer matches the title, which is what many people will find.
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,9),
'y' = c(NA,6,3,NA,5),
'z' = c(NA,NA,NA,4,5))
splitdf<-split(data[,c(2:4)], seq(nrow(data[,c(2:4)])))
data$mean<-unlist(lapply(splitdf, function(x) mean(unlist(x), na.rm=T) ) )
data$mode<-unlist(lapply(splitdf, function(x) {
tab <- tabulate(match(x, na.omit(unique(unlist(x) ))));
paste(na.omit(unique(unlist(x) ))[tab == max(tab) ], collapse = ", " )}) )
data
a x y z mean mode
1 A 1 NA NA 1.000000 1
2 B 2 6 NA 4.000000 2, 6
3 C NA 3 NA 3.000000 3
4 D NA NA 4 4.000000 4
5 E 9 5 5 6.333333 5
If you want to stick with base,
data <- data.frame('a' = c('A','B','C','D','E'),'x' = c(1,2,NA,NA,NA),'y' = c(NA,NA,3,NA,NA),'z' = c(NA,NA,NA,4,5))
data[is.na(data)]<-","
data$mycol<-paste0(data$x,data$y,data$z)
data$mycol <- gsub(',','',data$mycol)

How to optimize this for loop for bigger data in r?

I have some reproducible data, (my original dataset contains about 2,000,000 rows). For this reason, my for loop becomes inefficient and will take a long time to run this much data. I was wondering if there is a more efficient way to run this data. I attached my code with reproducible data
#----Reproducible data example--------------------#
#Upload first data set#
words1<-c("How","did","Quebec","nationalists","see","their","province","as","a","nation","in","the","1960s")
words2<-c("Why","does","volicty","effect","time",'?',NA,NA,NA,NA,NA,NA,NA)
words3<-c("How","do","I","wash","a","car",NA,NA,NA,NA,NA,NA,NA)
library<-c("The","the","How","see","as","a","for","then","than","example")
embedding1<-c(.5,.6,.7,.8,.9,.3,.46,.48,.53,.42)
embedding2<-c(.1,.5,.4,.8,.9,.3,.98,.73,.48,.56)
df <- data.frame(words1,words2,words3)
names(df)<-c("words1","words2","words3")
#--------Upload 2nd dataset-------#
df2 <- data.frame(library,embedding1, embedding2)
names(df2)<-c("library","embedding1","embedding2")
df2$meanembedding=rowMeans(df2[c("embedding1","embedding2")],na.rm=T)
df2<-df2[,-c(2,3)]
#-----Find columns--------#
l=ncol(df)
names<-names(df)
head(names)
classes<-sapply(df[,c(1:l)],class)
head(classes)
#------Combine and match libary to training data------#
require(gridExtra)
List = list()
for( name in names){
df1<-df[,name]
df1<-as.data.frame(df1)
x_train2<-merge(x= df1, y = df2,
by.x = "df1", by.y = 'library',all.x=T, sort=F)
x_train2<-x_train2[,-1]
x_train2<-as.data.frame(x_train2)
names(x_train2) <- name
List[[length(List)+1]] = x_train2
}
A better approach would be to use lapply:
myList2 <- lapply(names(df), function(x){
y <- merge(x = df[, x, drop = FALSE],
y = df2,
by.x = x,
by.y = 'library',
all.x = T,
sort = F)[, -1, drop = FALSE]
names(y) <- x
return(y)
})
We loop over the vector names(df), subset and merge on the fly, using [drop = FALSE] to prevent the simplification from a one-column-data.frame to a vector, and overwrite the column name. The output is a list.
Post script: You technically do not need the drop = FALSE if you use df[x] instead of df[, x], as #RuiBarradas pointed out. But I think it is helpful to know about the drop = FALSE option in cases where you need to subset both rows and columns.
when joining on large data volumes, give data.table a try...
library( data.table )
dt <- as.data.table( df )
dt2 <- as.data.table ( df2 )
lapply( names(dt), function(x) {
on_expr <- parse( text = paste0( "c( library = \"", x, "\")" ) )
dt2[dt, on = eval( on_expr )][,2]
})
# [[1]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.80
# 6: NA
# 7: NA
# 8: 0.90
# 9: 0.30
# 10: NA
# 11: NA
# 12: 0.55
# 13: NA
#
# [[2]]
# meanembedding
# 1: NA
# 2: NA
# 3: NA
# 4: NA
# 5: NA
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA
#
# [[3]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.30
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA

Retaining by variables in R data.table by-without-by

I'd like to retain the by variables in a by-without-by operation using data.table.
I have a by-without-by that used to work (ca. 2 years ago), and now with the latest version of data.table I think the behavior must have changed.
Here's a reproducible example:
library(data.table)
dt <- data.table( by1 = letters[1:3], by2 = LETTERS[1:3], x = runif(3) )
by <- c("by1","by2")
allPermutationsOfByvars <- do.call(CJ, sapply(dt[,by,with=FALSE], unique, simplify=FALSE)) ## CJ() to form index
setkeyv(dt, by)
dt[ allPermutationsOfByvars, list( x = x ) ]
Which produces:
> dt[ allPermutationsOfByvars, list( x = x ) ]
x
1: 0.9880997
2: NA
3: NA
4: NA
5: 0.4650647
6: NA
7: NA
8: NA
9: 0.4899873
I could just do:
> cbind( allPermutationsOfByvars, dt[ allPermutationsOfByvars, list( x = x ) ] )
by1 by2 x
1: a A 0.9880997
2: a B NA
3: a C NA
4: b A NA
5: b B 0.4650647
6: b C NA
7: c A NA
8: c B NA
9: c C 0.4899873
Which indeed works, but is inelegant and possibly inefficient.
Is there an argument I'm missing or a clever stratagem to retain the by variables?
Add by = .EACHI to get the "by-without-by" aka by-EACH-element-of-I:
dt[allPermutationsOfByvars, x, by = .EACHI]
And this is how I'd have done the initial part:
allPermutationsOfByvars = dt[, do.call(CJ, unique(setDT(mget(by))))]
Finally, the on argument is usually the better choice now (vs setkey).

Rbind with new columns and data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?
library(data.table)
aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)
dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))
dt.11 <- rbind(dt.1, dt.1) # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2) # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2) # What I need, doesn't work either
I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.
Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:
If TRUE fills missing columns with NAs. By default FALSE. When
TRUE, use.names has to be TRUE, and all items of the input list has to
have non-null column names.
Thus you can do (prior to approx v1.9.6):
data.table::rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
UPDATE for v1.9.6:
This now works directly:
rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
Here is an approach that will update the missing columns in
rbind.missing <- function(A, B) {
cols.A <- names(A)
cols.B <- names(B)
missing.A <- setdiff(cols.B,cols.A)
# check and define missing columns in A
if(length(missing.A) > 0L){
# .. means "look up one level"
class.missing.A <- lapply(B[, ..missing.A], class)
nas.A <- lapply(class.missing.A, as, object = NA)
A[,c(missing.A) := nas.A]
}
# check and define missing columns in B
missing.B <- setdiff(names(A), cols.B)
if(length(missing.B) > 0L){
class.missing.B <- lapply(A[, ..missing.B], class)
nas.B <- lapply(class.missing.B, as, object = NA)
B[,c(missing.B) := nas.B]
}
# reorder so they are the same
setcolorder(B, names(A))
rbind(A, B)
}
rbind.missing(dt.1,dt.2)
## aa bb cc
## 1: 1 2 NA
## 2: 2 3 NA
## 3: 3 4 NA
## 4: 1 2 3
## 5: 2 3 4
## 6: 3 4 5
This will not be efficient for many, or large data.tables, as it only works two at a time.
The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.
the basic concept is to add missing columns in both directions: from the running master table
to the newTable and back the other way.
As #menl pointed out in the comments, simply assigning an NA is a problem, because that will
make the whole column of class logical.
One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.
Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class)
as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName)
on the LHS to assign.
We can then wrap this in a function and use S3 methods so that we can simply call
rbindFill(A, B)
Below is the function.
rbindFill.data.table <- function(master, newTable) {
# Append newTable to master
# assign to Master
#-----------------#
# identify columns missing
colMisng <- setdiff(names(newTable), names(master))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))
# assign to each column value of NA with appropriate class
master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# assign to newTable
#-----------------#
# identify columns missing
colMisng <- setdiff(names(master), names(newTable))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))
# assign to each column value of NA with appropriate class
newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# reorder columns to avoid warning about ordering
#-----------------#
colOrdering <- colOrderingByOtherCol(newTable, names(master))
setcolorder(newTable, colOrdering)
# rbind them!
#-----------------#
rbind(master, newTable)
}
# implement generic function
rbindFill <- function(x, y, ...) UseMethod("rbindFill")
Example Usage:
# Sample Data:
#--------------------------------------------------#
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
#--------------------------------------------------#
# Four iterations of calling rbindFill
master <- rbindFill(A, B)
master <- rbindFill(master, A2)
master <- rbindFill(master, C)
# Results:
master
# a b c d m n f
# 1: 1 1 1 NA NA NA NA
# 2: 2 2 2 NA NA NA NA
# 3: 3 3 3 NA NA NA NA
# 4: NA 1 1 1 A NA NA
# 5: NA 2 2 2 B NA NA
# 6: NA 3 3 3 C NA NA
# 7: 6 6 6 NA NA NA NA
# 8: 7 7 7 NA NA NA NA
# 9: 8 8 8 NA NA NA NA
# 10: 9 9 9 NA NA NA NA
# 11: NA NA 7 NA NA 0.86 TRUE
# 12: NA NA 8 NA NA -1.15 FALSE
# 13: NA NA 9 NA NA 1.10 TRUE
Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.
Also, using rbindlist() should be much faster when dealing with data.tables.
Define the tables (same as mnel's code above):
library(data.table)
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
Insert the missing variables in table A: (note the use of A2[0]
A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)
Insert the missing columns in table A2:
A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)
Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):
setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL
Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...
DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))
DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL
The result looks the same as mnels' output (except for the random numbers and the column order).
PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?
PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

Resources