Full outer join of multiple dataframes stored as elements of a list using data.table - r

I'm trying to do a full outer join of multiple dataframes stored as elements of a list using data.table. I have successfully done this using the merge_recurse() function of the reshape package, but it is very slow with larger datasets, and I'd like to speed up the merge by using data.table. I'm not sure the best way for data.table to handle the list structure with multiple dataframes. I'm also not sure if I've written the Reduce() function correctly on unique keys to do a full outer join on multiple dataframes.
Here's a small example:
#Libraries
library("reshape")
library("data.table")
#Specify list of multiple dataframes
filelist <- list(data.frame(x=c(1,1,1,2,2,2,3,3,3), y=c(1,2,3,1,2,3,1,2,3), a=1:9),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,1), b=seq(from=0, by=5, length.out=9)),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,2), c=seq(from=0, by=10, length.out=9)))
#Merge with merge_recurse()
listMerged <- merge_recurse(filelist, by=c("x","y"))
#Attempt with data.table
ids <- lapply(filelist, function(x) x[,c("x","y")])
unique_keys <- unique(do.call("rbind", ids))
dt <- data.table(filelist)
setkey(dt, c("x","y")) #error here
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
Here's my expected output:
> listMerged
x y a b c
1 1 1 1 0 0
2 1 2 2 5 10
3 1 3 3 10 20
4 2 1 4 15 30
5 2 2 5 20 40
6 2 3 6 25 50
7 3 1 7 30 60
8 3 2 8 35 70
9 3 3 9 NA NA
10 4 1 NA 40 NA
11 4 2 NA NA 80
Here are my resources:
Suggestion to use Reduce() function on data.table (see last comment of answer)
Suggestion to use "unique keys" to do full outer join in data.table

This worked for me:
library("reshape")
library("data.table")
##
filelist <- list(
data.frame(
x=c(1,1,1,2,2,2,3,3,3),
y=c(1,2,3,1,2,3,1,2,3),
a=1:9),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,1),
b=seq(from=0, by=5, length.out=9)),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,2),
c=seq(from=0, by=10, length.out=9)))
##
## I used copy so that this would
## not modify 'filelist'
dtList <- copy(filelist)
lapply(dtList,setDT)
lapply(dtList,function(x){
setkeyv(x,cols=c("x","y"))
})
##
> Reduce(function(x,y){
merge(x,y,all=T,allow.cartesian=T)
},dtList)
x y a b c
1: 1 1 1 0 0
2: 1 2 2 5 10
3: 1 3 3 10 20
4: 2 1 4 15 30
5: 2 2 5 20 40
6: 2 3 6 25 50
7: 3 1 7 30 60
8: 3 2 8 35 70
9: 3 3 9 NA NA
10: 4 1 NA 40 NA
11: 4 2 NA NA 80
Also I noticed a couple of problems in your code. dt <- data.table(filelist) resulted in
> dt
filelist
1: <data.frame>
2: <data.frame>
3: <data.frame>
which is most likely the cause of the error in setkey(dt, c("x","y")) that you pointed out above. Also, did this work for you?
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
I'm just curious, because I was getting an error when I tried to run it (using dtList instead of filelist)
Error in eval(expr, envir, enclos) : could not find function "J"
which I believe has to do with the changes implemented since version 1.8.8 of data.table, explained by #Arun in this answer.

Related

Appending csvs with different column quantities and spellings

Nothing too complicated, it would just be useful to use rbindlist on a large number of csvs where the column names change a little over time (minor spelling changes), the column orders remain the same, and at some point, two additional columns are added to the csvs (which I don't really need).
library(data.table)
csv1 <- data.table("apple" = 1:3, "orange" = 2:4, "dragonfruit" = 13:15)
csv2 <- data.table("appole" = 7:9, "orangina" = 6:8, "dragonificfruit" = 2:4, "pear" = 1:3)
l <- list(csv1, csv2)
When I run
csv_append <- rbindlist(l, fill=TRUE) #which also forces use.names=TRUE
it gives me a data.table with 7 columns
apple orange dragonfruit appole orangina dragonificfruit pear
1: 1 2 13 NA NA NA NA
2: 2 3 14 NA NA NA NA
3: 3 4 15 NA NA NA NA
4: NA NA NA 7 6 2 1
5: NA NA NA 8 7 3 2
6: NA NA NA 9 8 4 3
as opposed to what I want, which is:
V1 V2 V3 V4
1: 1 2 13 NA
2: 2 3 14 NA
3: 3 4 15 NA
4: 7 6 2 1
5: 8 7 3 2
6: 9 8 4 3
which I can use, even though I have to go through the extra step later of renaming the columns back to standard variable names.
If I instead try the default fill=FALSE and use.names=FALSE, it throws an error:
Error in rbindlist(l) :
Item 2 has 4 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.
Is there a simple way to manage this, either by forcing fill=TRUE and use.names=FALSE somehow or by omitting the additional columns in the csvs that have them by specifying a vector of columns to append?
If we only need first 3 columns, then drop the rest and bind as usual:
rbindlist(lapply(l, function(i) i[, 1:3]))
# apple orange dragonfruit
# 1: 1 2 13
# 2: 2 3 14
# 3: 3 4 15
# 4: 7 6 2
# 5: 8 7 3
# 6: 9 8 4
Another option, from the comments: we could directly read the files, and set to keep only first 3 columns using fread, then bind:
rbindlist(lapply(filenames, fread, select = c(1:3)))
Here is an option with name matching using phonetic from stringdist. Extract the column names from the list of data.table ('nmlist'), unlist, group using phonetic, get the first element, relist it to the same list structure as 'nmlist', use Map to change the column names of the list of data.table, and then apply rbindlist
library(stringdist)
library(data.table)
nmlist <- lapply(l, names)
nm1 <- unlist(nmlist)
rbindlist(Map(setnames, l, relist(ave(nm1, phonetic(nm1),
FUN = function(x) x[1]), skeleton = nmlist)), fill = TRUE)
-output
# apple orange dragonfruit pear
#1: 1 2 13 NA
#2: 2 3 14 NA
#3: 3 4 15 NA
#4: 7 6 2 1
#5: 8 7 3 2
#6: 9 8 4 3

How do we avoid for-loops when we want to conditionally add columns by reference? (condition to be evaluated seperately in each row)

I have a data.table with many numbered columns. As a simpler example, I have this:
dat <- data.table(cbind(col1=sample(1:5,10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(1:5,10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')
I want to create a new column as follows: In each row, we add the values in columns from among col1-col4 if the value is not NA or 1.
My current code for this has two for-loops which is clearly not the way to do it:
for(i in 1:nrow(dat)){
dat[i,'sumCol':={temp=0;
for(j in 1:4){if(!is.na(dat[i,paste0('col',j),with=F])&
dat[i,paste0('col',j),with=F]!=1
){temp=temp+dat[i,paste0('col',j),with=F]}};
temp}]}
I would appreciate any advice on how to remove this for-loops. My code is running on a bigger data.table and it takes a long time to run.
A possible solution:
dat[, sumCol := rowSums(.SD * (.SD != 1), na.rm = TRUE), .SDcols = col1:col4]
which gives:
> dat
col1 col2 col3 col4 oneMoreCol sumCol
1: 4 5 5 3 a 17
2: 4 5 NA 5 a 14
3: 2 3 4 3 a 12
4: 1 2 3 4 a 9
5: 4 3 NA 5 a 12
6: 2 2 1 4 a 8
7: NA 2 NA 5 a 7
8: 4 2 2 4 a 12
9: 4 1 5 4 a 13
10: 2 1 5 1 a 7
Used data:
set.seed(20200618)
dat <- data.table(cbind(col1=sample(c(NA, 1:5),10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(c(1:5,NA),10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')

Data.table: rbind a list of data tables with unequal columns [duplicate]

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

How do I select all data.table columns that are in a second data.table

I have several data.tables that have the same columns, and one that has some extra columns. I want to rbind them all but only on the common columns
with data.frames I could simply do
rbind(df1[,names(df2)],df2,df3,...)
I can of course write all the column names in the form
list(col1,col2,col3,col4)
but this is not elegant, nor feasible if one has 1,000 variables
I am sure there is a way and I am not getting there - any help would be appreciated
May be you can try:
DT1 <- data.table(Col1=1:5, Col2=6:10, Col3=2:6)
DT2 <- data.table(Col1=1:4, Col3=2:5)
DT3 <- data.table(Col1=1:7, Col3=1:7)
lst1 <- mget(ls(pattern="DT\\d+"))
ColstoRbind <- Reduce(`intersect`,lapply(lst1, colnames))
# .. "looks up one level"
res <- rbindlist(lapply(lst1, function(x) x[, ..ColstoRbind]))
res
# Col1 Col3
# 1: 1 2
# 2: 2 3
# 3: 3 4
# 4: 4 5
# 5: 5 6
# 6: 1 2
# 7: 2 3
# 8: 3 4
# 9: 4 5
#10: 1 1
#11: 2 2
#12: 3 3
#13: 4 4
#14: 5 5
#15: 6 6
#16: 7 7
Update
As #Arun suggested in the comments, this might be better
rbindlist(lapply(lst1, function(x) {
if(length(setdiff(colnames(x), ColstoRbind))>0) {
x[, ..ColstoRbind]
}
else x}))

Skip NA values using "FUN=first"

there's probably really an simple explaination as to what I'm doing wrong, but I've been working on this for quite some time today and I still can not get this to work. I thought this would be a walk in the park, however, my code isn't quite working as expected.
So for this example, let's say I have a data frame as followed.
df
Row# user columnB
1 1 NA
2 1 NA
3 1 NA
4 1 31
5 2 NA
6 2 NA
7 2 15
8 3 18
9 3 16
10 3 NA
Basically, I would like to create a new column that uses the first (as well as last) function (within the TTR library package) to obtain the first non-NA value for each user. So my desired data frame would be this.
df
Row# user columnB firstValue
1 1 NA 31
2 1 NA 31
3 1 NA 31
4 1 31 31
5 2 NA 15
6 2 NA 15
7 2 15 15
8 3 18 18
9 3 16 18
10 3 NA 18
I've looked around mainly using google, but I couldn't really find my exact answer.
Here's some of my code that I've tried, but I didn't get the results that I wanted (note, I'm bringing this from memory, so there are quite a few more variations of these, but these are the general forms that I've been trying).
df$firstValue<-ave(df$columnB,df$user,FUN=first,na.rm=True)
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){x,first,na.rm=True})
df$firstValue<-ave(df$columnB,df$user,FUN=function(x){first(x,na.rm=True)})
df$firstValue<-by(df,df$user,FUN=function(x){x,first,na.rm=True})
Failed, these just give the first value of each group, which would be NA.
Again, these are just a few examples from the top of my head, I played around with na.rm, using na.exclude, na.omit, na.action(na.omit), etc...
Any help would be greatly appreciated. Thanks.
A data.table solution
require(data.table)
DT <- data.table(df, key="user")
DT[, firstValue := na.omit(columnB)[1], by=user]
Here is a solution with plyr :
ddply(df, .(user), transform, firstValue=na.omit(columnB)[1])
Which gives :
Row user columnB firstValue
1 1 1 NA 31
2 2 1 NA 31
3 3 1 NA 31
4 4 1 31 31
5 5 2 NA 15
6 6 2 NA 15
7 7 2 15 15
8 8 3 18 18
9 9 3 16 18
If you want to capture the last value, you can do :
ddply(df, .(user), transform, firstValue=tail(na.omit(columnB),1))
Using data.table
library (data.table)
DT <- data.table(df, key="user")
DT <- setnames(DT[unique(DT[!is.na(columnB), list(columnB), by="user"])], "columnB.1", "first")
Using a very small helper function
finite <- function(x) x[is.finite(x)]
here is an one-liner using only standard R functions:
df <- cbind(df, firstValue = unlist(sapply(unique(df[,1]), function(user) rep(finite(df[df[,1] == user,2])[1], sum(df[,1] == user))))
For a better overview, here is the one-liner unfolded into a "multi-liner":
# for each user, find the first finite (in this case non-NA) value of the second column and replicate it as many times as the user has rows
# then, the results of all users are joined into one vector (unlist) and appended to the data frame as column
df <- cbind(
df,
firstValue = unlist(
sapply(
unique(df[,1]),
function(user) {
rep(
finite(df[df[,1] == user,2])[1],
sum(df[,1] == user)
)
}
)
)
)

Resources