How do I select all data.table columns that are in a second data.table - r

I have several data.tables that have the same columns, and one that has some extra columns. I want to rbind them all but only on the common columns
with data.frames I could simply do
rbind(df1[,names(df2)],df2,df3,...)
I can of course write all the column names in the form
list(col1,col2,col3,col4)
but this is not elegant, nor feasible if one has 1,000 variables
I am sure there is a way and I am not getting there - any help would be appreciated

May be you can try:
DT1 <- data.table(Col1=1:5, Col2=6:10, Col3=2:6)
DT2 <- data.table(Col1=1:4, Col3=2:5)
DT3 <- data.table(Col1=1:7, Col3=1:7)
lst1 <- mget(ls(pattern="DT\\d+"))
ColstoRbind <- Reduce(`intersect`,lapply(lst1, colnames))
# .. "looks up one level"
res <- rbindlist(lapply(lst1, function(x) x[, ..ColstoRbind]))
res
# Col1 Col3
# 1: 1 2
# 2: 2 3
# 3: 3 4
# 4: 4 5
# 5: 5 6
# 6: 1 2
# 7: 2 3
# 8: 3 4
# 9: 4 5
#10: 1 1
#11: 2 2
#12: 3 3
#13: 4 4
#14: 5 5
#15: 6 6
#16: 7 7
Update
As #Arun suggested in the comments, this might be better
rbindlist(lapply(lst1, function(x) {
if(length(setdiff(colnames(x), ColstoRbind))>0) {
x[, ..ColstoRbind]
}
else x}))

Related

Appending csvs with different column quantities and spellings

Nothing too complicated, it would just be useful to use rbindlist on a large number of csvs where the column names change a little over time (minor spelling changes), the column orders remain the same, and at some point, two additional columns are added to the csvs (which I don't really need).
library(data.table)
csv1 <- data.table("apple" = 1:3, "orange" = 2:4, "dragonfruit" = 13:15)
csv2 <- data.table("appole" = 7:9, "orangina" = 6:8, "dragonificfruit" = 2:4, "pear" = 1:3)
l <- list(csv1, csv2)
When I run
csv_append <- rbindlist(l, fill=TRUE) #which also forces use.names=TRUE
it gives me a data.table with 7 columns
apple orange dragonfruit appole orangina dragonificfruit pear
1: 1 2 13 NA NA NA NA
2: 2 3 14 NA NA NA NA
3: 3 4 15 NA NA NA NA
4: NA NA NA 7 6 2 1
5: NA NA NA 8 7 3 2
6: NA NA NA 9 8 4 3
as opposed to what I want, which is:
V1 V2 V3 V4
1: 1 2 13 NA
2: 2 3 14 NA
3: 3 4 15 NA
4: 7 6 2 1
5: 8 7 3 2
6: 9 8 4 3
which I can use, even though I have to go through the extra step later of renaming the columns back to standard variable names.
If I instead try the default fill=FALSE and use.names=FALSE, it throws an error:
Error in rbindlist(l) :
Item 2 has 4 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.
Is there a simple way to manage this, either by forcing fill=TRUE and use.names=FALSE somehow or by omitting the additional columns in the csvs that have them by specifying a vector of columns to append?
If we only need first 3 columns, then drop the rest and bind as usual:
rbindlist(lapply(l, function(i) i[, 1:3]))
# apple orange dragonfruit
# 1: 1 2 13
# 2: 2 3 14
# 3: 3 4 15
# 4: 7 6 2
# 5: 8 7 3
# 6: 9 8 4
Another option, from the comments: we could directly read the files, and set to keep only first 3 columns using fread, then bind:
rbindlist(lapply(filenames, fread, select = c(1:3)))
Here is an option with name matching using phonetic from stringdist. Extract the column names from the list of data.table ('nmlist'), unlist, group using phonetic, get the first element, relist it to the same list structure as 'nmlist', use Map to change the column names of the list of data.table, and then apply rbindlist
library(stringdist)
library(data.table)
nmlist <- lapply(l, names)
nm1 <- unlist(nmlist)
rbindlist(Map(setnames, l, relist(ave(nm1, phonetic(nm1),
FUN = function(x) x[1]), skeleton = nmlist)), fill = TRUE)
-output
# apple orange dragonfruit pear
#1: 1 2 13 NA
#2: 2 3 14 NA
#3: 3 4 15 NA
#4: 7 6 2 1
#5: 8 7 3 2
#6: 9 8 4 3

How do we avoid for-loops when we want to conditionally add columns by reference? (condition to be evaluated seperately in each row)

I have a data.table with many numbered columns. As a simpler example, I have this:
dat <- data.table(cbind(col1=sample(1:5,10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(1:5,10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')
I want to create a new column as follows: In each row, we add the values in columns from among col1-col4 if the value is not NA or 1.
My current code for this has two for-loops which is clearly not the way to do it:
for(i in 1:nrow(dat)){
dat[i,'sumCol':={temp=0;
for(j in 1:4){if(!is.na(dat[i,paste0('col',j),with=F])&
dat[i,paste0('col',j),with=F]!=1
){temp=temp+dat[i,paste0('col',j),with=F]}};
temp}]}
I would appreciate any advice on how to remove this for-loops. My code is running on a bigger data.table and it takes a long time to run.
A possible solution:
dat[, sumCol := rowSums(.SD * (.SD != 1), na.rm = TRUE), .SDcols = col1:col4]
which gives:
> dat
col1 col2 col3 col4 oneMoreCol sumCol
1: 4 5 5 3 a 17
2: 4 5 NA 5 a 14
3: 2 3 4 3 a 12
4: 1 2 3 4 a 9
5: 4 3 NA 5 a 12
6: 2 2 1 4 a 8
7: NA 2 NA 5 a 7
8: 4 2 2 4 a 12
9: 4 1 5 4 a 13
10: 2 1 5 1 a 7
Used data:
set.seed(20200618)
dat <- data.table(cbind(col1=sample(c(NA, 1:5),10,replace=T),
col2=sample(1:5,10,replace=T),
col3=sample(c(1:5,NA),10,replace=T),
col4=sample(1:5,10,replace=T)),
oneMoreCol='a')

rbindlist only elements that meet a condition

I have a large list. Some of the elements are strings and some of the elements are data.tables. I would like to create a big data.table, but only rbind the elements that are data.tables.
I know how to do it in a for loop, but I am looking for something more efficient as my data are big and I need something quick.
Thank you!
library(data.table)
DT1 = data.table(
ID = c("b","b","b","a","a","c"),
a = 1:6
)
DT2 = data.table(
ID = c("b","b","b","a","a","c"),
a = 11:16
)
list<- list(DT1,DT2,"string")
I am looking for a result similar to doing, but since I have many entries I cannot do it like this.
rbind(DT1, DT2)
Filter the data.table and rbind
library(data.table)
rbindlist(Filter(is.data.table, list_df))
# ID a
# 1: b 1
# 2: b 2
# 3: b 3
# 4: a 4
# 5: a 5
# 6: c 6
# 7: b 11
# 8: b 12
# 9: b 13
#10: a 14
#11: a 15
#12: c 16
data
list_df <- list(DT1,DT2,"string")
We can use keep from purrr with bind_rows
library(tidyverse)
keep(list, is.data.table) %>%
bind_rows
# ID a
# 1: b 1
# 2: b 2
# 3: b 3
# 4: a 4
# 5: a 5
# 6: c 6
# 7: b 11
# 8: b 12
# 9: b 13
#10: a 14
#11: a 15
#12: c 16
Or using rbindlist with keep
rbindlist(keep(list, is.data.table))
Using sapply() to generate a logical vector to subset your list
rbindlist(list[sapply(list, is.data.table)])

How to replace a certain value in one data.table with values of another data.table of same dimension

Given two data.table:
dt1 <- data.table(id = c(1,-99,2,2,-99), a = c(2,1,-99,-99,3), b = c(5,3,3,2,5), c = c(-99,-99,-99,2,5))
dt2 <- data.table(id = c(2,3,1,4,3),a = c(6,4,3,2,6), b = c(3,7,8,8,3), c = c(2,2,4,3,2))
> dt1
id a b c
1: 1 2 5 -99
2: -99 1 3 -99
3: 2 -99 3 -99
4: 2 -99 2 2
5: -99 3 5 5
> dt2
id a b c
1: 2 6 3 2
2: 3 4 7 2
3: 1 3 8 4
4: 4 2 8 3
5: 3 6 3 2
How can one replace the -99 of dt1 with the values of dt2?
Wanted results should be dt3:
> dt3
id a b c
1: 1 2 5 2
2: 3 1 3 2
3: 2 3 3 4
4: 2 2 2 2
5: 3 3 5 5
You can do the following:
dt3 <- as.data.frame(dt1)
dt2 <- as.data.frame(dt2)
dt3[dt3 == -99] <- dt2[dt3 == -99]
dt3
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
If your data is all of the same type (as in your example) then transforming them to matrix is a lot faster and transparent:
dt1a <- as.matrix(dt1) ## convert to matrix
dt2a <- as.matrix(dt2)
# make a matrix of the same shape to access the right entries
missing_idx <- dt1a == -99
dt1a[missing_idx] <- dt2a[missing_idx] ## replace by reference
This is a vectorized operation, so it should be fast.
Note: If you do this make sure the two data sources match exactly in shape and order of rows/columns. If they don't then you need to join by the relevant keys and pick the correct columns.
EDIT: The conversion to matrix may be unnecessary. See kath's answer for a more terse solution.
Simple way could be to use setDF function to convert to data.frame and use data frame sub-setting methods. Restore to data.table at the end.
#Change to data.frmae
setDF(dt1)
setDF(dt2)
# Perform assignment
dt1[dt1==-99] = dt2[dt1==-99]
# Restore back to data.table
setDT(dt1)
setDT(dt2)
dt1
# id a b c
# 1 1 2 5 2
# 2 3 1 3 2
# 3 2 3 3 4
# 4 2 2 2 2
# 5 3 3 5 5
This simple trick would work efficiently.
dt1<-as.matrix(dt1)
dt2<-as.matrix(dt2)
index.replace = dt1==-99
dt1[index.replace] = dt2[index.replace]
as.data.table(dt1)
as.data.table(dt2)
This should work, a simple approach:
for (i in 1:nrow(dt1)){
for (j in 1:ncol(dt1)){
if (dt1[i,j] == -99) dt1[i,j] = dt2[i,j]
}
}

Full outer join of multiple dataframes stored as elements of a list using data.table

I'm trying to do a full outer join of multiple dataframes stored as elements of a list using data.table. I have successfully done this using the merge_recurse() function of the reshape package, but it is very slow with larger datasets, and I'd like to speed up the merge by using data.table. I'm not sure the best way for data.table to handle the list structure with multiple dataframes. I'm also not sure if I've written the Reduce() function correctly on unique keys to do a full outer join on multiple dataframes.
Here's a small example:
#Libraries
library("reshape")
library("data.table")
#Specify list of multiple dataframes
filelist <- list(data.frame(x=c(1,1,1,2,2,2,3,3,3), y=c(1,2,3,1,2,3,1,2,3), a=1:9),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,1), b=seq(from=0, by=5, length.out=9)),
data.frame(x=c(1,1,1,2,2,2,3,3,4), y=c(1,2,3,1,2,3,1,2,2), c=seq(from=0, by=10, length.out=9)))
#Merge with merge_recurse()
listMerged <- merge_recurse(filelist, by=c("x","y"))
#Attempt with data.table
ids <- lapply(filelist, function(x) x[,c("x","y")])
unique_keys <- unique(do.call("rbind", ids))
dt <- data.table(filelist)
setkey(dt, c("x","y")) #error here
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
Here's my expected output:
> listMerged
x y a b c
1 1 1 1 0 0
2 1 2 2 5 10
3 1 3 3 10 20
4 2 1 4 15 30
5 2 2 5 20 40
6 2 3 6 25 50
7 3 1 7 30 60
8 3 2 8 35 70
9 3 3 9 NA NA
10 4 1 NA 40 NA
11 4 2 NA NA 80
Here are my resources:
Suggestion to use Reduce() function on data.table (see last comment of answer)
Suggestion to use "unique keys" to do full outer join in data.table
This worked for me:
library("reshape")
library("data.table")
##
filelist <- list(
data.frame(
x=c(1,1,1,2,2,2,3,3,3),
y=c(1,2,3,1,2,3,1,2,3),
a=1:9),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,1),
b=seq(from=0, by=5, length.out=9)),
data.frame(
x=c(1,1,1,2,2,2,3,3,4),
y=c(1,2,3,1,2,3,1,2,2),
c=seq(from=0, by=10, length.out=9)))
##
## I used copy so that this would
## not modify 'filelist'
dtList <- copy(filelist)
lapply(dtList,setDT)
lapply(dtList,function(x){
setkeyv(x,cols=c("x","y"))
})
##
> Reduce(function(x,y){
merge(x,y,all=T,allow.cartesian=T)
},dtList)
x y a b c
1: 1 1 1 0 0
2: 1 2 2 5 10
3: 1 3 3 10 20
4: 2 1 4 15 30
5: 2 2 5 20 40
6: 2 3 6 25 50
7: 3 1 7 30 60
8: 3 2 8 35 70
9: 3 3 9 NA NA
10: 4 1 NA 40 NA
11: 4 2 NA NA 80
Also I noticed a couple of problems in your code. dt <- data.table(filelist) resulted in
> dt
filelist
1: <data.frame>
2: <data.frame>
3: <data.frame>
which is most likely the cause of the error in setkey(dt, c("x","y")) that you pointed out above. Also, did this work for you?
Reduce(function(x, y) x[y[J(unique_keys)]], filelist)
I'm just curious, because I was getting an error when I tried to run it (using dtList instead of filelist)
Error in eval(expr, envir, enclos) : could not find function "J"
which I believe has to do with the changes implemented since version 1.8.8 of data.table, explained by #Arun in this answer.

Resources