R: Merge data.table and fill in NAs

R: Merge data.table and fill in NAs - r

Suppose 3 data tables:
dt1<-data.table(Type=c("a","b"),x=1:2)
dt2<-data.table(Type=c("a","b"),y=3:4)
dt3<-data.table(Type=c("c","d"),z=3:4)
I want to merge them into 1 data table, so I do this:
dt4<-merge(dt1,dt2,by="Type") # No error, produces what I want
dt5<-merge(dt4,dt3,by="Type") # Produces empty data.table (0 rows) of 4 cols: Type,x,y,z
Is there a way to make dt5 like this instead?:
> dt5
Type x y z
1: a 1 3 NA
2: b 2 4 NA
3: c NA NA 3
4: d NA NA 4

While you explore the all argument to merge, I'll also offer you an alternative that might want to consider:
Reduce(function(x, y) merge(x, y, by = "Type", all = TRUE), list(dt1, dt2, dt3))
# Type x y z
# 1: a 1 3 NA
# 2: b 2 4 NA
# 3: c NA NA 3
# 4: d NA NA 4

If you know in advance the unique values you have in your Type column you can use J and then join tables the data.table way. You should set the key for each table so data.table knows what to join on, like this...
# setkeys
setkey( dt1 , Type )
setkey( dt2 , Type )
setkey( dt3 , Type )
# Join
dt1[ dt2[ dt3[ J( letters[1:4] ) , ] ] ]
# Type x y z
#1: a 1 3 NA
#2: b 2 4 NA
#3: c NA NA 3
#4: d NA NA 4
This shows off data.table's compound queries (i.e. dt1[dt2[dt3[...]]] ) which are wicked!
If you don't know in advance the unique values for the key column you can make a list of your tables and use lapply to quickly run through them getting the unique values to make your J expression...
# A simple way to get the unique values to make 'J',
# assuming they are in the first column.
ll <- list( dt1 , dt2 , dt3 )
vals <- unique( unlist( lapply( ll , `[` , 1 ) ) )
#[1] "a" "b" "c" "d"
Then use it like before, i.e. dt1[ dt2[ dt3[ J( vals ) , ] ] ].

Related

Retaining by variables in R data.table by-without-by

I'd like to retain the by variables in a by-without-by operation using data.table.
I have a by-without-by that used to work (ca. 2 years ago), and now with the latest version of data.table I think the behavior must have changed.
Here's a reproducible example:
library(data.table)
dt <- data.table( by1 = letters[1:3], by2 = LETTERS[1:3], x = runif(3) )
by <- c("by1","by2")
allPermutationsOfByvars <- do.call(CJ, sapply(dt[,by,with=FALSE], unique, simplify=FALSE)) ## CJ() to form index
setkeyv(dt, by)
dt[ allPermutationsOfByvars, list( x = x ) ]
Which produces:
> dt[ allPermutationsOfByvars, list( x = x ) ]
x
1: 0.9880997
2: NA
3: NA
4: NA
5: 0.4650647
6: NA
7: NA
8: NA
9: 0.4899873
I could just do:
> cbind( allPermutationsOfByvars, dt[ allPermutationsOfByvars, list( x = x ) ] )
by1 by2 x
1: a A 0.9880997
2: a B NA
3: a C NA
4: b A NA
5: b B 0.4650647
6: b C NA
7: c A NA
8: c B NA
9: c C 0.4899873
Which indeed works, but is inelegant and possibly inefficient.
Is there an argument I'm missing or a clever stratagem to retain the by variables?

Add by = .EACHI to get the "by-without-by" aka by-EACH-element-of-I:
dt[allPermutationsOfByvars, x, by = .EACHI]
And this is how I'd have done the initial part:
allPermutationsOfByvars = dt[, do.call(CJ, unique(setDT(mget(by))))]
Finally, the on argument is usually the better choice now (vs setkey).

Inline ifelse assignment in data.table

Let the following data set be given:
library('data.table')
set.seed(1234)
DT <- data.table(x = LETTERS[1:10], y =sample(10))
my.rows <- sample(1:dim(DT)[1], 3)
I want to add a new column to the data set such that, whenever the rows of the data set match the row numbers given by my.rows the entry is populated with, say, true, or false otherwise.
I have got DT[my.rows, z:= "true"], which gives
head(DT)
x y z
1: A 2 NA
2: B 6 NA
3: C 5 true
4: D 8 NA
5: E 9 true
6: F 4 NA
but I do not know how to automatically populate the else condition as well, at the same time. I guess I should make use of some sort of inline ifelse but I am lacking the correct syntax.

We can compare the 'my.rows' with the sequence of row using %in% to create a logical vector and assign (:=) it to create 'z' column.
DT[, z:= 1:.N %in% my.rows ]
Or another option would be to create 'z' as a column of 'FALSE', using 'my.rows' as 'i', we assign the elements in 'z' that correspond to 'i' as 'TRUE'.
DT[, z:= FALSE][my.rows, z:= TRUE]

DT <- cbind(DT,z = ifelse(DT[, .I] %in% my.rows , T, NA))
> DT
# x y z
# 1: A 2 NA
# 2: B 6 NA
# 3: C 5 TRUE
# 4: D 8 NA
# 5: E 9 TRUE
# 6: F 4 NA
# 7: G 1 TRUE
# 8: H 7 NA
# 9: I 10 NA
#10: J 3 NA

rbindlist data.tables with different number of columns

I am wondering how do I rbindlist data tables with different number of columns, and filling up empty rows with NAs like rbind.fill
DT1 <- data.table(A = 1:3)
DT2 <- data.table(A =4:5, B = letters[4:5])
l <- list(DT1, DT2)
rbindlist(l)
# Error in rbindlist(l) :
# Item 2 has 2 columns, inconsistent with item 1 which has 1 columns
What I want to get is
A B
1: 1 NA
2: 2 NA
3: 3 NA
4: 4 d
5: 5 e

This feature is now implemented in commit 1266 of v1.9.3. From NEWS:
o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented
entirely in C. Closes #5249
-> use.names by default is FALSE for backwards compatibility (doesn't bind by
names by default)
-> rbind(...) now just calls rbindlist() internally, except that 'use.names'
is TRUE by default, for compatibility with base (and backwards compatibility).
-> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
-> At least one item of the input list has to have non-null column names.
-> Duplicate columns are bound in the order of occurrence, like base.
-> Attributes that might exist in individual items would be lost in the bound result.
-> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
-> And incredibly fast ;).
-> Documentation updated in much detail. Closes DR #5158.
Check this post for benchmarks.
Examples:
1) Using fill argument of rbindlist:
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=2, z=-1)
rbindlist(list(DT1, DT2), fill=TRUE)
# x y z
# 1: 1 2 NA
# 2: NA 2 -1
Note that when fill=TRUE, use.names should be TRUE.
2) Binding tables with duplicate names appropriately:
DT1 <- data.table(x=1, x=2, y=1, y=2)
DT2 <- data.table(y=3, y=-1, y=-2)
rbindlist(list(DT1, DT2), fill=TRUE)
# x x y y y
# 1: 1 2 1 2 NA
# 2: NA NA 3 -1 -2
3) It's not limited to just data.tables, but works on data.frames and lists as well:
DT1 <- data.table(x=1, y=2)
DT2 <- data.frame(y=2, z=-1)
DT3 <- list(z=10)
rbindlist(list(DT1,DT2,DT3), fill=TRUE)
# x y z
# 1: 1 2 NA
# 2: NA 2 -1
# 3: NA NA 10
4) If you would like to bind just by names, you can set just use.names=TRUE, but not fill:
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=1, x=2)
rbindlist(list(DT1,DT2), use.names=TRUE, fill=FALSE)
# x y
# 1: 1 2
# 2: 2 1
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(z=2, y=1)
# returns error when fill=FALSE but can't be bound without fill=TRUE
rbindlist(list(DT1, DT2), use.names=TRUE, fill=FALSE)
# Error in rbindlist(list(DT1, DT2), use.names = TRUE, fill = FALSE) :
# Answer requires 3 columns whereas one or more item(s) in the input
# list has only 2 columns. ...
5) The default is the same for backwards compatibility (use.names=FALSE, fill=FALSE):
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=1, x=2)
rbindlist(list(DT1, DT2))
# x y
# 1: 1 2
# 2: 1 2
HTH

Rbind with new columns and data.table

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?
library(data.table)
aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)
dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))
dt.11 <- rbind(dt.1, dt.1) # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2) # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2) # What I need, doesn't work either
I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.

Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:
If TRUE fills missing columns with NAs. By default FALSE. When
TRUE, use.names has to be TRUE, and all items of the input list has to
have non-null column names.
Thus you can do (prior to approx v1.9.6):
data.table::rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
UPDATE for v1.9.6:
This now works directly:
rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5

Here is an approach that will update the missing columns in
rbind.missing <- function(A, B) {
cols.A <- names(A)
cols.B <- names(B)
missing.A <- setdiff(cols.B,cols.A)
# check and define missing columns in A
if(length(missing.A) > 0L){
# .. means "look up one level"
class.missing.A <- lapply(B[, ..missing.A], class)
nas.A <- lapply(class.missing.A, as, object = NA)
A[,c(missing.A) := nas.A]
}
# check and define missing columns in B
missing.B <- setdiff(names(A), cols.B)
if(length(missing.B) > 0L){
class.missing.B <- lapply(A[, ..missing.B], class)
nas.B <- lapply(class.missing.B, as, object = NA)
B[,c(missing.B) := nas.B]
}
# reorder so they are the same
setcolorder(B, names(A))
rbind(A, B)
}
rbind.missing(dt.1,dt.2)
## aa bb cc
## 1: 1 2 NA
## 2: 2 3 NA
## 3: 3 4 NA
## 4: 1 2 3
## 5: 2 3 4
## 6: 3 4 5
This will not be efficient for many, or large data.tables, as it only works two at a time.

The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.

the basic concept is to add missing columns in both directions: from the running master table
to the newTable and back the other way.
As #menl pointed out in the comments, simply assigning an NA is a problem, because that will
make the whole column of class logical.
One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.
Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class)
as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName)
on the LHS to assign.
We can then wrap this in a function and use S3 methods so that we can simply call
rbindFill(A, B)
Below is the function.
rbindFill.data.table <- function(master, newTable) {
# Append newTable to master
# assign to Master
#-----------------#
# identify columns missing
colMisng <- setdiff(names(newTable), names(master))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))
# assign to each column value of NA with appropriate class
master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# assign to newTable
#-----------------#
# identify columns missing
colMisng <- setdiff(names(master), names(newTable))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))
# assign to each column value of NA with appropriate class
newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# reorder columns to avoid warning about ordering
#-----------------#
colOrdering <- colOrderingByOtherCol(newTable, names(master))
setcolorder(newTable, colOrdering)
# rbind them!
#-----------------#
rbind(master, newTable)
}
# implement generic function
rbindFill <- function(x, y, ...) UseMethod("rbindFill")
Example Usage:
# Sample Data:
#--------------------------------------------------#
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
#--------------------------------------------------#
# Four iterations of calling rbindFill
master <- rbindFill(A, B)
master <- rbindFill(master, A2)
master <- rbindFill(master, C)
# Results:
master
# a b c d m n f
# 1: 1 1 1 NA NA NA NA
# 2: 2 2 2 NA NA NA NA
# 3: 3 3 3 NA NA NA NA
# 4: NA 1 1 1 A NA NA
# 5: NA 2 2 2 B NA NA
# 6: NA 3 3 3 C NA NA
# 7: 6 6 6 NA NA NA NA
# 8: 7 7 7 NA NA NA NA
# 9: 8 8 8 NA NA NA NA
# 10: 9 9 9 NA NA NA NA
# 11: NA NA 7 NA NA 0.86 TRUE
# 12: NA NA 8 NA NA -1.15 FALSE
# 13: NA NA 9 NA NA 1.10 TRUE

Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.
Also, using rbindlist() should be much faster when dealing with data.tables.
Define the tables (same as mnel's code above):
library(data.table)
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
Insert the missing variables in table A: (note the use of A2[0]
A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)
Insert the missing columns in table A2:
A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)
Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):
setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL
Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...
DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))
DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL
The result looks the same as mnels' output (except for the random numbers and the column order).
PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?
PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

Remove constant columns with or without NAs

I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?

Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)

There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.

Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5

Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.

If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}

Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5

For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method

library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Merge data.table and fill in NAs - r

While you explore the all argument to merge, I'll also offer you an alternative that might want to consider: Reduce(function(x, y) merge(x, y, by = "Type", all = TRUE), list(dt1, dt2, dt3)) # Type x y z # 1: a 1 3 NA # 2: b 2 4 NA # 3: c NA NA 3 # 4: d NA NA 4

Related

Retaining by variables in R data.table by-without-by

Inline ifelse assignment in data.table

rbindlist data.tables with different number of columns

Rbind with new columns and data.table

Remove constant columns with or without NAs

Categories

Resources