Rbind with new columns and data.table - r

I need to add many large tables to an existing table, so I use rbind with the excellent package data.table. But some of the later tables have more columns than the original one (which need to be included). Is there an equivalent of rbind.fill for data.table?
library(data.table)
aa <- c(1,2,3)
bb <- c(2,3,4)
cc <- c(3,4,5)
dt.1 <- data.table(cbind(aa, bb))
dt.2 <- data.table(cbind(aa, bb, cc))
dt.11 <- rbind(dt.1, dt.1) # Works, but not what I need
dt.12 <- rbind(dt.1, dt.2) # What I need, doesn't work
dt.12 <- rbind.fill(dt.1, dt.2) # What I need, doesn't work either
I need to start rbinding before I have all tables, so no way to know what future new columns will be called. Missing data can be filled with NA.

Since v1.9.2, data.table's rbind function gained fill argument. From ?rbind.data.table documentation:
If TRUE fills missing columns with NAs. By default FALSE. When
TRUE, use.names has to be TRUE, and all items of the input list has to
have non-null column names.
Thus you can do (prior to approx v1.9.6):
data.table::rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5
UPDATE for v1.9.6:
This now works directly:
rbind(dt.1, dt.2, fill=TRUE)
# aa bb cc
# 1: 1 2 NA
# 2: 2 3 NA
# 3: 3 4 NA
# 4: 1 2 3
# 5: 2 3 4
# 6: 3 4 5

Here is an approach that will update the missing columns in
rbind.missing <- function(A, B) {
cols.A <- names(A)
cols.B <- names(B)
missing.A <- setdiff(cols.B,cols.A)
# check and define missing columns in A
if(length(missing.A) > 0L){
# .. means "look up one level"
class.missing.A <- lapply(B[, ..missing.A], class)
nas.A <- lapply(class.missing.A, as, object = NA)
A[,c(missing.A) := nas.A]
}
# check and define missing columns in B
missing.B <- setdiff(names(A), cols.B)
if(length(missing.B) > 0L){
class.missing.B <- lapply(A[, ..missing.B], class)
nas.B <- lapply(class.missing.B, as, object = NA)
B[,c(missing.B) := nas.B]
}
# reorder so they are the same
setcolorder(B, names(A))
rbind(A, B)
}
rbind.missing(dt.1,dt.2)
## aa bb cc
## 1: 1 2 NA
## 2: 2 3 NA
## 3: 3 4 NA
## 4: 1 2 3
## 5: 2 3 4
## 6: 3 4 5
This will not be efficient for many, or large data.tables, as it only works two at a time.

The answers are awesome, but looks like, there are some functions suggested here such as plyr::rbind.fill and gtools::smartbind which seemed to work perfectly for me.

the basic concept is to add missing columns in both directions: from the running master table
to the newTable and back the other way.
As #menl pointed out in the comments, simply assigning an NA is a problem, because that will
make the whole column of class logical.
One solution is to force all columns of a single type (ie as.numeric(NA)), but that is too restrictive.
Instead, we need to analyze each new column for its class. We can then use as(NA, cc) _(cc being the class)
as the vector that we will assign to a new column. We wrap this in an lapply statement on the RHS and use eval(columnName)
on the LHS to assign.
We can then wrap this in a function and use S3 methods so that we can simply call
rbindFill(A, B)
Below is the function.
rbindFill.data.table <- function(master, newTable) {
# Append newTable to master
# assign to Master
#-----------------#
# identify columns missing
colMisng <- setdiff(names(newTable), names(master))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(newTable[[x]]))
# assign to each column value of NA with appropriate class
master[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# assign to newTable
#-----------------#
# identify columns missing
colMisng <- setdiff(names(master), names(newTable))
# if there are no columns missing, move on to next part
if (!identical(colMisng, character(0))) {
# identify class of each
colMisng.cls <- sapply(colMisng, function(x) class(master[[x]]))
# assign to each column value of NA with appropriate class
newTable[ , eval(colMisng) := lapply(colMisng.cls, function(cc) as(NA, cc))]
}
# reorder columns to avoid warning about ordering
#-----------------#
colOrdering <- colOrderingByOtherCol(newTable, names(master))
setcolorder(newTable, colOrdering)
# rbind them!
#-----------------#
rbind(master, newTable)
}
# implement generic function
rbindFill <- function(x, y, ...) UseMethod("rbindFill")
Example Usage:
# Sample Data:
#--------------------------------------------------#
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
#--------------------------------------------------#
# Four iterations of calling rbindFill
master <- rbindFill(A, B)
master <- rbindFill(master, A2)
master <- rbindFill(master, C)
# Results:
master
# a b c d m n f
# 1: 1 1 1 NA NA NA NA
# 2: 2 2 2 NA NA NA NA
# 3: 3 3 3 NA NA NA NA
# 4: NA 1 1 1 A NA NA
# 5: NA 2 2 2 B NA NA
# 6: NA 3 3 3 C NA NA
# 7: 6 6 6 NA NA NA NA
# 8: 7 7 7 NA NA NA NA
# 9: 8 8 8 NA NA NA NA
# 10: 9 9 9 NA NA NA NA
# 11: NA NA 7 NA NA 0.86 TRUE
# 12: NA NA 8 NA NA -1.15 FALSE
# 13: NA NA 9 NA NA 1.10 TRUE

Yet another way to insert the missing columns (with the correct type and NAs) is to merge() the first data.table A with an empty data.table A2[0] which has the structure of the second data.table. This saves the possibility to introduce bugs in user functions (I know merge() is more reliable than my own code ;)). Using mnel's tables from above, do something like the code below.
Also, using rbindlist() should be much faster when dealing with data.tables.
Define the tables (same as mnel's code above):
library(data.table)
A <- data.table(a=1:3, b=1:3, c=1:3)
A2 <- data.table(a=6:9, b=6:9, c=6:9)
B <- data.table(b=1:3, c=1:3, d=1:3, m=LETTERS[1:3])
C <- data.table(n=round(rnorm(3), 2), f=c(T, F, T), c=7:9)
Insert the missing variables in table A: (note the use of A2[0]
A <- merge(x=A, y=A2[0], by=intersect(names(A),names(A2)), all=TRUE)
Insert the missing columns in table A2:
A2 <- merge(x=A[0], y=A2, by=intersect(names(A),names(A2)), all=TRUE)
Now A and A2 should have the same columns, with the same types. Set the column order to match, just in case (possibly not needed, not sure if rbindlist() binds across column names or column positions):
setcolorder(A2, names(A))
DT.ALL <- rbindlist(l=list(A,A2))
DT.ALL
Repeat for the other tables... Maybe it would be better to put this into a function rather than repeat by hand...
DT.ALL <- merge(x=DT.ALL, y=B[0], by=intersect(names(DT.ALL), names(B)), all=TRUE)
B <- merge(x=DT.ALL[0], y=B, by=intersect(names(DT.ALL), names(B)), all=TRUE)
setcolorder(B, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, B))
DT.ALL <- merge(x=DT.ALL, y=C[0], by=intersect(names(DT.ALL), names(C)), all=TRUE)
C <- merge(x=DT.ALL[0], y=C, by=intersect(names(DT.ALL), names(C)), all=TRUE)
setcolorder(C, names(DT.ALL))
DT.ALL <- rbindlist(l=list(DT.ALL, C))
DT.ALL
The result looks the same as mnels' output (except for the random numbers and the column order).
PS1: The original author does not say what to do if there are matching variables -- do we really want to do a rbind() or are we thinking of a merge()?
PS2: (Since I do not have enough reputation to comment) The gist of the question seems a duplicate of this question. Also important for the benchmarking of data.table vs. plyr with large datasets.

Related

Comparing two columns in R based on matching values in a common column

So, I have two datasets, such that all the columns in one dataset is present in the other, along with some extra columns. What I want to do is to create a new dataset of the differences between the entries in common columns, on the basis of matching a common identifier column, present in both the datasets. How can I do that in R?
If it were a single column, I could have used the sqldf function as
sqldf("select a.v1 - b.v1 from ds1 a left join ds2 b on a.identifier=b.identifier")
But there are 900 common columns between both the datasets.
You can do this by simply joining the frames on identifier, and then subtracting the one frame from the other.
Here is an example of the approach using data.table
# load the library
library(data.table)
# do a left join on identifier
merged = setDT(ds2)[setDT(ds1), on="identifier"]
# get common column names, and remove "identifier" from that vector
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
# column-bind the identifier column with a substraction of the two sub-frames
cbind(
merged[,.(identifier)],
setnames(merged[,.SD, .SDcols = paste0("i.",cols)] - merged[,.SD, .SDcols = cols],paste0("diff_",cols))
)
Here is the same approach using dplyr:
library(dplyr)
merged = left_join(ds1,ds2, by="identifier")
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
bind_cols(
select(merged,identifier),
(select(merged, all_of(paste0(cols, ".x"))) - select(merged, all_of(paste0(cols, ".y")))) %>%
rename_with(~paste0("diff_",cols), everything())
)
Output (same under either approach):
identifier diff_v1 diff_v2 diff_v3
<char> <num> <num> <num>
1: O 0.5028498 0.7573174 -1.00630610
2: S -2.5631238 -0.7041228 1.33877932
3: N NA NA NA
4: C NA NA NA
5: J NA NA NA
6: R NA NA NA
7: K NA NA NA
8: E NA NA NA
9: X -0.1830764 0.2924459 -0.01860763
10: Y NA NA NA
11: W NA NA NA
12: T -0.4912840 -2.8126285 -1.33661044
13: I NA NA NA
14: L NA NA NA
15: U NA NA NA
16: M -0.3130889 1.1590316 -0.44551660
17: P NA NA NA
18: H NA NA NA
19: B NA NA NA
20: G -2.2817049 2.4156583 -0.34393988
21: Z NA NA NA
22: A -0.1654816 -0.8807393 -1.08534789
23: F NA NA NA
24: V NA NA NA
25: D 1.4653655 0.2604109 -0.17733840
26: Q NA NA NA
identifier diff_v1 diff_v2 diff_v3
Input:
set.seed(123)
ds1 = data.frame(identifier = sample(LETTERS),v1 = rnorm(26),v2 = rnorm(26),v3 = rnorm(26))
ds2 = data.frame(identifier = sample(LETTERS,8),v1 = rnorm(8),v2 = rnorm(8),v3 = rnorm(8))
Using S1 and S2 as an example (see Note at end) we find the common column names using intersect. In the example data in the Note at the end there is only one numeric common column that is not ID but the same code should work even if there are many.
Then using Filter extract the names of the numeric common columns. We have assumed that if a common column in S1 is numeric then it is also numeric in S2 so we only have to check S1. The Filter line could be omitted if we knew that all common columns were numeric.
Next ensure that the ID column is excluded using setdiff. If ID is non numeric the Filter line would have already removed it in which case we could omit the setdiff line.
Now construct the select clause. sprintf creates a character vector of the elements of the select clause and toString collapses it to a comma separated string giving the final select string. Finally run the SQL statement. Note that fn$sqldf turns on string interpolation in the SQL statement and $sel inserts the contents of the sel variable into the SQL string.
library (sqldf)
nms <- intersect(names(S1), names(S2))
nms <- names(Filter(is.numeric, S1[nms]))
nms <- setdiff(nms, "ID")
sel <- toString(sprintf("a.[%s] - b.[%s] as [%s]", nms, nms, nms))
fn$sqldf("select ID, $sel  
from S1 a  
left join S2 b using(ID)")
## ID extra
## 1 1 0
## 2 2 0
## 3 3 0
## 4 4 0
## 5 5 0
## 6 6 0
Pipe
The nms<- lines above could alternately be written in terms of pipes:
nms <- names(S1) |>
intersect(names(S2)) |>
subset(S1, select = _) |> 
Filter(f = is.numeric) |>
names() |>
setdiff("ID")
Note
The data frame sleep comes with R. S1 and S2 are used as an example.
S1 <- head(sleep)
S2 <- S1[-2]
S1
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
S2
## extra ID
## 1 0.7 1
## 2 -1.6 2
## 3 -0.2 3
## 4 -1.2 4
## 5 -0.1 5
## 6 3.4 6

Fast way to insert values in column of data frame in R

I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame.
This is my code, which works perfectly fine, and creates a result which matches my expectation.
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
unique_to_a <- df$a[!(df$a %in% df$b)]
unique_to_b <- df$b[!(df$b %in% df$a)]
n <- max(c(unique_to_a, unique_to_b))
out <- data.frame(A=rep(NA,n), B=rep(NA,n))
for (element in unique_to_a){
out[element, "A"] = element
}
for (element in unique_to_b){
out[element, "B"] = element
}
out
The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...
Any ideas on how to speed up the operation is much appreciated.
Cheers!
Didn't compare the speed but at least this is more concise:
elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
# X1 X2
# 1 NA NA
# 2 NA NA
# 3 NA 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 NA NA
# 8 NA NA
# 9 NA NA
# 10 NA NA
# 11 11 NA

Combine/match/merge vectors by row names

I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5

How to access variables of datatable when passed as a parameter to a function?

I'm new to R and I have searched over google for solution to the below problem.
I have
DT = data.table(y=c("a",NA,NA), y_1=c(NA,3,6), y_2=c(1,NA,3), y_3=c(1,1,1)).
I want to create a function passing the datatable and the column that needs to be changed.
fun <- function(dt, var)
{
dt[,(var) := ifelse(!(is.na(get(var))), get(paste0(var,"1")),
ifelse(!(is.na(get(paste0(var,"1")), get(paste0(var,"2")...))]
return(dt)
}
I want to replace the values in y variable which are NA's with the values in y_1 if they are not NULL or else replace with y_2 and so on. Like this I want to create a function which can accept different variables but with the same ending.
Update: Uwe, Thanks for pointing that previous question. I found it pretty useful. But, my requirement is slightly different. I need the same update for other variables as well where the values are NA. For example, I need to do it for (x,x_1,x_2,x_3...),(z,z_1,z_2,z_3..) and some other variables apart from y. Is there a way use lapply or function to do that.
Thanks in advance.
The OP is looking for a variant of the locf method (last observation carried forward) which is implemented as zoo::na.locf() for instance. While na.locf() usually is applied on a vector or a column of a data.frame, the OP is looking for a variant which is applied on each row of a data.table but restricted to a specific subset of columns. So, the function is being named na.locl() (last observation carried left).
In addition, the data.table is to be updated in place, e.g., without copying. The columns are named in a specific manner, e.g., x, x_1, x_2, x_3, etc. So, x is kind of a base name for the subset of columns.
The function below will look in each row of a specific subset of columns of a given data.table for the first non-NA column and copies this value to column x.
The implementation is based on this solution. It includes some plausibilty checks.
na.locl <- function(var, dt) {
checkmate::assert_data_table(dt)
checkmate::assert_string(var)
checkmate::assert_choice(var, names(dt))
ans_val = rep_len(NA_real_, nrow(dt))
selected_cols <- unlist(lapply(
var, function(x) stringr::str_subset(names(dt), paste0("^", x, "(_\\d*)?$"))))
for(col in selected_cols) {
i = is.na(ans_val) & (!is.na(dt[[col]]))
ans_val[i] = dt[[col]][i]
}
set(DT, , var, ans_val)
return(invisible(NULL))
}
In addition, the OP has requested to repeat this for other variables. This can be accomplished using lapply() with the na.locl() function. To demonstrate this, sample data are required.
library(data.table)
DT0 <- data.table(y=c("a",NA,NA,NA), y_1=c(NA,3,NA,NA), y_2=c(1,NA,3,NA), y_3=c(1,1,1,NA))
DT <- cbind(DT0, setnames(copy(DT0), stringr::str_replace(names(DT0), "^y", "x")))
DT <- cbind(DT, setnames(copy(DT0), stringr::str_replace(names(DT0), "^y", "zzz")))
DT
# y y_1 y_2 y_3 x x_1 x_2 x_3 zzz zzz_1 zzz_2 zzz_3
#1: a NA 1 1 a NA 1 1 a NA 1 1
#2: NA 3 NA 1 NA 3 NA 1 NA 3 NA 1
#3: NA NA 3 1 NA NA 3 1 NA NA 3 1
#4: NA NA NA NA NA NA NA NA NA NA NA NA
y, x, and zzz are NAexcept for row 1. After applying the function on DT,
dummy <- lapply(c("x", "y", "zzz"), na.locl, dt = DT)
DT
# y y_1 y_2 y_3 x x_1 x_2 x_3 zzz zzz_1 zzz_2 zzz_3
#1: a NA 1 1 a NA 1 1 a NA 1 1
#2: 3 3 NA 1 3 3 NA 1 3 3 NA 1
#3: 3 NA 3 1 3 NA 3 1 3 NA 3 1
#4: NA NA NA NA NA NA NA NA NA NA NA NA
the missing values in columns y, x, and zzz have been replaced by the next non-NA value to the right if available within the subset of columns. Thus, row 4 is all NA as no non-NA (that's three negations in a row) is available in each of the column subsets.

how to programatically assign a bunch of variables within a data frame

Given a character vector vars and a list of data frames d, I want to
make sure every data frame in d has all the columns named in vars.
Let's say some of the columns are missing in one data frame, then I create
those columns in the data frame and fill them with NAs.
However when I do this by using assign I get some strange results:
> vars <- c('y','z')
> b <- data.frame(a=1:3, b=3:1)
> b
a b
1 1 3
2 2 2
3 3 1
> within(b, for (v in vars) assign(v, NA))
a b z y v
1 1 3 NA NA z
2 2 2 NA NA z
3 3 1 NA NA z
You can see that I managed to create columns z and y using this method,
but there's also an extra volumn v which I don't know where it came from.
Here's a simple way that's in the spirit of your original code.
for(v in vars) { b[[v]] <- NA }
The reason you're getting the extra v in your version is that any variable that is created in the call to within gets added to that data frame, and the for loop creates that variable. If you remove it at the end it will go away. However, watch out for the case that your vars variable include v.
within(b, {for (v in vars) assign(v, NA); rm(v) })
You might also make vars include all the variables and just get the ones you want to keep with setdiff.
vars <- c('a','b','y','z')
b <- data.frame(a=1:3, b=3:1)
for(v in setdiff(vars, names(b))) { b[[v]] <- NA }
Try this:
missingCols <- setdiff(vars, names(b))
naColumn <- function(x)rep(NA, nrow(b))
cbind(b, sapply(missingCols, naColumn, USE.NAMES=TRUE))
a b y z
1 1 3 NA NA
2 2 2 NA NA
3 3 1 NA NA
You could also try:
list2env(split(rep(NA,2*nrow(b)),vars),envir=.GlobalEnv)
cbind(b,mget(vars))
# a b y z
# 1 1 3 NA NA
# 2 2 2 NA NA
# 3 3 1 NA NA
or
cbind(b,mget(setdiff(vars,names(b))))
Here's using data.table:
require(data.table) ## 1.9.2+
setDT(b) ## convert data.frame to data.table
set(b, j=vars, value=NA_integer_)
# a b y z
# 1: 1 3 NA NA
# 2: 2 2 NA NA
# 3: 3 1 NA NA
Note all set* functions in data.table (and := operator) operates by reference, meaning there was no (unnecessary) copy being made here.
In case you'd like to work with data.frame, you can just convert this back to a data.frame. In v1.9.3 (currently development version), there's a function setDF that's implemented to get back to a data.frame from data.table by reference (as opposed to the traditional as.data.frame(.) function that'll result in a copy).
Putting it all together (if you want a data.frame at the end)
## 1.9.3
setDF(set(setDT(b), j=vars, value=NA_integer_))
# a b y z
# 1 1 3 NA NA
# 2 2 2 NA NA
# 3 3 1 NA NA
Once again, no (deep) copies were made.

Resources