Inconsistent data.table assignment by reference behaviour - r

When assigning by reference with a data.table using a column from a second data.table, the results are inconsistent. When there are no matches by the key columns of both data.tables, it appears the assigment expression y := y is totally ignored - not even NAs are returned.
library(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
print(dt1[dt2, y := y])
## id x # Would have also expected column: y
## 1: 1 3 # NA
## 2: 2 4 # NA
However, when there is a partial match, non-matching columns have a placeholder NA.
dt2[, id := 2:3]
print(dt1[dt2, y := y])
## id x y
## 1: 1 3 NA # <-- placeholder NA here
## 2: 2 4 5
This wreaks havoc on later code that assumes a y column exists in all cases. Otherwise I keep having to write cumbersome additional checks to take into account both cases.
Is there an elegant way around this inconsistency?

With this recent commit, this issue, #759, is now fixed in v1.9.7. It works as expected when nomatch=NA (the current default).
require(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
dt1[dt2, y := y][]
# id x y
# 1: 1 3 NA
# 2: 2 4 NA

Using merge works:
> dt3 <- merge(dt1, dt2, by='id', all.x=TRUE)
> dt3
id x y
1: 1 3 NA
2: 2 4 NA

Related

Merging a sum by reference with data.table

Let's say I have two data.table, dt_a and dt_b defined as below.
library(data.table)
set.seed(20201111L)
dt_a <- data.table(
foo = c("a", "b", "c")
)
dt_b <- data.table(
bar = sample(c("a", "b", "c"), 10L, replace=TRUE),
value = runif(10L)
)
dt_b[]
## bar value
## 1: c 0.4904536
## 2: c 0.9067509
## 3: b 0.1831664
## 4: c 0.0203943
## 5: c 0.8707686
## 6: a 0.4224133
## 7: a 0.6025349
## 8: b 0.4916672
## 9: a 0.4566726
## 10: b 0.8841110
I want to left join dt_b on dt_a by reference, summing over the multiple match. A way to do so would be to first create a summary of dt_b (thus solving the multiple match issue) and merge if afterwards.
dt_b_summary <- dt_b[, .(value=sum(value)), bar]
dt_a[dt_b_summary, value_good:=value, on=c(foo="bar")]
dt_a[]
## foo value_good
## 1: a 1.481621
## 2: b 1.558945
## 3: c 2.288367
However, this will allow memory to the object dt_b_summary, which is inefficient.
I would like to have the same result by directly joining on dt_b and summing multiple match. I'm looking for something like below, but that won't work.
dt_a[dt_b, value_bad:=sum(value), on=c(foo="bar")]
dt_a[]
## foo value_good value_bad
## 1: a 1.481621 5.328933
## 2: b 1.558945 5.328933
## 3: c 2.288367 5.328933
Anyone knows if there is something possible?
We can use .EACHI with by
library(data.table)
dt_b[dt_a, .(value = sum(value)), on = .(bar = foo), by = .EACHI]
# bar value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
If we want to update the original object 'dt_a'
dt_a[, value := dt_b[.SD, sum(value), on = .(bar = foo), by = .EACHI]$V1]
dt_a
# foo value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
For multiple columns
dt_b$value1 <- dt_b$value
nm1 <- c('value', 'value1')
dt_a[, (nm1) := dt_b[.SD, lapply(.SD, sum),
on = .(bar = foo), by = .EACHI][, .SD, .SDcols = nm1]]

Variable as name in aggregate list of data.table

I'm aggregating an R/data.table (v1.12.2) and I need to use a variable as the name of the aggregated column. E.g.:
library(data.table)
DT <- data.table(x= 1:5, y= c('A', 'A', 'B', 'B', 'B'))
aggname <- 'max_x' ## 'max_x' should be the name of the aggregated column
DT2 <- DT[, list(aggname= max(x)), by= y]
DT2
y aggname <- This should be 'max_x' not 'aggname'!
1: A 2
2: B 5
I can rename the column(s) afterwards with something like:
setnames(DT2, 'aggname', aggname)
DT2
y max_x
1: A 2
2: B 5
But I would have to check that the string 'aggname' doesn't create duplicate names first. Is there any better way of doing it?
We can use setNames on the list column
DT[, setNames(.(max(x)), aggname), by = y]
# y max_x
#1: A 2
#2: B 5
aggname2 <- 'min_x'
DT[, setNames(.(max(x), min(x)), c(aggname, aggname2)), by = y]
# y max_x min_x
#1: A 2 1
#2: B 5 3
Or another option is lst from dplyr
library(dplyr)
DT[, lst(!! aggname := max(x)), by = y]
# y max_x
#1: A 2
#2: B 5
DT[, lst(!! aggname := max(x), !! aggname2 := min(x)), by = y]
# y max_x min_x
#1: A 2 1
#2: B 5 3

Accessing column name within the SD construct

I have a data table in R that looks like this
DT = data.table(a = c(1,2,3,4,5), a_mean = c(1,1,2,2,2), b = c(6,7,8,9,10), b_mean = c(3,2,1,1,2))
I want to create two more columns a_final and b_final defined as a_final = (a - a_mean) and b_final = (b - b_mean). In my real life use case, there can be a large number of such column pairs and I want a scalable solution in the spirit of R's data tables.
I tried something along the lines of
DT[,paste0(c('a','b'),'_final') := lapply(.SD, function(x) ((x-get(paste0(colnames(.SD),'_mean'))))), .SDcols = c('a','b')]
but this doesn't quite work. Any idea of how I can access the column name of the column being processed within the lapply statement?
We can create a character vector with columns names, subset it from the original data.table, get their corresponding "mean" columns, subtract and add as new columns.
library(data.table)
cols <- unique(sub('_.*', '', names(DT))) #Thanks to #Sotos
#OR just
#cols <- c('a', 'b')
DT[,paste0(cols, '_final')] <- DT[,cols, with = FALSE] -
DT[,paste0(cols, "_mean"), with = FALSE]
DT
# a a_mean b b_mean a_final b_final
#1: 1 1 6 3 0 3
#2: 2 1 7 2 1 5
#3: 3 2 8 1 1 7
#4: 4 2 9 1 2 8
#5: 5 2 10 2 3 8
Another option is using mget with Map:
cols <- c('a', 'b')
DT[, paste0(cols,'_final') := Map(`-`, mget(cols), mget(paste0(cols,"_mean")))]
Relying on the .SD construct you could do something along the lines of:
cols <- c('a', 'b')
DT[, paste0(cols, "_final") :=
DT[, .SD, .SDcols = cols] -
DT[, .SD, .SDcols = paste0(cols, "_mean")]]

Update existing data.frame with values from another one if missing

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.
Input
> d1
id x y
1 1 1 NA
2 2 NA 3
3 3 4 NA
> d2
id x y z
1 4 NA 30 a
2 3 20 2 b
3 2 14 NA c
4 1 15 97 d
(note that the actual data.frames have hundreds of columns)
Expected output:
> d1
id x y z
1 1 1 97 d
2 2 14 3 c
3 3 4 2 b
Data and current solution:
d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])
for (col in setdiff(names(d1), "id")) {
# If missing look in d2
missing <- is.na(d1[[col]])
d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
}
for (col in setdiff(names(d2), names(d1))) {
# If column missing then add
d1[[col]] <- d2[match(d1$id, d2$id), col]
}
PS:
Likely this questions has been asked before but I'm lacking in vocabulary to search it.
Assuming you are working with 2 data.frames, here is a base solution
#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]
#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])
If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.
edit:
Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R
coalesce.mat <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
rn <- match(ans$id, elt$id)
ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
}
ans
}
allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat,
lapply(list(d1, d2), function(x) {
x[, setdiff(allcols, names(x))] <- NA
x
}))
edit:
a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
setDT(d1)
setDT(d2)
#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)
#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]
#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
d1, .SD, on=.(id)]
Here is a possibility using dplyr::left_join:
left_join(d1, d2, by = "id") %>%
mutate(
x = ifelse(!is.na(x.x), x.x, x.y),
y = ifelse(!is.na(y.x), y.x, y.y)) %>%
select(id, x, y, z)
# id x y z
#1 1 1 97 d
#2 2 14 3 c
#3 3 4 2 b
We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset
library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1),
mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
# id x y z
#1: 1 1 97 d
#2: 2 14 3 c
#3: 3 4 2 b

How to reference a variable in a for loop?

I am looping through different data.tables and the variables in the data.table. But I'm having trouble referencing the variables inside of the for loop
dt1 <- data.table(a1 = c(1,2,3), a2 = c(4,5,2))
dt2 <- data.table(a1 = c(1,43,1), a2 = c(52,4,1))
For each datatable, I want to find the average of each variable for observations where that variable != 1. Below is my attempt which doesn't work:
dtname = 'dt'
ind = c('1', '2')
for (d in ind) {
df <- get(paste0('dt', d, sep=''))
for (v in ind) {
varname <- paste0('a', v, sep='')
df1 <- df %>%
filter(varname!=1) %>%
summarise(varname = mean(varname))
print(df1)
}
}
The desired output is to take and print the average of a1 = c(2,3) in dt1, the average of a2 = (4,5,2) in dt1, the average of a1 = c(43) in dt2, the average of a2 = c(54,4) in dt2.
What am I doing wrong here? In general, how should I reference a variable inside of a for loop (varname) that is pieced together by using the looping index (v) and something else?
For a purely data.table way, I would combine the different data.tables and compute the averages:
# Concatenate the data.tables:
all_dt <- rbind("dt1" = dt1, "dt2" = dt2, idcol = "origin")
all_dt
# origin a1 a2
# 1: dt1 1 4
# 2: dt1 2 5
# 3: dt1 3 2
# 4: dt2 1 52
# 5: dt2 43 4
# 6: dt2 1 1
# Melt so that "a1" and "a2" are labels in a group column:
all_dt <- melt(all_dt, id.vars="origin")
all_dt
# origin variable value
# 1: dt1 a1 1
# 2: dt1 a1 2
# 3: dt1 a1 3
# 4: dt2 a1 1
# 5: dt2 a1 43
# 6: dt2 a1 1
# 7: dt1 a2 4
# 8: dt1 a2 5
# 9: dt1 a2 2
# 10: dt2 a2 52
# 11: dt2 a2 4
# 12: dt2 a2 1
# Compute averages by each data.table and column group, ignoring 1s:
all_dt[value != 1, .(mean = mean(value)), by = .(origin, variable)]
# origin variable mean
# 1: dt1 a1 2.500000
# 2: dt2 a1 43.000000
# 3: dt1 a2 3.666667
# 4: dt2 a2 28.000000
I figured out a solution based on the comments of #Amar and #Scott Richie
for (d in ind) {
df <- get(paste0('dt', d, sep=''))
for (v in ind) {
varname <- paste0('a', v, sep='')
df1 <- df[eval(as.name(varname))!=1, .(mean =
mean(eval(as.name(varname))))]
print(df1)
}
}
Thanks EVERYONE!
Would go for a vectorised approach. You are using R!
One possible way:
require(dplyr)
dt1[dt1==1] <- NA #replace 1 with NA
dt1 %>% summarise_all(mean, na.rm = TRUE) #mean of all columns.
a1 a2
1 2.5 3.666667
It is not very clear what you are trying to do, but if you want to replace all of the rows in the dataframe with the mean of the previous data frame's columns, I would suggest using a dataframe type instead as it is easier to index. Here is code that should work:
dt1 <- data.frame(a1 = c(1,2,3), a2 = c(4,5,2))
dt2 <- data.frame(a1 = c(1,43,1), a2 = c(52,4,1))
dtname = 'dt'
ind = c('1', '2')
for (d in ind){
df <- get(paste0('dt', d, sep=''))
for (i in 1:nrow(df)){
for (j in 1:ncol(df)){
if (df[i,j] !=1){
df[,j]<- mean(df[,j])
}
}
print(df)
}
}
The reason your code was not working before was because the variables were being treated like strings, not actual variables. You can see this by printing the data type of variances:
dtname = 'dt'
ind = c('1', '2')
for (d in ind) {
df <- get(paste0('dt', d, sep=''))
for (v in ind) {
varname <- paste0('a', v, sep='')
print(class(varname))
}
}
Which just returns "character"
Another solution using variable names and the dataframe type would be to index the df as follows:
df[["varname"]]
Here are two helpful links for this kind of operation:
* link 1: How to find the mean of a column
* link 2: Data frames

Resources