Effective appending data to existing data.table by joining and coalescing - r

I have two data.tables: dt_main is the main one, to which I need to append information of only one column from dt_add:
dt_main <- data.table(id = c(1:5)
, name = c("a", "b", NA,NA,NA)
, stuff = c(11:15))
dt_add <- data.table(id = c(4:5)
, name = c("aaa", "bbb"))
I got the job correctly done by first joining then coalescing:
dt_main_final <- dt_add[dt_main, on = "id"]
dt_main_final[, name := fcoalesce(name, i.name)][, i.name:=NULL]
The provided output is as expected:
id name stuff
1: 1 a 11
2: 2 b 12
3: 3 <NA> 13
4: 4 aaa 14
5: 5 bbb 15
I wander whether there is a more direct way to have it done, any suggestions? Thanks.
PS> I also tried melting then dcasting:
dt <- dt_add[dt_main, on = "id"]
setnames(dt, "i.name", "name")
dt_melt <- melt(dt
, measure.vars = patterns("name")
)
dt_main_final <- dcast(dt_melt
, id + stuff ~ variable
, fun.aggregate = fcoalesce
, value.var = "value")
I got the error:
Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.
Any ideas for this one also?

We can do the join and coalesce in one step:
dt_main[dt_add, name := fcoalesce(name, i.name), on = .(id)]
dt_main
# id name stuff
# <int> <char> <int>
# 1: 1 a 11
# 2: 2 b 12
# 3: 3 <NA> 13
# 4: 4 aaa 14
# 5: 5 bbb 15

Related

data.table - Selecting by comparing list of columns to list of values

I'm trying to write a function which takes a data.table, a list of columns and a list of values and selects rows such that each column is filtered by the respective value.
So, given the following data.table:
> set.seed(1)
> dt = data.table(sample(1:5, 10, replace = TRUE),
sample(1:5, 10, replace = TRUE),
sample(1:5, 10, replace = TRUE))
> dt
V1 V2 V3
1: 1 5 5
2: 4 5 2
3: 1 2 2
4: 2 2 1
5: 5 1 4
6: 3 5 1
7: 2 5 4
8: 3 1 3
9: 3 1 2
10: 1 5 2
A call to filterDT(dt, c(V1, V3), c(1, 2)) would select the rows where V1 = 1 and V3 = 2 (rows 3 and 10 above).
My best thought was to use .SD and .SDcols to stand in for the desired columns and then do a comparison within i (from dt[i,j,by]):
> filterDT <- function(dt, columns, values) {
dt[.SD == values, , .SDcols = columns]
}
> filterDT(dt, c("V1", "V3"), c(1, 2))
Empty data.table (0 rows and 3 cols): V1,V2,V3
Unfortunately, this doesn't work, even if only filtering by one column.
I've noticed all examples of .SD I've found online use it in j, which tells me I'm probably doing something very wrong.
Any suggestions?
Assuming that the 'values' to be filtered are the ones corresponding to the 'columns' selected, we can do a comparison with Map and Reduce with &
dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]
As a function
filterDT <- function(dt, columns, values) {
dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]
}
filterDT(dt, c("V1", "V3"), c(1, 2))
# V1 V2 V3
#1: 1 4 2
Or another option is setkey
setkeyv(dt, c("V1", "V3"))
dt[.(1, 2)]
# V1 V2 V3
#1: 1 4 2
I think you should be able to write a function that joins using an arbitrary number of columns:
#' Filter a data.table on an arbitrary number of columns
#'
#' #param dt data.table to filter
#' #param ... named columns to filter on and their values
filter_dt <- function(dt, ...) {
filter_criteria <- as.data.table(list(...))
dt[filter_criteria, on = names(filter_criteria), nomatch=0]
}
# A few examples:
filter_dt(dt, V1=1, V3=2)
filter_dt(dt, V1=2, V2=2, V3=5)
filter_dt(dt, V1=c(5,4,4), V3=c(1,2,5))
Basically the function constructs a new data.table from the arguments supplied to ..., each argument becoming a column in the new data.table filter_criteria. This is then supplied to the i argument of dt with the column names of filter_criteria used as the columns in the join.

How to make a fuzzy join in R using more than one variable on each side

I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
UPDATE
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b
For this case you don't need multi_by or multy_match_fun, this works :
library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b
I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)
data.table approach could be
library(data.table)
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
library(data.table)
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
# x y
#1: 1 a
#2: 5 b
#3: 3 a
A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
{
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
eval(substitute(
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
mutate(._row_=.x$._row_)}
, dfl1, SIMPLIFY=FALSE)
,list(WHERE=call$.WHERE)))
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
}
my_left_join(a,b,(x>start)&(x<end))
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
my_left_join(a,b,(x==(start+1))|(x==(end+1)))
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b
You can try a GenomicRanges solution
library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
x y
1 1 a
2 3 <NA>
3 5 b

R Data Table Return Row Numbers of Matches

I am using the data.table package to work through a the House Prices data set from Kaggle.
When I retrieve the matches from the data table syntax, the row numbers are not returned with the data.
combined_df[is.na(GarageArea), garage_num_vars, with = FALSE]
GarageYrBlt GarageCars GarageArea
1: 1923 NA NA
How can I get the actual row number with that observation? I have seen many solutions using .I and using which = TRUE but how would I add the which = TRUE argument to my current syntax?
In addition to adding a column of row number as suggested in the comment, you can also use which argument in this way:
DT <- data.table(val = c(1, 2, 3, NA, 4))
# > DT
# val
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 4
x <- DT[is.na(val), which = TRUE]
cbind(rownum = x, DT[x])
# rownum val
# 1: 4 NA

Naming Aggregate variable(s) in data.table by reference in R

I would like to know if it's possible to name an aggregate variable by a dynamic reference at the time of aggregation in data.table.
Please note that I know I can rename the variable after aggregation by reference and that is not what I'm asking here!
Let's say I've got a data.table DT with three variables v1, v2, and v3.
> DT
var1 var2 var3
1: 1 1 A
2: 3 0 A
3: 2 2 B
4: 1 0 A
5: 0 2 C
I would like to dynamically name the aggregate variable, based on the names stored in a vector OR a string variable
var_string <- c('agg_var1', 'agg_var2')
# the following doesn't work
DT_agg <- DT[, .( (var_string[1]) = sum(v1 + v2)), by = .( (var_string[2]) = var3)]
#this is the output I want
> DT_agg
agg_var2 agg_var1
1: A 6
2: B 4
3: C 2
The code above doesn't work. it gives me error of the sort:
Error: unexpected '=' in "DT_agg <- DT[, .( (var_string[1]) = sum(v1 + v2)), by = .( (var_string[2]) = var3)="
I'm only interested to know if it's possible to do this at the same time as aggregation, rather than renaming the columns afterwards, which i know how to do already.

R Dynamically build "list" in data.table (or ddply)

My aggregation needs vary among columns / data.frames. I would like to pass the "list" argument to the data.table dynamically.
As a minimal example:
require(data.table)
type <- c(rep("hello", 3), rep("bye", 3), rep("ok",3))
a <- (rep(1:3, 3))
b <- runif(9)
c <- runif(9)
df <- data.frame(cbind(type, a, b, c), stringsAsFactors=F)
DT <-data.table(df)
This call:
DT[, list(suma = sum(as.numeric(a)), meanb = mean(as.numeric(b)), minc = min(as.numeric(c))), by= type]
will have result similar to this:
type suma meanb minc
1: hello 6 0.1332210 0.4265579
2: bye 6 0.5680839 0.2993667
3: ok 6 0.5694532 0.2069026
Future data.frames will have more columns that I will want to summarize differently. But for the sake of working with this small example: Is there a way to pass the list programatically?
I naïvely tried:
# create a different list
mylist <- "list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))"
# new call
DT[, mylist, by=type]
With the following error:
1: hello
2: bye
3: ok
mylist
1: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
2: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
3: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
Any hints appreciated! Best regards!
PS sorry about these as.numeric(), I could not quite figure out why, but I needed them for the example to run.
Minor edit inserted columns / before data.frame in initial sentence to clarify my needs.
This is explained FAQ 1.6 what you are looking for is quote and eval
something like
mycall <- quote(list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c))))
DT[, eval(mycall)]
After a bit of head-banging, here is a very ugly way of constructing the call for ddply using .()
myplyrcall <- .(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
do.call(ddply,c(.data = quote(DF), .variables = 'type',.fun = quote(summarise),myplyrcall))
You could also use as.quoted which has an as.quoted.character method to construct using paste0
myplc <-as.quoted(c("lengtha" = "length(as.numeric(a))", "maxb" = "max(as.numeric(b))", "meanc" = "mean(as.numeric(c))"))
This can be used with data.table as well!
dtcall <- as.quoted(mylist)[[1]]
DT[,eval(dtcall), by = type]
data.table all the way.
Another way is to use .SDcols to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e to be summed by type where as, b,g should have mean taken and c,f its median, then,
# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9),
b = rnorm(9), c=runif(9), d=sample(9), e=sample(9),
f = runif(9), g=rnorm(9))
# type a b c d e f g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3 1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597 1.7216593
# 4: bye 8 -0.2260640 0.3924327 8 2 0.1271187 0.4360063
# 5: bye 7 -1.0720503 0.3256450 7 8 0.5774691 0.7571990
# 6: bye 5 -0.7131021 0.4855804 6 9 0.2687791 1.5398858
# 7: ok 1 -0.4680549 0.8476840 2 4 0.5633317 1.5393945
# 8: ok 4 0.4183264 0.4402595 4 1 0.7592801 2.1829996
# 9: ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758
# 1) set key
setkey(dt, "type")
# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))
# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]
# 4) merge them.
dt1[dt2[dt3]]
# type a d e b g c f
# 1: bye 20 21 19 -0.6704055 0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3: ok 14 11 10 -0.5104907 0.9090061 0.5080116 0.5633317
If/when you have many many column, making a list like the one you've might be cumbersome.
Another method (supporting the use of paste or paste0 to build the expression):
expr <- parse(text=mylist)
DT[, eval( expr ), by=type]
#-------
type lengtha maxb meanc
1: hello 3 0.8265407 0.5244094
2: bye 3 0.4955301 0.6289475
3: ok 3 0.9527455 0.5600915
I find it worrysome that apparently eval is part of the answer. From your question it is not clear to me, if and why you really want to do what you claim to want. Thus I demonstrate here that you can also use a function:
fun <- function(a,b,c) {
list(lengtha = length(as.numeric(a)),
maxb = max(as.numeric(b)),
meanc = mean(as.numeric(c)))
}
DT[, fun(a,b,c), by=type]
type lengtha maxb meanc
1: hello 3 0.8792184 0.3745643
2: bye 3 0.8718397 0.4519999
3: ok 3 0.8900764 0.4511536

Resources