R: data.table left outer join function not updating - r

Based on this previous post I build leftOuterJoin which is a function to update a data.table X according to an other data.table Y. The function is defined as follows:
leftOuterJoin <- function(X, Y, onCol) {
.colsY <- names(Y)
X[Y, (.colsY) := mget(paste0("i.", .colsY)), on = onCol]
}
The function works 99% of the time as intended, e.g.:
X <- data.table(id = 1:5, L = letters[1:5])
id L
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
id L N
1: 3 <NA> 10
2: 4 g NA
3: 5 h 12
leftOuterJoin(X, Y, "id")
X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
However, for some reason that is unknown to me, it just stops working with some data tables (I have no reproductible example at hand). There is no error, but the data table is not updated. When I use the debug function, everything seems to be working fine, X is updated, but the real data.table isn't. Now, if I just do it outside the function it works. Maybe it is related to the scope of the function? I am really struggling with this problem.
Spec: R v3.5.1 and data.table v1.11.4.
EDIT
Based on the comments I figured out that the problem is related to the data.table pointer. You can reproduce the problem with this code:
> save(X, file = "X.RData")
> load("X.RData")
> leftOuterJoin(X, Y, "id")
> X
id L
1: 1 a
2: 2 b
3: 3 <NA>
4: 4 g
5: 5 h
Notice that X is updated but not the way we want it. However, if we use setDT() it works properly:
> load("X.RData")
> setDT(X)
> leftOuterJoin(X, Y, "id")
> X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
Is there a way to set up leftOuterJoin() such that it will not be necessary to run setDT() every time some data is loaded?

Related

Referencing columns in .SDcols using for loop

So what I'm trying to achieve is this : Say I have a data table dt having (say) 4 columns. I want to get unique length of every combination of 2 columns.
DT <- data.table(a = 1:10, b = c(1,1,1,2,2,3,4,4,5,5), c = letters[1:10], d = c(3,3,5,2,4,2,5,1,1,5))
> DT
a b c d
1: 1 1 a 3
2: 2 1 b 3
3: 3 1 c 5
4: 4 2 d 2
5: 5 2 e 4
6: 6 3 f 2
7: 7 4 g 5
8: 8 4 h 1
9: 9 5 i 1
10: 10 5 j 5
I tried the following code :
cols <- colnames(DT)
for(i in 1:(length(cols)-1)) {
for (j in i+1:length(cols)) {
print(unique(DT[,.SD, .SDcols = c(cols[i],cols[j])]))
}
}
Here, basically 'i' goes from first column to second last whereas 'j' is the combining column with 'i'. So the combinations I get are : ab, ac, ad, bc, bd, cd.
But it gives me the following error
Error in [.data.table(DT, , .SD, .SDcols = c(cols[i], cols[j])) :
.SDcols missing at the following indices: [2]
If someone can explain why this is and a way around it, I'll be really grateful. Thanks.
This is due to operators precedence, : is evaluated before +:
1+1:length(cols)
[1] 2 3 4 5
> (1+1):length(cols)
[1] 2 3 4
Correct loop is :
for(i in 1:(length(cols)-1)) {
for (j in (i+1):length(cols)) {
print(unique(DT[,.SD, .SDcols = c(cols[i],cols[j])]))
}
}

Create a list of dataframes/data.tables which created from function with argument in R?

I am impressed by the efficiency R-code could be by using functions and loops.
I will provide a simplified example of the question first, and explain my problem (where the code is probably not replicable).
If I have several vectors which are different in contents and length,like:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
And I have a function to create data.tables from the vector, like:
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = c("a","b","c"))
return(DT)
}
I am wondering, if there is a way to write a loop or function, where I can create all (or some of ) data.tables in the vector by running the function created before?
(probably like)
for i in 1:length(tables_vector){
create_dt(tables_vector, i)
}
And then combine the results in a list, same as the result if you run:
list(create_dt(tables_vector_1,1),create_dt(tables_vector_1,2),create_dt(tables_vector_1,3))
I have tried to use lapply(list(1:3),create_dt,tables_vector = tables_vector_1, i), but it falls, since I don't know how to specify the i argument correctly in lapply().
Here is the explanation why this problem rise:
In the real situation, I have created a function to import data.table from the database:
import_data <- function(tables_vector,i){
end <- Sys.time()
start <- end - 7200
con <- dbConnect("PostgreSQL", dbname="db", host = "host", user=db_user, password=db_password)
query <- sprintf("SELECT %s.timeutc, %s.scal AS %s FROM %s WHERE timeutc BETWEEN '%s' AND '%s' AND mode='General';",
tables_vector[i],tables_vector[i],tables_vector[i], tables_vector[i],start,end)
rs <- dbSendQuery(con, query)
df <- fetch(rs, n = -1)
dbClearResult(rs)
dbDisconnect(con)
return(as.data.table(df))
}
And I have tens of vectors which are defined by groups (e.g. vector1 contains channels for purpose 1, vector2 contains channels for purpose 2).
Since they are created for different analysis purposes, I cannot simply combine them in one vector.
Moreover, some vector contains 7, 8 channels, so it is quite annoying to list them by repeating the function one by one.
How about something like this:
tables_vector_1 <- c(1,2,3)
tables_vector_2 <- c(1:10)
create_dt <- function(tables_vector, i){
DT <- data.table(id = 1:i, name = letters[1:i])
return(DT)
}
make_list <- function(x){
lapply(seq_along(x), function(i)create_dt(x, i))
}
make_list(tables_vector_1)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
make_list(tables_vector_2)
[[1]]
id name
1: 1 a
[[2]]
id name
1: 1 a
2: 2 b
[[3]]
id name
1: 1 a
2: 2 b
3: 3 c
[[4]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
[[5]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
[[6]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
[[7]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
[[8]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
[[9]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
[[10]]
id name
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
6: 6 f
7: 7 g
8: 8 h
9: 9 i
10: 10 j
Note, I changed the create_dt() function so it did not produce a warning, but the mechanics should still work as intended.

collapse package: sum over two vectors but keep empty intersections

I would like to aggregate a vector/ matrix y by two variables a and b via the fsum function of the collapse package. fsum does not return values for empty intersections. Is there a way to keep empty intersection using the collapse package? I know that I could e.g. work through cross-joins and data.table, but as my function input is a vector and speed really matters, I would like to avoid converting the input matrix to a data.table and then convert the output back to a matrix / vector (for a solution with data.table, see e.g. here: data.table calculate sums by two variables and add observations for "empty" groups).
Here is an example:
library(collapse)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- matrix(rnorm(10), 10, 1)
fsum(x = y, g = data.frame(a = a, b = b))
#> fsum(x = y, g = data.frame(a = a, b = b))
# [,1]
#1.1 -0.40955189
#1.2 -0.05710677
#2.2 0.50360797
#2.3 -1.28459935
#3.1 0.04672617
#3.2 -0.69095384
#3.3 -0.23570656
#4.1 0.80418951
#5.2 1.08576936
What I would like to get: the regular output above, but keeping the empty intersections of (a, b) - e.g (a = 1, b = 3) and assign a missing or zero:
# a b y
#1: 1 1 -0.7702614
#2: 1 2 -0.2992151
#3: 1 3 NA
#4: 2 1 NA
#5: 2 2 -0.4115108
#6: 2 3 0.4356833
#.................
As an addition: base::aggregate() has a function argument drop = FALSE that achieves this:
aggregate(y, data.frame(a, b), sum, drop = FALSE)
a b V1
#1 1 1 -0.7702614
#2 2 1 NA
#3 3 1 -1.2375384
#4 4 1 -0.2894616
#5 5 1 NA
#6 1 2 -0.2992151
#7 2 2 -0.4115108
#8 3 2 -0.8919211
#9 4 2 NA
#10 5 2 0.2522234
#11 1 3 NA
#12 2 3 0.4356833
#13 3 3 -0.2242679
#14 4 3 NA
#15 5 3 NA
Nevertheless, in my experience both data.table and collapse are significantly faster, butcollapse has the advantage that it also works with matrix objects (that do not need to be converted to data.table's).
Is there away to achieve this via collapse?
yes you can do that with fsum, however other functions like fmedian will warn about that. To do that you need to create factors and interact them using : like so:
library(collapse)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- matrix(rnorm(10), 10, 1)
fsum(x = y, g = qF(a):qF(b))
# [,1]
# 1:1 -0.7702614
# 1:2 -0.2992151
# 1:3 NA
# 2:1 NA
# 2:2 -0.4115108
# 2:3 0.4356833
# 3:1 -1.2375384
# 3:2 -0.8919211
# 3:3 -0.2242679
# 4:1 -0.2894616
# 4:2 NA
# 4:3 NA
# 5:1 NA
# 5:2 0.2522234
# 5:3 NA
For the earlier example you gave, I'd also like to note that the expensive call to data.frame is absolutely not necessary, fsum(x = y, g = list(a = a, b = b)) is sufficient.

Set value of data frame new field equal to another field based on condition on a third field in R

If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F

Removing data point outside of a specific quantile

I would like to remove data points above 97.5% and below 2.5%. I created the following parsimonious data set to explain the issue:
y <- data.table(a = rep(c("b","d"), each = 2, times = 3), c=rep(c("e","f"),
each = 3, times = 2), seq(1,6))
I created the following script to accomplish the task:
require(data.table)
y[, trimErr := ifelse(y$V3 < quantile(y$V3, 0.95) & y$V3 > quantile(y$V3, 0.05),y$V3, NA),
by = list(a,c)]
I then got 4 warning messages, I will only provide the first warning:
Warning messages:
1: In `[.data.table`(y, , `:=`(trimErr, ifelse(y$V3 < quantile(y$V3, :
RHS 1 is length 12 (greater than the size (3) of group 1). The last 9 element(s) will be discarded.
can you please explain to me what the warning means and how can i modify my code.
Would you suggest a better code to remove the top and bottom 2.5% of the data. Thanks in advance.
You're grouping by a and c, but passing in a vector that is the length of the entire data.table, instead of just the data for each group.
You don't need the y$ inside the [.data.table call
y[, trimErr:=ifelse(V3 < quantile(V3, 0.95) & V3 > quantile(V3, 0.05),V3, NA),
by=list(a,c)]
y
# a c V3 trimErr
# 1: b e 1 NA
# 2: b e 2 2
# 3: d e 3 NA
# 4: d f 4 NA
# 5: b f 5 5
# 6: b f 6 NA
# 7: d e 1 NA
# 8: d e 2 2
# 9: b e 3 NA
#10: b f 4 NA
#11: d f 5 5
#12: d f 6 NA

Resources