Creating a new data table for each row of an existing data table R while avoiding memory vector issue - r

Suppose I have two data tables:
library(data.table)
A=data.table(w=1:3,d=5:7)
B=data.table(K=2:4,m=9:11)
> A
w d
1: 1 5
2: 2 6
3: 3 7
> B
K m
1: 2 9
2: 3 10
3: 4 11
I want to do the following expansion, where I have a new B for each row of A:
C=A[,B[],by=names(A)]
w d K m
1: 1 5 2 9
2: 1 5 3 10
3: 1 5 4 11
4: 2 6 2 9
5: 2 6 3 10
6: 2 6 4 11
7: 3 7 2 9
8: 3 7 3 10
9: 3 7 4 11
However, when I do it with my real data, I get this error:
Error in `[.data.table`(A, , B[], by = names(A)) :
negative length vectors are not allowed
It turns out this is a memory error. However, I think there should be a way to do this without loops, memory is not an issue on my server up to 50gb of ram, which the following data table would certainly be less than.
Does anyone know an efficient way to do this?

A hacky way to handle this might be to add an identical helper column to each table and then to allow cartesian joins:
library(data.table)
A = data.table(w = 1:3, d = 5:7)
B = data.table(K = 2:4, m = 9:11)
A[, j := 1]
B[, j := 1]
C = A[B, on = 'j', allow.cartesian = T]

Related

Replace Inf/-Inf values from vector of variable names, with values from similarly named vector of variables (substr/grep/gsub)

I'm currently stumped making some efficient code. I have a vector of variables (med.vars) that were transformed by the in-year global median. Sometimes the global median is 0, which creates Inf/-Inf values I would like to replace with the pre-transformed variable value (vars). I can't figure out how to do this efficiently with some type of data.table 'dat[,:=lapply(.SD), .SDcols=med.vars] function or a for loop with get(), noquotes(), etc.
dat<-data.table(v1=c(2,10,7),v2=c(5,6,5),v3=c(10,15,20),v1.med=c(1,Inf,5),v2.med=c(5,6,5),v3.med=c(-Inf,2,3))
vars<-c("v1","v2","v3")
med.vars<-c("v1.med","v2.med","v3.med")
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -Inf
2: 10 6 15 Inf 6 2
3: 7 5 20 5 5 3
In reality these vectors are 50+ vars I pull from names(dat) with grep() and use gsub(".med","",med.vars) to create the second vector of pre-transformed variable names.
I would like to efficiently perform
dat[v1.med==Inf | v1.med==-Inf, v1.med:=v1]
dat[v3.med==Inf | v3.med==-Inf, v3.med:=v3]
for each element, med.vars[i], and its corresponding element, vars[i] such that the resulting data.table is:
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -10
2: 10 6 15 10 6 2
3: 7 5 20 5 5 3
Thank you for your time
OP mentions efficiency, so maybe move to long form. Then the standard syntax can be used:
DT = melt(dat, meas=list(vars, med.vars), value.name=c("var", "med"))
DT[!is.finite(med), med := sign(med)*var]
variable var med
1: 1 2 1
2: 1 10 10
3: 1 7 5
4: 2 5 5
5: 2 6 6
6: 2 5 5
7: 3 10 -10
8: 3 15 2
9: 3 20 3
As these are corresponding columns, we can make use of Map
dat[, (med.vars) := Map(function(x, y) ifelse(is.finite(y), y,
x * sign(y)), .SD[, vars, with = FALSE],
.SD[, med.vars, with = FALSE])]
dat
# v1 v2 v3 v1.med v2.med v3.med
#1: 2 5 10 1 5 -10
#2: 10 6 15 10 6 2
#3: 7 5 20 5 5 3
Or another option is set by looping through the columns with a for loop
for(j in seq_along(vars)) {
i1 <- !is.finite(dat[[med.vars[j]]])
v1 <- dat[[vars[j]]]
v2 <- dat[[med.vars[j]]]
set(dat, i = which(i1), j = med.vars[j], value = sign(v2[i1]) * v1[i1])
}
This can also be done in base R (on a data.frame)
i1 <- !sapply(dat[med.vars], is.finite)
dat[med.vars][i1] <- dat[vars][i1] * sign(dat[med.vars][i1])

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

Renaming multiple columns in R data.table

This is related to this question from Henrik
Assign multiple columns using := in data.table, by group
But what if I want to create a new data.table with given column names instead of assigning new columns to an existing one?
f <- function(x){list(head(x,2),tail(x,2))}
dt <- data.table(group=sample(c('a','b'),10,replace = TRUE),val=1:10)
> dt
group val
1: b 1
2: b 2
3: a 3
4: b 4
5: a 5
6: b 6
7: a 7
8: a 8
9: b 9
10: b 10
I want to get a new data.table with predefined column names by calling the function f:
dt[,c('head','tail')=f(val),by=group]
I wish to get this:
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
But it gives me an error. What I can do is create the table then change the column names, but that seems cumbersome:
> dt2 <- dt[,f(val),by=group]
> dt2
group V1 V2
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
> colnames(dt2)[-1] <- c('head','tail')
> dt2
group head tail
1: a 1 8
2: a 3 10
3: b 2 6
4: b 5 9
Is it something I can do with one call?
From running your code as-is, this is the error I get:
dt[,c('head','tail')=f(val),by=group]
# Error: unexpected '=' in "dt2[,c('head','tail')="
The problem is using = instead of := for assignment.
On to your problem of wanting a new data.table:
dt2 <- dt[, setNames(f(val), c('head', 'tail')), by = group]

Update a data.table based on another data table

I want to update the columns in an old data.table based on a new data.table only when the value is not NA.
DT_old = data.table(x=rep(c("a","b","c")), y=c(1,3,6), v=1:3, l=c(1,1,1))
DT_old
x y v l
1: a 1 1 1
2: b 3 2 1
3: c 6 3 1
DT_new = data.table(x=rep(c("b","c",'d')), y=c(9,6,10), v=c(2,NA,10), z=c(9,9,9))
DT_new
x y v z
1: b 9 2 9
2: c 6 NA 9
3: d 10 10 9
I want the output to be
x y v z
1: b 9 2 9
2: c 6 3 9
3: d 10 10 9
4: a 1 1 NA
Currently I am merging the two data.table and going through each column and replacing the NA in the new data.table
DT_merged <- merge(DT_new, DT_old, all=TRUE, by='x')
DT_merged
x y.x v.x z y.y v.y l
1: a NA NA NA 1 1 1
2: b 9 2 9 3 2 1
3: c 6 NA 9 6 3 1
4: d 10 10 9 NA NA NA
DT_merged[is.na(y.x), y.x := y.y]
DT_merged[is.na(v.x), v.x := v.y]
DT_merged = DT_merged[, list(y=y.x, v=v.x, z=z)
Is there a better way to do the above?
Here's how I would approach this. First, I will expand DT_new according to the unique values combination of x columns of both tables using binary join
res <- setkey(DT_new, x)[unique(c(x, DT_old$x))]
res
# x y v z
# 1: b 9 2 9
# 2: c 6 NA 9
# 3: d 10 10 9
# 4: a NA NA NA
Then, I will updated the two columns in res by reference using another binary join
setkey(res, x)[DT_old, `:=`(y = i.y, v = i.v)]
res
# x y v z
# 1: a 1 1 NA
# 2: b 3 2 9
# 3: c 6 3 9
# 4: d 10 10 9
Following the comments section, it seems that you are trying to join each column by its own condition. There is no simple way of doing such thing in R or any language AFAIK. Thus, your own solution could be a good option by itself.
Though, here are some other alternatives, mainly taken from a similar question I myself asked not long ago
Using two ifelse statments
setkey(res, x)[DT_old, `:=`(y = ifelse(is.na(y), i.y, y),
v = ifelse(is.na(v), i.v, v))]
Two separate conditional joins
setkey(res, x) ; setkey(DT_old, x) ## old data set needs to be keyed too now
res[is.na(y), y := DT_old[.SD, y]]
res[is.na(v), v := DT_old[.SD, v]]
Both will give you what you need.
P.S.
If you don't want warnings, you need to define the corresponding column classes correctly, e.g. v column in DT_new should be defined as v= c(2L, NA_integer_, 10L)

Add marginals to data table?

What is the right way to add marginal sums to a data table?
What I do right now:
> (a <- data.table(x=c(1,2,1,2,2,3,3),y=c(10,10,20,20,30,30,40),z=1:7,key=c("x")))
x y z
1: 1 10 1
2: 1 20 3
3: 2 10 2
4: 2 20 4
5: 2 30 5
6: 3 30 6
7: 3 40 7
> (a <- a[a[,sum(z),by=x]])
x y z V1
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
> setnames(a,"V1","x.z")
> setkeyv(a,"y")
> (a <- a[a[,sum(z),by=y]])
y x z x.z V1
1: 10 1 1 4 3
2: 10 2 2 11 3
3: 20 1 3 4 7
4: 20 2 4 11 7
5: 30 2 5 11 11
6: 30 3 6 13 11
7: 40 3 7 13 7
> setnames(a,"V1","y.z")
I am pretty sure this is not The Right Way.
What is?
One alternative is this one:
> a[,Sum:=sum(z), by="x"]
> a
x y z Sum
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
Edit: Some more explanation on := usage:
The := operator enables add/update by reference. With this, you can:
add new columns or update existing columns by reference
DT[, x2 := x+1] # add one new column
DT[, `:=`(x2 = x+1, y2 = y+1)] # adding more than 1 col
DT[, x := x+1] # modify existing column
add or update certain rows of new or existing columns by reference
DT[x == 1L, y := NA] # modify 'y' just where expression in 'i' matches
DT[x == 1L, `:=`(y = NA, z=NA)] # same but for multiple columns
DT[x == 1L, newcol := 5L] # matched rows for 'newcol' will be 5, all other 'NA'
add or update cols while grouping, by reference - by default, the computed result is recycled within each group.
DT[, zsum := sum(z), by=x]
Here, sum(z) returns 1 value for each group in x. The result is then recycled for length of that group and is added/updated by reference to zsum.
add or update during a by-without-by operation. That is, when you perform a data.table join and you want to add/update column while joining:
X <- data.table(x=rep(1:3, each=2), y=1:6, key="x")
Y <- data.table(x=1:3, y=c(3L, 1L, 2L), key="x")
X[Y, y.gt := y > i.y]
Finally, you can also remove columns by reference (i.e. instantly even it's a 20GB table) :
DT[, x := NULL] # just 1 column
DT[, c("x","y") := NULL] # 1 or more columns
toRemove = c("x","y")
DT[, (toRemove) := NULL] # wrap with brackets to lookup variable
Hope this helps clarify the usage on :=. Also check out ?set. It is similar to :=, but with the limitation that it can not be combined with joins. This allows for it to be faster inside a for loop (due to reduced overhead from not calling [.data.table) for all operations it is capable of than :=.
It can be quite handy, especially, in some scenarios. See this post for a nice usage.

Resources