I'm trying to improve the efficiency of the following simple data.table syntax, so I'm trying to combine it into one call without repeatedly calling by = "group".
#data
library(data.table)
DT <- data.table(group = c(rep("a", 40), rep("b", 40)),
other = rnorm(80),
num = c(1:80))
#reduce this to one "by" call
DT[, c1 := ifelse(num <= 7, NA, num), by = "group"]
DT[, sprintf("c%d", 2:10) := shift(c1, 1:9, type = 'lag'), by = "group"]
DT[, d1 := shift(c10, 1, type = 'lag'), by = "group"]
DT[, sprintf("d%d", 2:10) := shift(d1, 1:9, type = 'lag'), by = "group"]
DT[, e1 := shift(d10, 1, type = 'lag'), by = "group"]
DT[, sprintf("e%d", 2:10) := shift(e1, 1:9, type = 'lag'), by = "group"]
Something like
DT[, .(c1 := ifelse(num <= 7, NA, num),
sprintf("c%d", 2:10) := shift(c1, 1:9, type = 'lag'),
d1 := shift(c10, 1, type = 'lag'),
sprintf("d%d", 2:10) := shift(d1, 1:9, type = 'lag'),
e1 := shift(d10, 1, type = 'lag'),
sprintf("e%d", 2:10) := shift(e1, 1:9, type = 'lag')), by = "group"]
Edit:
This is similar but slightly different to this question as the variables created here are not independent of one another.
Any suggestions?
Thanks
Here is an option:
ix <- 2L:10L
m <- 1L:9L
DT[, c(sprintf("c%d", ix), sprintf("d%d", ix), sprintf("e%d", ix)) := {
c1 = replace(num, num <= 7L, NA_integer_)
lc = shift(c1, m)
d1 = shift(lc[[9L]])
ld = shift(d1, m)
e1 = shift(ld[[9L]])
c(lc, ld, shift(e1, m))
}, group]
# You can write function:
f <- function(num) {
c1 <- ifelse(num <= 7, NA, num)
cl <- shift(c1, 1:9, type = 'lag')
names(cl) <- sprintf("c%d", 2:10)
d1 <- shift(cl[9], 1, type = 'lag')
dl <- shift(d1, 1:9, type = 'lag')
names(dl) <- sprintf("d%d", 2:10)
e1 <- shift(dl[9], 1, type = 'lag')
el <- shift(e1, 1:9, type = 'lag')
names(el) <- sprintf("e%d", 2:10)
c(c1 = list(c1), cl, d1 = d1, dl, e1 = e1, el) # list of desired columns
}
x <- DT[, f(num), by = group] # apply it by group
DT <- cbind(DT, x[, -'group']) # add to initial data
Maybe this will be faster. Also, the function probably could be written better. Make sure that the function return list with your desired column names.
You can call by once using the fact that (1) every column in the j argument of a data.table
becomes a column in the return data.table, and that (2) curly braces can be used for
intermediate calculations in j.
Because the default value of the argument type in the shift function is lag,
I did not specify it.
Note that the last line in the curly braces, lst, is the only object returned.
DT[, {
nms = paste0(rep(c("c", "d", "e"), each = 10), 1:10)
lst = setNames(vector("list", 30), nms)
lst[["c1"]] = ifelse(num <= 7, NA, num)
lst[sprintf("c%d", 2:10)] = shift(lst[["c1"]], 1:9)
lst[["d1"]] = shift(lst[["c10"]], 1)
lst[sprintf("d%d", 2:10)] = shift(lst[["d1"]], 1:9)
lst[["e1"]] = shift(lst[["d10"]], 1)
lst[sprintf("e%d", 2:10)] = shift(lst[["e1"]], 1:9)
lst
}, by = group]
The output contains 30 columns: c1, ...,c10, d1,...,d10 and e1,...,e10
Related
I have a fairly large data.table that comes from an SQL table.
All columns containing missing values in SQL are replaced by NULLs in the data.table so that these columns are actually lists containing values and also missing values.
I would like an efficient way to replace the NULLs with NAs and then convert the column (list) to a real data.table column
This is an example to reproduce my case :
library(data.table)
n = 10^6
l1 = as.list(rnorm(n, 10, 25))
l2 = as.list(rnorm(n, 0, 200))
l3 = as.list(rnorm(n))
df = data.table(a = runif(n),
b = l1,
c = l2,
d = rnorm(n, 88, 0.5),
e = l3
)
# create an index vector to set to NULL
id1 = sample(1:n, 0.26*n)
id2 = sample(1:n, 0.60*n)
id3 = sample(1:n, 0.09*n)
# set to NULL
df$b[id1] = list(NULL)
df$c[id2] = list(NULL)
df$e[id3] = list(NULL)
This is what I have done but it's a bit too long :
type = data.frame(type = sapply(df, class))
col = names(df)[which(type$type == "list")]
type = data.frame(type = sapply(df, class))
col = names(df)[which(type$type == "list")]
# ----------- FIRST WAY -----------------------------------------------------------------------
system.time(
df[, (col) := lapply(.SD, function(i) unlist(lapply(i, function(x) ifelse(is.null(x), NA, x)))), .SDcols = col]
)
# ----------- SECOND WAY (a little bit faster) ---------------------------
system.time(
for (i in col) {
df[, eval(i) := unlist(lapply(get(i), function(x) ifelse(is.null(x), NA, x)))]
}
)
Why the 1st solution is slower than second ? Anybody has a better way ?
We may use set here
library(data.table)
df1 <- copy(df)
system.time({
for(nm in col) {
i1 <- which(lengths(df1[[nm]]) == 0)
set(df1, i = i1, j = nm, value = list(NA))
df1[[nm]] <- unlist(df1[[nm]])
}
})
# user system elapsed
# 0.158 0.004 0.161
Compared with OP's second method
system.time(
for (i in col) {
df[, eval(i) := unlist(lapply(get(i), function(x) ifelse(is.null(x), NA, x)))]
}
)
# user system elapsed
# 5.618 0.157 5.756
-checking output
> all.equal(df, df1)
[1] TRUE
One solution based on lengths function:
cols = which(sapply(df, is.list))
df[, (cols) := lapply(.SD, \(x) {x[lengths(x)==0L] = NA; as.numeric(x)}), .SDcols=cols]
How can I efficiently calculate distances between (almost) consecutive rows of a large-ish (~4m rows) of a data.table? I've outlined my current approach, but it is very slow. My actual data has up to a few hundred columns. I need to calculate lags and leads for future use, so I create these and use them to calculate distances.
library(data.table)
library(proxy)
set_shift_col <- function(df, shift_dir, shift_num, data_cols, byvars = NULL){
df[, (paste0(data_cols, "_", shift_dir, shift_num)) := shift(.SD, shift_num, fill = NA, type = shift_dir), byvars, .SDcols = data_cols]
}
set_shift_dist <- function(dt, shift_dir, shift_num, data_cols){
stopifnot(shift_dir %in% c("lag", "lead"))
shift_str <- paste0(shift_dir, shift_num)
dt[, (paste0("dist", "_", shift_str)) := as.numeric(
proxy::dist(
rbindlist(list(
.SD[,data_cols, with=FALSE],
.SD[, paste0(data_cols, "_" , shift_str), with=FALSE]
), use.names = FALSE),
method = "cosine")
), 1:nrow(dt)]
}
n <- 10000
test_data <- data.table(a = rnorm(n), b = rnorm(n), c = rnorm(n), d = rnorm(n))
cols <- c("a", "b", "c", "d")
set_shift_col(test_data, "lag", 1, cols)
set_shift_col(test_data, "lag", 2, cols)
set_shift_col(test_data, "lead", 1, cols)
set_shift_col(test_data, "lead", 2, cols)
set_shift_dist(test_data, "lag", 1, cols)
I'm sure this is a very inefficient approach, any suggestions would be appreciated!
You aren't using the vectorisation efficiencies in the proxy::dist function - rather than call it once for each row you can get all the distances you need from a single call.
Try this replacement function and compare the speed:
set_shift_dist2 <- function(dt, shift_dir, shift_num, data_cols){
stopifnot(shift_dir %in% c("lag", "lead"))
shift_str <- paste0(shift_dir, shift_num)
dt[, (paste0("dist2", "_", shift_str)) := proxy::dist(
.SD[,data_cols, with=FALSE],
.SD[, paste0(data_cols, "_" , shift_str), with=FALSE],
method = "cosine",
pairwise = TRUE
)]
}
You could also do it in one go without storing copies of the data in the table
test_data[, dist_lag1 := proxy::dist(
.SD,
as.data.table(shift(.SD, 1)),
pairwise = TRUE,
method = 'cosine'
), .SDcols = c('a', 'b', 'c', 'd')]
I want to apply function to portion of a table.
With data.frame, no problem:
df <- data.frame(name = paste("a", 1:10, sep = "-"),
x = 1:10,
y = rep(1:5),
z = rep(1:2, each = 5))
df[2:5, -1] <- scale(df[2:5, -1], center = c(1,2,3), scale = c(4,5,6))
But data.table complains:
dt <- data.table(name = paste("a", 1:10, sep = "-"),
x = 1:10,
y = rep(1:5),
z = rep(1:2, each = 5))
dt[2:5, -1] <- scale(dt[2:5, -1], center = c(1,2,3), scale = c(4,5,6))
Error in [<-.data.table(*tmp*, 2:5, -1, value = c(0.25, 0.5, 0.75, :
Item 1 of column numbers in j is -1 which is outside range [1,ncol=4]. Use column names
instead in j to add new columns.
What is the correct way in data.table? Thanks!
data.table needs more work to apply scale :
library(data.table)
cols <- names(dt)[-1]
dt[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
dt[2:5, (cols) := Map(scale, .SD, c(1,2,3), c(4,5,6)), .SDcols = cols]
How to avoid using the for loop in the following code to speed up the computation (the real data is about 1e6 times larger)
id = rep(1:5, 20)
v = 1:100
df = data.frame(groupid = id, value = v)
df = dplyr::arrange(df, groupid)
bkt = rep(seq(0, 100, length.out = 4), 5)
id = rep(1:5, each = 4)
bktpts = data.frame(groupid = id, value = bkt)
for (i in 1:5) {
df[df$groupid == i, "bin"] = cut(df[df$groupid == i, "value"],
bktpts[bktpts$groupid == i, "value"],
include.lowest = TRUE, labels = F)
}
I'm not sure why yout bktpts is formatted like it is?
But here is a data.table slution that should be (at least a bit) faster than your for-loop.
library( data.table )
setDT(df)[ setDT(bktpts)[, `:=`( id = seq_len(.N),
value_next = shift( value, type = "lead", fill = 99999999 ) ),
by = .(groupid) ],
bin := i.id,
on = .( groupid, value >= value, value < value_next ) ][]
Another way:
library(data.table)
setDT(df); setDT(bktpts)
bktpts[, b := rowid(groupid) - 1L]
df[, b := bktpts[copy(.SD), on=.(groupid, value), roll = -Inf, x.b]]
# check result
df[, any(b != bin)]
# [1] FALSE
See ?data.table for how rolling joins work.
I came out with another data.table answer:
library(data.table) # load package
# set to data.table
setDT(df)
setDT(bktpts)
# Make a join
df[bktpts[, list(.(value)), by = groupid], bks := V1, on = "groupid"]
# define the bins:
df[, bin := cut(value, bks[[1]], include.lowest = TRUE, labels = FALSE), by = groupid]
# remove the unneeded bks column
df[, bks := NULL]
Explaining the code:
bktpts[, list(.(value)), by = groupid] is a new table that has in a list al the values of value for each groupid. If you run it alone, you'll understand where we're going.
bks := V1 assigns to variable bks in df whatever exists in V1, which is the name of the list column in the previous table. Of course on = "groupid" is the variable on which we make the join.
The code defining the bins needs little explanation, except by the bks[[1]] bit. It needs to be [[ in order to access the list values and provide a vector, as required by the cut function.
EDIT TO ADD:
All data.table commands can be chained in a -rather unintelligible- single call:
df[bktpts[, list(.(value)), by = groupid],
bks := V1,
on = "groupid"][,
bin := cut(value,
bks[[1]],
include.lowest = TRUE,
labels = FALSE),
by = groupid][,
bks := NULL]
I'm starting with the below table dt and try to subset its column by the list keys:
library(data.table)
set.seed(123)
randomchar <- function(n, w){
chararray <- replicate(w, sample(c(letters, LETTERS), n, replace = TRUE))
apply(chararray, 1, paste0, collapse = "")
}
dt <- data.table(x = randomchar(1000, 3),
y = randomchar(1000, 3),
z = randomchar(1000, 3),
key = c("x", "y", "z"))
keys <- with(dt, list(x = sample(x, 501),
y = sample(y, 500),
z = sample(z, 721)))
I can get the result I want by using a loop:
desired <- copy(dt)
for(i in seq_along(keys)){
keyname <- names(keys)[i]
desired <- desired[get(keyname) %in% keys[[i]]]
}
desired
The question is - Is there a more data.table idiomatic way to do this subset?
I tried using CJ: dt[CJ(keys)], but it takes a very long time.
What about building a mask and filter dt on this mask:
dt[Reduce(`&`, Map(function(key, col) col %in% key, keys, dt)),]