Related
Consider the following code snippet:
foo <- function(dt, num) {
expect_equal(class(num), "numeric")
col <- paste("b", num, sep = "_")
col2 <- paste("b", num + 1, sep = "_")
condition <- dt$a > 0
st <- nanotime(Sys.time())
dt[condition, a := a - get(col) ]
dt[condition, a := a - get(col2) ]
et <- nanotime(Sys.time())
diff <- (et - st) / 1e9
message(diff)
st <- nanotime(Sys.time())
tmp <- dt$a - dt[[col]]
tmp <- tmp - dt[[col2]]
dt[condition, a := tmp[condition]]
et <- nanotime(Sys.time())
diff <- (et - st) / 1e9
message(diff)
st <- nanotime(Sys.time())
dt[, tmp := a - get(col)]
dt[, tmp := a - get(col2)]
dt[condition, a := tmp]
et <- nanotime(Sys.time())
diff <- (et - st) / 1e9
message(diff)
}
dt <- data.table(c = 0, d = 0, e = 0, f = 0, g = 0, h = 0, i = 0, a = -15000:15000, b_1 = 1L, b_2 = 1L)
foo(dt, 1)
Output
0.002342
0.001131
0.002389
Queries
1. Is get(col) slower than dt[[col]] ?
2. Is vectorization, howsoever complex the computation better performed over entire data, rather than subsetting ?
3. Should a series of computations be performed outside data table and in the end set in a column rather than in place calculations ?
If I take all of the internals and run with microbenchmark, it will execute each (in random order) and report some good statistics. I'll pre-compute tmp and condition,
microbenchmark::microbenchmark(
a = {
dt[condition, a := a - get(col) ]
dt[condition, a := a - get(col2) ]
},
b = {
tmp <- dt$a - dt[[col]]
tmp <- tmp - dt[[col2]]
dt[condition, a := tmp[condition]]
},
b2 = {
tmp <- dt$a - dt[[col]]
tmp <- tmp - dt[[col2]]
set(dt, i = which(condition), j = "a", value = tmp[condition])
},
c = {
dt[, tmp := a - get(col)]
dt[, tmp := a - get(col2)]
dt[condition, a := tmp]
},
times = 1000
)
# Unit: microseconds
# expr min lq mean median uq max neval
# a 2871.501 2965.951 3429.7118 3058.701 3640.551 9190.800 1000
# b 660.601 679.701 788.5797 696.451 805.801 3675.201 1000
# b2 166.001 176.801 251.9144 180.201 187.501 39527.302 1000
# c 1391.001 1502.901 1633.2692 1530.150 1664.101 3638.701 1000
It appears that there is a clear winner in your original set of candidates : dt[[col]] outshines them all. Edit: however, as #jangorecki (a significant contributor to data.table source) commented, data.table::set is even faster.
It isn't really tested well here, but it really depends on the amount of subsetting and how "expensive" the calculations are. In this case, the calcs are rather trivial, so I would not expect much difference.
You're always balancing readability and maintainability with speed and efficiency. In some of my speed-sensitive stuff (length 2-4M), I tend to do everything in a raw vector, but that decision involves several factors, not just those involving data.table. Once you start getting into significant copies of data (R does that a lot) and many huge strengths of data.table (grouping, inequality joins, etc), then doing it in-table becomes much faster and, more importantly for me, more maintainable and readable.
I have the following data.table:
DT <- data.table(A = c(rep("aa",2),rep("bb",2)),
B = c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time =c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0")
)
and use this code to extract the the highest number of consecutive intensity values above a certain value For a more detailed explanation on how this calculation works please see Reading and counting of consecutive points:
newCols <- do.call(rbind, Map(function(u, v, x, y) {
u1 <- as.numeric(u)
v1 <- as.numeric(v)
lb <- which.min(abs(x - u1))
ub <- which.min(abs(y - u1))
v3 <- as.numeric(v[(lb+1):(ub-1)])
i3 = with(rle(v3 > min(as.numeric(v[c(lb, ub)]))),
pmax(max(lengths[values]), 0))
data.frame(Consec.Points.base = i3)
},
strsplit(DT$Time, ","), strsplit(DT$Intensity, ","), DT$Low, DT$High))
DT <- cbind(DT, newCols)
I was wondering how it would be possible to instead of getting the length of the Consec.Points.base, to extract their actual points (Time and Intensity) as two vectors?
Thanks a lot in advance!
I think this answers your question, but let me know if I made a mistake, or something needs more thought/clarification.
DT <- data.table(A = c(rep("aa",2),rep("bb",2)),
B = c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time =c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0")
)
# unique identifier
DT[, i := .I]
# re-structure
DT2 <- DT[, .(Time = as.numeric(strsplit(Time, ",")[[1]]),
Intensity = as.numeric(strsplit(Intensity, ",")[[1]])), by = i]
DT2 <- merge(DT2, DT[, .(i,A,B,Low,High)], by="i")
DT2 <- DT2[between(Time, Low, High, incbounds = FALSE),]
DT2[, IntensityGood := Intensity != min(Intensity), by=i]
# encode each part of sequence with its own value, if not FALSE
encoder <- function(x){
rle.response <- rle(x)
v2 <- rep(0, length(rle.response$values))
v2[rle.response$values!=FALSE] <- which(rle.response$values != FALSE)
rep(v2, rle.response$lengths)
}
DT2[, encodeI := encoder(IntensityGood), by = i]
# remove ones which are all 0, easily handle seperately
DT3 <- DT2[, test := all(encodeI==0), by=i][test==FALSE,][, test:=NULL]
# get count - can infer missing are 0
count <- DT3[encodeI!=0, .(max(table(encodeI))), by = i]
# get sequence
findMaxDt <- DT3[encodeI != 0, .N, by=.(i, encodeI)]
DT3 <- merge(DT3, findMaxDt, by=c("i", "encodeI"))
DT3 <- DT3[, Best := N==max(N), by=i]
DT3[Best==TRUE, .(list(Intensity)), by=i]
I've been reading how to improve code in R taking a look a some of the answers here and also reading a bit of the R inferno document. Now I have this problem and the loop I created seems to be taking forever (15 hours and counting).
k <- NROW(unique(df$EndStation.Id))
l <- NROW(unique(df$StartStation.Id))
m1 <- as.matrix(df[,c("Duration","StartStation.Id","EndStation.Id")])
g <- function(m){
for (i in 1:l){
for (j in 1:k){
duration <- m[(m[,2]==i & m[,3]==j),1]
if (NROW(duration)<=1) {
m[(m[,2]==i & m[,3]==j),1] <- NA
next
}
duration <- duration/median(duration)
m[(m[,2]==i & m[,3]==j),1] <- duration
}
}
return(m)
}
answer <- g(m1)
The number of Stations (Start and End) is both 750 and the duration vector size can vary a lot from 1 or 2 to 80. Is this loop improbable or should I give up and try to get access to a faster computer.
Best regards,
Fernando
The code is a bit hard to read, but I think this is what you want to do:
library(data.table)
## generate a data table
dt <- setDT(df[,c("Duration","StartStation.Id","EndStation.Id")])
## calculate the duration
dt[, Duration := Duration / median(Duration), by = .(StartStation.Id, EndStation.Id)]
## replace the result with NA when the vector length == 1
dt[, N := .N, by = .(StartStation.Id, EndStation.Id)][
N == 1, Duration := NA
][, N := NULL]
If I understand your function correctly, you want to divide the duration between two stations by it median duration and if there is only one entry for the pair of stations set to NA
Here is a base solution (it's a bit clunky, I haven't finished my first cup of coffee):
##Some sample data
df <- data.frame(StartStation.Id=sample(LETTERS[1:10], 100, replace =T),
EndStation.Id=sample(LETTERS[11:20], 100, replace =T),
Duration=runif(100, 0.1,100))
res <- tapply(df$Duration, paste0(df$StartStation.Id, df$EndStation.Id), function(x) x/median(x))
res <- data.frame(StartStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 1),
EndStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 2),
durn=unlist(res))
res[res$durn==1,] <- NA
Using a database with a numeric range defined by two columns start and end, I am trying to look up the factor, code, associated with a numeric value in a separate vector identityCodes.
database <- data.frame(start = seq(1, 150000000, 1000),
end = seq(1000, 150000000, 1000),
code = paste0(sample(LETTERS, 15000, replace = TRUE),
sample(LETTERS, 15000, replace = TRUE)))
identityCodes <- sample(1:15000000, 1000)
I've come up with a method for finding the corresponding codes using a for loop and subsetting:
fun <- function (x, y) {
z <- rep(NA, length(x))
for (i in 1:length(x)){
z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
}
return(z)
}
a <- fun(identityCodes, database)
But the method is slow, especially if I am to scale it:
system.time(fun(identityCodes, database))
user system elapsed
15.36 0.00 15.50
How can I identify the factors associated with each identityCodes faster? Is there a better way to go about this than using a for loop and subsetting?
Here's my attempt using data.table. Very fast - even though I am sure I am not leveraging it efficiently.
Given function:
# method 1
system.time(result1 <- fun(identityCodes, database))
user system elapsed
8.99 0.00 8.98
Using data.table
# method 2
require(data.table)
# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
# convert x to a data.table and set 'start' and 'end' as keys
DT <- setDT(x)
setkey(DT, start, end)
# create a lookup data.table where start and end are the identityCodes
DT2 <- data.table(start=y, end=y)
# overlap join where DT2 start & end are within DT start and end
res <- foverlaps(DT2, DT[, .(start, end)], type="within")
# store i as row number and key (for sorting later)
res[, i:=seq_len(nrow(res))]
setkey(res, i)
# merge the joined table to the original to get codes
final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]
# export as character the codes
as.character(final[[1]])
}
system.time(result2 <- dt_comb(x=database, y=identityCodes))
user system elapsed
0.08 0.00 0.08
identical(result1, result2)
[1] TRUE
edit: trimmed a couple lines from the function
This is about 45% faster on my machine:
result = lapply(identityCodes, function(x) {
data.frame(identityCode=x,
code=database[database$start <= x & database$end >= x, "code"])
})
result = do.call(rbind, result)
Here's a sample of the output:
identityCode code
1 6836845 OK
2 14100352 RB
3 2313115 NK
4 8440671 XN
5 11349271 TI
6 14467193 VL
I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.
Say, I've two lists as shown below:
x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))
Properties of x and y:
Each element of the list x always sums up to 100.
Each element of y will always be sorted and will be always between 1 and 100.
The problem:
Now, what I'd like is this. Taking x[[1]] and y[[1]], I'd like to find the count of numbers in y[[1]] that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55) and number between 83 and 100 is c(90). Similarly for x[[2]] and y[[2]], c(0, 2, 1). That is, the answer should be:
[[1]]
[1] 4 1
[[2]]
[1] 0 2 1
Let me know if this is still unclear.
Simulated data with 1 million entries
set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)
x <- lapply(seq_len(n), function(ix) {
probs <- sample(100:1000, len[ix])
probs <- probs/sum(probs)
oo <- round(N * probs)
if (sum(oo) != 100) {
oo[1] <- oo[1] + (100 - sum(oo))
}
oo
})
require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1
What I've done so far:
Using mapply (slow):
I thought of using rank with ties.method="first" and mapply (obvious choice with 2 lists) first and tried out this:
tt1 <- mapply(y, x, FUN=function(a,b) {
tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})
Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank and diff that many times adds to it. This takes 241 seconds!
Therefore, I decided to try and overcome the usage of rank and diff by using data.table and sorting with a "group" column. I came up with a longer but much faster solution shown below:
Using data.table (faster):
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]
# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1
This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.
Here's another data.table approach. Edit I added a (dirty?) hack that speeds this up and makes it ~2x faster than the OP data.table solution.
# compile the data.table's, set appropriate keys
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))
# hack #0, set key but prevent sorting, since we know data is already sorted
setattr(ydt, 'sorted', c('grp', 'val'))
# by setting the key in y to val and in x to cumval we can
# leverage the rolling joins
setattr(xdt, 'sorted', c('grp', 'cumval')) # hack #1 set key, but prevent sorting
vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]
# hack #2, same deal as above
# we know that the order of cumval and cumval.copy is the same
# so let's convince data.table in that
setattr(vals, 'sorted', c('grp', 'cumval.copy'))
# compute the counts and fill in the missing 0's
# for when there is no y in the appropriate x interval
tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]
# convert to list
tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1
This is about 25% faster but outputs as a matrix rather than a list. You many be able to use appy/sappy to make it work with a list (saving as a list was slowing it down).
c=matrix(0,length(x),100)
for(j in 1:length(x)){
a=-1
b=0
for(i in 1:length(x[[j]])){
a=b
b=b+x[[j]][i]
c[j,i]=sum((a<=y[[j]])*(y[[j]]<=b))
}
}