Efficiently counting numbers falling within each range of numbers - r

I'm looking for a faster solution to the problem below. I'll illustrate the problem with a small example and then provide the code to simulate a large data as that's the point of this question. My actual problem size is of list length = 1 million entries.
Say, I've two lists as shown below:
x <- list(c(82, 18), c(35, 50, 15))
y <- list(c(1,2,3,55,90), c(37,38,95))
Properties of x and y:
Each element of the list x always sums up to 100.
Each element of y will always be sorted and will be always between 1 and 100.
The problem:
Now, what I'd like is this. Taking x[[1]] and y[[1]], I'd like to find the count of numbers in y[[1]] that are 1) <= 82 and 2) > 82 and <= 100. That would be, c(4, 1) because numbers <= 82 are c(1,2,3,55) and number between 83 and 100 is c(90). Similarly for x[[2]] and y[[2]], c(0, 2, 1). That is, the answer should be:
[[1]]
[1] 4 1
[[2]]
[1] 0 2 1
Let me know if this is still unclear.
Simulated data with 1 million entries
set.seed(1)
N <- 100
n <- 1e6
len <- sample(2:3, n, TRUE)
x <- lapply(seq_len(n), function(ix) {
probs <- sample(100:1000, len[ix])
probs <- probs/sum(probs)
oo <- round(N * probs)
if (sum(oo) != 100) {
oo[1] <- oo[1] + (100 - sum(oo))
}
oo
})
require(data.table)
ss <- sample(1:10, n, TRUE)
dt <- data.table(val=sample(1:N, sum(ss), TRUE), grp=rep(seq_len(n), ss))
setkey(dt, grp, val)
y <- dt[, list(list(val)),by=grp]$V1
What I've done so far:
Using mapply (slow):
I thought of using rank with ties.method="first" and mapply (obvious choice with 2 lists) first and tried out this:
tt1 <- mapply(y, x, FUN=function(a,b) {
tt <- rank(c(a, cumsum(b)), ties="first")[-(1:length(a))]; c(tt[1]-1, diff(tt)-1)
})
Although this works just fine, it takes a lot of time on 1M entries. I think the overhead of computing rank and diff that many times adds to it. This takes 241 seconds!
Therefore, I decided to try and overcome the usage of rank and diff by using data.table and sorting with a "group" column. I came up with a longer but much faster solution shown below:
Using data.table (faster):
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl), type = "x")
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl), type = "y")
tt2 <-rbindlist(list(ydt, xdt[, list(cumval, grp, type)]))
setkey(tt2, grp, val)
xdt.pos <- which(tt2$type == "x")
tt2[, type.x := 0L][xdt.pos, type.x := xdt.pos]
tt2 <- tt2[xdt.pos][tt2[, .N, by = grp][, N := cumsum(c(0, head(N, -1)))]][, sub := type.x - N]
tt2[, val := xdt$val]
# time consuming step
tt2 <- tt2[, c(sub[1]-1, sub[2:.N] - sub[1:(.N-1)] - 1), by = grp]
tt2 <- tt2[, list(list(V1)),by=grp]$V1
This takes 26 seconds. So it's about 9 times faster. I'm wondering if it's possible to get much more speedup as I'll have to recursively compute this on 5-10 such 1 million elements. Thank you.

Here's another data.table approach. Edit I added a (dirty?) hack that speeds this up and makes it ~2x faster than the OP data.table solution.
# compile the data.table's, set appropriate keys
xl <- sapply(x, length)
yl <- sapply(y, length)
xdt <- data.table(val=unlist(x, use.names=FALSE), grp=rep(seq_along(xl), xl))
xdt[, cumval := cumsum(val), by=grp]
ydt <- data.table(val=unlist(y, use.names=FALSE), grp=rep(seq_along(yl), yl))
# hack #0, set key but prevent sorting, since we know data is already sorted
setattr(ydt, 'sorted', c('grp', 'val'))
# by setting the key in y to val and in x to cumval we can
# leverage the rolling joins
setattr(xdt, 'sorted', c('grp', 'cumval')) # hack #1 set key, but prevent sorting
vals = xdt[, cumval.copy := cumval][ydt, roll = -Inf]
# hack #2, same deal as above
# we know that the order of cumval and cumval.copy is the same
# so let's convince data.table in that
setattr(vals, 'sorted', c('grp', 'cumval.copy'))
# compute the counts and fill in the missing 0's
# for when there is no y in the appropriate x interval
tt2 = vals[, .N, keyby = list(grp, cumval.copy)][xdt][is.na(N), N := 0L]
# convert to list
tt2 = tt2[order(grp, cumval.copy), list(list(N)), by = grp]$V1

This is about 25% faster but outputs as a matrix rather than a list. You many be able to use appy/sappy to make it work with a list (saving as a list was slowing it down).
c=matrix(0,length(x),100)
for(j in 1:length(x)){
a=-1
b=0
for(i in 1:length(x[[j]])){
a=b
b=b+x[[j]][i]
c[j,i]=sum((a<=y[[j]])*(y[[j]]<=b))
}
}

Related

Using parLapply instead of lapply make memory usage explode

So I recently started the analysis of much bigger dataset than I used to and realised my code was not efficient enough in terms of speed.
In order to make parts of my script go faster I decided to go parallel on some lapply call I do.
This is my original line, it works but it is so slow :
list_vect <- lapply(X = df1$start, function(x){
vect_of_num <- which(df2$start <= x + 500 & df2$start >= x - 500)
})
My first attempt to go parallel was like this :
cl <- makeCluster(detectCores() -2) # 6 out of 8 cores
list_vect <- parLapply(cl, X = df1$start, function(x){
vect_of_num <- which(df2$start <= x + 500 & df2$start >= x - 500)
})
Which produces an error telling me df2 doesn't exist
Following advices, I created the function outside :
get_list_vect <- function(var1, var2){
vect_of_num <- which(var2$start <= var1 + 500 & var2$start >= var1 - 500)
}
cl <- makeCluster(detectCores() -2) # 6 out of 8 cores
list_vect <- parLapply(cl = cl, df1$start, df2$start, get_list_vect)
This last piece of code does run but I feel I did something wrong. When using lapply, I can see on the monitoring screen that the memory usage is steady (around 8go). However when calling the parLapply I see the memory usage increasing until it reach the maximum of 32Go and then the system freeze.
I hope you guys will see where I am wrong. Feel free to suggest a better approach.
A data.table approach using non-equi joins:
require(data.table)
# crate data.tables:
dt1 <- data.table(start = df1$start)
dt2 <- data.table(start = df2$start)
# calculate begining and end values
dt2[, beg := start - 500L]
dt2[, end := start + 500L]
# add indexes
dt1[, i1 := .I]
dt2[, i2 := .I]
# join:
x <- dt2[dt1, on = .(end >= start, beg <= start), allow.cartesian = T]
x[, .(.(i2)), keyby = i1]
# i1 V1
# 1: 1 788,1148,2511,3372,5365,8315,...
# 2: 2 2289,3551,4499,4678,4918,5008,...
# 3: 3 2319,3459,3834,5013,6537,9714
r <- x[, .(.(i2)), keyby = i1][[2]] # transform to list
all.equal(list_vect, r)
# TRUE

Apply function over rows of data.table keeping the first two columns and results as a data.table

Apologies for another 'apply to rows of data.table' question however I have not found a solution in any of the other answers.
I have a data.table with >2 million rows and >5000 columns. I would like to keep the first two columns and collapse the remaining columns by some summaries.
Example: Translate this ...
keep1 keep2 c d
1: a A 568.62060 599.4427
2: b B 815.63027 728.9226
To this ...
keep1 keep2 mean median
1: a A 584.0316 584.0316
2: b B 772.2765 772.2765
Currently my solution keeps all rows and is not as fast as I had hoped.
library(data.table)
x = data.table(keep1=letters[1:5], keep2=LETTERS[1:5], c=runif(5, 1, 1000), d=runif(5, 1, 1000))
stats = function(x) list(mean(x), median(x))
x[,c("mean", "median") := get_stats(unlist(.SD)),
by = seq_len(nrow(x)), .SDcols = 3:ncol(x)]
I have two questions:
Is there a way to prevent outputting all columns and instead only output column1, column2 and the summaries I have made, as in the example?
Is there a quicker way to do this?
EDIT:
To give some context to speed problem here is the speed in a million row ~50 column table compared to base R. It is almost 3 times slower. I assume this is because I am returning the whole table back instead of the columns I want so I am hoping for a solution to this.
library(data.table)
ids = function(n) sample(LETTERS, n, rep=T)
nums = function(n) runif(n, 1, 1000)
x = data.table(keep1=ids(1e6), keep2=ids(1e6), replicate(50, nums(1e6)))
stats = function(x) c(mean(x), median(x))
ss = Sys.time()
y = x[,c("mean", "median") := stats(unlist(.SD)),
by = seq_len(nrow(x)), .SDcols = 3:ncol(x)]
Sys.time() - ss # Time difference of 1.408833 mins
ss = Sys.time()
y = cbind(x[,1:2], t(apply(x[,3:ncol(x)], 1, function(x) c(mean(x), median(x)))))
Sys.time() - ss # Time difference of 40.196 secs

Ideas to improve this loop possible?

I've been reading how to improve code in R taking a look a some of the answers here and also reading a bit of the R inferno document. Now I have this problem and the loop I created seems to be taking forever (15 hours and counting).
k <- NROW(unique(df$EndStation.Id))
l <- NROW(unique(df$StartStation.Id))
m1 <- as.matrix(df[,c("Duration","StartStation.Id","EndStation.Id")])
g <- function(m){
for (i in 1:l){
for (j in 1:k){
duration <- m[(m[,2]==i & m[,3]==j),1]
if (NROW(duration)<=1) {
m[(m[,2]==i & m[,3]==j),1] <- NA
next
}
duration <- duration/median(duration)
m[(m[,2]==i & m[,3]==j),1] <- duration
}
}
return(m)
}
answer <- g(m1)
The number of Stations (Start and End) is both 750 and the duration vector size can vary a lot from 1 or 2 to 80. Is this loop improbable or should I give up and try to get access to a faster computer.
Best regards,
Fernando
The code is a bit hard to read, but I think this is what you want to do:
library(data.table)
## generate a data table
dt <- setDT(df[,c("Duration","StartStation.Id","EndStation.Id")])
## calculate the duration
dt[, Duration := Duration / median(Duration), by = .(StartStation.Id, EndStation.Id)]
## replace the result with NA when the vector length == 1
dt[, N := .N, by = .(StartStation.Id, EndStation.Id)][
N == 1, Duration := NA
][, N := NULL]
If I understand your function correctly, you want to divide the duration between two stations by it median duration and if there is only one entry for the pair of stations set to NA
Here is a base solution (it's a bit clunky, I haven't finished my first cup of coffee):
##Some sample data
df <- data.frame(StartStation.Id=sample(LETTERS[1:10], 100, replace =T),
EndStation.Id=sample(LETTERS[11:20], 100, replace =T),
Duration=runif(100, 0.1,100))
res <- tapply(df$Duration, paste0(df$StartStation.Id, df$EndStation.Id), function(x) x/median(x))
res <- data.frame(StartStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 1),
EndStation.Id=sapply(strsplit(rep(names(res), sapply(res, length)), ""), "[", 2),
durn=unlist(res))
res[res$durn==1,] <- NA

Faster way to subset data table instead of a for loop R

I have a data table (you'll need the data table package installed) in R generated with X and Y coordinates and random data values from both normal and uniform distributions. The coordinates represent points on a 2000x1600 array and has to be divided into 16 smaller "sectors" each 500x400. These sectors need their mean of Normal Distribution values taken, divided by the min^2 of the Uniform Distribution values. I also created two variables x and y using a provided function startstop, that have the coordinates for the 16 sectors and a function that calculates the numbers for each sector.
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
x <- startstop(2000)
y <- startstop(1600, T)
sectorNos <- sectorCalc(x,y,DT)
The startstop function isn't really an issue but I need a faster way to subset the data table. Some modifications have to be made to the 'sectorCalc' function. The for loop was the best way I could think of but I don't have too much experience with data tables. Any ideas on a faster method of breaking up the data table?
A solution using not only the package data.table but also the cut function to build the interval "groups":
# Create your test data
library(data.table)
set.seed(123) # make random numbers reproducible to allow comparison of different answers
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
# calculate the sector by cutting the x and y values into groups defined by the interval breaks
DT[, x.sect := cut(DT[, X], c(0, 499, 1000, 1500, 2000), dig.lab=10)] # Intervals should be: seq(0, 2000, by=500) lower bound is less one since it is not included in the interval (see help for cut function)
DT[, y.sect := cut(DT[, Y], c(0, 399, 800, 1200, 1600), dig.lab=10)] # Intervals should be: seq(0, 1600, by=400)
# Now calculate per group (calculation logic "stolen" from the working answer of user "Symbolix"
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)]
Please note: I think the size of the first and second interval is wrong in the original solution (499 instead of 500 for x and 399 instead of 400 for y so that I could not use the seq function to reproduce your desired intervals but had to enumerate the interval breaks manually).
Edit 1: I have replaced the original code that adds the x.sect and y.sect columns by an improved solution that adds columns by reference (:=).
Edit 2: If you want to order the result you have (at least) two options:
# "Chaining" (output is input of next)
DT[, .(sect = mean(Norm)/min(Unif)^2), by=.(x.sect, y.sect)][order(x.sect, y.sect),]
# Or: Use the "keyby" param instead of "by"
DT[, .(sect = mean(Norm)/min(Unif)^2), keyby=.(x.sect, y.sect)]
Edit 3: Added dig.lab=10 param to cut function in code above to avoid scientific notation of the interval breaks.
To replace your sectorCalc function I think we can make use of data.tables joins
As you are looping over each row of sector, you just have to create a data.table to join onto that is your sector data,
specify a column to join (here I'm using key_col), and specify a 'group' variable for each row, to enable us to do a
the calculation at the end:
x <- startstop(2000)
y <- startstop(1600, T)
## copy the original DT
dt <- copy(DT)
dt_xy <- data.table(x_1 = x[[1]],
x_2 = x[[2]],
y_1 = y[[1]],
y_2 = y[[2]])
dt[, key_col := 1]
dt_xy[, `:=`(key_col = 1, xy_grp = seq(1,.N))]
## Use a data.table join, allowing cartesian, then filter out results.
dt_res <- dt[ dt_xy, on="key_col", allow.cartesian=T][x_1 <= X & X <= x_2 & y_1 <= Y & Y <= y_2]
## calculate 'sect' as required.
dt_sect <- dt_res[, .(sect = mean(Norm)/min(Unif)^2) , by=.(xy_grp)]

Return factor associated with a numeric range defined in two columns

Using a database with a numeric range defined by two columns start and end, I am trying to look up the factor, code, associated with a numeric value in a separate vector identityCodes.
database <- data.frame(start = seq(1, 150000000, 1000),
end = seq(1000, 150000000, 1000),
code = paste0(sample(LETTERS, 15000, replace = TRUE),
sample(LETTERS, 15000, replace = TRUE)))
identityCodes <- sample(1:15000000, 1000)
I've come up with a method for finding the corresponding codes using a for loop and subsetting:
fun <- function (x, y) {
z <- rep(NA, length(x))
for (i in 1:length(x)){
z[i] <- as.character(y[y["start"] <= x[i] & y["end"] >= x[i], "code"])
}
return(z)
}
a <- fun(identityCodes, database)
But the method is slow, especially if I am to scale it:
system.time(fun(identityCodes, database))
user system elapsed
15.36 0.00 15.50
How can I identify the factors associated with each identityCodes faster? Is there a better way to go about this than using a for loop and subsetting?
Here's my attempt using data.table. Very fast - even though I am sure I am not leveraging it efficiently.
Given function:
# method 1
system.time(result1 <- fun(identityCodes, database))
user system elapsed
8.99 0.00 8.98
Using data.table
# method 2
require(data.table)
# x: a data.frame with columns start, end, code
# y: a vector with lookup codes
dt_comb <- function(x, y) {
# convert x to a data.table and set 'start' and 'end' as keys
DT <- setDT(x)
setkey(DT, start, end)
# create a lookup data.table where start and end are the identityCodes
DT2 <- data.table(start=y, end=y)
# overlap join where DT2 start & end are within DT start and end
res <- foverlaps(DT2, DT[, .(start, end)], type="within")
# store i as row number and key (for sorting later)
res[, i:=seq_len(nrow(res))]
setkey(res, i)
# merge the joined table to the original to get codes
final <- merge(res, DT, by=c("start", "end"))[order(i), .(code)]
# export as character the codes
as.character(final[[1]])
}
system.time(result2 <- dt_comb(x=database, y=identityCodes))
user system elapsed
0.08 0.00 0.08
identical(result1, result2)
[1] TRUE
edit: trimmed a couple lines from the function
This is about 45% faster on my machine:
result = lapply(identityCodes, function(x) {
data.frame(identityCode=x,
code=database[database$start <= x & database$end >= x, "code"])
})
result = do.call(rbind, result)
Here's a sample of the output:
identityCode code
1 6836845 OK
2 14100352 RB
3 2313115 NK
4 8440671 XN
5 11349271 TI
6 14467193 VL

Resources