Optimizing a vectorized function using apply, compiler, or other techniques - r

I'm seeking to optimize this algorithm smartWindow and (and the process where I original post which explains some context around the function and how I got here:
Vectorizing a loop through lines of data frame R while accessing multiple variables the dataframe).
This currently takes me 240 seconds to run on my actual data. I've tried some Rprof It seems that chg2 <- line of smartWindow is eating the most time. I've also tried the compiler in R using cmpfun I'm wondering there's a way to significantly improve the speed of what I'm trying to do.
What I'm really looking for, is if there's a technique to accomplish what I've done below in something closer to 20 seconds than 240 seconds. I've shaved off 1-5% of of the computation time using various things. but what I'm really wondering is if I can decrease the time by a factor of a number greater than 2.
## the function
smartWindow <- function(tdate, aid, chgdf, datev='Submit.Date', assetv='Asset.ID', fdays=30, bdays=30) {
fdays <- tdate+fdays
bdays <- tdate-bdays
chg2 <- chgdf[chgdf[,assetv]==aid & chgdf[,datev]<fdays & chgdf[,datev]>bdays, ]
ret <- nrow(chg2)
return(ret)
}
## set up some data #################################################
dates <- seq(as.Date('2011-01-01'), as.Date('2013-12-31'), by='days')
aids <- paste(rep(letters[1:26], 3), 1:3, sep='')
n <- 3000
inc <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
chg <- data.frame(
Submit.Date = sample(dates, n, replace=T),
Asset.ID = sample(aids, n, replace=T))
## applying function to just one incident ###########################
smartWindow(inc$Submit.Date[1], inc$Asset.ID[1], chgdf=chg, bdays=100)
## applying to every incident... this is process i seek to optimize #########
system.time({
inc$chg_b30 <- apply(inc[,c('Submit.Date', 'Asset.ID')], 1, function(row) smartWindow(as.Date(row[1]), row[2], chgdf=chg,
datev='Submit.Date', assetv='Asset.ID', bdays=30, fdays=0))
})
table(inc$chg_b30)

Related

Assigning value to dataframe in R - for loop speed

I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]

Filter data using a loop to compare calculated metrics for different days of week in R

I have a data set of traffic by day and by hour. I have written a function that I would like to apply to different conditions of this data set.
For instance, I want to compare average traffic for different days of the week and different hours of the day.
How do I use a loop to filter through each possibility day of week and return the metric i have a function for, for each of them?
Would really appreciate some help here.
Thanks,
Zach
I'm not sure a loop is the best thing for what you're trying to do, but here's one way to do it.
# generate example data
set.seed(1234)
df <- data.frame(hour = sample(1:24, 100, T),
dow = sample(1:7, 100, T),
traffic = round(runif(100, 1, 50)))
# prep storage matrix for results
H <- sort(unique(df$hour))
D <- sort(unique(df$dow))
res_mat <- matrix(NA, nrow=length(H), ncol=length(D))
colnames(res_mat) <- D
rownames(res_mat) <- H
# function I want to apply to subsets of values
my_fun <- function(x) { mean(x) + 2 }
# loop
for(h in seq_along(H)) {
for(d in seq_along(D)) {
# get vector of traffic for a particular hour and day-of-week combo
subset_of_traffic <- df[df$hour == H[h] & df$dow == D[d], "traffic"]
# skip if no traffic data for this hour and day-of-week combo
if(length(subset_of_traffic)==0) next
# run function on that subset and store result
res_mat[h,d] <- my_fun(subset_of_traffic)
}
}
A faster way to get the same results with data.table:
library(data.table)
dt <- data.table(df)
res_dt <- dt[ , .(results = my_fun(traffic)), by=.(hour, dow)]

Increasing speed using while loops: finding MULTIPLE chains of infection in R

I recently asked a question about improving performance in my code (Faster method than "while" loop to find chain of infection in R).
Background:
I'm analyzing large tables (300 000 - 500 000 rows) that store data output by a disease simulation model. In the model, animals on a landscape infect other animals. For example, in the example pictured below, animal a1 infects every animal on the landscape, and the infection moves from animal to animal, branching off into "chains" of infection.
In my original question, I asked how I could output a data.frame corresponding to animal "d2"s "chain of infection (see below, outlined in green, for illustration of one "chain"). The suggested solution worked well for one animal.
In reality, I will need to calculate chains for about 400 animals, corresponding to a subset of all animals (allanimals table).
I've included a link to an example dataset that is large enough to play with.
Here is the code for one chain, starting with animal 5497370, and note that I've slightly changed column names from my previous question, and updated the code!
The code:
allanimals <- read.csv("https://www.dropbox.com/s/0o6w29lz8yzryau/allanimals.csv?raw=1",
stringsAsFactors = FALSE)
# Here's an example animal
ExampleAnimal <- 5497370
ptm <- proc.time()
allanimals_ID <- setdiff(unique(c(allanimals$ID, allanimals$InfectingAnimal_ID)), -1)
infected <- rep(NA_integer_, length(allanimals_ID))
infected[match(allanimals$ID, allanimals_ID)] <-
match(allanimals$InfectingAnimal_ID, allanimals_ID)
path <- rep(NA_integer_, length(allanimals_ID))
curOne <- match(ExampleAnimal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
chain <- allanimals[path[seq_len(i - 1)], ]
chain
proc.time() - ptm
# check it out
chain
I'd like to output chains for each animal in "sel.set":
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
If possible, I'd like to store each "chain" data.frame as list with length = number of chains.
So I'll return the indices to access the data frame rather than all data frame subsets. You'll just need to use lapply(test, function(path) allanimals[path, ]) or with a more complicated function inside the lapply if you want to do other things on the data frame subsets.
One could think of just lapply on the solution for one animal:
get_path <- function(animal) {
curOne <- match(animal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
path[seq_len(i - 1)]
}
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
system.time(
test <- lapply(sel.set, get_path)
) # 0.66 seconds
We could rewrite this function as a recursive function (this will introduce my third and last solution).
system.time(
sel.set.match <- match(sel.set, allanimals_ID)
) # 0
get_path_rec <- function(animal.match) {
`if`(is.na(nextOne <- infected[animal.match]),
NULL,
c(animal.match, get_path_rec(nextOne)))
}
system.time(
test2 <- lapply(sel.set.match, get_path_rec)
) # 0.06
all.equal(test2, test) # TRUE
This solution is 10 times as fast. I don't understand why though.
Why I wanted to write a recursive function? I thought you might have a lot of cases where you want for example to get the path of animalX and animalY where animalY infected animalX. So when computing the path of animalX, you would recompute all path of animalY.
So I wanted to use memoization to store already computed results and memoization works well with recursive functions. So my last solution:
get_path_rec_memo <- memoise::memoize(get_path_rec)
memoise::forget(get_path_rec_memo)
system.time(
test3 <- lapply(sel.set.match, get_path_rec_memo)
) # 0.12
all.equal(test3, test) # TRUE
Unfortunately, this is slower than the second solution. Hope it will be useful for the whole dataset.

Efficiently building a large (200 MM line) dataframe

I am attempting to build a large (~200 MM line) dataframe in R. Each entry in the dataframe will consist of approximately 10 digits (e.g. 1234.12345). The code is designed to walk through a list, subtract an item in position [i] from every item after [i], but not the items before [i] (If I was putting the output into a matrix it would be a triangular matrix). The code is simple and works fine on smaller lists, but I am wondering if there is a faster or more efficient way to do this? I assume the first part of the answer is going to entail "don't use a nested for loop," but I am not sure what the alternatives are.
The idea is that this will be an "edge list" for a social network analysis graph. Once I have 'outlist' I will reduce the number of edges based on some criteria(<,>,==,) so the final list (and graph) won't be quite so ponderous.
#Fake data of same approximate dimensions as real data
dlist<-sample(1:20,20, replace=FALSE)
#purge the output list before running the loop
rm(outlist)
outlist<-data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
outlist<-rbind(outlist, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
IIUC your final dataset will be ~200 million rows by 3 columns, all of type numeric, which takes a total space of:
200e6 (rows) * 3 (cols) * 8 (bytes) / (1024 ^ 3)
# ~ 4.5GB
That's quite a big data, where it's essential to avoid copies wherever possible.
Here's a method that uses data.table package's unexported (internal) vecseq function (written in C and is fast + memory efficient) and makes use of it's assignment by reference operator :=, to avoid copies.
fn1 <- function(x) {
require(data.table) ## 1.9.2
lx = length(x)
vx = as.integer(lx * (lx-1)/2)
# R v3.1.0 doesn't copy on doing list(.) - so should be even more faster there
ans = setDT(list(v1 = rep.int(head(x,-1L), (lx-1L):1L),
v2=x[data.table:::vecseq(2:lx, (lx-1L):1, vx)]))
ans[, v3 := v2-v1]
}
Benchmarking:
I'll benchmark with functions from other answers on your data dimensions. Note that my benchmark is on R v3.0.2, but fn1() should give better performance (both speed and memory) on R v3.1.0 because list(.) doesn't result in copy anymore.
fn2 <- function(x) {
diffmat <- outer(x, x, "-")
ss <- which(upper.tri(diffmat), arr.ind = TRUE)
data.frame(v1 = x[ss[,1]], v2 = x[ss[,2]], v3 = diffmat[ss])
}
fn3 <- function(x) {
idx <- combn(seq_along(x), 2)
out2 <- data.frame(v1=x[idx[1, ]], v2=x[idx[2, ]])
out2$v3 <- out2$v2-out2$v1
out2
}
set.seed(45L)
x = runif(20e3L)
system.time(ans1 <- fn1(x)) ## 18 seconds + ~8GB (peak) memory usage
system.time(ans2 <- fn2(x)) ## 158 seconds + ~19GB (peak) memory usage
system.time(ans3 <- fn3(x)) ## 809 seconds + ~12GB (peak) memory usage
Note that fn2() due to use of outer requires quite a lot of memory (peak memory usage was >=19GB) and is slower than fn1(). fn3() is just very very slow (due to combn, and unnecessary copy).
Another way to create that data is
#Sample Data
N <- 20
set.seed(15) #for reproducibility
dlist <- sample(1:N,N, replace=FALSE)
we could do
idx <- combn(1:N,2)
out2 <- data.frame(i=dlist[idx[1, ]], j=dlist[idx[2, ]])
out2$dist <- out2$j-out2$i
This uses combn to create all paris of indices in the data.set rather than doing loops. This allows us to build the data.frame all at once rather than adding a row at a time.
We compare that to
out1 <- data.frame()
for(i in 1:(length(dlist)-1)){
for(j in (i+1):length(dlist)){
out1<-rbind(out1, c(dlist[i],dlist[j], dlist[j]-dlist[i]))
}
}
we see that
all(out1==out2)
# [1] TRUE
Plus, if we compare with microbenchmark we see that
microbenchmark(loops(), combdata())
# Unit: microseconds
# expr min lq median uq max neval
# loops() 30888.403 32230.107 33764.7170 34821.2850 82891.166 100
# combdata() 684.316 800.384 873.5015 940.9215 4285.627 100
The method that doesn't use loops is much faster.
You can always start with a triangular matrix and then make your dataframe directly from that:
vec <- 1:10
diffmat <- outer(vec,vec,"-")
ss <- which(upper.tri(diffmat),arr.ind = TRUE)
data.frame(one = vec[ss[,1]],
two = vec[ss[,2]],
diff = diffmat[ss])
You need to preallocate out list, this will significantly increase the speed of your code. By preallocating I mean creating an output structure that already has the desired size, but filled with for example NA's.

How can I speed up this sapply for cross checking samples?

I'm trying to speed up a QC function for checking similarity between samples. I wanted to know if there is a faster way to compare the way I am doing below? I know there have been answers to this kind of question that are pretty definitive (on SO or otherwise) but I can't find them. I know I should investigate plyr but I'm still getting a hold of sapply.
The following sample data is a representative output of what I would be working but randomized and I don't think would impact the application to my original question.
## sample data
nSamples <- 1000
nSamplesQC <- 100
nAssays <- 96
microarrayScores <- matrix(sample(c("G:G", "T:G", "T:T", NA),nSamples * nAssays,replace = TRUE), nrow = nSamples, ncol = nAssays)
microarrayScoresQC <- matrix(sample(c("G:G", "T:G", "T:T", NA),nSamples * nAssays,replace = TRUE), nrow = nSamples, ncol = nAssays)
mycombs <- data.frame(Experiment = rep(1:nSamples,nSamplesQC),QC = sort(rep(1:nSamplesQC,nSamples)))
## testing function
system.time(
sapply(seq(length(mycombs[,1])), function(x) {compare <- microarrayScores[mycombs[x,1],]==microarrayScoresQC[mycombs[x,2],];
sum(compare[!is.na(compare)])/sum(!is.na(compare))})
)
Here is a vectorized version of your code, about 20 times faster on my machine:
rowMeans(microarrayScores[mycombs[,1], ] ==
microarrayScoresQC[mycombs[,2], ], na.rm = TRUE)
Something like this:
foo <- function(x){
compare <- microarrayScores[x[1],]==microarrayScoresQC[x[2],]
sum(compare[!is.na(compare)])/sum(!is.na(compare))
}
system.time(apply(mycombs,1,foo))
appears to be modestly faster. (Maybe 2-3x)

Resources