Alternate loop functions in R/Optimize loop - r

I am trying to find out the connecting transaction. From the first TRUE to last TRUE, its considered as one transaction and also find out in the transaction, the tpt_mode whether is mixed or pure. Then, insert a new column with new data but currently now the for loop is working with little volume of data, when it comes to huge volume of data, it tends to run very slow. How can I optimize the for loop to speed up the performance?
firstid<-1
currTpt <- 'NA'
count<-0
n <- nrow(tnx)
for (i in 1:n) {
if(tnx$FIRST[i]){
firstid<-i
currTpt <-tnx$mode[i]
count <-1
}
else{
count <- count + 1
}
if(as.character(tnx$mode[i])!= as.character(currTpt)){
currTpt <- 'both'
}
if(tnx$LAST[i])
{
tnx$final_end_loc[firstid]<-tnx$end_loc[i]
tnx$final_end_date[firstid]<-as.character(tnx$end_date[i])
tnx$final_end_time[firstid]<-as.character(tnx$end_time[i])
tnx$final_mode[firstid]<-as.character(currTpt)
tnx$final_count[firstid] <- count
}
}
final_tnx<-subset(tnx,FIRST==TRUE,c("id","start_date","start_time","final_end_date","final_end_time","start_loc","final_end_loc","final_mode","final_count"))
Sample data: EDIT
tnx<- data.frame(
id=c("A","A","A","A","C","C","D","D","E"),
mode=c("on","on","off","on","on","off","off","off","on"),
start_time=c("8:20:22","17:20:22","17:45:22","18:20:22","16:35:22","17:20:22","15:20:22","16:00:22","12:20:22"),
end_time=c("8:45:22","17:30:22","18:00:22","18:30:22","17:00:22","17:50:22","15:45:22","16:14:22","27:50:22"),
start_loc=c("12","12","207","12","11","65","222","32","12"),
end_loc=c(31,31,29,11,22,12,45,31,11),
start_date=c("6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012"),
end_date=c("6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012"),
FIRST=c(T,T,F,F,T,F,T,F,T),
LAST=c(T,F,F,T,F,T,F,T,T)
)
Sample dataset in picture form:
Expected results:
Thanks in advance.

To get your results, you don't need a loop. If you check where your transactions start and end and index accordingly your code simplifies to
nLAST <- which(tnx$LAST)
nFIRST <- which(tnx$FIRST)
count <- sapply(1:length(nFIRST),FUN = function(i){nFIRST[i]:nLAST[i]})
mode <- unlist(lapply(count,FUN=function(x){ifelse(length(unique(tnx$mode[x]))==1,
as.character(unique(tnx$mode[x])),'both')}))
final_tnx <- data.frame(id = tnx$id[nFIRST],start_date = tnx$start_date[nFIRST],
start_time = tnx$start_time[nFIRST],final_end_date = tnx$end_date[nLAST],
final_end_time = tnx$end_time[nLAST], start_loc=tnx$start_loc[nFIRST],
final_end_loc = tnx$end_loc[nLAST],final_mode = mode,
final_count = nLAST - nFIRST +1)
This should definitly speed up things and also perform well on larger data sets.
EDIT: When the mode is allowed to change more than once you have to check for the uniqueness on all subsets. In count I build a list of index sequences for each record. Then apply on the index list a function that checks whether there is one or more modes in the subset.

I'm sure there are far more improvements to be made but if you index as little as possible in the loop and specify data as vectors you can see some improvement.
require("rbenchmark")
###Specify data as vectors
FIRST <- tnx$FIRST
mode <- tnx$mode
LAST <- tnx$LAST
final_end_loc <- tnx$final_end_loc
final_end_date <- tnx$final_end_date
final_end_time <- tnx$final_end_time
final_mode <- tnx$final_mode
final_count <- tnx$final_count
end_date <- tnx$end_date
end_time <- tnx$end_time
end_loc <- tnx$end_loc
benchmark(for (i in 1:n) {
if(FIRST[i]){
firstid<-i
currTpt <-mode[i]
count <-1
}
else{
count <- count + 1
}
if(as.character(mode[i])!= as.character(currTpt)){
currTpt <- 'both'
}
if(LAST[i])
{
final_end_loc[firstid]<-end_loc[i]
final_end_date[firstid]<-as.character(end_date[i])
final_end_time[firstid]<-as.character(end_time[i])
final_mode[firstid]<-as.character(currTpt)
final_count[firstid] <- count
}
})
replications elapsed relative user.self sys.self user.child sys.child
1 100 0.11 1 0.11 0 NA NA
Now your loop
replications elapsed relative user.self sys.self user.child sys.child
1 100 0.18 1 0.19 0 NA NA
Cannot be certain if this will perform well with large dataset but keeping indexing to a minimum have worked for me in the past. A good post can be found here Speed up the loop operation in R if this isn't fast enough for you or doesn't work well with large data.

Related

Running a quicker calculation

I have 2 data frames in R, one of which is a subset of the other. I had to do some manipulations in it, and calculate the % of the subsetted data from the main data frame for 6 x-values (DayTreat in the code). So I created a function to do the calculation and create a new column. My issue is that it's painfully slow. Any suggestions?
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
Examing your code, it looks like that you are doing redundant calculasions
the line:
for (i in fullDat$DayTreat)
should be:
for (i in unique(fullDat$DayTreat))
After that you could use data.table and do not use separate data frames,
if you say that one is subset of onother
require(data.table)
setDT(fullDat)
fullDat[, subsetI := Abundance > 30] # for example, should be your Condition
fullDat[, DaySum:= sum(Abundance), by = DayTreat]
fullDat[, DayPerc := Abundance/DaySum]
# get subset:
fullDat[subsetI == T]
If you would provide example data and desired output, it could be possible to supply more concrete code.
So, at a high level, I think the solutions are:
Use faster data classes if you aren't already
Avoid for loops
vectorize manually or
real on faster functions/libraries that use more C code and/or have more vectorization "under the hood"
Try data.table and/or tidyverse for greater speed and cleaner code
Benchmark and profile your code
Example:
require(tidyverse)
require(data.table)
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
# My simulation of your data.frame:
fullDat <- data.frame(Abundance=rnorm(200),
DayTreat=c(1:100,1:100))
subDat <- dplyr::sample_frac(fullDat, .25)
# Your function modifies the data, so I'll make a copy. For a potential
# speed improvement I'll try data.table class
fullDat0 <- as.data.table(fullDat)
subDat0 <- as.data.table(subDat)
require(rbenchmark)
benchmark("original" = {
percDay(fullDat, subDat)
},
"example_improvement" = {
# Tidy approach
tmp <- fullDat0 %>%
group_by(DayTreat) %>%
summarize(DaySum = sum(Abundance))
subDat0 <- merge(subDat, tmp, by="DayTreat") # could use semi_join
subDat0$DayPerc <- (subDat0$Abundance/subDat0$DaySum) # could use mutate
},
replications = 100,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
example_improvement 100 0.22 1.000 0.22 0.00
original 100 1.42 6.455 1.23 0.01
Typically a data.table approach is going to have the greatest speed. The tibble-based "tidy" approach has clearer syntax whilst typically being faster than data.frame but slower than data.table. An experienced data.table expert like #akrun could offer a maximal performance solution using probably just 1 single data.table statement.

R: Remove nested for loops in order to make a custom bootstrap more efficient

I am trying to gather some bootstrapped estimates for summary statistics from a dataset, but I want to resample parts of the dataset at different rates, which has led me to lean on nested for loops.
Specifically, suppose there are two groups in my dataset, and each group is further divided into test and control. Group 1 has a 75% / 25% test-control ratio, and Group 2 has a 50% / 50% test-control ratio.
I want to resample such that the dataset is the same size, but the test-control ratios are 90% / 10% for both groups... in other words, resample different subgroups at different rates, which strikes me as different from what the boot package normally does.
In my dataset, I created a group variable representing the groups, and a groupT variable representing group concatenated with test/control, e.g.:
id group groupT
1 1 1T
2 1 1T
3 2 2T
4 1 1C
5 2 2C
Here's what I am running right now, with nreps arbitrarily set to be my number of bootstrap replications:
for (j in 1:nreps){
bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset
for (i in unique(datafile$groups)){
tstring<-paste0(i,"T") ## e.g. 1T
cstring<-paste0(i,"C") ## e.g. 1C
## Size of test group resample should be ~90% of total group size
tsize<-round(.90*length(which(datafile$groups==i)),0)
## Size of control group resample should be total group size minus test group size
csize<-length(which(datafile$groups==i))-tsize
## Continue building bootdat by rbinding the test and control resample
## before moving on to the next group
## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
replace=TRUE),])
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
replace=TRUE),])
}
## Here, there is code to grab some summary statistics from bootdat
## and store them in statVector[j] before moving on to the next replication
}
With a dataset size of about 1 million total records, this takes 3-4 minutes per replication. I feel certain there is a better way to do this either with sapply or possibly some of the dplyr functions, but I have come up empty in my attempts so far. Any help would be appreciated!
I'd strongly encourage you to look into data.table and foreach, using keyed searches for bootstraps. It'll allow you to do a single bootstrap very rapidly, and you can run each bootstrap independently on a different core. Each bootstrap of the below takes 0.5 seconds on my machine, searching through a table of 1 million rows. Something like the following should get you started:
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
# example data
dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# define number of bootstraps
nBootstraps <- 1000
# define sampling fractions
fraction_test <- 0.90
fraction_control <- 1 - fraction_test
# get number that you want to sample from each group
N.test <- round(fraction_test * dim(dat)[1])
N.control <- round(fraction_control * dim(dat)[1])
# key data by id
setkey(dat, id)
# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {
# sample each group with the defined sizes, with replacement
g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
dat.all[, bootstrap := n]
# do summary stats here with dat.all, return the summary stats data.table object
return(dat.summarized)
}
EDIT: example below includes a lookup table for each of any arbitrary number of unique groups. The IDs corresponding to each combination of group + (test OR control) can be referenced within a foreach loop for simplicity. With lower numbers for N.test and N.control (900 and 100) it spits out the results of 1000 bootstraps in
library(data.table)
library(foreach)
# example data
dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# save vector of all group values & change group to character vector for hashed environment lookup
all_groups <- as.character(sort(unique(dat$group)))
dat[, group := as.character(group)]
# define number of bootstraps
nBootstraps <- 100
# get number that you want to sample from each group
N.test <- 900
N.control <- 100
# key data by id
setkey(dat, id)
# all values for group
# Set up lookup table for every combination of group + test/control
control.ids <- new.env()
test.ids <- new.env()
for(i in all_groups) {
control.ids[[i]] <- dat[group==i & test_control=="C"]$id
test.ids[[i]] <- dat[group==i & test_control=="T"]$id
}
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
yielding
> results
id group test_control bootstrap
1: 701570 1 C 1
2: 424018 1 C 1
3: 909932 1 C 1
4: 15354 1 C 1
5: 514882 1 C 1
---
23999996: 898651 24 T 1000
23999997: 482374 24 T 1000
23999998: 845577 24 T 1000
23999999: 862359 24 T 1000
24000000: 602078 24 T 1000
This doesn't involve any of the summary stat calculation time, but here 1000 bootstraps were pulled out on 1 core serially in
user system elapsed
62.574 1.267 63.844
If you need to manually code N to be different for each group, you can do the same thing as with id lookup
# create environments
control.Ns <- new.env()
test.Ns <- new.env()
# assign size values
control.Ns[["1"]] <- 900
test.Ns[["1"]] <- 100
control.Ns[["2"]] <- 400
test.Ns[["2"]] <- 50
... ...
control.Ns[["24"]] <- 200
test.Ns[["24"]] <- 5
then change the big bootstrap loop to look up these values based on the loop's current group:
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# get size values
N.control <- control.Ns[[group.i]]
N.test <- test.Ns[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
Just like caw5cv, I recommend taking a look at data.table it is usually very efficient in solving such problems, however if you want to choose to work with dplyr then you can try doing something like this:
summary_of_boot_data <- lapply(1:nreps,
function(y){
# get bootdata
bootdata <- lapply(unique(datafile$group),
function(x){
tstring<-paste0(x,"T")
cstring<-paste0(x,"C")
tsize<-round(.90*length(which(datafile$group==x)),0)
csize<-length(which(datafile$group==x))-tsize
df <-rbind(datafile[sample(which(datafile$groupT==tstring),
size=tsize,
replace=TRUE),],
datafile[sample(which(datafile$groupT==cstring),
size=csize,
replace=TRUE),])
return(df)
}) %>% do.call(rbind, .)
# return your summary thing for bootdata e.g. summary(bootdata)
summary(bootdata)
})
summary_of_boot_data
I tried not changing you code a lot, I just replaced the use of for with lapply
hope this helps
EDIT: Based on the comment from Hugh you might want to try using data.table::rbindlist() instead of do.call(rbind, .)

Alternative to for loop with "dynamic" variables with R

I am new to StackOverflow even though it's been a while I play with R. I am struggling with a problem for which I was not able to find any answer on the site. Please correct me if I my quest was not enough accurate.
I have two 3d arrays, in this simplified case 256x256x200. First one is a field, the second one is made of indices, spanning from 1 to 8. I want to compute the average on each vertical level according to the values and the counts of the indices, i.e. the average the the field for 200 levels for each index (from 1 to 8). This should be done only if there are enough counts of the indices (i.e. an if condition within the loop). My output must be a matrix of 8x200.
For the example I create two random arrays. Here below there is the basic code I am using:
nz=200
lev=1:nz
indices=8
var0=array(rnorm(256*256*nz),dim=c(256,256,nz))
#octo=array(sample(1:indices),dim=c(256,256,nz))
octo=array(sample(1:indices,size=256*256*nz,replace=T),dim=c(256,256,nz))
counts=apply(octo,3,function(x) table(factor(x,levels=1:indices)))
#thr=0.1
thr=0.125
np=length(var0[,1,1])*length(var0[1,,1])
profile=array(NA,dim=c(nz,indices))
t0=proc.time()
for (i in 1:indices)
{
for (z in 1:length(lev))
{
if (counts[i,z]/np>thr)
{v0=var0[,,z]; profile[z,i]=counts[i,z]/np*mean(v0[octo[,,z]==i],na.rm=T)}
}
}
print(proc.time()-t0)
user system elapsed
5.169 0.001 5.170
I tried with apply family of functions but I am not able to write it down in a reasonable and efficient way, considering that I need that each computation takes into account a "dynamic" variable that changes its level (i.e. octo and counts vars). My real case is made by way bigger matrices and this should be done on dozens of fields, thus time is pretty relevant.
Are you aware of any faster alternatives?
Many thanks for any help!
EDIT: I corrected the original definition of octo and I adjusted the threshold thr. In this way the if condition makes sense, since it is not always respected.
Here's a data.table reshape solution that avoids loops and or apply statements:
nz=200
lev=1:nz
indices=8
var0=array(rnorm(256*256*nz),dim=c(256,256,nz))
octo=array(sample(1:indices),dim=c(256,256,nz))
counts=apply(octo,3,function(x) table(factor(x,levels=1:indices)))
thr=0.1
np=length(var0[,1,1])*length(var0[1,,1])
profile=array(NA,dim=c(nz,indices))
# From here load data.table to do the manipulation
# reshape2 to convert back into a matrix at the end
library(data.table)
library(reshape2)
# Take the data long and convert to data.table
var01 <- setDT(melt(var0))
octo1 <- setDT(melt(octo))
# Join the data to get corresponding data
# EDIT, it currently works, but I think that's because all data is defined
# adding nomatch in case of missing data
octo1 <- octo1[var01, on = c('Var1','Var2','Var3'), nomatch = NA]
# Make our calculation grouping by the vertical dimension and the value
profile <- octo1[,if(.N/np > thr) .N / np * mean(i.value, na.rm = TRUE) else NA, by = .(value,Var3)]
# Recast to matrix
profile <- acast(profile, value ~ Var3, mean, value.var = 'V1')
I think that I find a good solution with sapply including the thr
f1<-function()
{
for (i in 1:indices)
{
for (z in 1:length(lev)) {if (counts[i,z]/np>thr) {v0=var0[,,z]; profile[z,i]=counts[i,z]/np*mean(v0[octo[,,z]==i],na.rm=T) } }
}
return(profile)
}
f2<-function()
{
profile=sapply(lev, function(i) {
v0=var0[,,i];
mV=sapply(1:indices, function(j) {mean(v0[octo[,,i] == j], na.rm = TRUE)})
counts[,i]/np*mV
})
profile[counts/np <= thr]=NA
profile<-matrix(profile, nz, indices, byrow = TRUE)
return(profile)
}
f3<-function()
{
profile=sapply(lev, function(i) {
v0=var0[,,i];
mV=sapply(1:indices, function(j) {if (counts[j,i]/np>thr) {mean(v0[octo[,,i] == j], na.rm = TRUE)} else {NA}})
counts[,i]/np*mV
})
profile<-matrix(profile, nz, indices, byrow = TRUE)
return(profile)
}
Actually f1() is the original, f2() is the #parksw3 one and f3() is my version slightly improved.
benchmark(f1(),f2(),f3(),replications=10)
test replications elapsed relative user.self sys.self user.child sys.child
1 f1() 10 27.382 1.411 27.375 0 0 0
2 f2() 10 35.195 1.814 35.186 0 0 0
3 f3() 10 19.403 1.000 19.392 0 0 0
In this way it is always faster than the standard loop. The data.table is likely faster but it requires a full change of data structure that I cannot perform for the moment. Hope this helps!
This seems to be faster on my machine:
profile2 <- sapply(lev, function(i){
v0 <- var0[,,i]
mV <- sapply(1:indices, function(j){
mean(v0[octo[,,i] == j], na.rm = TRUE)
})
counts[,i]/np*mV
})
profile2[counts/np > thr] <- NA
profile2<- t(profile2)
all.equal(profile, profile2)
## TRUE
I tried comparing them with microbenchmark package but it takes fairly long... Here's a quick comparison I did with rbenchmark package
f1 <- function(){
for (i in 1:indices){
for (z in 1:length(lev)) {
if (counts[i,z]/np>thr){
v0=var0[,,z]; profile[z,i]=counts[i,z]/np*mean(v0[octo[,,z]==i],na.rm=T)
}
}
}
}
f2 <- function(){
prof <- sapply(lev, function(i){
v0 <- var0[,,i]
mV <- sapply(1:indices, function(j){
mean(v0[octo[,,i] == j], na.rm = TRUE)
})
counts[,i]/np*mV
})
profile2[counts/np > thr] <- NA
profile2<- t(profile2)
}
library(rbenchmark)
benchmark(f1(), f2(), replications = 10)
I put both codes into a function and tested. Here's the result:
## test replications elapsed relative user.self sys.self
## 1 f1() 10 89.03 1.342 85.15 1.72
## 2 f2() 10 66.34 1.000 61.50 0.75

Fast count of number of Events for each Object with R data.tables

I have a number of Objects that can be within a number of locations (where the number of locations is considerably smaller then the number of objects), each object has a start and an end date. I also have a number of Events, which also have a location, and a day on which they occur. I want to know for each Object the number of Events that occured at the same location during their stay (so that occur between the object's start and end date).
As I have several sets and the number of objects differs from 450.000 to 6 million, this task takes considerable time. Until now, the fastests method I found uses the data.table method. The function below shows an example of this where you can vary the number of sizes.
coupleEventObject <- function(sizeO=100,sizeE=100){
require(data.table)
require(zoo)
#create the events
Events <- data.table(EventNumber = c(1:sizeE),
Location = as.character(sample(c(1:floor(sizeO/10)),size=sizeE,replace=T)),
DayEvent = rand.day(day.start="2007-01-01",
day.end ="2015-12-31",
size=sizeE))
#Create the objects
Objects <- data.table(ObjectNumber = c(1:sizeO),
Location = as.character(sample(c(1:floor(sizeO/10)),size=sizeO,replace=T)),
Day1 = rand.day(day.start="2007-01-01",
day.end ="2015-12-31",
size=sizeO),
Day2 = rand.day(day.start="2007-01-01",
day.end ="2015-12-31",
size=sizeO))
Objects[, DayStart := as.Date(ifelse (Day1>Day2,Day2,Day1))]
Objects[, DayEnd := as.Date(ifelse (Day1<Day2,Day2,Day1))]
Objects[,c("Day1","Day2"):=NULL]
#Set keys right for the coupling/counting
setkey(Objects,Location,DayStart,DayEnd)
setkey(Events,Location,DayEvent)
#Count the number of events
system.time(
Objects[,NumberEvents:=Events[Location,][DayEvent >= DayStart & DayEvent <= DayEnd,.N],by=list(DayStart,DayEnd,Location)]
)
}
rand.day <- function(day.start,day.end,size) {
dayseq <- seq.Date(as.Date(day.start),as.Date(day.end),by="day")
dayselect <- sample(dayseq,size,replace=TRUE)
return(dayselect)
}
For 100 Objects and 100 Events this code runs on my laptop within 0.3 seconds
> coupleEventObject()
user system elapsed
0.30 0.00 0.29
But if I increasy the number of objects this scales almost linearly with the process time.
> coupleEventObject(sizeE=200,sizeO=6000)
user system elapsed
15.11 0.00 15.26
So to count the number of Events for 6 million objects this costs about 4 hours, and I have to do this several times (different kind of location levels and). Is there a way to speed this up? Thanks for your help and ideas!
Here is an option. The main idea is to join the Events and Objects in such way that every combination exist first, and then just count the valid values. As you will see, the new way runs by far faster and you get the same results.
Btw... You might have to change the random date generator because I didn't have access to your original function.
coupleEventObject1 <- function(sizeO=100,sizeE=100){
require(data.table)
require(zoo)
#create the events
Events <- data.table(EventNumber = c(1:sizeE),
Location = as.character(sample(c(1:floor(sizeO/10)),size=sizeE,replace=T)),
DayEvent = as.Date(as.integer(runif(sizeE)*1000)))
#Create the objects
Objects <- data.table(ObjectNumber = c(1:sizeO),
Location = as.character(sample(c(1:floor(sizeO/10)),size=sizeO,replace=T)),
Day1 = as.Date(as.integer(runif(sizeE)*1000)),
Day2 = as.Date(as.integer(runif(sizeE)*1000)))
Objects[, DayStart := as.Date(ifelse (Day1>Day2,Day2,Day1))]
Objects[, DayEnd := as.Date(ifelse (Day1<Day2,Day2,Day1))]
Objects[,c("Day1","Day2"):=NULL]
#Set keys right for the coupling/counting
setkey(Objects,Location,DayStart,DayEnd)
setkey(Events,Location,DayEvent)
#Count the number of events
cat("First method:")
cat(system.time(
res1 <- Objects[,NumberEvents:=Events[Location,][DayEvent >= DayStart & DayEvent <= DayEnd,.N],by=list(DayStart,DayEnd,Location)]
))
cat("\n")
## second method
#Set keys right for the coupling/counting
setkey(Objects,Location)
setkey(Events,Location)
#Count the number of events
cat("Second method:")
cat(system.time({
oe <- Objects[Events,allow.cartesian=T]
res2 <- oe[,sum(DayEvent >= DayStart & DayEvent <= DayEnd),by=list(ObjectNumber,DayStart,DayEnd,Location)]
}))
cat("\n")
# comparing
setkey(res1, ObjectNumber, Location, DayStart, DayEnd)
setkey(res2, ObjectNumber, Location, DayStart, DayEnd)
cat("Compare values: ", nrow(res1[res2][NumberEvents != V1,])," mismatches\n")
return(list(res1=res1,res2=res2))
}
and here the results:
xx <- coupleEventObject1(200,6000)
First method:8.151 0.041 8.15 0 0
Second method:0.614 0.017 0.625 0 0
Compare values: 0 mismatches

Constructing an R data.table by selecting each row from an array of tables

Assume I have a list of length D containing data.table objects. Each data.table has the same columns (X, Y) and same number of rows N. I'd like to construct another table with N rows, with the individual rows taken from the tables specified by an index vector also of length N. Restated, each row in the final table is taken from one and only one of the tables in the array, with the index of the source table specified by an existing vector.
N = 100 # rows in each table (actual ~1000000 rows)
D = 4 # number of tables in array (actual ~100 tables)
tableArray = vector("list", D)
for (d in 1:D) {
tableArray[[d]] = data.table(X=rnorm(N), Y=d) # actual ~100 columns
}
tableIndexVector = sample.int(D, N, replace=TRUE) # length N of random 1:D
finalTable = copy(tableArray[[1]]) # just for length and column names
for (n in 1:N) {
finalTable[n] = tableArray[[tableIndexVector[n]]][n]
}
This seems to work the way I want, but the array within array notation is hard to understand, and I presume the performance of the for loop isn't going to be very good. It seems like there should be some elegant way of doing this, but I haven't stumbled across it yet. Is there another way of doing this that is efficient and less arcane?
(In case you are wondering, each table in the array represents simulated counterfactual observations for a subject under a particular regime of treatment, and I want to sample from these with different probabilities to test the behavior of different regression approaches with different ratios of regimes observed.)
for loops work just fine with data.table but we can improve the performance of your specific loop significantly (I believe) using the following approaches.
Approach # 1
Use set instead, as it avoids the [.data.table overhead
Don't loop over 1:N because you can simplify your loop to run only on unique values of tableIndexVector and assign all the corresponding values at once. This should decrease the run time by at least x10K (as N is of size 1MM and D is only of size 100, while unique(tableIndexVector) <= D)
So you basically could convert your loop to the following
for (i in unique(tableIndexVector)) {
indx <- which(tableIndexVector == i)
set(finalTable, i = indx, j = 1:2, value = tableArray[[i]][indx])
}
Approach # 2
Another approach is to use rbindlist and combine all the tables into one big data.table while adding the new idcol parameter in order to identify the different tables within the big table. You will need the devel version for that. This will avoid the loop as requested, but the result will be ordered by the tables appearance
temp <- rbindlist(tableArray, idcol = "indx")
indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
finalTable <- temp[indx]
Here's a benchmark on bigger data set
N = 100000
D = 10
tableArray = vector("list", D)
set.seed(123)
for (d in 1:D) {
tableArray[[d]] = data.table(X=rnorm(N), Y=d)
}
set.seed(123)
tableIndexVector = sample.int(D, N, replace=TRUE)
finalTable = copy(tableArray[[1]])
finalTable2 = copy(tableArray[[1]])
## Your approach
system.time(for (n in 1:N) {
finalTable[n] = tableArray[[tableIndexVector[n]]][n]
})
# user system elapsed
# 154.79 33.14 191.57
## My approach # 1
system.time(for (i in unique(tableIndexVector)) {
indx <- which(tableIndexVector == i)
set(finalTable2, i = indx, j = 1:2, value = tableArray[[i]][indx])
})
# user system elapsed
# 0.01 0.00 0.02
## My approach # 2
system.time({
temp <- rbindlist(tableArray, idcol = "indx")
indx <- temp[, .I[which(tableIndexVector == indx)], by = indx]$V1
finalTable3 <- temp[indx]
})
# user system elapsed
# 0.11 0.00 0.11
identical(finalTable, finalTable2)
## [1] TRUE
identical(setorder(finalTable, X), setorder(finalTable3[, indx := NULL], X))
## [1] TRUE
So to conclusion
My first approach is by far the fastest and elapses x15K times faster
than your original one. It is also returns identical result
My second approach is still x1.5K times faster than your original approach but avoids the loop (which you don't like for some reason). Though the result is order by the tables appearance, so the order isn't identical to your result.

Resources