Slow loop using data.table and lapply in R - r

I have a function fnEp_ that takes a data.table column, a data.table and a logical type. I am trying to make the function iterate over each element of eltIndexEnriched$Max (see below). It works but it is very slow. I wonder if there is a better way to iterate or perhaps a setting to make it faster.
Here I am creating a column EP_1 in the data.table from a function called fnEP_ using lapply:
eltIndexEnriched <- eltIndexEnriched[, EP_1 :=
lapply(Max, fnEp_, dt = eltIndexEnriched, Indemn_Bool = TRUE)]
fnEp_ <- function(Att_, dt, Indemn_Bool) {
if (Indemn_Bool == TRUE) {
retval <- (1 - exp(-1 * sum(dt$Rate * (ifelse(Att_ > dt$Max,
0, 1 - pbeta(Att_ / dt$Max, dt$Alpha, dt$Beta))))))
} else {
retval <- dt[Mean > Att_, 1 - exp(-1 * sum(Rate))]
}
return(retval)
}

Related

R dataframe uses values in current row from previous row

I have a data frame in R as defined below:
df <- data.frame('ID'=c(1,1,1,1),
'Month' =c('M1','M2','M3','M4'),
"Initial.Balance" =c(100,100,100,0),
"Value" = c(0.1,0.2,0.2,0.2),
"Threshold"=c(0.05,0.18,0.25,0.25),
"Intermediate.Balance"=c(0,0,100,0),
"Final.Balance"=c(100,100,0,0))
This task uses Initial.Balance (in current row) from the Final.Balance of the previous row.
When Value >= Threshold, Intermediate.Balance=0 and Final.Balance = Initial.Balance-Intermediate.Balance
When Value < Threshold, Intermediate.Balance = Initial.Balance and Final.Balance = Initial.Balance-Intermediate.Balance
I have tried to accomplish this task using for loop but it takes lot of time on large dataset (for many IDs)
Here is my solution:
for (i in 1:nrow(df)){
df$Intermediate.Balance[i] <- ifelse(df$Value[i]>df$Threshold[i],0,df$Initial.balance[i])
df$Final.Balance[i] <- df$Initial.balance[i]-df$Intermediate.Balance[i]
if(i+1<=nrow(df)){
df$Initial.balance[i+1] <- df$Final.Balance[i] }
}
Can we look for similar solution using Data Table? As data table operations are quicker than for loop on dataframe, I believe this will help me save computation time.
Thanks,
I think in this particular case, final balance goes to 0 once there is a row with Value less than Threshold and subsequent balances all go to 0. So you can use this:
ib <- 100
df[, InitBal := ib * 0^shift(cumsum(Value<=Threshold), fill=0L)]
df[, ItmdBal := replace(rep(0, .N), which(Value<=Threshold)[1L], ib)]
df[, FinlBal := InitBal - ItmdBal]
or in one []:
df[, c("InitBal", "ItmdBal", "FinlBal") := {
v <- Value<=Threshold
InitBal <- ib * 0^shift(cumsum(v), fill=0L)
ItmdBal <- replace(rep(0, .N), which(v)[1L], ib)
.(InitBal, ItmdBal, InitBal - ItmdBal)
}]
Or a more general approach using Rcpp when the intermediate balance is not simply equal to the initial balance:
library(Rcpp)
cppFunction('List calc(NumericVector Value, NumericVector Threshold, double init) {
int n = Value.size();
NumericVector InitialBalance(n), IntermediateBalance(n), FinalBalance(n);
InitialBalance[0] = init;
for (int i=0; i<n; i++) {
if (Value[i] <= Threshold[i]) {
IntermediateBalance[i] = InitialBalance[i];
}
FinalBalance[i] = InitialBalance[i] - IntermediateBalance[i];
if (i < n-1) {
InitialBalance[i+1] = FinalBalance[i];
}
}
return List::create(Named("InitialBalance") = InitialBalance,
Named("IntermediateBalance") = IntermediateBalance,
Named("FinalBalance") = FinalBalance);
}')
setDT(df)[, calc(Value, Threshold, Initial.Balance[1L])]
I can't see an obvious way of getting rid of the loop since each row is deterministic into the next. That being said, data.frames copy the whole frame or at least whole columns whenever you set some portion of them. As such you can do this:
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
dt[i,Intermediate.Balance:=ifelse(Value>Threshold,0,Initial.Balance)]
dt[i,Final.Balance:=Initial.Balance-Intermediate.Balance]
if(i+1<=nrow(dt)) dt[i+1,Initial.Balance:=dt[i,Final.Balance]]
}
You could also try the set function but I'm not sure if it'll be faster, or by how much, given that the data comes from the data.table anyway.
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
i<-as.integer(i)
set(dt,i,"Intermediate.Balance", ifelse(dt[i,Value]>dt[i,Threshold],0,dt[i,Initial.Balance]))
set(dt,i,"Final.Balance", dt[i,Initial.Balance-Intermediate.Balance])
if(i+1<=nrow(dt)) set(dt,i+1L,"Initial.Balance", dt[i,Final.Balance])
}

Return() Not Working While Print() does after building a function in R

I'm working with panel data in R and am endeavoring to build a function that returns every user ID where PCA==1. I've largely gotten this to work, with one small problem: it only returns the values when I end the function with print() but does not do so when I end the function with return(). As I want the ids in a vector so I can later subset the data to only include those IDs, that's a problem. Code reflected below - can anyone advise on what I'm doing wrong?
The version that works (but doesn't do what I want):
retrievePCA<-function(data) {
for (i in 1:dim(data)[1]) {
if (data$PCA[i] == 1) {
id<-data$CPSIDP[i]
print(id)
}
}
}
retrievePCA(data)
The version that doesn't:
retrievePCA<-function(data) {
for (i in 1:dim(data)[1]) {
if (data$PCA[i] == 1) {
id<-data$CPSIDP[i]
return(id)
}
}
}
vector<-retrievePCA(data)
vector
Your problem is a simple misunderstanding of what a function and returning from a function does.
Take the small example below
f <- function(x){
x <- x * x
return x
x <- x * x
return x
}
f(2)
[1] 4
4 is returned, 8 is not. That is because return exits the function returning the specific value. So in your function the function hits the first instance where PCA[i] == 1 and then exits the function. Instead you should create a vector, list or another alternative and return this instead.
retrievePCA<-function(data) {
ids <- vector('list', nrow(data))
for (i in 1:nrow(data)) {
if (data$PCA[i] == 1) {
ids[[i]] <-data$CPSIDP[i]
}
}
return unlist(ids)
}
However you could just do this in one line
data$CPSIDP[data$PCA == 1]

Having trouble with nested loops (R)

I've wrote this nested loop so that, in the inner loop, the code runs through the first row; then, the outer one updates the loop so as to allow the inner one to run though the second row (and so on). The data comes from 'supergerador', a matrix. "rodadas" is the row size and "n" is the column size. "vec" is the vector of interest. Thank you in advance!
Edit: i, j were initially assigned i = 1, j = 2
for(e in 1:rodadas) {
for(f in 1:(n-1)) {
if(j >= 10) {
vec[f] = min(supergerador[i, j] - supergerador[i, j - 1], 1 - supergerador[i, j])
}
else {
vec[f] = func(i, j)
}
j = j + 1
}
i = i + 1
}
func is defined as
func = function(i, j) {
minf = min(supergerador[i, j] - supergerador[i, j - 1], supergerador[i, j + 1] - supergerador[i, j])
return(minf)
}
For reference, this is what the nested loop returns. You can tell that it only went through a single row.
> vec
[1] 0.127387378 0.068119707 0.043472981 0.043472981 0.027431603 0.027431603
[7] 0.015739046 0.008010766 0.008010766
I'm not quite sure what you are intending to do here, but here is a few suggestions and code edits:
Suggestions:
If you have a for loop, use the loop-index for your subsetting (as much as plausible) and avoid additional indexes where plausible.
This avoids code clutter and unforseen errors when indices should be reset but aren't.
Avoid double subsetting variables whenever possible. Eg if you have multiple calls to x[i, j], store this in a variable and then use this variable in your result.
Single line functions are fine, but should add readability to your code. Otherwise inlining your code is optimal from an efficiency perspective.
Incorporating these into your code I beliieve you are looking for
for(i in 1:rodadas) {
for(j in 2:n) {
x1 = supergerador[i, j]
x2 = supergerador[i, j - 1]
if(j >= 10) {
vec[f] = min(x1 - x2, 1 - x1)
}
else {
vec[f] = min(x1 - x2, supergerador[i, j + 1] - x1)
}
}
}
Here i am making the assumption that you wish to loop over columns for every row up to rodadas.
Once you get a bit more familiarized with R you should look into vectorization. With a bit more knowledge of your problem, we should be very easily able to vectorize your second for loop, removing your if statement and performing the calculation in 1 fast sweep. But until then this is a good place to start your programming experience and having a strong understanding of for-loops is vital in any language.

R -- screening Excel rows according to characteristics of multiple cells

I am trying to eliminate all rows in excel that have he following features:
First column is an integer
Second column begins with an integer
Third column is empty
The code I have written appears to run indefinitely. CAS.MULT is the name of my dataframe.
for (i in 1:nrow(CAS.MULT)) {
testInteger <- function(x) {
test <- all.equal(x, as.integer(x), check.attributes = FALSE)
if (test == TRUE) {
return (TRUE)
}
else {
return (FALSE)
}
}
if (testInteger(as.integer(CAS.MULT[i,1])) == TRUE) {
if (testInteger(as.integer(substring(CAS.MULT[i,2],1,1))) == TRUE) {
if (CAS.MULT[i,3] == '') {
CAS.MULT <- data.frame(CAS.MULT[-i,])
}
}
}
}
You should be very wary of deleting rows within a for loop, if often leads to undesired behavior. There are a number of ways you could handle this. For instance, you can flag the rows for deletion and then delete them after.
Another thing I noticed is that you are converting your columns to integers before passing them to your function to test if they are integers, so you will be incorrectly returning true for all values passed to the function.
Maybe something like this would work (without a reproducible example it's hard to say if it will work or not):
toDelete <- numeric(0)
for (i in 1:nrow(CAS.MULT)) {
testInteger <- function(x) {
test <- all.equal(x, as.integer(x), check.attributes = FALSE)
if (test == TRUE) {
return (TRUE)
}
else {
return (FALSE)
}
}
if (testInteger(CAS.MULT[i,1]) == TRUE) {
if (testInteger(substring(CAS.MULT[i,2],1,1)) == TRUE) {
if (CAS.MULT[i,3] == '') {
toDelete <- c(toDelete, i)
}
}
}
}
CAS.MULT <- CAS.MULT[-1*toDelete,]
Hard to be sure without testing my code on your data, but this might work. Instead of a loop, the code below uses logical indexing based on the conditions you specified in your question. This is vectorized (meaning it operates on the entire data frame at once, rather than by row) and is much faster than looping row by row:
CAS.MULT.screened = CAS.MULT[!(CAS.MULT[,1] %% 1 == 0 |
as.numeric(substring(CAS.MULT[,2],1,1)) %% 1 == 0 |
CAS.MULT[,3] == ""), ]
For more on checking whether a value is an integer, see this SO question.
One other thing: Just for future reference, for efficiency you should define your function outside the loop, rather than recreating the function every time through the loop.

R - vectorised conditional replace

Hi I'm trying manipulate a list of numbers and I would like to do so without a for loop, using fast native operation in R. The pseudocode for the manipulation is :
By default the starting total is 100 (for every block within zeros)
From the first zero to next zero, the moment the cumulative total falls by more than 2% replace all subsequent numbers with zero.
Do this far all blocks of numbers within zeros
The cumulative sums resets to 100 every time
For example if following were my data :
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
Results would be :
0 0 0 1 3 4 5 -1 2 3 -5 0 0 0 -2 -3 0 0 0 0 0 -1 -1 -1 0
Currently I have an implementation with a for loop, but since my vector is really long, the performance is terrible.
Thanks in advance.
Here is a running sample code :
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
ans <- d;
running_total <- 100;
count <- 1;
max <- 100;
toggle <- FALSE;
processing <- FALSE;
for(i in d){
if( i != 0 ){
processing <- TRUE;
if(toggle == TRUE){
ans[count] = 0;
}
else{
running_total = running_total + i;
if( running_total > max ){ max = running_total;}
else if ( 0.98*max > running_total){
toggle <- TRUE;
}
}
}
if( i == 0 && processing == TRUE )
{
running_total = 100;
max = 100;
toggle <- FALSE;
}
count <- count + 1;
}
cat(ans)
I am not sure how to translate your loop into vectorized operations. However, there are two fairly easy options for large performance improvements. The first is to simply put your loop into an R function, and use the compiler package to precompile it. The second slightly more complicated option is to translate your R loop into a c++ loop and use the Rcpp package to link it to an R function. Then you call an R function that passes it to c++ code which is fast. I show both these options and timings. I do want to gratefully acknowledge the help of Alexandre Bujard from the Rcpp listserv, who helped me with a pointer issue I did not understand.
First, here is your R loop as a function, foo.r.
## Your R loop as a function
foo.r <- function(d) {
ans <- d
running_total <- 100
count <- 1
max <- 100
toggle <- FALSE
processing <- FALSE
for(i in d){
if(i != 0 ){
processing <- TRUE
if(toggle == TRUE){
ans[count] <- 0
} else {
running_total = running_total + i;
if (running_total > max) {
max <- running_total
} else if (0.98*max > running_total) {
toggle <- TRUE
}
}
}
if(i == 0 && processing == TRUE) {
running_total <- 100
max <- 100
toggle <- FALSE
}
count <- count + 1
}
return(ans)
}
Now we can load the compiler package and compile the function and call it foo.rcomp.
## load compiler package and compile your R loop
require(compiler)
foo.rcomp <- cmpfun(foo.r)
That is all it takes for the compilation route. It is all R and obviously very easy. Now for the c++ approach, we use the Rcpp package as well as the inline package which allows us to "inline" the c++ code. That is, we do not have to make a source file and compile it, we just include it in the R code and the compilation is handled for us.
## load Rcpp package and inline for ease of linking
require(Rcpp)
require(inline)
## Rcpp version
src <- '
const NumericVector xx(x);
int n = xx.size();
NumericVector res = clone(xx);
int toggle = 0;
int processing = 0;
int tot = 100;
int max = 100;
typedef NumericVector::iterator vec_iterator;
vec_iterator ixx = xx.begin();
vec_iterator ires = res.begin();
for (int i = 0; i < n; i++) {
if (ixx[i] != 0) {
processing = 1;
if (toggle == 1) {
ires[i] = 0;
} else {
tot += ixx[i];
if (tot > max) {
max = tot;
} else if (.98 * max > tot) {
toggle = 1;
}
}
}
if (ixx[i] == 0 && processing == 1) {
tot = 100;
max = 100;
toggle = 0;
}
}
return res;
'
foo.rcpp <- cxxfunction(signature(x = "numeric"), src, plugin = "Rcpp")
Now we can test that we get the expected results:
## demonstrate equivalence
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1)
all.equal(foo.r(d), foo.rcpp(d))
Finally, create a much larger version of d by repeating it 10e4 times. Then we can run the three different functions, pure R code, compiled R code, and R function linked to c++ code.
## make larger vector to test performance
dbig <- rep(d, 10^5)
system.time(res.r <- foo.r(dbig))
system.time(res.rcomp <- foo.rcomp(dbig))
system.time(res.rcpp <- foo.rcpp(dbig))
Which on my system, gives:
> system.time(res.r <- foo.r(dbig))
user system elapsed
12.55 0.02 12.61
> system.time(res.rcomp <- foo.rcomp(dbig))
user system elapsed
2.17 0.01 2.19
> system.time(res.rcpp <- foo.rcpp(dbig))
user system elapsed
0.01 0.00 0.02
The compiled R code takes about 1/6 the time the uncompiled R code taking only 2 seconds to operate on the vector of 2.5 million. The c++ code is orders of magnitude faster even then the compiled R code requiring just .02 seconds to complete. Aside from the initial setup, the syntax for the basic loop is nearly identical in R and c++ so you do not even lose clarity. I suspect that even if parts or all of your loop could be vectorized in R, you would be sore pressed to beat the performance of the R function linked to c++. Lastly, just for proof:
> all.equal(res.r, res.rcomp)
[1] TRUE
> all.equal(res.r, res.rcpp)
[1] TRUE
The different functions return the same results.

Resources