R dataframe uses values in current row from previous row - r

I have a data frame in R as defined below:
df <- data.frame('ID'=c(1,1,1,1),
'Month' =c('M1','M2','M3','M4'),
"Initial.Balance" =c(100,100,100,0),
"Value" = c(0.1,0.2,0.2,0.2),
"Threshold"=c(0.05,0.18,0.25,0.25),
"Intermediate.Balance"=c(0,0,100,0),
"Final.Balance"=c(100,100,0,0))
This task uses Initial.Balance (in current row) from the Final.Balance of the previous row.
When Value >= Threshold, Intermediate.Balance=0 and Final.Balance = Initial.Balance-Intermediate.Balance
When Value < Threshold, Intermediate.Balance = Initial.Balance and Final.Balance = Initial.Balance-Intermediate.Balance
I have tried to accomplish this task using for loop but it takes lot of time on large dataset (for many IDs)
Here is my solution:
for (i in 1:nrow(df)){
df$Intermediate.Balance[i] <- ifelse(df$Value[i]>df$Threshold[i],0,df$Initial.balance[i])
df$Final.Balance[i] <- df$Initial.balance[i]-df$Intermediate.Balance[i]
if(i+1<=nrow(df)){
df$Initial.balance[i+1] <- df$Final.Balance[i] }
}
Can we look for similar solution using Data Table? As data table operations are quicker than for loop on dataframe, I believe this will help me save computation time.
Thanks,

I think in this particular case, final balance goes to 0 once there is a row with Value less than Threshold and subsequent balances all go to 0. So you can use this:
ib <- 100
df[, InitBal := ib * 0^shift(cumsum(Value<=Threshold), fill=0L)]
df[, ItmdBal := replace(rep(0, .N), which(Value<=Threshold)[1L], ib)]
df[, FinlBal := InitBal - ItmdBal]
or in one []:
df[, c("InitBal", "ItmdBal", "FinlBal") := {
v <- Value<=Threshold
InitBal <- ib * 0^shift(cumsum(v), fill=0L)
ItmdBal <- replace(rep(0, .N), which(v)[1L], ib)
.(InitBal, ItmdBal, InitBal - ItmdBal)
}]
Or a more general approach using Rcpp when the intermediate balance is not simply equal to the initial balance:
library(Rcpp)
cppFunction('List calc(NumericVector Value, NumericVector Threshold, double init) {
int n = Value.size();
NumericVector InitialBalance(n), IntermediateBalance(n), FinalBalance(n);
InitialBalance[0] = init;
for (int i=0; i<n; i++) {
if (Value[i] <= Threshold[i]) {
IntermediateBalance[i] = InitialBalance[i];
}
FinalBalance[i] = InitialBalance[i] - IntermediateBalance[i];
if (i < n-1) {
InitialBalance[i+1] = FinalBalance[i];
}
}
return List::create(Named("InitialBalance") = InitialBalance,
Named("IntermediateBalance") = IntermediateBalance,
Named("FinalBalance") = FinalBalance);
}')
setDT(df)[, calc(Value, Threshold, Initial.Balance[1L])]

I can't see an obvious way of getting rid of the loop since each row is deterministic into the next. That being said, data.frames copy the whole frame or at least whole columns whenever you set some portion of them. As such you can do this:
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
dt[i,Intermediate.Balance:=ifelse(Value>Threshold,0,Initial.Balance)]
dt[i,Final.Balance:=Initial.Balance-Intermediate.Balance]
if(i+1<=nrow(dt)) dt[i+1,Initial.Balance:=dt[i,Final.Balance]]
}
You could also try the set function but I'm not sure if it'll be faster, or by how much, given that the data comes from the data.table anyway.
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
i<-as.integer(i)
set(dt,i,"Intermediate.Balance", ifelse(dt[i,Value]>dt[i,Threshold],0,dt[i,Initial.Balance]))
set(dt,i,"Final.Balance", dt[i,Initial.Balance-Intermediate.Balance])
if(i+1<=nrow(dt)) set(dt,i+1L,"Initial.Balance", dt[i,Final.Balance])
}

Related

How to implement a function with a sum inside in R?

I am trying to define a function with a for loop and inside a conditional in R studio. Yesterday I was able with the help of another thread to devise this piece of code. The problem is that I want to sum the vector elements ma for any possible x, so that is inside the function l. This is a simpler case which I am trying to solve to adapt the original model. However, I do not know how to proceed.
ma<-rep(0,20)
l <- function(x, ma) {
for(i in seq_along(ma)) {
if(i %% 2 == 1) {
ma[i] <- i + x
} else {
ma[i] <- 0
}
}
return(ma)
}
My problem is that I would like to have the sum of i+x+0+i+x... for any possible x. I mean a function of the kind for any possible x.
Question:
Can someone explain to me how to implement such a function in R?
Thanks in advance!
I am going to update the original function:
Theta_alpha_s<-function(s,alpha,t,Basis){
for (i in seq_along(Basis)){
if(i%% 2==1) {Basis[i]=s*i^{-alpha-0.5}*sqrt(2)*cos(2*pi*i*t)}
else{Basis[i]=s*i^{-alpha-0.5}*sqrt(2)*sin(2*pi*i*t)}
}
return(Basis)
}
If you don't want to change the values in Basis, you can create a new vector in the function (here result) that you will return:
l = function(s,alpha,t,Basis){
is.odd = which(Basis %% 2 == 1)
not.odd = which(Basis %% 2 == 0)
result = rep(NA, length(Basis))
result[is.odd] = s*is.odd^{-alpha-0.5}*sqrt(2)*cos(2*pi*is.odd*t)
result[not.odd] = s*not.odd^{-alpha-0.5}*sqrt(2)*sin(2*pi*not.odd*t)
#return(result)
return(c(sum(result[is.odd]), sum(result[not.odd])))
}

Having trouble with nested loops (R)

I've wrote this nested loop so that, in the inner loop, the code runs through the first row; then, the outer one updates the loop so as to allow the inner one to run though the second row (and so on). The data comes from 'supergerador', a matrix. "rodadas" is the row size and "n" is the column size. "vec" is the vector of interest. Thank you in advance!
Edit: i, j were initially assigned i = 1, j = 2
for(e in 1:rodadas) {
for(f in 1:(n-1)) {
if(j >= 10) {
vec[f] = min(supergerador[i, j] - supergerador[i, j - 1], 1 - supergerador[i, j])
}
else {
vec[f] = func(i, j)
}
j = j + 1
}
i = i + 1
}
func is defined as
func = function(i, j) {
minf = min(supergerador[i, j] - supergerador[i, j - 1], supergerador[i, j + 1] - supergerador[i, j])
return(minf)
}
For reference, this is what the nested loop returns. You can tell that it only went through a single row.
> vec
[1] 0.127387378 0.068119707 0.043472981 0.043472981 0.027431603 0.027431603
[7] 0.015739046 0.008010766 0.008010766
I'm not quite sure what you are intending to do here, but here is a few suggestions and code edits:
Suggestions:
If you have a for loop, use the loop-index for your subsetting (as much as plausible) and avoid additional indexes where plausible.
This avoids code clutter and unforseen errors when indices should be reset but aren't.
Avoid double subsetting variables whenever possible. Eg if you have multiple calls to x[i, j], store this in a variable and then use this variable in your result.
Single line functions are fine, but should add readability to your code. Otherwise inlining your code is optimal from an efficiency perspective.
Incorporating these into your code I beliieve you are looking for
for(i in 1:rodadas) {
for(j in 2:n) {
x1 = supergerador[i, j]
x2 = supergerador[i, j - 1]
if(j >= 10) {
vec[f] = min(x1 - x2, 1 - x1)
}
else {
vec[f] = min(x1 - x2, supergerador[i, j + 1] - x1)
}
}
}
Here i am making the assumption that you wish to loop over columns for every row up to rodadas.
Once you get a bit more familiarized with R you should look into vectorization. With a bit more knowledge of your problem, we should be very easily able to vectorize your second for loop, removing your if statement and performing the calculation in 1 fast sweep. But until then this is a good place to start your programming experience and having a strong understanding of for-loops is vital in any language.

Get out of infinite while loop

What is the best way to have a while loop recognize when it is stuck in an infinite loop in R?
Here's my situation:
diff_val = Inf
last_val = 0
while(diff_val > 0.1){
### calculate val from data subset that is greater than the previous iteration's val
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val) ### how much did this change val?
last_val = val ### set last_val for the next iteration
}
The goal is to have val get progressively closer and closer to a stable value, and when val is within 0.1 of the val from the last iteration, then it is deemed sufficiently stable and is released from the while loop. My problem is that with some data sets, val gets stuck alternating back and forth between two values. For example, iterating back and forth between 27.0 and 27.7. Thus, it never stabilizes. How can I break the while loop if this occurs?
I know of break but do not know how to tell the loop when to use it. I imagine holding onto the value from two iterations before would work, but I do not know of a way to keep values two iterations ago...
while(diff_val > 0.1){
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val)
last_val = val
if(val == val_2_iterations_ago) break
}
How can I create val_2_iterations_ago?
Apologies for the non-reproducible code. The real foo() and data that are needed to replicate the situation are not mine to share... they aren't key to figuring out this issue with control flow, though.
I don't know if just keeping track of the previous two iterations will actually suffice, but it isn't too much trouble to add logic for this.
The logic is that at each iteration, the second to last value becomes the last value, the last value becomes the current value, and the current value is derived from foo(). Consider this code:
while (diff_val > 0.1) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
}
Another approach, perhaps a little more general, would be to track your iterations and set a maximum.
Pairing this with Tim's nice answer:
iter = 0
max_iter = 1e6
while (diff_val > 0.1 & iter < max_iter) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
iter = iter + 1
}
How this is generally done is that you have:
A convergence tolerance, so that when your objective function doesn't change appreciably, the algorithm is deemed to have converged
A limit on the number of iterations, so that the code is guaranteed to terminate eventually
A check that the objective function is actually decreasing, to catch the situation where it's diverging/cyclic (many optimisation algorithms are designed so this shouldn't happen, but in your case it does happen)
Pseudocode:
oldVal <- Inf
for(i in 1:NITERS)
{
val <- objective(x)
diffVal <- val - oldVal
converged <- (diffVal <= 0 && abs(diffVal) < TOL)
if(converged || diffVal > 0)
break
oldVal <- val
}

Slow loop using data.table and lapply in R

I have a function fnEp_ that takes a data.table column, a data.table and a logical type. I am trying to make the function iterate over each element of eltIndexEnriched$Max (see below). It works but it is very slow. I wonder if there is a better way to iterate or perhaps a setting to make it faster.
Here I am creating a column EP_1 in the data.table from a function called fnEP_ using lapply:
eltIndexEnriched <- eltIndexEnriched[, EP_1 :=
lapply(Max, fnEp_, dt = eltIndexEnriched, Indemn_Bool = TRUE)]
fnEp_ <- function(Att_, dt, Indemn_Bool) {
if (Indemn_Bool == TRUE) {
retval <- (1 - exp(-1 * sum(dt$Rate * (ifelse(Att_ > dt$Max,
0, 1 - pbeta(Att_ / dt$Max, dt$Alpha, dt$Beta))))))
} else {
retval <- dt[Mean > Att_, 1 - exp(-1 * sum(Rate))]
}
return(retval)
}

Fastest way to drop rows with missing values?

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.
This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i
Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/
For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.
Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04

Resources