Fastest way to drop rows with missing values? - r

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.

This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i

Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/

For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.

Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04

Related

R dataframe uses values in current row from previous row

I have a data frame in R as defined below:
df <- data.frame('ID'=c(1,1,1,1),
'Month' =c('M1','M2','M3','M4'),
"Initial.Balance" =c(100,100,100,0),
"Value" = c(0.1,0.2,0.2,0.2),
"Threshold"=c(0.05,0.18,0.25,0.25),
"Intermediate.Balance"=c(0,0,100,0),
"Final.Balance"=c(100,100,0,0))
This task uses Initial.Balance (in current row) from the Final.Balance of the previous row.
When Value >= Threshold, Intermediate.Balance=0 and Final.Balance = Initial.Balance-Intermediate.Balance
When Value < Threshold, Intermediate.Balance = Initial.Balance and Final.Balance = Initial.Balance-Intermediate.Balance
I have tried to accomplish this task using for loop but it takes lot of time on large dataset (for many IDs)
Here is my solution:
for (i in 1:nrow(df)){
df$Intermediate.Balance[i] <- ifelse(df$Value[i]>df$Threshold[i],0,df$Initial.balance[i])
df$Final.Balance[i] <- df$Initial.balance[i]-df$Intermediate.Balance[i]
if(i+1<=nrow(df)){
df$Initial.balance[i+1] <- df$Final.Balance[i] }
}
Can we look for similar solution using Data Table? As data table operations are quicker than for loop on dataframe, I believe this will help me save computation time.
Thanks,
I think in this particular case, final balance goes to 0 once there is a row with Value less than Threshold and subsequent balances all go to 0. So you can use this:
ib <- 100
df[, InitBal := ib * 0^shift(cumsum(Value<=Threshold), fill=0L)]
df[, ItmdBal := replace(rep(0, .N), which(Value<=Threshold)[1L], ib)]
df[, FinlBal := InitBal - ItmdBal]
or in one []:
df[, c("InitBal", "ItmdBal", "FinlBal") := {
v <- Value<=Threshold
InitBal <- ib * 0^shift(cumsum(v), fill=0L)
ItmdBal <- replace(rep(0, .N), which(v)[1L], ib)
.(InitBal, ItmdBal, InitBal - ItmdBal)
}]
Or a more general approach using Rcpp when the intermediate balance is not simply equal to the initial balance:
library(Rcpp)
cppFunction('List calc(NumericVector Value, NumericVector Threshold, double init) {
int n = Value.size();
NumericVector InitialBalance(n), IntermediateBalance(n), FinalBalance(n);
InitialBalance[0] = init;
for (int i=0; i<n; i++) {
if (Value[i] <= Threshold[i]) {
IntermediateBalance[i] = InitialBalance[i];
}
FinalBalance[i] = InitialBalance[i] - IntermediateBalance[i];
if (i < n-1) {
InitialBalance[i+1] = FinalBalance[i];
}
}
return List::create(Named("InitialBalance") = InitialBalance,
Named("IntermediateBalance") = IntermediateBalance,
Named("FinalBalance") = FinalBalance);
}')
setDT(df)[, calc(Value, Threshold, Initial.Balance[1L])]
I can't see an obvious way of getting rid of the loop since each row is deterministic into the next. That being said, data.frames copy the whole frame or at least whole columns whenever you set some portion of them. As such you can do this:
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
dt[i,Intermediate.Balance:=ifelse(Value>Threshold,0,Initial.Balance)]
dt[i,Final.Balance:=Initial.Balance-Intermediate.Balance]
if(i+1<=nrow(dt)) dt[i+1,Initial.Balance:=dt[i,Final.Balance]]
}
You could also try the set function but I'm not sure if it'll be faster, or by how much, given that the data comes from the data.table anyway.
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
i<-as.integer(i)
set(dt,i,"Intermediate.Balance", ifelse(dt[i,Value]>dt[i,Threshold],0,dt[i,Initial.Balance]))
set(dt,i,"Final.Balance", dt[i,Initial.Balance-Intermediate.Balance])
if(i+1<=nrow(dt)) set(dt,i+1L,"Initial.Balance", dt[i,Final.Balance])
}

Loop for minimum spanning tree does not work

we as 3 friends try to solve minimum spanning tree with coflicts problem using r. In solving this question, we read files in .txt format that contain for ex.
"1 2 5
2 4 6" etc. which indicates from node 1 to 2, there exists an edge with weight 5 and
"1 2 2 4" etc. which indicates there's a conflict relationship between the edges 1-2 and 2-4. To continue, we have to form an nxn conflict matrix in which we will store 0's if there exist no conflict relation between the edges or 1 if there exist a conflict relation. For this purpose, we developed a 3-for loop for(i in 1:dim(edges_read)[1]){
for(i in 1:dim(edges_read)[1]){
for(k in 1:dim(edges_read)[1]){
for(t in 1:dim(conflicts)[1]){
if(all(conflicts[t,] == c(edges_read[i,1], edges_read[i,2],
edges_read[k,1], edges_read[k,2]) )){
conflictmatrix[i,k] <- 1
}
}
}
}
However, R cannot get us a solution and this for loops take very long times. How can we solve this situation? Thanks for further assistance
As you have discovered, for() loops are not fast in R. There are faster approaches, but it's hard to provide examples without data. Please use something like dput(edges_read) and dput(conflicts) to provide a small example of the data.
As one example, you could implement the for loops in the Rcpp package for speed improvement. Based on the code in your question, you could re-implement the 3-loop code sort of like this:
Rcpp::cppFunction('NumericVector MSTC_nxn_Cpp(NumericMatrix edges_read, NumericMatrix conflicts){
int n = edges_read.nrow(); //output matrix size (adjust to what you need)
int m = conflicts.nrow(); //output matrix size (adjust to what you need)
NumericMatrix conflictmatrix( n , m ); //the output matrix
for(int i=0;i<n;i++){ //your i loop
for(int k=0;k<n;k++){ // your k loop
double te = edges_read( i, 0 ); //same as edges_read[i,1]
double tf = edges_read( i, 1 ); //same as edges_read[i,2]
double tg = edges_read( k, 0 ); //same as edges_read[k,1]
double th = edges_read( k, 1 ); //same as edges_read[k,2]
NumericVector w = NumericVector::create(te,tf,tg,th); //this could probably be more simple
for(int t=0;t<m;t++){ //your t loop
NumericVector v = conflicts( t , _ ); // same as conflicts[t,]
LogicalVector r; //vector for checking if conflicts and edges are the same
for(int p=0; p<4; p++){ //loop to check logic
r[p]=v[p]==w[p]; //True / False stored
};
int q = r.size();
for (int ii = 0; ii < q; ++ii) { //similar to all() This code could be simplified!
if (!r[ii]) {false;}
else{conflictmatrix[i,k] = 1;}}
}}}
return conflictmatrix; //your output
}')
#Then run the function
MSTC_nxn_Cpp(edges_read, conflicts )

Get out of infinite while loop

What is the best way to have a while loop recognize when it is stuck in an infinite loop in R?
Here's my situation:
diff_val = Inf
last_val = 0
while(diff_val > 0.1){
### calculate val from data subset that is greater than the previous iteration's val
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val) ### how much did this change val?
last_val = val ### set last_val for the next iteration
}
The goal is to have val get progressively closer and closer to a stable value, and when val is within 0.1 of the val from the last iteration, then it is deemed sufficiently stable and is released from the while loop. My problem is that with some data sets, val gets stuck alternating back and forth between two values. For example, iterating back and forth between 27.0 and 27.7. Thus, it never stabilizes. How can I break the while loop if this occurs?
I know of break but do not know how to tell the loop when to use it. I imagine holding onto the value from two iterations before would work, but I do not know of a way to keep values two iterations ago...
while(diff_val > 0.1){
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val)
last_val = val
if(val == val_2_iterations_ago) break
}
How can I create val_2_iterations_ago?
Apologies for the non-reproducible code. The real foo() and data that are needed to replicate the situation are not mine to share... they aren't key to figuring out this issue with control flow, though.
I don't know if just keeping track of the previous two iterations will actually suffice, but it isn't too much trouble to add logic for this.
The logic is that at each iteration, the second to last value becomes the last value, the last value becomes the current value, and the current value is derived from foo(). Consider this code:
while (diff_val > 0.1) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
}
Another approach, perhaps a little more general, would be to track your iterations and set a maximum.
Pairing this with Tim's nice answer:
iter = 0
max_iter = 1e6
while (diff_val > 0.1 & iter < max_iter) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
iter = iter + 1
}
How this is generally done is that you have:
A convergence tolerance, so that when your objective function doesn't change appreciably, the algorithm is deemed to have converged
A limit on the number of iterations, so that the code is guaranteed to terminate eventually
A check that the objective function is actually decreasing, to catch the situation where it's diverging/cyclic (many optimisation algorithms are designed so this shouldn't happen, but in your case it does happen)
Pseudocode:
oldVal <- Inf
for(i in 1:NITERS)
{
val <- objective(x)
diffVal <- val - oldVal
converged <- (diffVal <= 0 && abs(diffVal) < TOL)
if(converged || diffVal > 0)
break
oldVal <- val
}

Remove NA values efficiently

I need to remove NA values efficiently from vectors inside a function which is implemented with RcppEigen. I can of course do it using a for loop, but I wonder if there is a more efficient way.
Here is an example:
library(RcppEigen)
library(inline)
incl <- '
using Eigen::Map;
using Eigen::VectorXd;
typedef Map<VectorXd> MapVecd;
'
body <- '
const MapVecd x(as<MapVecd>(xx)), y(as<MapVecd>(yy));
VectorXd x1(x), y1(y);
int k(0);
for (int i = 0; i < x.rows(); ++i) {
if (x.coeff(i)==x.coeff(i) && y.coeff(i)==y.coeff(i)) {
x1(k) = x.coeff(i);
y1(k) = y.coeff(i);
k++;
};
};
x1.conservativeResize(k);
y1.conservativeResize(k);
return Rcpp::List::create(Rcpp::Named("x") = x1,
Rcpp::Named("y") = y1);
'
na.omit.cpp <- cxxfunction(signature(xx = "Vector", yy= "Vector"),
body, "RcppEigen", incl)
na.omit.cpp(c(1.5, NaN, 7, NA), c(7.0, 1, NA, 3))
#$x
#[1] 1.5
#
#$y
#[1] 7
In my use case I need to do this about one million times in a loop (inside the Rcpp function) and the vectors could be quite long (let's assume 1000 elements).
PS: I've also investigated the route to find all NA/NaN values using x.array()==x.array(), but was unable to find a way to use the result for subsetting with Eigen.
Perhaps I am not understanding the question correctly, but within Rcpp, I don't see how you could possibly do this more efficiently than a for loop. for loops are generally inefficient in R only because iterating through a loop in R requires a lot of heavy interpreted machinery. But this is not the case once you are down at the C++ level. Even natively vectorized R functions ultimately are implemented with for loops in C. So the only way I can think to make this more efficient is to try to do it in parallel.
For example, here's a simple na.omit.cpp function that omits NA values from a single vector:
rcppfun<-"
Rcpp::NumericVector naomit(Rcpp::NumericVector x){
std::vector<double> r(x.size());
int k=0;
for (int i = 0; i < x.size(); ++i) {
if (x[i]==x[i]) {
r[k] = x[i];
k++;
}
}
r.resize(k);
return Rcpp::wrap(r);
}"
na.omit.cpp<-cppFunction(rcppfun)
This runs even more quickly than R's built in na.omit:
> set.seed(123)
> x<-1:10000
> x[sample(10000,1000)]<-NA
> y1<-na.omit(x)
> y2<-na.omit.cpp(x)
> all(y1==y2)
[1] TRUE
> require(microbenchmark)
> microbenchmark(na.omit(x),na.omit.cpp(x))
Unit: microseconds
expr min lq median uq max neval
na.omit(x) 290.157 363.9935 376.4400 401.750 6547.447 100
na.omit.cpp(x) 107.524 168.1955 173.6035 210.524 222.564 100
I do not know if I understand the problem correctly or not but you can use the following arguments:
a = c(1.5, NaN, 7, NA)
a[-which(is.na(a))]
[1] 1.5 7.0
It might be useful to use `rinside' if you want to use it in C++.

R - vectorised conditional replace

Hi I'm trying manipulate a list of numbers and I would like to do so without a for loop, using fast native operation in R. The pseudocode for the manipulation is :
By default the starting total is 100 (for every block within zeros)
From the first zero to next zero, the moment the cumulative total falls by more than 2% replace all subsequent numbers with zero.
Do this far all blocks of numbers within zeros
The cumulative sums resets to 100 every time
For example if following were my data :
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
Results would be :
0 0 0 1 3 4 5 -1 2 3 -5 0 0 0 -2 -3 0 0 0 0 0 -1 -1 -1 0
Currently I have an implementation with a for loop, but since my vector is really long, the performance is terrible.
Thanks in advance.
Here is a running sample code :
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
ans <- d;
running_total <- 100;
count <- 1;
max <- 100;
toggle <- FALSE;
processing <- FALSE;
for(i in d){
if( i != 0 ){
processing <- TRUE;
if(toggle == TRUE){
ans[count] = 0;
}
else{
running_total = running_total + i;
if( running_total > max ){ max = running_total;}
else if ( 0.98*max > running_total){
toggle <- TRUE;
}
}
}
if( i == 0 && processing == TRUE )
{
running_total = 100;
max = 100;
toggle <- FALSE;
}
count <- count + 1;
}
cat(ans)
I am not sure how to translate your loop into vectorized operations. However, there are two fairly easy options for large performance improvements. The first is to simply put your loop into an R function, and use the compiler package to precompile it. The second slightly more complicated option is to translate your R loop into a c++ loop and use the Rcpp package to link it to an R function. Then you call an R function that passes it to c++ code which is fast. I show both these options and timings. I do want to gratefully acknowledge the help of Alexandre Bujard from the Rcpp listserv, who helped me with a pointer issue I did not understand.
First, here is your R loop as a function, foo.r.
## Your R loop as a function
foo.r <- function(d) {
ans <- d
running_total <- 100
count <- 1
max <- 100
toggle <- FALSE
processing <- FALSE
for(i in d){
if(i != 0 ){
processing <- TRUE
if(toggle == TRUE){
ans[count] <- 0
} else {
running_total = running_total + i;
if (running_total > max) {
max <- running_total
} else if (0.98*max > running_total) {
toggle <- TRUE
}
}
}
if(i == 0 && processing == TRUE) {
running_total <- 100
max <- 100
toggle <- FALSE
}
count <- count + 1
}
return(ans)
}
Now we can load the compiler package and compile the function and call it foo.rcomp.
## load compiler package and compile your R loop
require(compiler)
foo.rcomp <- cmpfun(foo.r)
That is all it takes for the compilation route. It is all R and obviously very easy. Now for the c++ approach, we use the Rcpp package as well as the inline package which allows us to "inline" the c++ code. That is, we do not have to make a source file and compile it, we just include it in the R code and the compilation is handled for us.
## load Rcpp package and inline for ease of linking
require(Rcpp)
require(inline)
## Rcpp version
src <- '
const NumericVector xx(x);
int n = xx.size();
NumericVector res = clone(xx);
int toggle = 0;
int processing = 0;
int tot = 100;
int max = 100;
typedef NumericVector::iterator vec_iterator;
vec_iterator ixx = xx.begin();
vec_iterator ires = res.begin();
for (int i = 0; i < n; i++) {
if (ixx[i] != 0) {
processing = 1;
if (toggle == 1) {
ires[i] = 0;
} else {
tot += ixx[i];
if (tot > max) {
max = tot;
} else if (.98 * max > tot) {
toggle = 1;
}
}
}
if (ixx[i] == 0 && processing == 1) {
tot = 100;
max = 100;
toggle = 0;
}
}
return res;
'
foo.rcpp <- cxxfunction(signature(x = "numeric"), src, plugin = "Rcpp")
Now we can test that we get the expected results:
## demonstrate equivalence
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1)
all.equal(foo.r(d), foo.rcpp(d))
Finally, create a much larger version of d by repeating it 10e4 times. Then we can run the three different functions, pure R code, compiled R code, and R function linked to c++ code.
## make larger vector to test performance
dbig <- rep(d, 10^5)
system.time(res.r <- foo.r(dbig))
system.time(res.rcomp <- foo.rcomp(dbig))
system.time(res.rcpp <- foo.rcpp(dbig))
Which on my system, gives:
> system.time(res.r <- foo.r(dbig))
user system elapsed
12.55 0.02 12.61
> system.time(res.rcomp <- foo.rcomp(dbig))
user system elapsed
2.17 0.01 2.19
> system.time(res.rcpp <- foo.rcpp(dbig))
user system elapsed
0.01 0.00 0.02
The compiled R code takes about 1/6 the time the uncompiled R code taking only 2 seconds to operate on the vector of 2.5 million. The c++ code is orders of magnitude faster even then the compiled R code requiring just .02 seconds to complete. Aside from the initial setup, the syntax for the basic loop is nearly identical in R and c++ so you do not even lose clarity. I suspect that even if parts or all of your loop could be vectorized in R, you would be sore pressed to beat the performance of the R function linked to c++. Lastly, just for proof:
> all.equal(res.r, res.rcomp)
[1] TRUE
> all.equal(res.r, res.rcpp)
[1] TRUE
The different functions return the same results.

Resources