R - vectorised conditional replace - r

Hi I'm trying manipulate a list of numbers and I would like to do so without a for loop, using fast native operation in R. The pseudocode for the manipulation is :
By default the starting total is 100 (for every block within zeros)
From the first zero to next zero, the moment the cumulative total falls by more than 2% replace all subsequent numbers with zero.
Do this far all blocks of numbers within zeros
The cumulative sums resets to 100 every time
For example if following were my data :
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
Results would be :
0 0 0 1 3 4 5 -1 2 3 -5 0 0 0 -2 -3 0 0 0 0 0 -1 -1 -1 0
Currently I have an implementation with a for loop, but since my vector is really long, the performance is terrible.
Thanks in advance.
Here is a running sample code :
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
ans <- d;
running_total <- 100;
count <- 1;
max <- 100;
toggle <- FALSE;
processing <- FALSE;
for(i in d){
if( i != 0 ){
processing <- TRUE;
if(toggle == TRUE){
ans[count] = 0;
}
else{
running_total = running_total + i;
if( running_total > max ){ max = running_total;}
else if ( 0.98*max > running_total){
toggle <- TRUE;
}
}
}
if( i == 0 && processing == TRUE )
{
running_total = 100;
max = 100;
toggle <- FALSE;
}
count <- count + 1;
}
cat(ans)

I am not sure how to translate your loop into vectorized operations. However, there are two fairly easy options for large performance improvements. The first is to simply put your loop into an R function, and use the compiler package to precompile it. The second slightly more complicated option is to translate your R loop into a c++ loop and use the Rcpp package to link it to an R function. Then you call an R function that passes it to c++ code which is fast. I show both these options and timings. I do want to gratefully acknowledge the help of Alexandre Bujard from the Rcpp listserv, who helped me with a pointer issue I did not understand.
First, here is your R loop as a function, foo.r.
## Your R loop as a function
foo.r <- function(d) {
ans <- d
running_total <- 100
count <- 1
max <- 100
toggle <- FALSE
processing <- FALSE
for(i in d){
if(i != 0 ){
processing <- TRUE
if(toggle == TRUE){
ans[count] <- 0
} else {
running_total = running_total + i;
if (running_total > max) {
max <- running_total
} else if (0.98*max > running_total) {
toggle <- TRUE
}
}
}
if(i == 0 && processing == TRUE) {
running_total <- 100
max <- 100
toggle <- FALSE
}
count <- count + 1
}
return(ans)
}
Now we can load the compiler package and compile the function and call it foo.rcomp.
## load compiler package and compile your R loop
require(compiler)
foo.rcomp <- cmpfun(foo.r)
That is all it takes for the compilation route. It is all R and obviously very easy. Now for the c++ approach, we use the Rcpp package as well as the inline package which allows us to "inline" the c++ code. That is, we do not have to make a source file and compile it, we just include it in the R code and the compilation is handled for us.
## load Rcpp package and inline for ease of linking
require(Rcpp)
require(inline)
## Rcpp version
src <- '
const NumericVector xx(x);
int n = xx.size();
NumericVector res = clone(xx);
int toggle = 0;
int processing = 0;
int tot = 100;
int max = 100;
typedef NumericVector::iterator vec_iterator;
vec_iterator ixx = xx.begin();
vec_iterator ires = res.begin();
for (int i = 0; i < n; i++) {
if (ixx[i] != 0) {
processing = 1;
if (toggle == 1) {
ires[i] = 0;
} else {
tot += ixx[i];
if (tot > max) {
max = tot;
} else if (.98 * max > tot) {
toggle = 1;
}
}
}
if (ixx[i] == 0 && processing == 1) {
tot = 100;
max = 100;
toggle = 0;
}
}
return res;
'
foo.rcpp <- cxxfunction(signature(x = "numeric"), src, plugin = "Rcpp")
Now we can test that we get the expected results:
## demonstrate equivalence
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1)
all.equal(foo.r(d), foo.rcpp(d))
Finally, create a much larger version of d by repeating it 10e4 times. Then we can run the three different functions, pure R code, compiled R code, and R function linked to c++ code.
## make larger vector to test performance
dbig <- rep(d, 10^5)
system.time(res.r <- foo.r(dbig))
system.time(res.rcomp <- foo.rcomp(dbig))
system.time(res.rcpp <- foo.rcpp(dbig))
Which on my system, gives:
> system.time(res.r <- foo.r(dbig))
user system elapsed
12.55 0.02 12.61
> system.time(res.rcomp <- foo.rcomp(dbig))
user system elapsed
2.17 0.01 2.19
> system.time(res.rcpp <- foo.rcpp(dbig))
user system elapsed
0.01 0.00 0.02
The compiled R code takes about 1/6 the time the uncompiled R code taking only 2 seconds to operate on the vector of 2.5 million. The c++ code is orders of magnitude faster even then the compiled R code requiring just .02 seconds to complete. Aside from the initial setup, the syntax for the basic loop is nearly identical in R and c++ so you do not even lose clarity. I suspect that even if parts or all of your loop could be vectorized in R, you would be sore pressed to beat the performance of the R function linked to c++. Lastly, just for proof:
> all.equal(res.r, res.rcomp)
[1] TRUE
> all.equal(res.r, res.rcpp)
[1] TRUE
The different functions return the same results.

Related

Translating a VBA function into R

I am attempting to translate the function DISCRINV() which is an excel function available in the simtools excel add-in that was created by Roger Myerson into an R function. I believe i am close, but am having difficulty understanding the looping syntax of VBA.
The VBA code for this function is as follows:
Function DISCRINV(ByVal randprob As Double, values As Object, probabilities As Object)
On Error GoTo 63
Dim i As Integer, cumv As Double, cel As Object
If values.Count <> probabilities.Count Then GoTo 63
For Each cel In probabilities
i = i + 1
cumv = cumv + cel.Value
If randprob < cumv Then
DISCRINV = values.Cells(i).Value
Exit Function
End If
Next cel
If randprob < cumv + 0.001 Then
DISCRINV = values.Cells(i).Value
Exit Function
End If
63 DISCRINV = CVErr(xlErrValue)
End Function
Attempting to translate this directly from the VBA code i have come up with this (Not Correct):
DISCRINV <- function(R,V,P){
if(length(V) != length(P)){
print("ERROR NUMBER OF VALUES DOES NOT EQUAL NUMBER OF PROBABILITIES")
} else{
for (i in 1:length(P)){
cumv=cumv+P[i]
if (R < cumv){
DISCY1 = V[i]
return(DISCY1)
}
print(cumv)
if (R < cumv +0.001){
DISCY2 = V[i]
return(DISCY2)
}
}
}
}
Attempting to translate this through my understanding of what it is doing i have come up with this:
DISCRINV <- function(x,values,probabilities){
require(FSA)
precumsum <- pcumsum(probabilities)
middle <- c()
for (i in 1:(length(values)-2)){
if (precumsum[i+1] <= x & x < precumsum[i+2]){
middle[i] <- values[i+1]}
else{
middle[i] <- 0
}
}
firstrow <- ifelse(x < precumsum[2], values[1], 0)
lastrow <- ifelse(precumsum[length(precumsum)] <= x , values[length(precumsum)] , 0)
Gvector <- c(firstrow,middle,lastrow)
print(firstrow)
print(middle)
print(lastrow)
print(Gvector)
simulatedvalue <- sum(Gvector)
return(simulatedvalue)
}
The latter option works 99% of the time, but not when the first function parameter is over 0.5, the second parameter is a vector of values c(1000,2000) and the third parameter is a vector (0.5,0.5). The case of the latter option not working 100% of the time is what has led me to try to translate the function directly. Could someone please give some insight into where my translation is going wrong?
Additionally a description of the function is as follows:
DISCRINV(randprob, values, probabilities) returns inverse cumulative values for a discrete random variable. When the first parameter is a RAND, DISCRINV returns a discrete random variable with possible values and corresponding probabilities in the given ranges.
Thank you in advance for the insight!
For anyone that is interested, i was able to successfully translate this VBA script using this code
DISCRINV <- function(x,values,probabilities){
require(FSA)
precumsum <- pcumsum(probabilities)
middle <- c()
if(length(values <3 )){
if(x<0.5){
middle1 <- values[1]
return(middle1)
} else{
middle2 <- values[2]
return(middle2)
}
}
else{
for (i in 1:(length(values)-2)){
if (precumsum[i+1] <= x & x < precumsum[i+2]){
middle[i] <- values[i+1]}
else{
middle[i] <- 0
}
}
firstrow <- ifelse(x < precumsum[2], values[1], 0)
lastrow <- ifelse(precumsum[length(precumsum)] <= x , values[length(precumsum)] , 0)
Gvector <- c(firstrow,middle,lastrow)
print(firstrow)
print(middle)
print(lastrow)
print(Gvector)
simulatedvalue <- sum(Gvector)
return(simulatedvalue)
}
}

Dealing with recursion depth limitation in R

The algorithm is from https://www.math.upenn.edu/~wilf/eastwest.pdf page 16 RandomKSubsets
RandomKSubsets = function(n, k){
if (n<0 | k<0 | k<n){
return()
}
else {
if (n==0 && k==0){
return(c())
}
else {
rno = runif(1)
if (rno < n/k){
east = RandomKSubsets(n-1,k-1)
return (c(east, k))
}
else{
west = RandomKSubsets(n,k-1)
return(west)
}
}
}
}
Running the program with k=4000 and n=1200 I run into recursion depth limit. I tried options(expressions=500000) but it's not enough for the algorithm. How can I run this code for my variables?
This is close to tail recursion: the only recursive calls are in the return statements. This blog: http://blog.moertel.com/posts/2013-05-11-recursive-to-iterative.html describes how to change such functions into loops. I followed the mostly mechanical process described there, and came up with this version:
RandomKSubsetsLoop = function(n, k) {
acc <- NULL
while (TRUE) {
if (n<0 | k<0 | k<n){
return(acc)
}
else {
if (n==0 && k==0){
return(acc)
}
else {
rno = runif(1)
if (rno < n/k){
acc <- c(k, acc)
k <- k - 1
n <- n - 1
next
}
else{
k <- k - 1
next
}
}
}
break
}
}
I haven't tested it extensively, but it produces the same result as the original in this test:
set.seed(1)
RandomKSubsets(5, 10)
# [1] 1 3 6 9 10
set.seed(1)
RandomKSubsetsLoop(5, 10)
# [1] 1 3 6 9 10
You'll probably want to do more extensive testing, and read the blog to make sure I've done things as it describes.
By the way, there are other algorithms to do this sampling, e.g. the one described in
AUTHOR="McLeod, A.I. and Bellhouse, D.R. ",
YEAR = 1983,
TITLE="A convenient algorithm for drawing a simple random sample",
JOURNAL="Applied Statistics",
VOLUME="32",
PAGES="182-184"
That one is based on a loop by design, and has the advantage that you don't need to know the population size (k in your notation) in advance: you just keep updating your sample until there are no more items to process.

When running code in R was given an error that there was a missing value where true/false needed and I can't fix it

I am new to using R and have minimal amount of Python experience. I am sure this is an easy fix but I am just not seeing it. I was given a code to run a Fibonacci sequence to 100 and I copy and pasted it, but I am getting the following error code: Error in if (numterms <= 0) { : missing value where TRUE/FALSE needed. I know this has to do with the if/else clause but I am not seeing the problem.
I have run through the code a couple different ways but it has not helped. And the person to assist is not available during the weekend. Any help would be appreciated.
# take the max number input from the user
numterms = as.integer(readline(prompt="What is your max number? "))
# first two items
num1 = 0
num2 = 1
counter = 2
# check if the number of terms is valid
if(numterms <= 0) {
print("Please enter an integer above zero")
} else {
if(numterms == 1) {
print("The Fibonacci sequence:")
print(num1)
} else {
print("The Fibonacci sequence:")
print(num1)
print(num2)
while(counter < numterms) {
numth = num1 + num2
print(numth)
# update values
num1 = num2
num2 = numth
counter = counter + 1
}
}
}
If you just execute the code numterms is not correctly defined. It is normally defined by a user input: The function readline reads the numbers the user types in the command line. If you just execute this line you can properly define numterms.
If you execute all the code at once numterms is set to NA which cannot be compared to 0 in the numterms <= 0 clause. In this case numterms <= 0 is also NA which is not a logical value and can therefore not be evaluated by if. This ultimately causes your error.
The solution would be to just run the first line of your code and enter the number and only after you entered the number to execute the rest of the code.
Alternatively you can define your code as a function:
printFibonacci <- function(){
numterms = as.integer(readline(prompt="What is your max number? "))
if(is.na(numterms)){
numterms <- 4
}
# first two items
num1 = 0
num2 = 1
counter = 2
# check if the number of terms is valid
if(numterms <= 0) {
print("Please enter an integer above zero")
} else {
if(numterms == 1) {
print("The Fibonacci sequence:")
print(num1)
} else {
print("The Fibonacci sequence:")
print(num1)
print(num2)
while(counter < numterms) {
numth = num1 + num2
print(numth)
# update values
num1 = num2
num2 = numth
counter = counter + 1
}
}
}
}
And then just call your function with printFibonacci(). In this case the prompt and answer of the readline function gets executed first and numterms can be defined by the user before the rest of the code is executed.

Adding a counter to a loop

On a broad question that I haven't been able to find for R:
I'm trying to add a counter at the beginning of a loop.
So that when I run the loop sim = 1000:
if(hours$week1 > 1 and hours$week1 < 48) add 1 to the counter
ifelse add 0
I have came across counter tutorials that print a sentence to let you know where you are (if something goes wrong):
e.g
For (i in 1:1000) {
if (i%%100==0) print(paste("No work", i))
}
But the purpose of my counter is to generate a value output, measuring how many of the 1000 runs in the loop fall inside a specified range.
You basically had it. You just need to a) initialize the counter before the loop, b) use & instead of and in your if condition, c) actually add 1 to the counter. Since adding 0 is the same as doing nothing, you don't have to worry about the "else".
counter = 0
for (blah in your_loop_definition) {
... loop code ...
if(hours$week1 > 1 & hours$week1 < 48) {
counter = counter + 1
}
... more loop code ...
}
Instead of
if(hours$week1 > 1 & hours$week1 < 48) {
counter = counter + 1
}
you could also use
counter = counter + (hours$week1 > 1 && hours$week1 < 48)
since R is converting TRUE to 1 and FALSE to 0.
How about this?
count = 0
for (i in 1:1000) {
count = ifelse(i %in% 1:100, count + 1, count)
}
count
#> [1] 100
If your goal is just to monitor progression coarsely, and you're using Rstudio, a simple solution is to just refresh the environment tab to check the current value of i.

Fastest way to drop rows with missing values?

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.
This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i
Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/
For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.
Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04

Resources