Count Sequential Occurrences of Value in Vector - r

Given a generic vector x (which could be numeric, character, factor, etc.), I need to be able to count the number of sequential occurrences of value, including singletons.
x <- c(0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1)
In this case, the following would be returned if value == 0.
[1] 1 1 2 3
This code below works, but it is very slow when x gets large. Does anyone know of a way to speed this up? I imagine there must be a clever vectorized way of doing this.
getSequential <- function(x, value) {
counts <- c()
last <- FALSE
for (i in 1:length(x)) {
if (last & x[i] == value) {
counts[length(counts)] <- counts[length(counts)] + 1
} else if (x[i] == value) {
counts <- c(counts, 1)
last <- TRUE
} else {
last <- FALSE
}
}
return(counts)
}

you can use rle
rle(x)$lengths[which(rle(x)$values==0)]
# 1 1 2 3
for speed, you could only run rle once:
x1 <- rle(x)
x1$lengths[which(x1$values==0)]

Well, the code is quite good. I doubt using rle and with or which together would increase speed of the algorithm (much).
My proposition:
counting(int[] input, int value) {
int[] len = int[size(input)](); \\assume worst case scenario
len[0] = 0;
int j = 0;
for (i = 0; i < size(input); i++) { \\2 unit operations
if (input[i] != value && len[j] == 0) \\followed relatively often, 3-4 unit operations (1-2 unit operations on this step)
continue; \\5 unit operations
else if (input[1] == value) \\also followed relatively often, 4 unit operations (on '&&' if first is false, second is not checked)
len[j]++; \\5 unit operations
else /*if (input[i] != value && len[j] != 0)*/ { \\4 unit operations (commented operation is not needed (only remaining possible outcome)
j++; \\5 unit operations
len[j] = 0; \\7 unit operations (access and substitution)
continue; \\8 unit operations
}
}
}
As you can see there are at most 8 unit operations, and at best 4-5. Worst case scenario is that there are n/2 paths that go through 8 operations, but most of the cases I suppose it would follow one of the 5 step path.
Well, perhaps rle and those other functions are better optimized, but question is, is it optimized for your problem? I recommend checking.

Related

Optimizing for + if in R

I am a bit lost about how to optimize for loops in R.
I have a set such that element i belongs to the set iff contains[[i]] == 1. I want to check whether sets of indices are included in this set. Currently I have the following code. Can it be written more efficiently?
contains = c(1, 0, 0, 1, 0, 1, 1, 0)
indices = c(4, 5) # not ok
# indices = c(4, 6) # ok
ok <- TRUE
for (index in indices) {
if (contains[[index]] == 0) {
ok <- FALSE
break
}
}
if (ok) {
print("ok")
} else {
print("not ok")
}
I would suggest either of these:
ok = all(indices %in% which(contains == 1))
ok = all(contains[indices] == 1)
They will be faster than a for loop in almost all cases. (Exception: if the vectors involved are very long and there is an early discrepancy, your break will stop searching as soon as a first false is found and probably be faster.)
If you need really fast solutions on biggish data, please share some code to simulate data at scale so we can benchmark on a relevant use case.

if else statement concatenation - R

This is a very common question: 1, 2, 3, 4, 5, and still I cannot find even an answer to my problem.
If a == 1, then do X.
If a == 0, then do Y.
If a == 0 and b == 1, then do Z.
Just to explain: the if else statements has to do Y if a==0 no matter the value of b. But if b == 1 and a == 0, Z will do additional changes to those already done by Y.
My current code and its error:
if (a == 1){
X
} else if(a == 0){
Y
} else if (a == 0 & b == 1){
Z}
Error in !criterion : invalid argument type
An else only happens if a previous if hasn't happened.
When you say
But if b == 1 and a == 0, Z will do additional changes to those already done by Y
Then you have two options:
## Option 1: nest Z inside Y
if (a == 1){
X
} else if(a == 0){
Y
if (b == 1){
Z
}
}
## Option 2: just use `if` again (not `else if`):
if (a == 1) {
X
} else if(a == 0) {
Y
}
if (a == 0 & b == 1) {
Z
}
Really, you don't need any else here at all.
## This will work just as well
## (assuming that `X` can't change the value of a from 1 to 0
if (a == 1) {
X
}
if (a == 0) {
Y
if (b == 1){
Z
}
}
Typically else is needed when you want to have a "final" action that is done only if none of the previous if options were used, for example:
# try to guess my number between 1 and 10
if (your_guess == 8) {
print("Congratulations, you guessed my number!")
} else if (your_guess == 7 | your_guess = 9) {
print("Close, but not quite")
} else {
print("Wrong. Not even close!")
}
In the above, else is useful because I don't want to have enumerate all the other possible guesses (or even bad inputs) that a user might enter. If they guess 8, they win. If they guess 7 or 9, I tell them they were close. Anything else, no matter what it is, I just say "wrong".
Note: this is true for programming languages in general. It is not unique to R.
However, since this is in the R tag, I should mention that R has if{}else{} and ifelse(), and they are different.
if{} (and optionally else{}) evaluates a single condition, and you can run code to do anything in {} depending on that condition.
ifelse() is a vectorized function, it's arguments are test, yes, no. The test evaluates to a boolean vector of TRUE and FALSE values. The yes and no arguments must be vectors of the same length as test. The result will be a vector of the same length as test, with the corresponding values of yes (when test is TRUE) and no (when test is FALSE).
I believe you want to include Z in the second condition like this:
if (a == 1){X}
else if(a == 0){
Y
if (b == 1){Z}
}

Get out of infinite while loop

What is the best way to have a while loop recognize when it is stuck in an infinite loop in R?
Here's my situation:
diff_val = Inf
last_val = 0
while(diff_val > 0.1){
### calculate val from data subset that is greater than the previous iteration's val
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val) ### how much did this change val?
last_val = val ### set last_val for the next iteration
}
The goal is to have val get progressively closer and closer to a stable value, and when val is within 0.1 of the val from the last iteration, then it is deemed sufficiently stable and is released from the while loop. My problem is that with some data sets, val gets stuck alternating back and forth between two values. For example, iterating back and forth between 27.0 and 27.7. Thus, it never stabilizes. How can I break the while loop if this occurs?
I know of break but do not know how to tell the loop when to use it. I imagine holding onto the value from two iterations before would work, but I do not know of a way to keep values two iterations ago...
while(diff_val > 0.1){
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val)
last_val = val
if(val == val_2_iterations_ago) break
}
How can I create val_2_iterations_ago?
Apologies for the non-reproducible code. The real foo() and data that are needed to replicate the situation are not mine to share... they aren't key to figuring out this issue with control flow, though.
I don't know if just keeping track of the previous two iterations will actually suffice, but it isn't too much trouble to add logic for this.
The logic is that at each iteration, the second to last value becomes the last value, the last value becomes the current value, and the current value is derived from foo(). Consider this code:
while (diff_val > 0.1) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
}
Another approach, perhaps a little more general, would be to track your iterations and set a maximum.
Pairing this with Tim's nice answer:
iter = 0
max_iter = 1e6
while (diff_val > 0.1 & iter < max_iter) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
iter = iter + 1
}
How this is generally done is that you have:
A convergence tolerance, so that when your objective function doesn't change appreciably, the algorithm is deemed to have converged
A limit on the number of iterations, so that the code is guaranteed to terminate eventually
A check that the objective function is actually decreasing, to catch the situation where it's diverging/cyclic (many optimisation algorithms are designed so this shouldn't happen, but in your case it does happen)
Pseudocode:
oldVal <- Inf
for(i in 1:NITERS)
{
val <- objective(x)
diffVal <- val - oldVal
converged <- (diffVal <= 0 && abs(diffVal) < TOL)
if(converged || diffVal > 0)
break
oldVal <- val
}

Divide&Conquer Algorithm for Checking if a list is sorted in Descending order

I need to write a function that checks if an array/list/vector(in R)/
is sorted in Descending order. The answer must be implemented using Divide&Conquer.
Here is what I tried(The Implementation is in R):
is.sort<-function(x){
n<-length(x)
if (n==1)
return(T)
mid<-floor(n/2)
x1<-is.sort(x[1:mid])
x2<-is.sort(x[(mid+1):n])
return(ifelse(x1>=x2,T,F))
}
But thats always returns T :/
Thanks for the Help
It is quite awkward to implement that using D&C. However, let's try.
A sequence of length 1 is always sorted.
If x is of length >= 2, then divide it into 2 non-empty subsequences x1 and x2. Then x is sorted if x1 is sorted, x2 is sorted, and the last element in x1 is greater than or equal to (assuming a non-increasing ordering) the first element in x2.
This may be implemented as:
is.sort<-function(x){
n<-length(x)
if (n==1)
return(TRUE)
mid<-floor(n/2)
return(x[mid] >= x[mid+1] && is.sort(x[1:mid]) && is.sort(x[(mid+1):n]))
}
BTW, R has a is.unsorted function, but it determines only if a sequence is sorted w.r.t. < or <=. For those who miss its >= equivalent here's a trivial Rcpp implementation:
Rcpp::cppFunction('
bool is_sort(NumericVector x) {
int n=x.size();
for (int i=0; i<n-1; ++i)
if (x[i] < x[i+1]) return false;
return true;
}
')

Fastest way to drop rows with missing values?

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.
So far I've tried the following:
require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")
x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]
Is there a faster way of doing this?
Thanks.
This should be faster than using apply:
x[rowSums(is.na(x[, ..varcols])) == 0, ]
# var1 var2 textcol
# 1: 0 0 e
# 2: 0 1 f
# 3: 1 0 h
# 4: 1 1 i
Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.
After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.
Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.
cleanmat.cpp
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
// remove colums
X = X.cols(idx - 1);
// get dimensions
int n = X.n_rows,k = X.n_cols;
// create keep vector
vec keep = ones<vec>(n);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
// alternative with view for each row (slightly slower)
/*vec keep = zeros<vec>(n);
for (int i = 0; i < n; i++) {
keep(i) = is_finite(X.row(i));
}*/
return (X.rows(find(keep==1)));
}
// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
int n = X.nrow(), k = X.ncol();
// create keep vector
LogicalVector keep(n, true);
for (int j = 0; j < k; j++)
for (int i = 0; i < n; i++)
if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;
return (keep);
}
/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")
# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")
idx=sample(ncol(X),90)
microbenchmark(
X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
X[rowSums(is.na(X[,idx])) == 0, idx],
cleanmat(X,idx),
X[keep(X[,idx]),idx],
times=3)
# output
# Unit: milliseconds
# expr min lq median uq max
# 1 cleanmat(X, idx) 253.2596 259.7738 266.2880 272.0900 277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3 X[keep(X[, idx]), idx] 360.8254 361.5165 362.2077 371.2061 380.2045
# 4 X[rowSums(is.na(X[, idx])) == 0, idx] 358.4772 367.5698 376.6625 379.6093 382.5561
*/
For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.
Two more approaches
two vector scans
x[!is.na(var1) & !is.na(var2)]
join with unique combinations of non-NA values
If you know the possible unique values in advance, this will be the fastest
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
Some timings
x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
key = c('var1','var2'))
system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
user system elapsed
0.09 0.02 0.11
system.time(x[!is.na(var1) & !is.na(var2)])
user system elapsed
0.06 0.02 0.07
system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
user system elapsed
0.03 0.00 0.04

Resources