Counting lengths of subseries

Counting lengths of subseries - r

Imagine series of numbers like
c(21,22,23,30,31,32,34,35,36,37,38,50,NA,52)
where subseries are defined as: x[t] is a part of some subserie if x[t] = x[t-1] + 1?
So in the example above we have the following series:
c(21,22,23,30,31,32,34,35,36,37,38,50,NA,52)
## 1 1 1 2 2 2 3 3 3 3 3 4 - 5 # serie ID
## 3 | 3 | 5 | 1 | | 1 # length
What would be the most efficient way of tagging the subseries and counting their lengths (as a single function or two separate ones)?

We can get the difference between the adjacent elements, check whether it is equal to 1, get the cumulative sum, and use that as group to get the length of the vector
unname(tapply(v1, cumsum(c(TRUE, diff(replace(v1, is.na(v1), 0))!=1)), length))
#[1] 3 3 5 1 1 1
If we need the NA elements as ""
unname(tapply(v1, cumsum(c(TRUE, diff(replace(v1, is.na(v1), 0))!=1)),
function(x) if(all(is.na(x))) "" else length(x)))
#[1] "3" "3" "5" "1" "" "1"
Or a variation posted by #DavidArenburg with rle
rle(cumsum(c(TRUE, diff(replace(v1, is.na(v1), 0))!=1)))$lengths

I'm accepting the answer by akrun (with contribution by David Arenburg), but for the reference I provide a Rcpp solution I created in the meantime.
NumericVector cpp_seriesLengths(NumericVector x) {
int n = x.length();
if (n == 1)
return wrap(1);
NumericVector out(n);
int tmpCount = 1;
int prevStart = 0;
for (int i = 0; i < (n-1); i++) {
if ( x[i] == (x[i+1] - 1) ) {
tmpCount += 1;
} else {
for (int j = prevStart; j <= i; j++)
out[j] = tmpCount;
tmpCount = 1;
prevStart = i+1;
}
}
for (int j = prevStart; j < n; j++)
out[j] = tmpCount;
return out;
}

Related

How to speed up an R loop with sequential operations

I have a model which has multiple conditions and returns a value which it depends on for next prediction. Lets say given a time serie of A and B, the model returns a value of C variable, which in turn is used to estimate a value of D. In the next iteration along the new A and B, the model also uses estimated D as input:
df = data.frame(A = sample(-5:5, 10000, replace = TRUE),
B = sample(-5:5, 10000, replace = TRUE),
C = 0,
D=0)
for(i in 1:nrow(df)){
if (df$A[i]< 0 & df$B[i]>0){
df$C[i]<-df$B[i]
} else if(df$A[i]==0 & df$B[i]==0 ){
df$C[i]<-0
} else {
df$C[i]<-df$A[i]+df$B[i]-df$D[i]
}
df$D[i+1]<-ifelse(df$D[i]<=-df$C[i],0,df$D[i]+df$C[i]) # this is a cumulative sum-reset function
}
Though the code works well, it is very slow since I have hundred thousands of observations. I would appreciate for any suggestion that could speed it up.

Since each row is dependent on the result of the previous row, this is difficult to write in such a way that one can take advantage of R's vectorization. In cases like this, we get a massive advantage in writing the code in Rcpp.
library(Rcpp)
cppFunction('
DataFrame f_Rcpp(DataFrame df) {
NumericVector A = df["A"];
NumericVector B = df["B"];
NumericVector C = df["C"];
NumericVector D = df["D"];
for(int i = 0; i < (df.nrows() - 1); ++i) {
if (A[i] < 0 && B[i] > 0) {
C[i] = B[i];
} else if(A[i] == 0 && B[i] == 0 ) {
C[i] = 0;
} else {
C[i] = A[i] + B[i] - D[i];
}
if(D[i] <= -C[i]) {
D[i+1] = 0;
} else {
D[i+1] = D[i] + C[i];
}
}
return(df);
}
')
If we wrap your own code as a function so we can compare it, we see that our Rcpp function gives the same results:
f_R <- function(df) {
for(i in 1:(nrow(df) - 1)) {
if (df$A[i] < 0 & df$B[i] > 0) {
df$C[i] <- df$B[i]
} else if(df$A[i] == 0 & df$B[i] == 0 ){
df$C[i] <- 0
} else {
df$C[i] <- df$A[i] + df$B[i] - df$D[i]
}
df$D[i+1] <- ifelse(df$D[i] <= -df$C[i], 0, df$D[i] + df$C[i])
}
return(df)
}
res1 <- f_R(df)
res2 <- f_Rcpp(df)
identical(res1, res2)
#> [1] TRUE
But look what happens when we benchmark:
microbenchmark::microbenchmark(f_R(df), f_Rcpp(df), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> f_R(df) 1746032.401 1793779.0 1794274.9209 1802222.051 1810686.801 1815285.001 10 b
#> f_Rcpp(df) 567.701 585.9 610.1607 601.851 642.801 650.101 10 a
The Rcpp function processes all 10,000 rows in less than a millisecond, as opposed to almost 2 seconds in basic R. The Rcpp version is almost 3,000 times faster.
Edit
To get this working with your own data, try:
cppFunction('
DataFrame f_Rcpp(DataFrame df, NumericVector v) {
NumericVector A = df["Tav"];
NumericVector B = df["dprcp"];
NumericVector C = df["dSWE"];
NumericVector D = df["simSWE"];
NumericVector E = df["dSWElag"];
for(int i = 5; i < (df.nrows() - 1); ++i) {
if (A[i] < -1 && B[i] > 0) {
C[i] = B[i];
} else if(A[i] < -1 && B[i] == 0 ) {
C[i] = 0;
} else {
C[i] = v[i];
}
if(D[i-1] <= -C[i]) {
D[i] = 0;
} else {
D[i] = D[i-1] + C[i];
}
E[i + 1] = C[i];
}
df["dSWE"] = C;
df["simSWE"] = D;
df["dSWElag"] = E;
return(df);
}
')
Which you could call like this:
preds <- predict(svm_model,station)
station2 <- f_Rcpp(station, preds)

An alternative approach, if you don't mind using another library {dplyr}.
Admittedly, this alternative, while (perhaps) more readable, is 200 times slower than #Allan Camerons Rcpp solution.
library(dplyr)
f_dplyr <- function(df){
df |>
mutate(C = ifelse(!any(A, B),
0,
ifelse(A < 0 & B > 0,
B,
A + B - D
)
),
lag_C = lag(C), ## default: lag by 1
lag_D = lag(D)
) |>
rowwise() |>
mutate(D = ifelse(lag_D <= lag_C,
0,
sum(lag_C, lag_D, na.rm = TRUE)
)
)
}
output:
> f_dplyr(df) |> head()
# A tibble: 6 x 6
# Rowwise:
A B C D lag_C lag_D
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 -4 -2 -6 NA NA NA
2 -5 -2 -6 -6 -6 0
3 3 1 -6 -6 -6 0
4 1 -2 -6 -6 -6 0
5 4 -4 -6 -6 -6 0
6 4 -3 -6 -6 -6 0
speed:
> microbenchmark(f1(df), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
f1(df) 112.5365 115.7435 122.5075 122.0079 127.432 136.4511 10

Are there any differences between x==NA_REAL and NumericVector::is_na(x) as an if condition in Rcpp?

I was working on the exercise of section 25.4 of "Advanced R" by Hardley from https://adv-r.hadley.nz/rcpp.html#rcpp-na. The question I try to replicate is to construct a function in Rcpp which mimics the generic R function min(x) with na.rm options. According to the exercise, the final answer, minC(x,na.rm), should have the exact same behaviors as min(x). The solution is as below from http://advanced-r-solutions.rbind.io/rewriting-r-code-in-c.html (Exercise 20.2):
NumericVector minC(NumericVector x, bool na_rm = false)
{
int n = x.size();
NumericVector out = NumericVector::create(R_PosInf);
if(na_rm){
for(int i = 0; i < n; ++i){
if(x[i]==NA_REAL){
continue;
}
if(x[i] < out[0]){
out[0] = x[i];
}
}
}else{
for(int i = 0; i < n; ++i){
if (NumericVector::is_na(x[i])){
out[0] = NA_REAL;
return out;
}
if(x[i] < out[0]){
out[0] = x[i];
}
}
}
return out;
}
I test the function and it works as expected:
x <- 1:10
min(x)
[1] 1
minC(x)
[1] 1
x[1] <- NA
min(x)
[1] NA
minC(x)
[1] NA
min(x,na.rm=T)
[1] 2
minC(x,na.rm=T)
[1] 2
However, if I change the second if condition from NumericVector::is_na(x[i]) to x[i]==NA_REAL, the results become:
> x <- 1:10
> min(x)
[1] 1
> minC(x)
[1] 1
> x[1] <- NA
> min(x)
[1] NA
> minC(x)
[1] 2
> min(x,na.rm=T)
[1] 2
> minC(x,na_rm = T)
[1] 2
Can anyone help me understand why does the results changed after I make the change mentioned above?

Converting grid coordinates

So I have a grid which is in one form and im trying to get the X and Y.
Is there a formula where I could turn for example 12 into 2,2 or 14 to 2,3
Also is there a name for this type of grid?
static int getX(int z)
{
int count = 0;
int res = 0;
int curr = 0;
for(int temp = z; temp > 0; temp >>= 1)
{
if(count % 2 ==0)
{
res += ((temp & 1) << curr);
curr++;
}
count++;
}
return res;
}
static int getY(int z)
{
int count = 0;
int res = 0;
int curr = 0;
for(int temp = z; temp > 0; temp >>= 1)
{
if(count % 2 ==1)
{
res += ((temp & 1) << curr);
curr++;
}
count++;
}
return res;
}

As Sneftel observed, this looks like a Z-order curve. As such, you can convert coordinates by interleaving binary representations. So your examples are
0 1 0 x=2 0 1 0 x=2
0 1 0 y=2 0 1 1 y=3
001100 p=12 001110 p=14
So to get x and y coordinates from the cell number p, you assign the bits of p alternatingly to x and y. This kind of bit arithmetic is pretty hard to express using elementary arithmetic operations, and there is no generally recognized formula symbol for this that I'm aware of, but the idea is quite simple.

Challenge with vector: how to split a vector based on max/min conditions

I've recently come across the following problem:
Let say I have an vector of random length (L) of 0 and 1 randomly distributed (for example [0,1,1,1,0,0,1,0]), I need to split the vector in two sub-vector at index K so following conditions are valid:
the left sub-vector must contains the maximum number of elements from
K in reverse order such as the number of zeros must be greater or
equal to the number of 1s
the right sub vector must contains the maximum number of element starting from K+1 such as the number of 1s must be greater or equal to the number of zeros
For example, [1,1,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0] the split is at index 9, left vector is [1,0], right vector [0,1]
I wrote the following solution but the complexity is O(L^2). I think there could be a solution with complexity of worst case O(L) but I cannot find anything that can help me. Any idea? Thanks
var max = 0;
var kMax = -1;
var firstZeroFound = false;
for (var i = 0; i < testVector.Length - 1; i++)
{
if (!firstZeroFound)
{
if (testVector[i]) continue;
firstZeroFound = true;
}
var maxZero = FindMax(testVector, i, -1, -1, false);
if (maxZero == 0) continue;
var maxOne = FindMax(testVector, i + 1, testVector.Length, 1, true);
if (maxOne == 0) continue;
if ((maxZero + maxOne) <= max)
continue;
max = maxOne + maxZero;
kMax = i;
if (max == testVector.Length)
break;
}
Console.Write("The result is {0}", kMax);
int FindMax(bool[] v, int start, int end, int increment, bool maximize)
{
var max = 0;
var sum = 0;
var count = 0;
var i = start;
while (i != end)
{
count++;
if (v[i])
sum++;
if (maximize)
{
if (sum * 2 >= count)
max = count;
}
else if (sum * 2 <= count)
{
max = count;
}
i += increment;
}
return max;
}

I think you should look at rle.
y <- c(1,1,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0)
z <- rle(y)
d <- cbind(z$values, z$lengths)
[,1] [,2]
[1,] 1 9
[2,] 0 1
[3,] 1 1
[4,] 0 8
Basically, rle calculates the lengths of 0's and 1's at each level.
From here things may go easier for you.

How to find the number of binary numbers with the following constraints:

Given a binary digit count of n, and a maximum consecutive occurrence count of m, find the number of different possible binary numbers. Also, the leftmost and rightmost bit must be 1.
For example n = 5, and m = 3.
The count is 7:
10001
10011
10101
10111
11001
11011
11101
Notice we excluded 11111 because too many consecutive 1's exist in it.
This was an interview question I had recently, and It has been bothering me. I don't want to brute force check each number for legitimacy because n can be > 32.

Let's call a binary sequence almost valid if it starts with "1" and has at most m consecutive "1" digits.
For i = 1, ..., n and j = 0, ..., m let a(i, j) be the number of almost valid sequences with length i that end with exactly j consecutive "1" digits.
Then
a(1, 1) = 1 and a(1, j) = 0 for j != 1, because "1" is the only almost valid sequence of length one.
For n >= 2 and j = 0 we have a(i, 0) = a(i-1, 0) + a(i-1, 1) + ... + a(i-1, m), because appending "0" to any almost valid sequence of length i-1 gives an almost valid sequence of length i ending with "0".
For n >= 2 and j > 0 we have a(i, j) = a(i-1, j-1) because appending "1" to an almost valid sequence with i-1 trailing ones gives an almost valid sequence of length j with i trailing ones.
Finally, the wanted number is the number of almost valid sequences with length n that have a trailing "1", so this is
f(n, m) = a(n, 1) + a(n, 2) + ... + a(n, m)
Written as a C function:
int a[NMAX+1][MMAX+1];
int f(int n, int m)
{
int i, j, s;
// compute a(1, j):
for (j = 0; j <= m; j++)
a[1][j] = (j == 1);
for (i = 2; i <= n; i++) {
// compute a(i, 0):
s = 0;
for (j = 0; j <= m; j++)
s += a[i-1][j];
a[i][0] = s;
// compute a(i, j):
for (j = 1; j <= m; j++)
a[i][j] = a[i-1][j-1];
}
// final result:
s = 0;
for (j = 1; j <= m; j++)
s += a[n][j];
return s;
}
The storage requirement could even be improved, because only the last column of the matrix a is needed. The runtime complexity is O(n*m).

Without too much combinatorial insight you can tackle this with DP. Let's call left#n,mright the number of binary strings of length n, with no substring of consecutive 1's longer than m, beginning with the string left, and ending with the string right. Clearly, we want to find 1#n-2,m1.
The key observation is simply that left#n,mright = left+'1'#n-1,mright + left+'0'#n-1,mright
A simplistic implementation in js (not sure if it works for small m, and in general untested):
function hash(n,m) {
return _('1',n-2);
function _(left,n){
if (m+1 <= left.length && left.lastIndexOf('0') <= left.length-m-2)
return 0;
if (n==0)
return (m <= left.length &&
left.lastIndexOf('0') <= left.length-m-1 ? 0:1);
return _(left+'1',n-1) + _(left+'0',n-1);
}
}
hash(5,3); // 7
Of course this is more efficient than brute force, however the runtime complexity is still exponential, so it isn't practical for large values of n.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Counting lengths of subseries - r

Related

How to speed up an R loop with sequential operations

Are there any differences between x==NA_REAL and NumericVector::is_na(x) as an if condition in Rcpp?

Converting grid coordinates

Challenge with vector: how to split a vector based on max/min conditions

How to find the number of binary numbers with the following constraints:

Categories

Resources