How to create binary constraints for optimization in R? - r

I have a function f(x) which I intend to minimize. "x" is a vector containing 50 parameters. This function has several constraints: first is that all parameters in x should be binary, so that x = (1,1,0,1,...); second is that the sum of "x" should be exactly 25, so that sum(x) = 25. The question can be illustrated as:
min f(x)
s.t. sum(x) = 25,
x = 0 or 1
However when I try to solve this problem in R, I met some problems. Prevalent packages such as "optim","constrOptim" from "stats" can only input coefficients of the target function (in my case, the function is bit complex and cannot be simply illustrated using coefficient matrix), "donlp2" from "Rdonlp" does not support setting parameters to be binary. I'm wondering whether anyone has any idea of how to set binary constraints for this case?

Expanding my comment, here is an example of a Local Search, as implemented in package NMOF. (I borrow Stéphane's objective function).
library("NMOF")
library("neighbours")
## Stéphane's objective function
f <- function(x)
sum(1:20 * x)
nb <- neighbourfun(type = "logical", kmin = 10, kmax = 10)
x0 <- c(rep(FALSE, 10), rep(TRUE, 10))
sol <- LSopt(f, list(x0 = x0, neighbour = nb, nI = 1000))
## initial solution
as.numeric(x0)
## [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
## final solution
as.numeric(sol$xbest)
## [1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
(Disclosure: I am the maintainer of packages NMOF and neighbours.)

You can try the amazing package rgenoud. Below is an example.
I take 20 binary variables instead of your 50 for easier reading. I take f(x) = sum(1:20 * x), this is a weighted sum with increasing weights so clearly the best solution (restricted to sum(x)=10) is 1, 1, ..., 1, 0, 0, ..., 0. And rgenoud brilliantly finds it.
library(rgenoud)
f <- function(x) { # the function to be minimized
sum(1:20 * x)
}
g <- function(x){
c(
ifelse(sum(x) == 10, 0, 1), # set the constraint (here sum(x)=10) in this way
f(x) # the objective function (to minimize/maximize)
)
}
solution <- genoud(
g,
pop.size = 3000,
lexical = 2, # see ?genoud for explanations
nvars = 20, # number of x_i's
starting.values = c(rep(0, 10), rep(1, 10)),
Domains = cbind(rep(0, 20), rep(1, 20)), # lower and upper bounds
data.type.int = TRUE # x_i's are integer
)
solution$par # the values of x
## [1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
solution$value
## [1] 0 55 ; 0 is the value of ifelse(sum(x)=10,0,1) and 55 is the value of f(x)

Related

How to submit a "vector of distributions" to a function in R?

I want to write an R-function, say f, which has inputs x and n, where x is some kind of "list of distributions" and f is supposed to draw n samples from each distribution in x.
What is a good way to implement this in R?
My current idea is
f = function(x,n){
out = list()
for(i in 1:length(x)){
name = sub("\\(.*", "",x[i])
size = ifelse(name=="sample",paste("size=",n),paste0("n=",n))
args = paste(size,gsub("[\\(\\)]", "", regmatches(x[i], gregexpr("\\(.*?\\)", x[i]))[[1]]),sep=",")
out[[i]] = eval(parse(text=paste0(name,"(",args,")")))
}
return(out)
}
f(x = c("rnorm(mean=1,sd=2)","sample(0:1,replace=TRUE)","rbinom(size=10,prob=0.1)"), n = 10)
I don't like this implementation, because
n is not always the input name for the sample size (e.g. in sample it is size),
the code will crash if not all inputs for the distributions are properly defined.
Can I improve the implementation, for example with x of class alist?
You could change your input and create a list of function names and arguments. For each distribution we set the n/size-value to 1.
ls_func <- list("rnorm" = list(mean = 0, sd = 1, n = 1),
"sample" = list(x = 0:1, replace = TRUE, size = 1),
"rbinom" = list(size = 10, prob = 0.1, n = 1))
Your function takes those distributions and replicates them n-times:
g <- function(ls_func, n) {
out = list()
for(i in seq_along(ls_func)){
out[[i]] <- replicate(do.call(names(ls_func)[i], ls_func[[i]]), n = n)
}
return(out)
}
so
set.seed(4096)
g(ls_func, 10)
returns
[[1]]
[1] 0.1894398 -0.1622468 0.5327100 -1.5747229 -0.6884024 -0.3092226 -0.0879258 -0.4320240 -0.7799596 0.4525895
[[2]]
[1] 0 1 0 0 0 1 1 1 0 0
[[3]]
[1] 0 0 1 1 0 1 1 1 1 0
or. Basically it's not a good approach to use eval(parse(text=...)) to execute functions. Use do.call instead.
You can remove the for-loop:
g <- function(ls_func, n) {
out = list()
out <- lapply(seq_along(ls_func), function(i) replicate(do.call(names(ls_func)[i], ls_func[[i]]), n = n))
return(out)
}
Note: This code also crashes, if your distributions aren't defined properly. To avoid this, you need some error handling. Look for try and stop functions.
I've been putting together an R package -- distionary -- that can help with this.
First make a list of input distributions:
library(distionary)
x <- list(
dst_norm(1, 2^2),
dst_empirical(0:1),
dst_binom(10, 0.1)
)
The function for drawing from a distribution is realize(), which fits nicely with lapply() (or purrr's map()):
set.seed(123)
lapply(x, realize, n = 10)
#> [[1]]
#> [1] -0.1209513 0.5396450 4.1174166 1.1410168 1.2585755 4.4301300
#> [7] 1.9218324 -1.5301225 -0.3737057 0.1086761
#>
#> [[2]]
#> [1] 0 0 0 0 0 0 0 0 1 1
#>
#> [[3]]
#> [1] 3 2 1 2 0 1 2 0 0 0
Putting this code in a function is then straightforward:
f <- function(x, n) {
lapply(x, realize, n = n)
}
set.seed(123)
f(x, n = 10)
#> [[1]]
#> [1] -0.1209513 0.5396450 4.1174166 1.1410168 1.2585755 4.4301300
#> [7] 1.9218324 -1.5301225 -0.3737057 0.1086761
#>
#> [[2]]
#> [1] 0 0 0 0 0 0 0 0 1 1
#>
#> [[3]]
#> [1] 3 2 1 2 0 1 2 0 0 0

How to solve linear programming model in R

I need to solve the following microeconomic problem:
I have six assets I can produce (asset 1 - 6) across five years (2011 - 2015).
Each asset can only be produced during one year.
Each asset must be produced in my five year period.
Production is not mutually exclusive; I can produce more than one good in a year without affecting the production of either.
Each asset has a fixed cost of production equal to 30.
I must have non-negative profit in each year; revenues must be at least 30.
Below is a matrix representing my potential revenue for producing each asset (i) in a given year (j).
2011 2012 2013 2014 2015
Asset1 35* 37 39 42 45
Asset2 16 17 18 19 20*
Asset3 125 130 136*139 144
Asset4 15 27 29 30* 33
Asset5 14 43* 46 50 52
Asset6 5 7 8 10 11*
The asterisks (*) represent what should be the optimal solution set.
How can I use R to solve for the production plan that maximizes my revenue (and therefore profit) subject to the constraints outlined. My output should be a similar 6x5 matrix of 0's and 1's, where 1's represent choosing to produce a good in a given year.
This is a classic problem, and one that needs to be reformulated.
Start by reformulating your problem
Max( sum_[i,t] (pi_[i,t] - C_[i,t]) * x_[i,t])
Sd.
sum_t x_[i,t] = 1 [ for all i ]
sum_i x_[i,t] >= 30 [ for all t ]
x_[i,t] >= 0 [for all i, t]
In the lpSolve package the maximization problem is given in a linear representation, eg. in non-matrix format. Lets start by making a vector representing our x_[i,t]. For ease let's name it (although this is not used), just so we can keep track.
n <- 6
t <- 5
#x ordered by column.
x <- c(35, 16, 125, 15, 14, 5, 37, 17, 130, 27, 43, 7, 39, 18, 136, 29, 46, 8, 42, 19, 139, 30, 50, 10, 45, 20, 144, 33, 52, 11)
# if x is matrix use:
# x <- as.vector(x)
names(x) <- paste0('x_[', seq(n), ',', rep(seq(t), each = n), ']')
head(x, n * 2)
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2] x_[2,2] x_[3,2] x_[4,2] x_[5,2] x_[6,2]
35 16 125 15 14 5 37 17 130 27 43 7
length(x)
[1] 30
Now now we need to create our conditions. Starting with the first condition
sum_t x_[i,t] = 1 [ for all i ]
we can create this rather simply. The thing to watch out for here, is that the dimension has to be right. We have a vector of length 30, so we'll need our conditions matrix to have 30 columns. In addition we have 6 assets, so we'll need 6 rows for this condition. Again lets name the rows and columns to keep track ourself.
cond1 <- matrix(0, ncol = t * n,
nrow = n,
dimnames = list(paste0('x_[', seq(n), ',t]'),
names(x)))
cond1[, seq(n + 1)]
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2]
x_[1,t] 0 0 0 0 0 0 0
x_[2,t] 0 0 0 0 0 0 0
x_[3,t] 0 0 0 0 0 0 0
x_[4,t] 0 0 0 0 0 0 0
x_[5,t] 0 0 0 0 0 0 0
x_[6,t] 0 0 0 0 0 0 0
Next we fill our the correct fields. x_[1,1] + x[1, 2] + ... = 1 and x_[2,1] + x_[2,2] + ... = 1 and so forth. Using a for loop is the simplest for this problem
for(i in seq(n)){
cond1[i, seq(i, 30, n)] <- 1
}
cond1[, seq(n + 1)]
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2]
x_[1,t] 1 0 0 0 0 0 1
x_[2,t] 0 1 0 0 0 0 0
x_[3,t] 0 0 1 0 0 0 0
x_[4,t] 0 0 0 1 0 0 0
x_[5,t] 0 0 0 0 1 0 0
x_[6,t] 0 0 0 0 0 1 0
We still have to create the RHS and specify direction but I'll wait with this for now.
So next lets create our matrix for the second condition
sum_i x_[i,t] >= 30 [ for all t ]
The process for this one is very similar, but now we need a row for each period, so the dimension of the matrix is 5x30. The main difference here, is we need to insert the values of x_[i, t]
cond2 <- matrix(0, ncol = t * n,
nrow = t,
dimnames = list(paste0('t=', seq(t)),
names(x)))
for(i in seq(t)){
cond2[i, seq(n) + n * (i - 1)] <- x[seq(n) + n * (i - 1)]
}
cond2[, seq(1, n * t, n)]
x_[1,1] x_[1,2] x_[1,3] x_[1,4] x_[1,5]
t=1 35 0 0 0 0
t=2 0 37 0 0 0
t=3 0 0 39 0 0
t=4 0 0 0 42 0
t=5 0 0 0 0 45
Note that I'm printing the result for x_[1, t] to illustrate we've got it right.
Last we have the final condition. For this we note the ?lpSolve::lp has an argument all.bin, and reading this, it states
Logical: should all variables be binary? Default: FALSE.
So since all variables are either 1 or 0, we simply set this value to TRUE. Before continuing lets combine our conditions into one matrix
cond <- rbind(cond1, cond2)
Now both the RHS and the direction are simply taken from the 2 conditions. From the documentation on the const.dir argument
Vector of character strings giving the direction of the constraint: each value should be one of "<," "<=," "=," "==," ">," or ">=". (In each pair the two values are identical.)
In our conditions we have 6 rows representing the first condition, and rows represeting condition 2. Thus we need n (6) times == and t (5) times >=.
cond_dir <- c(rep('==', n), rep('>=', t))
The RHS is created in a similar fashion
RHS <- c(rep(1, n), rep(30, t))
And that's it! Now we're ready to solve our problem using the lpSolve::lp function.
sol = lpSolve::lp(direction = 'max',
objective.in = x,
const.mat = cond,
const.dir = cond_dir,
const.rhs = RHS,
all.bin = TRUE)
sol$objval
[1] 275
The weights for the solution are stored in sol$solution
names(sol$solution) <- names(x)
sol$solution
x_[1,1] x_[2,1] x_[3,1] x_[4,1] x_[5,1] x_[6,1] x_[1,2] x_[2,2] x_[3,2] x_[4,2] x_[5,2] x_[6,2] x_[1,3] x_[2,3] x_[3,3]
1 0 0 0 0 0 0 0 0 0 1 0 0 0 1
x_[4,3] x_[5,3] x_[6,3] x_[1,4] x_[2,4] x_[3,4] x_[4,4] x_[5,4] x_[6,4] x_[1,5] x_[2,5] x_[3,5] x_[4,5] x_[5,5] x_[6,5]
0 0 0 0 0 0 1 0 0 0 1 0 0 0 1
matrix(sol$solution,
ncol = t,
dimnames = list(rownames(cond1),
rownames(cond2)))
t=1 t=2 t=3 t=4 t=5
x_[1,t] 1 0 0 0 0
x_[2,t] 0 0 0 0 1
x_[3,t] 0 0 1 0 0
x_[4,t] 0 0 0 1 0
x_[5,t] 0 1 0 0 0
x_[6,t] 0 0 0 0 1
Which we quickly see is the correct solution. :-)
Side note on costs
One may have noticed "Where the hell did the costs go?". In this specific case, costs are fixed and not very interesting. This means we can ignore these during the calculations because we know the total cost is going to be 30 * 6 = 180 (which has to be substracted from the objective value). However it is not uncommon that costs depend on various factors, and might affect the optimal solution. For illustration, I'll include how we could incorporate costs in this example here.
First we'll have to extend our objective vector to incorporate the costs for each product at each period
Fixed_C <- -30
x <- c(x, rep(Fixed_C, n * t))
Next we'll add a pseudo-constraint
x_[i,t] - C_[i,t] = 0 [for all i, t]
This constraint ensures that if x_[i,t] = 1 then the relevant cost is added to the problem. There's 2 ways to create this constraint. The first is to have a matrix with n * t rows, one for each cost and period. Alternatively we can use our first constraint and actually live with only a single constrant
sum_[i,t] x_[i,t] - C_[i,t] = 0
because our first constraint makes sure x[1, 1] != x[1, 2]. So our third constraint becomes
cond3 <- c(rep(1, n * t), rep(-1, n * t))
Lastly we have to extend our RHS and condition 1 and 2 matrices. Simply add 0's to the condition matrices to make the dimensions fit.
cond1 <- cbind(cond1, matrix(0, nrow = n, ncol = n * t))
cond2 <- cbind(cond2, matrix(0, nrow = n, ncol = n * t))
cond <- rbind(cond1, cond2, cond3)
cond_dir <- c(cond_dir, '==')
RHS <- c(RHS, 0)
And now we can once again find the optimal solution using lpSolve::lp
solC = lpSolve::lp(direction = 'max',
objective.in = x,
const.mat = cond,
const.dir = cond_dir,
const.rhs = RHS,
all.bin = TRUE)
solC$objval
[1] 95
which is equal to our previous value 275 minus our fixed costs Fixed_C * n = 180.

Transform categorical attribute vector into similarity matrix

I need to transfrom a categorical attribute vector into a "same attribute matrix" using R.
For example I have a vector which reports gender of N people (male = 1, female = 0). I need to convert this vector into a NxN matrix named A (with people names on rows and columns), where each cell Aij has the value of 1 if two persons (i and j) have the same gender and 0 otherwise.
Here is an example with 3 persons, first male, second female, third male, which produce this vector:
c(1, 0, 1)
I want to transform it into this matrix:
A = matrix( c(1, 0, 1, 0, 1, 0, 1, 0, 1), nrow=3, ncol=3, byrow = TRUE)
Like lmo said in acomment it's impossible to know the structure of your dataset so what follows is just an example for you to see how it could be done.
First, make up some data.
set.seed(3488) # make the results reproducible
x <- LETTERS[1:5]
y <- sample(0:1, 5, TRUE)
df <- data.frame(x, y)
Now tabulate it according to your needs
A <- outer(df$y, df$y, function(a, b) as.integer(a == b))
dimnames(A) <- list(df$x, df$x)
A
# A B C D E
#A 1 1 1 0 0
#B 1 1 1 0 0
#C 1 1 1 0 0
#D 0 0 0 1 1
#E 0 0 0 1 1

how to make double for loops faster in R

I am trying to do the below calculation using R. my function is recursive and it uses a double for loop to calculate values of "result" matrix. Is there a method to replace the for loops or achieve the if condition faster?
x<-rnorm(2400,0, 3)
y<-rnorm(400,0,3)
no_row<-length(x)
no_col<-length(y)
input<-matrix(data=1,nrow = no_row, ncol = no_col)
result<-matrix(nrow = no_row, ncol = no_col)
calculation<-function(x,y)
{
for(i in 1:no_row)
{
for(j in 1:no_col)
{
z<-exp(x[i]-y[j])
result[i,j]<-(z/1+z)
}
}
new_x<-x-1
new_y<-y-1
residual<-input-result
sq_sum_residulas<-sum((rowSums(residual, na.rm = T))^2)
if(sq_sum_residulas>=1){calculation(new_x,new_y)}
else(return(residual))
}
output<-calculation(x,y)
To complete Benjamin answer, you shouldn't use a recursion function. You should instead use a while loop with a max_iter parameter.
Reusing Benjamin function:
calculation2 <- function(x, y){
result <- outer(x, y, function(x, y) { z <- exp(x - y); z / 1 + z})
result
}
calculation <- function(x, y, max_iter = 10){
input <- matrix(data=1,nrow = length(x), ncol = length(y))
sq_sum_residulas <- 1 # Initialize it to enter while loop
new_x <- x # Computation x: it will be updated at each loop
new_y <- y # Computation y
n_iter <- 1 # Counter of iteration
while(sq_sum_residulas >= 1 & n_iter < max_iter){
result <- calculation2(new_x, new_y)
new_x <- x - 1
new_y <- y - 1
residual <- input - result
sq_sum_residulas <- sum((rowSums(residual, na.rm = T))^2)
n_iter <- n_iter + 1
}
if (n_iter == max_iter){
stop("Didn't converge")
}
return(residual)
}
If you try to run this code, you will see that it doesn't converge. I geuess there is a mistake in your computation. Especially in z/1 + z ?
The outer function is the tool you are looking for.
Compare these two functions that only generate the result matrix
x<-rnorm(100,0, 3)
y<-rnorm(100,0,3)
calculation<-function(x,y)
{
result <- matrix(nrow = length(x), ncol = length(y))
for(i in seq_along(x))
{
for(j in seq_along(y))
{
z<-exp(x[i]-y[j])
result[i,j]<-(z/1+z)
}
}
result
}
calculation2 <- function(x, y){
result <- outer(x, y, function(x, y) { z <- exp(x - y); z / 1 + z})
result
}
library(microbenchmark)
microbenchmark(
calculation(x, y),
calculation2(x, y)
)
Unit: microseconds
expr min lq mean median uq max neval
calculation(x, y) 1862.40 1868.119 1941.5523 1871.490 1876.1825 8375.666 100
calculation2(x, y) 466.26 469.192 515.3696 471.392 480.9225 4481.371 100
That discrepancy in time seems to grow as the length of the vectors increases.
Note, this will solve the speed for your double for loop, but there seem to be other issues in your function. It isn't clear to me what you are trying to do, or why you are calling calculation from within itself. As you have it written, there are no changes to x and y before it gets to calling itself again, so it would be stuck in a loop forever, if it worked at all (it doesn't on my machine)
#Benjamin #Emmanuel-Lin Thanks for the solutions :) I was able to solve the issue with your inputs. Please find below the sample data set and code. The solution converges when sq_sum_residual becomes less than 0.01. This is more than 12x faster than my code with double for loops.Sorry for the confusion created by the sample data & new_x, new_y calculation provided in the question.
Input is a dichotomous 9x10 matrix
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA 1 1 1 1 1 1 1 0 1
2 1 1 1 1 1 1 1 0 1 0
3 1 1 1 1 1 1 0 1 0 0
4 1 1 1 1 1 1 0 1 0 0
5 1 1 1 1 1 1 0 1 0 0
6 1 1 1 1 1 0 1 0 0 0
7 1 1 1 1 0 1 0 0 0 0
8 1 0 1 0 1 0 0 0 0 0
9 0 1 0 1 0 0 0 0 0 0
x<-c( 2.0794415,1.3862944,0.8472979, 0.8472979, 0.8472979,0.4054651,0.0000000, -0.8472979, -1.3862944)
y<-c(-1.4404130, -1.5739444, -1.5739444, -1.5739444, -0.7472659, -0.1876501, 1.1986443 , 0.7286407,2.5849387,2.5849387 )
result<-matrix(nrow = length(x), ncol = length(y))
calculation<-function(x,y)
{
result<-outer(x,y,function(x,y){ z<-exp(x-y);z/(1+z)})
result[!is.finite(result)]<-NA
variance_result<-result*(1-result)
row_var<- (-1)*rowSums(variance_result,na.rm=T)
col_var<- (-1)*colSums(variance_result,na.rm=T)
residual<-input-result
row_residual<-rowSums(residual,na.rm=T)#(not to be multiplied by -1)
col_residual<-(-1)*colSums(residual,na.rm=T)
new_x<-x-(row_residual/row_var)
new_x[!is.finite(new_x)]<-NA
new_x<as.array(new_x)
new_y<-y-(col_residual/col_var)
new_y[!is.finite(new_y)]<-NA
avg_new_y<-mean(new_y, na.rm = T)
new_y<-new_y-avg_new_y
new_y<-as.array(new_y)
sq_sum_residual<-round(sum(row_residual^2),5)
if(sq_sum_residual>=.01)
{calculation(new_x,new_y)}
else(return(residual))
}
calculation(x,y)

Create counter within consecutive runs of certain values

I have an hourly value. I want to count how many consecutive hours the value has been zero since the last time it was not zero. This is an easy job for a spreadsheet or for loop, but I am hoping for a snappy vectorized one-liner to accomplish the task.
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
df <- data.frame(x, zcount = NA)
df$zcount[1] <- ifelse(df$x[1] == 0, 1, 0)
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(df$x[i] == 0, df$zcount[i - 1] + 1, 0)
Desired output:
R> df
x zcount
1 1 0
2 0 1
3 1 0
4 0 1
5 0 2
6 0 3
7 1 0
8 1 0
9 0 1
10 0 2
William Dunlap's posts on R-help are the place to look for all things related to run lengths. His f7 from this post is
f7 <- function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)}
and in the current situation f7(!x). In terms of performance there is
> x <- sample(0:1, 1000000, TRUE)
> system.time(res7 <- f7(!x))
user system elapsed
0.076 0.000 0.077
> system.time(res0 <- cumul_zeros(x))
user system elapsed
0.345 0.003 0.349
> identical(res7, res0)
[1] TRUE
Here's a way, building on Joshua's rle approach: (EDITED to use seq_len and lapply as per Marek's suggestion)
> (!x) * unlist(lapply(rle(x)$lengths, seq_len))
[1] 0 1 0 1 2 3 0 0 1 2
UPDATE. Just for kicks, here's another way to do it, around 5 times faster:
cumul_zeros <- function(x) {
x <- !x
rl <- rle(x)
len <- rl$lengths
v <- rl$values
cumLen <- cumsum(len)
z <- x
# replace the 0 at the end of each zero-block in z by the
# negative of the length of the preceding 1-block....
iDrops <- c(0, diff(v)) < 0
z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
# ... to ensure that the cumsum below does the right thing.
# We zap the cumsum with x so only the cumsums for the 1-blocks survive:
x*cumsum(z)
}
Try an example:
> cumul_zeros(c(1,1,1,0,0,0,0,0,1,1,1,0,0,1,1))
[1] 0 0 0 1 2 3 4 5 0 0 0 1 2 0 0
Now compare times on a million-length vector:
> x <- sample(0:1, 1000000,T)
> system.time( z <- cumul_zeros(x))
user system elapsed
0.15 0.00 0.14
> system.time( z <- (!x) * unlist( lapply( rle(x)$lengths, seq_len)))
user system elapsed
0.75 0.00 0.75
Moral of the story: one-liners are nicer and easier to understand, but not always the fastest!
rle will "count how many consecutive hours the value has been zero since the last time it was not zero", but not in the format of your "desired output".
Note the lengths for the elements where the corresponding values are zero:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 1 3 2 2
# values : num [1:6] 1 0 1 0 1 0
A simple base R approach:
ave(!x, cumsum(x), FUN = cumsum)
#[1] 0 1 0 1 2 3 0 0 1 2
One-liner, not exactly super elegant:
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
unlist(lapply(split(x, c(0, cumsum(abs(diff(!x == 0))))), function(x) (x[1] == 0) * seq(length(x))))
Using purr::accumulate() is very straightforward, so this tidyverse solution may add some value here. I must acknowledge it is definitely not the fastest, as it calls the same function length(x)times.
library(purrr)
accumulate(x==0, ~ifelse(.y!=0, .x+1, 0))
[1] 0 1 0 1 2 3 0 0 1 2

Resources