Create counter within consecutive runs of certain values - r

I have an hourly value. I want to count how many consecutive hours the value has been zero since the last time it was not zero. This is an easy job for a spreadsheet or for loop, but I am hoping for a snappy vectorized one-liner to accomplish the task.
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
df <- data.frame(x, zcount = NA)
df$zcount[1] <- ifelse(df$x[1] == 0, 1, 0)
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(df$x[i] == 0, df$zcount[i - 1] + 1, 0)
Desired output:
R> df
x zcount
1 1 0
2 0 1
3 1 0
4 0 1
5 0 2
6 0 3
7 1 0
8 1 0
9 0 1
10 0 2

William Dunlap's posts on R-help are the place to look for all things related to run lengths. His f7 from this post is
f7 <- function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)}
and in the current situation f7(!x). In terms of performance there is
> x <- sample(0:1, 1000000, TRUE)
> system.time(res7 <- f7(!x))
user system elapsed
0.076 0.000 0.077
> system.time(res0 <- cumul_zeros(x))
user system elapsed
0.345 0.003 0.349
> identical(res7, res0)
[1] TRUE

Here's a way, building on Joshua's rle approach: (EDITED to use seq_len and lapply as per Marek's suggestion)
> (!x) * unlist(lapply(rle(x)$lengths, seq_len))
[1] 0 1 0 1 2 3 0 0 1 2
UPDATE. Just for kicks, here's another way to do it, around 5 times faster:
cumul_zeros <- function(x) {
x <- !x
rl <- rle(x)
len <- rl$lengths
v <- rl$values
cumLen <- cumsum(len)
z <- x
# replace the 0 at the end of each zero-block in z by the
# negative of the length of the preceding 1-block....
iDrops <- c(0, diff(v)) < 0
z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
# ... to ensure that the cumsum below does the right thing.
# We zap the cumsum with x so only the cumsums for the 1-blocks survive:
x*cumsum(z)
}
Try an example:
> cumul_zeros(c(1,1,1,0,0,0,0,0,1,1,1,0,0,1,1))
[1] 0 0 0 1 2 3 4 5 0 0 0 1 2 0 0
Now compare times on a million-length vector:
> x <- sample(0:1, 1000000,T)
> system.time( z <- cumul_zeros(x))
user system elapsed
0.15 0.00 0.14
> system.time( z <- (!x) * unlist( lapply( rle(x)$lengths, seq_len)))
user system elapsed
0.75 0.00 0.75
Moral of the story: one-liners are nicer and easier to understand, but not always the fastest!

rle will "count how many consecutive hours the value has been zero since the last time it was not zero", but not in the format of your "desired output".
Note the lengths for the elements where the corresponding values are zero:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 1 3 2 2
# values : num [1:6] 1 0 1 0 1 0

A simple base R approach:
ave(!x, cumsum(x), FUN = cumsum)
#[1] 0 1 0 1 2 3 0 0 1 2

One-liner, not exactly super elegant:
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
unlist(lapply(split(x, c(0, cumsum(abs(diff(!x == 0))))), function(x) (x[1] == 0) * seq(length(x))))

Using purr::accumulate() is very straightforward, so this tidyverse solution may add some value here. I must acknowledge it is definitely not the fastest, as it calls the same function length(x)times.
library(purrr)
accumulate(x==0, ~ifelse(.y!=0, .x+1, 0))
[1] 0 1 0 1 2 3 0 0 1 2

Related

Creating a table with proportional values

I have got a data set that looks like this:
COMPANY DATABREACH CYBERBACKGROUND
A 1 2
B 0 2
C 0 1
D 0 2
E 1 1
F 1 2
G 0 2
H 0 2
I 0 2
J 0 2
No I want to create the following: 40% of the cases that the column DATABREACH has the value of 1, I want the value CYBERBACKGROUND to take the value of 2. I figure there must be some function to do this, but I cannot find it.
ind <- which(df$DATABREACH == 1)
ind <- ind[rbinom(length(ind), 1, prob = 0.4) > 0]
df$CYBERBACKGROUND[ind] <- 2
The above is a bit more efficient in that it only pulls randomness for as many as strictly required. If you aren't concerned (11000 doesn't seem too high), you can reduce that to
df$CYBERBACKGROUND <-
ifelse(df$DATABREACH == 1 & rbinom(nrow(df), 1, prob = 0.4) > 0,
2, df$CYBERBACKGROUND)
We may use
library(dplyr)
df1 <- df1 %>%
mutate(CYBERBACKGROUND = replace(CYBERBACKGROUND,
sample(which(DATABREACH == 0), sum(ceiling(sum(DATABREACH) * 0.4))), 2))

Multiple for loop time computation very high in R

I have data about machines in the following form
Number of rows - 900k
Data
A B C D E F G H I J K L M N
---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 1 1 1 0 1 1 0 0
2 0 0 0 0 1 1 1 0 1 1 0 0 1 0
3 0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 indicates that the machine was active and 0 indicates that it was inactive.
I want my output to look like
A B C D E F G H I J K L M N
---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
2 0 0 0 0 1 1 1 1 1 1 0 0 1 0
3 0 0 0 0 0 0 0 1 1 1 1 1 0 0
Basically all I am trying to do is look for zeros in a particular row and if that zero is surrounded by ones on either side, replace 0 with 1
example -
in row 1 you have zero in column J
but you also have 1 in column I and K
which means I replace that 0 by 1 because it is surrounded by 1s
The code I am using is this
for(j in 2:13) {
if(data[i,j]==0 && data[i,j-1]==1 && data[i,j+1]==1){
data[i,j] = 1
}
}
}
Is there a way to reduce the time computation for this? This takes me almost 30 mins to run in R. Any help would be appreciated.
this is faster because it does not require to iterate through the rows.
for(j in 2:13) {
data[,j] = ifelse(data[,j-1] * data[,j+1]==1,1,data[,j])
}
or a littlebit more optimized, without using ifelse
for(j in 2:(ncol(data) - 1)) {
data[data[, j - 1] * data[, j + 1] == 1, j] <- 1
}
You could also use gsub to replace any instances of 101 with 111 using the following code:
collapsed <- gsub('101', '111', apply(df1, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]
Here's a comparison of everyone's solutions:
library(data.table)
library(rbenchmark)
library(tidyverse)
set.seed(1)
numLetters <- 13
df <- as_tibble(matrix(round(runif(numLetters * 100)), ncol = numLetters))
names(df) <- LETTERS[1:numLetters]
benchmark(
'gsub' = {
data <- df
collapsed <- gsub('101', '111', apply(data, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]
},
'for_orig' = {
data <- df
for(i in 1:nrow(data)) {
for(j in 2:(ncol(data) - 1)) {
if(data[i, j] == 0 && data[i, j - 1] == 1 && data[i, j + 1] == 1) {
data[i, j] = 1
}
}
}
},
'for_norows' = {
data <- df
for(j in 2:(ncol(data) - 1)) {
data[, j] = ifelse(data[, j - 1] * data[, j + 1] == 1, 1, data[, j])
}
},
'vectorize' = {
data <- df
for(i in seq(ncol(data) - 2) + 1) {
condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
data[which(condition), i] <- 1
}
},
'index' = {
data <- df
idx <- apply(data, 1, function(x) c(0, diff(x)))
data[which(idx == -1 & lead(idx == 1), arr.ind = TRUE)[, 2:1]] <- 1
},
replications = 100
)
The indexing solution (which has since been deleted) wins hands-down in terms of computational time for a 13-by-100 data frame.
test replications elapsed relative user.self sys.self user.child
3 for_norows 100 1.19 7.438 1.19 0 NA
2 for_orig 100 9.29 58.063 9.27 0 NA
1 gsub 100 0.28 1.750 0.28 0 NA
5 index 100 0.16 1.000 0.16 0 NA
4 vectorize 100 0.87 5.438 0.87 0 NA
sys.child
3 NA
2 NA
1 NA
5 NA
4 NA
Cut the time by using vectorized operations. As you are planning to do the same thing for every row, this can be done by utilizing the vectorized conditional statements.
for(i in seq(ncol(data) - 2) + 1){ #<== all but last and first column
#Find all neighbouring columns that are equal, where the the center column is equal to 0
condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
#Overwrite only the values that holds the condition
data[which(condition), i] <- 1
}
You can avoid loops altogether and use indexing to replace all the values at once:
nc <- ncol(df)
df[, 2:(nc - 1)][df[, 1:(nc - 2)] * df[, 3:nc] == 1] <- 1

Creating a new column based on two old columns in a data frame

data <- data.frame(foo = c(0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1),
bar = c(1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0))
Hi, Here I am having a data frame with two columns foo and bar. I want to create a new column Complete, based on foo and bar data.
If foo and bar is zero then complete should be 0.
If foo is one and bar is 0 then complete should be one.
If bar is 1 and foo is 0 then complete should be two.
For example.
foo bar complete
0 0 0
1 0 1
0 1 2
Edit:
If foo==1 and bar==1 then NA.
Following suit, using NA when both columns are 1. Start with the row sums. If any of them are 2 (the number of columns), replace it with NA. Then multiply that by the max.col() value.
rs <- rowSums(data)
cbind(data, complete = max.col(data) * replace(rs, rs == 2, NA))
# foo bar complete
# 1 0 1 2
# 2 1 0 1
# 3 0 0 0
# 4 0 0 0
# 5 1 1 NA
# 6 0 0 0
# 7 0 1 2
# 8 0 0 0
# 9 1 0 1
# 10 1 1 NA
# 11 1 0 1
If you don't wish to assign new objects, you can use a local environment or wrap it up into a function:
local({
rs <- rowSums(data)
max.col(data) * replace(rs, rs == 2, NA)
})
# [1] 2 1 0 0 NA 0 2 0 1 NA 1
If an algebraic approach is sought, we can try one of the lines below:
with(data, 2L * bar + foo + 0L * NA^(bar & foo))
with(data, 2L * bar + foo + NA^(bar & foo) - 1L)
with(data, (2L * bar + foo) * NA^(bar & foo))
All return
[1] 2 1 0 0 NA 0 2 0 1 NA 1
Explanation
The expression 2L * bar + foo is treating bar and foo as digits of a binary number. The difficulty is to return NA in case of foo == 1 & bar == 1. For that, bar and foo are treated as logical values. If both are 1, i.e., TRUE then NA^(bar & foo) returns NA, otherwise 1.
If one operand of an expression is NA so is the overall expression. So, there are several possibilities to combine NA^(bar & foo) with 2L * bar + foo. I wonder which is the fastest.
Benchmark
So far, 7 different approaches have been posted by
d.b
Balter
PoGibas
Rich Scriven
Frank (in chat)
user 20650 in a comment
Uwe
The OP has supplied his sample data as type double. As I have seen remarkable different timings for integer and double values on other occasions, the benchmark runs will be repeated for each type to investigate the impact of data type on the different approaches.
Benchmark data
The benchmark data will consist of 1 million rows:
n_row <- 1e6L
set.seed(1234L)
data_int <- data.frame(foo = sample(0:1, n_row, replace = TRUE),
bar = sample(0:1, n_row, replace = TRUE))
with(data_int, table(foo, bar))
bar
foo 0 1
0 249978 250330
1 249892 249800
data_dbl <- data.frame(foo = as.double(data_int$foo),
bar = as.double(data_int$bar))
Benchmark code
For benchmarking, the microbenchmark package is used.
# define check function to compare results
check <- function(values) {
all(sapply(values[-1], function(x) all.equal(values[[1]], x)))
}
library(dplyr)
data <- data_dbl
microbenchmark::microbenchmark(
d.b = {
vect = c("0 0" = 0, "1 0" = 1, "0 1" = 2)
unname(vect[match(with(data, paste(foo, bar)), names(vect))])
},
Balter = with(data,ifelse(foo == 0 & bar == 0, 0,
ifelse(foo == 1 & bar == 0, 1,
ifelse(foo == 0 & bar == 1, 2, NA)))),
PoGibas = with(data, case_when(foo == 0 & bar == 0 ~ 0,
foo == 1 & bar == 0 ~ 1,
foo == 0 & bar == 1 ~ 2)),
Rich = local({rs = rowSums(data); max.col(data) * replace(rs, rs == 2, NA)}),
Frank = with(data, ifelse(xor(foo, bar), max.col(data), 0*NA^foo)),
user20650 = with(data, c(0, 1, 2, NA)[c(2*bar + foo + 1)]),
uwe1i = with(data, 2L * bar + foo + 0L * NA^(bar & foo)),
uwe1d = with(data, 2 * bar + foo + 0 * NA^(bar & foo)),
uwe2i = with(data, 2L * bar + foo + NA^(bar & foo) - 1L),
uwe2d = with(data, 2 * bar + foo + NA^(bar & foo) - 1),
uwe3i = with(data, (2L * bar + foo) * NA^(bar & foo)),
uwe3d = with(data, (2 * bar + foo) * NA^(bar & foo)),
times = 11L,
check = check)
Note that only the result vector is created without creating a new column in data. The approach of PoGibas was modified accordingly.
As mentioned above, there might be speed differences in using integer or double values. Therefore, I wanted to test also the effect of using integer constant, e.g., 0L, 1L, versus double constants 0, 1.
Benchmark results
First, for input data of type double:
Unit: milliseconds
expr min lq mean median uq max neval cld
d.b 1687.05063 1700.52197 1707.72896 1706.48511 1715.46814 1730.62160 11 e
Balter 287.89649 377.42284 412.59764 452.75668 458.21178 472.92971 11 d
PoGibas 152.90900 154.82164 176.09522 158.23214 165.73524 333.48223 11 c
Rich 67.43862 68.68331 76.42759 77.10620 82.42179 89.90016 11 b
Frank 170.78293 174.66258 192.85203 179.69422 184.55237 333.74578 11 c
user20650 20.11790 20.29744 22.32541 20.81453 21.11509 34.45654 11 a
uwe1i 24.86296 25.13935 28.38634 25.60604 28.79395 45.53514 11 a
uwe1d 24.90034 25.05439 28.62943 25.41460 29.47379 41.08459 11 a
uwe2i 25.21222 25.59754 30.15579 26.29135 33.00361 47.13382 11 a
uwe2d 24.38305 25.09385 29.46715 25.41951 29.11112 45.05486 11 a
uwe3i 23.27334 23.95714 27.12474 24.28073 25.86336 44.40467 11 a
uwe3d 23.23332 23.65073 27.60330 23.96620 29.53911 40.41175 11 a
Now, for input data of type integer:
Unit: milliseconds
expr min lq mean median uq max neval cld
d.b 591.71859 596.31904 607.51452 601.24232 617.13886 636.51405 11 e
Balter 284.08896 297.06170 374.42691 303.14888 465.27859 488.19606 11 d
PoGibas 151.75851 155.28304 174.31369 159.18364 163.50864 329.00412 11 c
Rich 67.79770 71.22311 78.38562 77.46642 84.56777 96.55540 11 b
Frank 166.60802 170.34078 192.19833 180.09257 182.43584 350.86681 11 c
user20650 19.79204 20.06220 21.95963 20.18624 20.42393 30.13135 11 a
uwe1i 27.54680 27.83169 32.36917 28.08939 37.82286 45.21722 11 ab
uwe1d 22.60162 22.89350 25.94329 23.10419 23.74173 47.39435 11 a
uwe2i 27.05104 27.57607 27.80843 27.68122 28.02048 28.88193 11 a
uwe2d 22.83384 22.93522 23.22148 23.12231 23.41210 24.18633 11 a
uwe3i 25.17371 26.44427 29.34889 26.68290 27.08276 47.71379 11 a
uwe3d 21.68712 21.83060 26.16276 22.37659 28.40750 43.33989 11 a
For both integer and double input values, the approach by user20650 is the fastest. Next are my algebraic approaches. Third is Richs solution but three times slower than the second.
The type of input data has the strongest impact on d.b's solution and to a lesser extent on Balter's. The other solutions seem to be rather invariant.
Interestingly, there seems to be no remarkable difference from using integer or double constants in my algebraic solutions.
You can create a named vector (vect in this example) and lookup values from that vector using match
vect = c("0 0" = 0, "1 0" = 1, "0 1" = 2)
unname(vect[match(with(data, paste(foo, bar)), names(vect))])
# [1] 2 1 0 0 NA 0 2 0 1 NA 1
There's a lot of ways to do this, some more efficient depending on how many conditions you have. But a basic way is:
data$New_Column <- with(data,ifelse(foo == 0 & bar == 0, 0,
ifelse(foo == 1 & bar == 0, 1,
ifelse(foo == 0 & bar == 1, 2, NA))))
# foo bar New_Column
#1 0 1 2
#2 1 0 1
#3 0 0 0
#4 0 0 0
#5 1 1 NA
#6 0 0 0
#7 0 1 2
#8 0 0 0
#9 1 0 1
#10 1 1 NA
#11 1 0 1

Combine each element of a vector with another vector in R

I have two vectors
x <- c(2, 3, 4)
y <- rep(0, 5)
I want to get the following output:
> z
2, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0
How can I create z? I have tried to use paste and c but nothing seems to work. The only thing I can think of is using a for() and it is terribly slow. I have googled this and I am sure the solution is out there and I am just not hitting the right keywords.
UPDATE:
For benchmarking purposes:
Using Nicola's solution:
> system.time(
+ precipitation <- `[<-`(numeric(length(x)*(length(y)+1)),seq(1,by=length(y)+1,length.out=length(x)),x)
+ )
user system elapsed
0.419 0.407 0.827
This is ridiculously fast! I must say! Can someone please explain this to me? My for() which I know is always wrong in R would have taken at least a day if it even finished.
The other suggestions:
> length(prate)
[1] 4914594
> length(empty)
[1] 207
> system.time(
+ precipitation <- unlist(sapply(prate, FUN = function(prate) c(prate,empty), simplify=FALSE))
+ )
user system elapsed
16.470 3.859 28.904
I had to kill
len <- length(prate)
precip2 <- c(rbind(prate, matrix(rep(empty, len), ncol = len)))
After 15 minutes.
you can try this
unlist(sapply(x, FUN = function(x) c(x,y), simplify=FALSE))
[1] 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0
or simpler from #docendodiscimus
unlist(lapply(x, FUN = function(x) c(x,y)))
This seems faster for some reason:
unlist(t(matrix(c(as.list(x),rep(list(y),length(x))),ncol=2)))
The above solution is general, in the sense that both x and y can have any value. In the OP case, where y is made just of 0, this is fast as it can be:
`[<-`(numeric(length(x)*(length(y)+1)),seq(1,by=length(y)+1,length.out=length(x)),x)
#[1] 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0
Edit
I realise I've been very cryptic and the code I produced is not easy to follow, despite being just one line. I'm gonna explain in detail what the second solution does.
First of all, you notice that the resulting vector will have the values containd in x plus the zeroes in y repeated length(x) times. So in total, it will be length(x) + length(x)*length(y) or length(x)*(length(y)+1) long. So we create a vector with just zeroes as long as needed:
res<-numeric(length(x)*(length(y)+1))
Now we have to place the x values in res. We notice that the first value of x occupies the first value in res; the second will be after length(y)+1 from the first and so on, until all the length(x) values are filled. We can create a vector of indices in which to put the x values:
indices<-seq.int(1,by=length(y)+1,length.out=length(x))
And then we make the replacement:
res[indices]<-x
My line was just a shortcut for the three lines above. Hope this clarifies a little.
You could also try to vectorize as follows
len <- length(x)
c(rbind(x, matrix(rep(y, len), ncol = len)))
## [1] 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0
A more compact, but potentially slower option (contributed by #akrun) would be
c(rbind(x, replicate(len, y)))
## [1] 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0
You can try:
c(sapply(x, 'c', y))
#[1] 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0
Or a crazy solution with gusb and paste..
library(functional)
p = Curry(paste0, collapse='')
as.numeric(strsplit(p(gsub('(.*)$', paste0('\\1',p(y)),x)),'')[[1]])
#[1] 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0
Here's another way:
options(scipen=100)
as.numeric(unlist(strsplit(as.character(x * 10^5), "")))
And some benchmarks:
microbenchmark({as.numeric(unlist(strsplit(as.character(x*10^5), "")))}, {unlist(t(matrix(c(as.list(x),rep(list(y),length(x))),ncol=2)))}, {unlist(sapply(x, FUN = function(x) c(x,y), simplify=FALSE))}, times=100000)
Unit: microseconds
expr
{ as.numeric(unlist(strsplit(as.character(x * 10^5), ""))) }
{ unlist(t(matrix(c(as.list(x), rep(list(y), length(x))), ncol = 2))) }
{ unlist(sapply(x, FUN = function(x) c(x, y), simplify = FALSE)) }
min lq mean median uq max neval
9.286 10.644 12.15242 11.678 12.286 1650.133 100000
9.485 11.164 13.25424 12.288 13.067 1887.761 100000
5.607 7.429 9.21015 8.147 8.784 30457.994 100000
And here's another idea (but it seems slow):
r = rle(1)
r$lengths = rep(c(1,5), length(x))
r$values = as.vector(rbind(x, 0))
inverse.rle(r)

Finding sublists in lists in R

I have a matrix of booleans. Most rows look like this
1 1 1 1 1 0 0 0 0
but some of them look like this
1 1 1 1 0 0 1 1 1
I want to find the ones that have a 0 followed by a 1. How can I do this? My naive attempt is to try
c(0, 1) %in% my_list
but that returns
[1] TRUE TRUE
since both 0 and 1 are in the list. D'oh! Any help?
How about looking at diffs instead
x <- c(1,1,1,1,0,0,1,1,1)
1 %in% diff(x)
Then create a function, and apply it to the rows of your matrix.
This is probably pretty pointless as the other answer is already pretty quick, but this will scale better for very big matrices to identify the rows you want. E.g.:
no <- c(1, 1, 1, 1, 1, 0, 0, 0, 0)
yes <- c(1, 1, 1, 1, 0, 0, 1, 1, 1)
m <- rbind(no,yes,no,yes,no,yes,yes)
# 1 2 3 4 5 6 7
# result should thus be c(2,4,6,7)
col(t(m[,-1]))[diff(t(m))==1]
#[1] 2 4 6 7
1 million row matrix benchmark:
m <- m[sample(1:2,1000000,replace=TRUE),]
system.time(apply(m, 1, function(x) 1 %in% diff(x) ))
# user system elapsed
# 12.09 0.00 12.09
system.time(col(t(m[,-1]))[diff(t(m))==1])
# user system elapsed
# 0.61 0.05 0.65
Alternative suggested by #MatthewLundberg, which is probably a really good balance of speed and readability.
system.time(apply(diff(t(m))==1, 2, any))
# user system elapsed
# 1.85 0.00 1.84

Resources