Generating Coin flips using purrr - r

Now I'm learning how to use purrr package in r, and thinking about how to generate 5 samples of each 1, 2, ..., 99, 100 coin flips.
My image is to create a list, that should look like..
[[1]]
[[1]]
[1] 1 0 1 0 0
[[2]]
[[1]]
[1] 1 0 0 0 1
[[2]]
[1] 0 1 0 1 1
[[3]]
[[1]]
[1] 0 1 1 1 0
[[2]]
[1] 1 0 0 0 1
[[3]]
[1] 0 1 1 1 1
..
Can anyone help me make this up?

You want the function rerun applied to each element of the vector 1:100 using the map function as follows
library(purrr)
1:100 %>% map(function(x) rerun(x, rbinom(5,1,.5)))
However, it is just as easy to use replicate, where the default for replicate is to produce a column wise array.
lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5)))
Note that the base R expression is much faster in this case.
a <- function() 1:100 %>% map(function(x) rerun(x, rbinom(5,1,.5)))
b <- function() lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5)))
library(microbenchmark)
microbenchmark(a(),b())
Unit: milliseconds
expr min lq mean median uq max neval cld
a() 96.89941 104.83822 117.10245 111.48309 120.28554 391.9411 100 b
b() 16.88232 18.47104 23.22976 22.20549 26.31445 49.0042 100 a
Edit Regarding your question in the comments, if you are just interested in the law of large numbers representation, you could do as follows.
plot(1:100, do.call("c", lapply(b(), mean)),
type= "l", xlab = "replications",
ylab = "proportion of heads")
abline(h = .5)

If I understand you correctly, this is what you're after:
lapply(1:100, function(x) replicate(x,rbinom(5,1,0.5),simplify = FALSE))

Related

Save the output of a for loop in R

Suppose I have a binomial distribution where n=12, p=0.2. I split this sample into 4 chunks(groups), each chunk has group size 3. Then I remove the output whose sum is equal to 0. For the remaining outputs, what I'm trying to do is combining all remaining outputs into a new vector. Here's my code
set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
print(a.no0)
}
}
And here's the output:
[1] 1 1 0
[1] 0 1 0
[1] 0 1 0
I want to combine these three outputs into a new vector like:
[1] 1 1 0 0 1 0 0 1 0
but I have no idea how it works, any hints please?
set.seed(123)
sample1=rbinom(12,1,0.2)
chuck2=function(x,n)split(x,cut(seq_along(x),n,labels=FALSE))
chunk=chuck2(sample1,4)
int_vector <- c()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
int_vector <- c(int_vector, a.no0)
}
}
int_vector
# [1] 1 1 0 0 1 0 0 1 0
Create a list() and assign it a variable name. Next, you add that variable inside the loop, then append the looping values in the list.
new_vector <- list()
for (i in 1:4){
aa=chunk[[i]]
if (sum(aa)!=0){
a.no0=aa
new_vector <- append(new_vector, a.no0)
}
}
new_vector
This will return:
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 0
[[5]]
[1] 1
[[6]]
[1] 0
[[7]]
[1] 0
[[8]]
[1] 1
[[9]]
[1] 0
But I think you want a flattened vector:
as.vector(unlist(new_vector))
[1] 1 1 0 0 1 0 0 1 0
Doesn't directly address your issue, but this can be accomplished without a for-loop:
library(dplyr)
set.seed(123)
sample1 <- rbinom(12, 1, 0.2)
as.data.frame(matrix(sample1, ncol = 3, byrow = TRUE)) %>%
mutate(test = rowSums(.), id = 1:n()) %>%
filter(test > 0) %>%
dplyr::select(-test) %>%
gather(key, value, -id) %>%
arrange(id, key) %>%
.$value
Two versions without for loop.
data:
set.seed(123)
sample1 <- rbinom(12, 1, 0.2)
base-R functional version:
split.sample1 <- split(sample1,cut(seq_along(sample1),4,labels=FALSE))
sumf <- function(x) if(sum(x) == 0) NULL else x
result <- unlist(lapply(split.sample1,sumf),use.names=F)
> result
[1] 1 1 0 0 1 0 0 1 0
modern use of pipe %>% operator version:
library(magrittr) # for %>% operator
grp.indx <- cut(seq_along(sample1),4,labels=FALSE)
split.sample1 <- sample1 %>% split(grp.indx)
result <- split.sample1 %>% lapply(sumf) %>% unlist(use.names=F)
> result
[1] 1 1 0 0 1 0 0 1 0
It seems like your function makes a pseudo matrix as a list. This instead directly makes a matrix from sample1 and then outputs a vector where rowSums are greater than 0.
set.seed(123)
sample1 = rbinom(12, 1, 0.2)
chunk_mat = matrix(sample1, ncol = 3, byrow = T)
as.vector(t(chunk_mat[which(rowSums(chunk_mat) != 0), ]))
Here are benchmarks - I have the chuck2 in the global environment but each function still has to generate the chunk dataframe / matrix / list so that they're apples to apples.
Unit: microseconds
expr min lq mean median uq max neval
cole_matrix 19.902 26.2515 38.60094 43.3505 47.4505 56.801 100
heds_int_vector 4965.201 5101.9010 5616.53893 5251.8510 5490.9010 23417.401 100
bwilliams_dplyr 5278.602 5506.4010 5847.55298 5665.7010 5821.5515 9413.801 100
Simon_base 128.501 138.0010 196.46697 185.6005 203.1515 2481.101 100
Simon_magrittr 366.601 392.5005 453.74806 455.1510 492.0010 739.501 100

Return vector position in list r

I am trying to determine the vector where an element is coming from in a list I have created. I'll give a repeatable example here:
set.seed(101)
a <- runif(10, min=0, max=100)
b <- runif(10, min=0, max=100)
c <- runif(10, min=0, max=100)
d <- runif(10, min=0, max=100)
information <- list(a, b, c, d)
information.wanted <- mean(do.call(pmax, information))
The code to get the information.wanted works just fine. What I am now trying to find is the individual vector in the list where each of the maximum values comes from. For example, value 1 in information.wanted (87.97...) comes from vector b in the information list. I would like to create another piece of code that gives the vector where the information.wanted comes from.
> information.wanted
[1] 87.97957 95.68375 73.19726 93.16344 92.33189 91.34787 82.04361 81.42830 62.20120
[10] 92.48044
I have no idea how to do this though. None of the code that I've tried has gotten me anywhere close.
postition.of.information.wanted <- ??
I'm looking to get something like this. A numeric vector is fine. I can supplement the values in later.
> position.of.informaiton.wanted
[1] 2 3 ...
Any help would be greatly appreciated. Thanks.
You need to apply which.max to each "i" index of each element in "information":
f1 = function(x)
sapply(seq_along(x[[1]]), function(i) which.max(sapply(x, "[[", i)))
f1(information)
# [1] 2 3 2 2 3 4 2 4 1 4
mapply already provides that kind of "parallel" functionality:
f2 = function(x)
unlist(.mapply(function(...) which.max(c(...)), x, NULL))
f2(information)
# [1] 2 3 2 2 3 4 2 4 1 4
Or, instead of concatenating "information" in chunks, convert to a "matrix" -as David Arenburg notes in the comments- at start and apply which.max to its rows:
f3a = function(x)
apply(do.call(cbind, x), 1, which.max)
f3a(information)
# [1] 2 3 2 2 3 4 2 4 1 4
or its columns:
f3b = function(x)
apply(do.call(rbind, x), 2, which.max)
f3b(information)
# [1] 2 3 2 2 3 4 2 4 1 4
also, max.col is convenient for a "matrix":
f4 = function(x)
max.col(do.call(cbind, x), "first")
f4(information)
# [1] 2 3 2 2 3 4 2 4 1 4
If it wasn't R, then a simple loop over the elements would provide both which.max and max ...but R, also, handles vectors:
f5 = function(x)
{
ans = rep_len(1L, length(x[[1]]))
maxs = x[[1]]
for(i in 2:length(x)) {
wh = x[[i]] > maxs
maxs[wh] = x[[i]][wh]
ans[wh] = i
}
ans #or '(data.frame(i = ans, val = maxs)' for both
}
f5(information)
# [1] 2 3 2 2 3 4 2 4 1 4
It had to end with a benchmark:
set.seed(007)
dat = replicate(13, runif(1e4), FALSE)
identical(f1(dat), f2(dat))
#[1] TRUE
identical(f2(dat), f3a(dat))
#[1] TRUE
identical(f3a(dat), f3b(dat))
#[1] TRUE
identical(f3b(dat), f4(dat))
#[1] TRUE
identical(f4(dat), f5(dat))
#[1] TRUE
microbenchmark::microbenchmark(f1(dat), f2(dat), f3a(dat), f3b(dat), f4(dat), f5(dat), do.call(pmax, dat), times = 50)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# f1(dat) 274995.963 298662.210 339279.948 318937.172 350822.539 723673.972 50 d
# f2(dat) 94619.397 100079.205 114664.776 107479.127 114619.439 226733.260 50 c
# f3a(dat) 19767.925 23423.688 26382.919 25795.499 29215.839 40100.656 50 b
# f3b(dat) 20351.872 22829.997 28889.845 25090.446 30503.100 140311.058 50 b
# f4(dat) 975.102 1109.431 1546.571 1169.462 1361.733 8954.100 50 a
# f5(dat) 2427.665 2470.816 5299.386 2520.755 3197.793 112986.612 50 a
# do.call(pmax, dat) 1477.618 1530.166 1627.934 1551.046 1602.898 2814.295 50 a

Most efficient way to turn factor matrix into binary (indicator) matrix in R

I can think of several ways to turn matrix (data frame) of this type:
dat = data.frame(
x1 = rep(c('a', 'b'), 100),
x2 = rep(c('x', 'y'), 100)
)
head(dat)
x1 x2
1 a x
2 b y
3 a x
4 b y
5 a x
6 b y
Into a binary (indicator) matrix (or data frame) like this:
a b x y
1 0 1 0
0 1 0 1
...
(This structure is, of course, trivial and only for illustrative purpose!)
Many thanks!
We can use table
tbl <- table(rep(1:nrow(dat),2),unlist(dat))
head(tbl, 2)
# a b x y
# 1 1 0 1 0
# 2 0 1 0 1
Or a possibly efficient option would be
library(Matrix)
sM <- sparse.model.matrix(~ -1 + x1 +x2, dat,
contrasts.arg = lapply(dat, contrasts, contrasts = FALSE))
colnames(sM) <- sub(".*\\d", "", colnames(sM))
head(sM, 2)
# 2 x 4 sparse Matrix of class "dgCMatrix"
# a b x y
#1 1 . 1 .
#2 . 1 . 1
It can be converted to binary by converting to matrix
head(as.matrix(sM),2)
# a b x y
#1 1 0 1 0
#2 0 1 0 1
There are some good solutions posted already, but none are optimal for performance. We can optimize performance by looping over each input column, and then looping over each factor level index within each input column and doing a straight integer comparison of the factor indexes. It's not the most concise or elegant piece of code, but it's fairly straightforward and fast:
do.call(cbind,lapply(dat,function(col)
`colnames<-`(do.call(cbind,lapply(seq_along(levels(col)),function(i)
as.integer(as.integer(col)==i)
)),levels(col))
));
Performance:
library(Matrix);
library(data.table);
library(microbenchmark);
bgoldst <- function(dat) do.call(cbind,lapply(dat,function(col) `colnames<-`(do.call(cbind,lapply(seq_along(levels(col)),function(i) as.integer(as.integer(col)==i))),levels(col))));
akrun1 <- function(dat) table(rep(1:nrow(dat),2),unlist(dat));
akrun2 <- function(dat) sparse.model.matrix(~-1+x1+x2,dat,contrasts.arg=lapply(dat,contrasts,contrasts=FALSE));
davidar <- function(dat) { dat[,rowid:=.I]; dcast(melt(dat,id='rowid'),rowid~value,length); }; ## requires a data.table
dataminer <- function(dat) t(apply(dat,1,function(x) as.numeric(unique(unlist(dat))%in%x)));
N <- 100L; dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
identical(unname(bgoldst(dat)),matrix(as.vector(akrun1(dat)),ncol=4L));
## [1] TRUE
identical(unname(bgoldst(dat)),unname(matrix(as.integer(as.matrix(akrun2(dat))),ncol=4L)));
## [1] TRUE
identical(bgoldst(dat),as.matrix(davidar(datDT)[,rowid:=NULL]));
## [1] TRUE
identical(unname(bgoldst(dat)),matrix(as.integer(dataminer(dat)),ncol=4L));
## [1] TRUE
N <- 100L;
dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
microbenchmark(bgoldst(dat),akrun1(dat),akrun2(dat),davidar(datDT),dataminer(dat));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(dat) 67.570 92.374 106.2853 99.6440 121.2405 188.596 100
## akrun1(dat) 581.182 652.386 773.6300 690.6605 916.4625 1192.299 100
## akrun2(dat) 4429.208 4836.119 5554.5902 5145.3135 5977.0990 11263.537 100
## davidar(datDT) 5064.273 5498.555 6104.7621 5664.9115 6203.9695 11713.856 100
## dataminer(dat) 47577.729 49529.753 55217.3726 53190.8940 60041.9020 74346.268 100
N <- 1e4L;
dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
microbenchmark(bgoldst(dat),akrun1(dat),akrun2(dat),davidar(datDT));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(dat) 1.775617 1.820949 2.299493 1.84725 1.972124 8.362336 100
## akrun1(dat) 38.954524 41.109257 48.409613 45.60304 52.147633 162.365472 100
## akrun2(dat) 16.915832 17.762799 21.288200 19.20164 23.775180 46.494055 100
## davidar(datDT) 36.151684 38.366715 42.875940 42.38794 45.916937 58.695008 100
N <- 1e5L;
dat <- data.frame(x1=rep(c('a','b'),N),x2=rep(c('x','y'),N)); datDT <- setDT(copy(dat));
microbenchmark(bgoldst(dat),akrun1(dat),akrun2(dat),davidar(datDT));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(dat) 17.16473 22.97654 35.01815 26.76662 31.75562 152.6188 100
## akrun1(dat) 501.72644 626.14494 671.98315 680.91152 727.88262 828.8313 100
## akrun2(dat) 212.12381 242.65505 298.90254 272.28203 357.65106 429.6023 100
## davidar(datDT) 368.04924 461.60078 500.99431 511.54921 540.39358 638.3840 100
If you have a data.frame as you are showing (not a matrix), you could as well recast the data
library(data.table)
setDT(dat)[, rowid := .I] # Creates a row index
res <- dcast(melt(dat, id = "rowid"), rowid ~ value, length) # long/wide format
head(res)
# rowid a b x y
# 1 1 1 0 1 0
# 2 2 0 1 0 1
# 3 3 1 0 1 0
# 4 4 0 1 0 1
# 5 5 1 0 1 0
# 6 6 0 1 0 1
Some benchmarks
dat = data.frame(
x1 = rep(c('a', 'b'), 1e3),
x2 = rep(c('x', 'y'), 1e3)
)
library(data.table)
library(Matrix)
library(microbenchmark)
dat2 <- copy(dat)
microbenchmark("akrun1 : " = table(rep(1:nrow(dat),2),unlist(dat)),
"akrun2 : " = sparse.model.matrix(~ -1 + x1 +x2, dat, contrasts.arg = lapply(dat, contrasts, contrasts = FALSE)),
"DatamineR : " = t(apply(dat,1, function(x) as.numeric(unique(unlist(dat)) %in% x))),
"David Ar : " = {setDT(dat2)[, rowid := .I] ; dcast(melt(dat2, id = "rowid"), rowid ~ value, length)},
times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# akrun1 : 3.826075 4.061904 6.654399 5.165376 11.26959 11.82029 10 a
# akrun2 : 5.269531 5.713672 8.794434 5.943422 13.34118 20.01961 10 a
# DatamineR : 3199.336286 3343.774160 3410.618547 3385.756972 3517.22133 3625.70909 10 b
# David Ar : 8.092769 8.254682 11.030785 8.465232 15.44893 19.83914 10 a
The apply solution is highly inefficient and will take forever on a bigger data set. Comparing for a bigger data set while excluding the apply solution
dat = data.frame(
x1 = rep(c('a', 'b'), 1e4),
x2 = rep(c('x', 'y'), 1e4)
)
dat2 <- copy(dat)
microbenchmark("akrun1 : " = table(rep(1:nrow(dat),2),unlist(dat)),
"akrun2 : " = sparse.model.matrix(~ -1 + x1 +x2, dat, contrasts.arg = lapply(dat, contrasts, contrasts = FALSE)),
#"DatamineR : " = t(apply(dat,1, function(x) as.numeric(unique(unlist(dat)) %in% x))),
"David Ar : " = {setDT(dat2)[, rowid := .I] ; dcast(melt(dat2, id = "rowid"), rowid ~ value, length)},
times = 100L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# akrun1 : 38.66744 41.27116 52.97982 42.72534 47.17203 161.0420 100 b
# akrun2 : 17.02006 18.93534 27.27582 19.35580 20.72022 153.2397 100 a
# David Ar : 34.15915 37.91659 46.11050 38.58536 41.40412 149.0038 100 b
Seems like the Matrix package shines for a bigger data sets.
It probably worth comparing different scenarios when there are more columns/unique values too.
One alternative using apply
head(t(apply(dat,1, function(x) as.numeric(unique(unlist(dat)) %in% x))))
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 1 0 1
[3,] 1 0 1 0
[4,] 0 1 0 1
[5,] 1 0 1 0
[6,] 0 1 0 1

R Generating a new variable based on conditional statement applied to many columns

There is probably an obvious and elegant way to do this, probably using lapply, but I am still mastering apply commands and am struggling to find it.
I have a dataframe that looks like the following except that instead of 5 factor variables there are dozens and instead of 10 rows there are hundreds.
a<- data.frame("id" = c(1:10),
"a1" = factor(c(0,0,1,1,0,1,0,1,0,1)),
"a2" = factor(c(0,0,0,0,0,0,0,0,1,0)),
"a3" = factor(c(0,0,0,0,0,1,0,0,0,0)),
"a4" = factor(c(0,0,0,0,0,0,0,0,1,1)),
"a5" = factor(c(0,0,0,1,0,0,0,0,0,0)))
I want to create a new variable which is 1 if any of 13 columns contain a particular level of the factor. The equivalent in the example dataframe would be creating a new variable called "b" which is 1 is there's a "1" in any of the columns a1:a4, which would look like the following.
a<- data.frame("id" = c(1:10),
"a1" = factor(c(0,0,1,1,0,1,0,1,0,1)),
"a2" = factor(c(0,0,0,0,0,0,0,0,1,0)),
"a3" = factor(c(0,0,0,0,0,1,0,0,0,0)),
"a4" = factor(c(0,0,0,0,0,0,0,0,1,1)),
"a5" = factor(c(0,0,0,1,0,0,0,0,0,0)),
"b" = c(0,0,1,1,0,1,0,1,1,1))
There has GOT to be a way to do this using the 13 column positions instead of writing a conditional ifthen statement for each of the 13 variables.
Just use rowSums, something like this:
> as.numeric(rowSums(a[paste0("a", 1:5)] == 1) >= 1)
[1] 0 0 1 1 0 1 0 1 1 1
In case you wanted to try lapply
Reduce(`|`,lapply(a[,-1], function(x) as.numeric(as.character(x))))+0
#[1] 0 0 1 1 0 1 0 1 1 1
Or just
Reduce(`|`, lapply(a[,-1], `==`, 1)) +0
#[1] 0 0 1 1 0 1 0 1 1 1
Benchmarks
set.seed(155)
df <- as.data.frame(matrix(sample(0:1, 5000*1e4, replace=TRUE), ncol=5000))
library(microbenchmark)
f1 <- function() {as.numeric(rowSums(df == 1) >= 1) }
f2 <- function() {Reduce(`|`, lapply(df, `==`, 1)) +0}
f3 <- function() {apply(df == 1, 1, function(x) any(x %in% TRUE))+0}
microbenchmark(f1(), f2(), f3(), unit="relative")
# Unit: relative
# expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 100
# f2() 1.040561 1.043713 1.053773 1.032932 1.045067 100
# f3() 2.538287 2.517184 2.825253 2.477225 2.454511 100
You could also use any after converting the matrix to logical.
> apply(a[grep("a[1-4]", names(a))] == 1, 1, any)+0
# [1] 0 0 1 1 0 1 0 1 1 1
Or
> apply(a[grepl("a[1-4]", names(a))] == 1, 1, any)+0
# [1] 0 0 1 1 0 1 0 1 1 1

R: condense indexes

I have a vector like the following:
xx <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)
I want to find the indexes that have ones and combine them together. In this case, I want the output to look like 1 6 and 11 14 in a 2x2 matrix. My vector is actually very long so I can't do this by hand. Can anyone help me with this? Thanks.
Since the question originally had a tag 'bioinformatics' I'll mention the Bioconductor package IRanges (and it's companion for ranges on genomes GenomicRanges)
> library(IRanges)
> xx <- c(1,1,1,1,1,1,0,0,0,0,1,1,1,1)
> sl = slice(Rle(xx), 1)
> sl
Views on a 14-length Rle subject
views:
start end width
[1] 1 6 6 [1 1 1 1 1 1]
[2] 11 14 4 [1 1 1 1]
which could be coerced to a matrix, but that would often not be convenient for whatever the next step is
> matrix(c(start(sl), end(sl)), ncol=2)
     [,1] [,2]
[1,]    1    6
[2,]   11   14
Other operations might start on the Rle, e.g.,
> xx = c(2,2,2,3,3,3,0,0,0,0,4,4,1,1)
> r = Rle(xx)
> m = cbind(start(r), end(r))[runValue(r) != 0,,drop=FALSE]
> m
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 11 12
[4,] 13 14
See the help page ?Rle for the full flexibility of the Rle class; to go from a matrix like that above to a new Rle as asked in the comment below, one might create a new Rle of appropriate length and then subset-assign using an IRanges as index
> r = Rle(0L, max(m))
> r[IRanges(m[,1], m[,2])] = 1L
> r
integer-Rle of length 14 with 3 runs
Lengths: 6 4 4
Values : 1 0 1
One could expand this to a full vector
> as(r, "integer")
[1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1
but often it's better to continue the analysis on the Rle. The class is very flexible, so one way of going from xx to an integer vector of 1's and 0's is
> as(Rle(xx) > 0, "integer")
[1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1
Again, though, it often makes sense to stay in Rle space. And Arun's answer to your separate question is probably best of all.
Performance (speed) is important, although in this case I think the Rle class provides a lot of flexibility that would weigh against poor performance, and ending up at a matrix is an unlikely end-point for a typical analysis. Nonetheles the IRanges infrastructure is performant
eddi <- function(xx)
matrix(which(diff(c(0,xx,0)) != 0) - c(0,1),
ncol = 2, byrow = TRUE)
iranges = function(xx) {
sl = slice(Rle(xx), 1)
matrix(c(start(sl), end(sl)), ncol=2)
}
iranges.1 = function(xx) {
r = Rle(xx)
cbind(start(r), end(r))[runValue(r) != 0, , drop=FALSE]
}
with
> xx = sample(c(0, 1), 1e5, TRUE)
> microbenchmark(eddi(xx), iranges(xx), iranges.1(xx), times=10)
Unit: milliseconds
expr min lq median uq max neval
eddi(xx) 45.88009 46.69360 47.67374 226.15084 234.8138 10
iranges(xx) 112.09530 114.36889 229.90911 292.84153 294.7348 10
iranges.1(xx) 31.64954 31.72658 33.26242 35.52092 226.7817 10
Something like this, maybe?
if (xx[1] == 1) {
rr <- cumsum(c(0, rle(xx)$lengths))
} else {
rr <- cumsum(rle(xx)$lengths)
}
if (length(rr) %% 2 == 1) {
rr <- head(rr, -1)
}
oo <- matrix(rr, ncol=2, byrow=TRUE)
oo[, 1] <- oo[, 1] + 1
[,1] [,2]
[1,] 1 6
[2,] 11 14
This edit takes care of cases where 1) the vector starts with a "0" rather than a "1" and 2) where the number of consecutive occurrences of 1's are odd/even. For ex: xx <- c(1,1,1,1,1,1,0,0,0,0).
Another, short one:
cbind(start = which(diff(c(0, xx)) == +1),
end = which(diff(c(xx, 0)) == -1))
# start end
# [1,] 1 6
# [2,] 11 14
I tested on a very long vector and it is marginally slower than using rle. But more readable IMHO. If speed were really a concern, you could also do:
xx.diff <- diff(c(0, xx, 0))
cbind(start = which(head(xx.diff, -1) == +1),
end = which(tail(xx.diff, -1) == -1))
# start end
# [1,] 1 6
# [2,] 11 14
Here's another solution that's built upon the others' ideas, and is a bit shorter and faster:
matrix(which(diff(c(0,xx,0)) != 0) - c(0,1), ncol = 2, byrow = T)
# [,1] [,2]
#[1,] 1 6
#[2,] 11 14
I didn't test the non-base solution, but here's a comparison of base ones:
xx = sample(c(0,1), 1e5, T)
microbenchmark(arun(xx), flodel(xx), flodel.fast(xx), eddi(xx))
#Unit: milliseconds
# expr min lq median uq max neval
# arun(xx) 14.021134 14.181134 14.246415 14.332655 15.220496 100
# flodel(xx) 12.885134 13.186254 13.248334 13.432974 14.367695 100
# flodel.fast(xx) 9.704010 9.952810 10.063691 10.211371 11.108171 100
# eddi(xx) 7.029448 7.276008 7.328968 7.439528 8.361609 100

Resources