Related
I have 100,000 individuals
Using a combination of upper case letters, lower case letters and numbers, I want to create
a five-character ID for each individual. I should not have any duplicates.
How can I do this? I have tried the code below but I have 4 duplicates.
What is the number of possible unique combinations to create a 5 character ID with "letters", "LETTERS" and "0:9"?
set.seed(0)
mydata<-data.frame(
ID=rep(NA,10^5),
Poids=rnorm(n=10^5,mean = 65,sd=5)
)
for (i in 1:nrow(mydata)){
mydata$ID[i]<-c(
paste(sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),
sample(c(0:9,LETTERS,letters),replace = F,size = 1),sep = "")
)
}
table(duplicated(mydata$ID))
FALSE TRUE
99996 4
(length(letters) + length(LETTERS) + length(0:9))^5 is 91,6132,832, so there is plenty of space to avoid clashes.
In fact, we can use this number to help generate our sample. We draw 100,000 integers out of 91,6132,832 without replacement and interpret each number as its unique string of characters using a bit of modular math and indexing. This can all be done in a single pass:
space <- c(LETTERS, letters, 0:9)
set.seed(0)
samps <- sample(length(space)^5, 10^5)
m <- matrix("", nrow = 10^5, ncol = 5)
for(i in seq(ncol(m))) {
m[,i] <- space[(samps %% length(space)) + 1]
samps <- samps %/% length(space)
}
ID <- apply(m, 1, paste, collapse = "")
We can see this fulfils our requirements:
head(ID)
#> [1] "vpdnq" "rK0ej" "ofE9t" "PqLIr" "6G6tu" "Vhc7R"
length(ID)
#> [1] 100000
length(unique(ID))
#> [1] 100000
The whole thing takes less than a second on my modest machine:
user system elapsed
0.72 0.00 0.74
Update
It occurs to me that it is possible to give 100,000 people a unique ID using only 16 characters, i.e. 0-9 and a-f, with code that is much quicker and simpler than above:
set.seed(0)
ID <- as.hexmode(sample(16^5, 10^5))
head(ID)
#> [1] "d43f9" "392a7" "033a2" "cf1d7" "aa10e" "134bb"
length(unique(ID))
#> [1] 100000
Which takes less than 10 milliseconds.
Created on 2022-05-15 by the reprex package (v2.0.1)
You can try the code below (given N <- 1e5 and k <- 5):
n <- ceiling(N^(1 / k))
S <- sample(c(LETTERS, letters, 0:9), n)
ID <- head(do.call(paste0, expand.grid(rep(list(S), k))),N)
where
n gives a subset of the whole space that supports all unique combinations up to given number N, e.g., N <- 100000
S denotes a sub-space from which we draw the alphabets or digits
expand.grid gives all combinations
If you don't need randomness, the highly performant arrangements package can help by iterating over the permutations in order, not generating any more than are needed:
library(arrangements)
x = c(letters, LETTERS, 0:9)
ix = ipermutations(x = x, k = 5)
ind = ix$getnext(d = nrow(mydata))
mydata$ID = apply(ind, MAR = 1, FUN = \(i) paste(x[i], collapse = ""))
rbind(head(mydata), tail(mydata))
# ID Poids
# 1 abcde 64.46278
# 2 abcdf 62.00053
# 3 abcdg 75.71787
# 4 abcdh 67.73765
# 5 abcdi 66.45402
# 6 abcdj 66.85561
# 99995 abFpe 56.20545
# 99996 abFpf 64.14443
# 99997 abFpg 70.70191
# 99998 abFph 66.83226
# 99999 abFpi 65.22835
# 100000 abFpj 56.28880
This is quite fast:
user system elapsed
0.194 0.001 0.203
I have a vector as follows:
playtimes <- c("1H18M20S", "1H27M5S", "18M27S", "56M38S", "21S")
and I want to convert these to playtimes in second. For example, the resulting vector would be something like this:
playtimeInSeconds <- c(4700, 5225, 1107, 3398, 21)
Im having trouble with separating the strings correctly based on the H, M and S. I wrote the following that works for the playtimes under 1 hour
minutes <- gsub("M.*", "", playtime)
seconds <- gsub(".*M", "", playtime) %>%
gsub("S", "", .)
totalPlaytime <- as.numeric(minutes)*60 + as.numeric(seconds)
But Im not sure how to tackle the H portion of some strings.
You could strsplit and adapt the length of the list elements reversely to 3 which allows you to use sapply to get a matrix where you apply the matrix product %*%.
m <- sapply(strsplit(p, 'H|M|S'), \(x) as.double(rev(`length<-`(rev(x), 3))))
res <- as.vector(t(replace(m, is.na(m), 0)) %*% rbind(3600, 60, 1))
res
# [1] 4700 5225 1107 3398 21
interesting problem. here is a solution that potentially could be more efficient but does the job
# function from https://www.statworx.com/de/blog/strsplit-but-keeping-the-delimiter/
strsplit <- function(x,
split,
type = "remove",
perl = FALSE,
...) {
if (type == "remove") {
# use base::strsplit
out <- base::strsplit(x = x, split = split, perl = perl, ...)
} else if (type == "before") {
# split before the delimiter and keep it
out <- base::strsplit(x = x,
split = paste0("(?<=.)(?=", split, ")"),
perl = TRUE,
...)
} else if (type == "after") {
# split after the delimiter and keep it
out <- base::strsplit(x = x,
split = paste0("(?<=", split, ")"),
perl = TRUE,
...)
} else {
# wrong type input
stop("type must be remove, after or before!")
}
return(out)
}
# convert to seconds
to_seconds <- c(H = 60 * 60,
M = 60,
S = 1)
get_seconds <- function(value, unit) {
value * to_seconds[unit]
}
# example vector
playtimes <- c("1H18M20S", "1H27M5S", "18M27S", "56M38S", "21S")
# extract time parts
times <- strsplit(playtimes,
split = "[A-Z]",
type = "after")
times
#> [[1]]
#> [1] "1H" "18M" "20S"
#>
#> [[2]]
#> [1] "1H" "27M" "5S"
#>
#> [[3]]
#> [1] "18M" "27S"
#>
#> [[4]]
#> [1] "56M" "38S"
#>
#> [[5]]
#> [1] "21S"
# calculate each time in seconds
sapply(times,
function(t) {
# split numeric and unit part
t_split <- strsplit(x = t,
split = "[A-Z]",
type = "before")
# calculate seconds for each unit part
times_in_seconds <- get_seconds(value = as.numeric(sapply(t_split, `[`, 1)),
unit = sapply(t_split, `[`, 2))
# sum of all parts
sum(times_in_seconds)
})
#> [1] 4700 5225 1107 3398 21
I followed the example given in the 3rd answer here and made the following
playtime <- sapply(playtime, function(x){paste(paste(rep(0, 3 - str_count(x, '[0-9]+')), collapse = ' '), x)})
totalPlaytime <- time_length(hms(playtime))
Short, sweet, and checks for potential errors where the playtime is less that 1 hr or less than 1 min.
For instance, how to convert the number '10010000110000011000011111011000' in Base2 to number in Base4 ?
Here is one approach that breaks up the string into units of length 2 and then looks up the corresponding base 4 for the pair:
convert <- c("00"="0","01"="1","10"="2","11"="3")
from2to4 <- function(s){
if(nchar(s) %% 2 == 1) s <- paste0('0',s)
n <- nchar(s)
bigrams <- sapply(seq(1,n,2),function(i) substr(s,i,i+1))
digits <- convert[bigrams]
paste0(digits, collapse = "")
}
A one-liner approach:
> paste(as.numeric(factor(substring(a,seq(1,nchar(a),2),seq(2,nchar(a),2))))-1,collapse="")
[1] "2100300120133120"
There are multiple ways to split the string into 2 digits, see Chopping a string into a vector of fixed width character elements
Here are a couple inverses:
bin_to_base4 <- function(x){
x <- strsplit(x, '')
vapply(x, function(bits){
bits <- as.integer(bits)
paste(2 * bits[c(TRUE, FALSE)] + bits[c(FALSE, TRUE)], collapse = '')
}, character(1))
}
base4_to_bin <- function(x){
x <- strsplit(x, '')
vapply(x, function(quats){
quats <- as.integer(quats)
paste0(quats %/% 2, quats %% 2, collapse = '')
}, character(1))
}
x <- '10010000110000011000011111011000'
bin_to_base4(x)
#> [1] "2100300120133120"
base4_to_bin(bin_to_base4(x))
#> [1] "10010000110000011000011111011000"
...and they're vectorized!
base4_to_bin(bin_to_base4(c(x, x)))
#> [1] "10010000110000011000011111011000" "10010000110000011000011111011000"
For actual use, it would be a good idea to put in some sanity checks to ensure the input is actually in the appropriate base.
Convert Base2 to Base10 first, then from Base10 to Base4
I'm trying to use R to calculate the moving average over a series of values in a matrix. There doesn't seem to be a built-in function in R that will allow me to calculate moving averages. Do any packages provide one? Or do I need to write my own?
Or you can simply calculate it using filter, here's the function I use:
ma <- function(x, n = 5){filter(x, rep(1 / n, n), sides = 2)}
If you use dplyr, be careful to specify stats::filter in the function above.
Rolling Means/Maximums/Medians in the zoo package (rollmean)
MovingAverages in TTR
ma in forecast
Using cumsum should be sufficient and efficient. Assuming you have a vector x and you want a running sum of n numbers
cx <- c(0,cumsum(x))
rsum <- (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
As pointed out in the comments by #mzuther, this assumes that there are no NAs in the data. to deal with those would require dividing each window by the number of non-NA values. Here's one way of doing that, incorporating the comment from #Ricardo Cruz:
cx <- c(0, cumsum(ifelse(is.na(x), 0, x)))
cn <- c(0, cumsum(ifelse(is.na(x), 0, 1)))
rx <- cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]
rn <- cn[(n+1):length(cx)] - cn[1:(length(cx) - n)]
rsum <- rx / rn
This still has the issue that if all the values in the window are NAs then there will be a division by zero error.
In data.table 1.12.0 new frollmean function has been added to compute fast and exact rolling mean carefully handling NA, NaN and +Inf, -Inf values.
As there is no reproducible example in the question there is not much more to address here.
You can find more info about ?frollmean in manual, also available online at ?frollmean.
Examples from manual below:
library(data.table)
d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
# multiple columns at once
frollmean(d, 3)
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
## three above are embarrassingly parallel using openmp
The caTools package has very fast rolling mean/min/max/sd and few other functions. I've only worked with runmean and runsd and they are the fastest of any of the other packages mentioned to date.
You could use RcppRoll for very quick moving averages written in C++. Just call the roll_mean function. Docs can be found here.
Otherwise, this (slower) for loop should do the trick:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n):i])
}
res
}
Here is example code showing how to compute a centered moving average and a trailing moving average using the rollmean function from the zoo package.
library(tidyverse)
library(zoo)
some_data = tibble(day = 1:10)
# cma = centered moving average
# tma = trailing moving average
some_data = some_data %>%
mutate(cma = rollmean(day, k = 3, fill = NA)) %>%
mutate(tma = rollmean(day, k = 3, fill = NA, align = "right"))
some_data
#> # A tibble: 10 x 3
#> day cma tma
#> <int> <dbl> <dbl>
#> 1 1 NA NA
#> 2 2 2 NA
#> 3 3 3 2
#> 4 4 4 3
#> 5 5 5 4
#> 6 6 6 5
#> 7 7 7 6
#> 8 8 8 7
#> 9 9 9 8
#> 10 10 NA 9
In fact RcppRoll is very good.
The code posted by cantdutchthis must be corrected in the fourth line to the window be fixed:
ma <- function(arr, n=15){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Another way, which handles missings, is given here.
A third way, improving cantdutchthis code to calculate partial averages or not, follows:
ma <- function(x, n=2,parcial=TRUE){
res = x #set the first values
if (parcial==TRUE){
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res
}else{
for(i in 1:length(x)){
t<-max(i-n+1,1)
res[i] = mean(x[t:i])
}
res[-c(seq(1,n-1,1))] #remove the n-1 first,i.e., res[c(-3,-4,...)]
}
}
In order to complement the answer of cantdutchthis and Rodrigo Remedio;
moving_fun <- function(x, w, FUN, ...) {
# x: a double vector
# w: the length of the window, i.e., the section of the vector selected to apply FUN
# FUN: a function that takes a vector and return a summarize value, e.g., mean, sum, etc.
# Given a double type vector apply a FUN over a moving window from left to the right,
# when a window boundary is not a legal section, i.e. lower_bound and i (upper bound)
# are not contained in the length of the vector, return a NA_real_
if (w < 1) {
stop("The length of the window 'w' must be greater than 0")
}
output <- x
for (i in 1:length(x)) {
# plus 1 because the index is inclusive with the upper_bound 'i'
lower_bound <- i - w + 1
if (lower_bound < 1) {
output[i] <- NA_real_
} else {
output[i] <- FUN(x[lower_bound:i, ...])
}
}
output
}
# example
v <- seq(1:10)
# compute a MA(2)
moving_fun(v, 2, mean)
# compute moving sum of two periods
moving_fun(v, 2, sum)
You may calculate the moving average of a vector x with a window width of k by:
apply(embed(x, k), 1, mean)
The slider package can be used for this. It has an interface that has been specifically designed to feel similar to purrr. It accepts any arbitrary function, and can return any type of output. Data frames are even iterated over row wise. The pkgdown site is here.
library(slider)
x <- 1:3
# Mean of the current value + 1 value before it
# returned as a double vector
slide_dbl(x, ~mean(.x, na.rm = TRUE), .before = 1)
#> [1] 1.0 1.5 2.5
df <- data.frame(x = x, y = x)
# Slide row wise over data frames
slide(df, ~.x, .before = 1)
#> [[1]]
#> x y
#> 1 1 1
#>
#> [[2]]
#> x y
#> 1 1 1
#> 2 2 2
#>
#> [[3]]
#> x y
#> 1 2 2
#> 2 3 3
The overhead of both slider and data.table's frollapply() should be pretty low (much faster than zoo). frollapply() looks to be a little faster for this simple example here, but note that it only takes numeric input, and the output must be a scalar numeric value. slider functions are completely generic, and you can return any data type.
library(slider)
library(zoo)
library(data.table)
x <- 1:50000 + 0L
bench::mark(
slider = slide_int(x, function(x) 1L, .before = 5, .complete = TRUE),
zoo = rollapplyr(x, FUN = function(x) 1L, width = 6, fill = NA),
datatable = frollapply(x, n = 6, FUN = function(x) 1L),
iterations = 200
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slider 19.82ms 26.4ms 38.4 829.8KB 19.0
#> 2 zoo 177.92ms 211.1ms 4.71 17.9MB 24.8
#> 3 datatable 7.78ms 10.9ms 87.9 807.1KB 38.7
EDIT: took great joy in adding the side parameter, for a moving average (or sum, or ...) of e.g. the past 7 days of a Date vector.
For people just wanting to calculate this themselves, it's nothing more than:
# x = vector with numeric data
# w = window length
y <- numeric(length = length(x))
for (i in seq_len(length(x))) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- mean(x[ind])
}
y
But it gets fun to make it independent of mean(), so you can calculate any 'moving' function!
# our working horse:
moving_fn <- function(x, w, fun, ...) {
# x = vector with numeric data
# w = window length
# fun = function to apply
# side = side to take, (c)entre, (l)eft or (r)ight
# ... = parameters passed on to 'fun'
y <- numeric(length(x))
for (i in seq_len(length(x))) {
if (side %in% c("c", "centre", "center")) {
ind <- c((i - floor(w / 2)):(i + floor(w / 2)))
} else if (side %in% c("l", "left")) {
ind <- c((i - floor(w) + 1):i)
} else if (side %in% c("r", "right")) {
ind <- c(i:(i + floor(w) - 1))
} else {
stop("'side' must be one of 'centre', 'left', 'right'", call. = FALSE)
}
ind <- ind[ind %in% seq_len(length(x))]
y[i] <- fun(x[ind], ...)
}
y
}
# and now any variation you can think of!
moving_average <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = mean, side = side, na.rm = na.rm)
}
moving_sum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = sum, side = side, na.rm = na.rm)
}
moving_maximum <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = max, side = side, na.rm = na.rm)
}
moving_median <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = median, side = side, na.rm = na.rm)
}
moving_Q1 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.25)
}
moving_Q3 <- function(x, w = 5, side = "centre", na.rm = FALSE) {
moving_fn(x = x, w = w, fun = quantile, side = side, na.rm = na.rm, 0.75)
}
Though a bit slow but you can also use zoo::rollapply to perform calculations on matrices.
reqd_ma <- rollapply(x, FUN = mean, width = n)
where x is the data set, FUN = mean is the function; you can also change it to min, max, sd etc and width is the rolling window.
One can use runner package for moving functions. In this case mean_run function. Problem with cummean is that it doesn't handle NA values, but mean_run does. runner package also supports irregular time series and windows can depend on date:
library(runner)
set.seed(11)
x1 <- rnorm(15)
x2 <- sample(c(rep(NA,5), rnorm(15)), 15, replace = TRUE)
date <- Sys.Date() + cumsum(sample(1:3, 15, replace = TRUE))
mean_run(x1)
#> [1] -0.5910311 -0.2822184 -0.6936633 -0.8609108 -0.4530308 -0.5332176
#> [7] -0.2679571 -0.1563477 -0.1440561 -0.2300625 -0.2844599 -0.2897842
#> [13] -0.3858234 -0.3765192 -0.4280809
mean_run(x2, na_rm = TRUE)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] -0.13873536 -0.14571604 -0.12596067 -0.11116961 -0.09881996 -0.08871569
#> [13] -0.05194292 -0.04699909 -0.05704202
mean_run(x2, na_rm = FALSE )
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.12188853 -0.13873536
#> [7] NA NA NA NA NA NA
#> [13] NA NA NA
mean_run(x2, na_rm = TRUE, k = 4)
#> [1] -0.18760011 -0.09022066 -0.06543317 0.03906450 -0.10546063 -0.16299272
#> [7] -0.21203756 -0.39209010 -0.13274756 -0.05603811 -0.03894684 0.01103493
#> [13] 0.09609256 0.09738460 0.04740283
mean_run(x2, na_rm = TRUE, k = 4, idx = date)
#> [1] -0.187600111 -0.090220655 -0.004349696 0.168349653 -0.206571573 -0.494335093
#> [7] -0.222969541 -0.187600111 -0.087636571 0.009742884 0.009742884 0.012326968
#> [13] 0.182442234 0.125737145 0.059094786
One can also specify other options like lag, and roll only at specific indexes. More in package and function documentation.
Here is a simple function with filter demonstrating one way to take care of beginning and ending NAs with padding, and computing a weighted average (supported by filter) using custom weights:
wma <- function(x) {
wts <- c(seq(0.5, 4, 0.5), seq(3.5, 0.5, -0.5))
nside <- (length(wts)-1)/2
# pad x with begin and end values for filter to avoid NAs
xp <- c(rep(first(x), nside), x, rep(last(x), nside))
z <- stats::filter(xp, wts/sum(wts), sides = 2) %>% as.vector
z[(nside+1):(nside+length(x))]
}
vector_avg <- function(x){
sum_x = 0
for(i in 1:length(x)){
if(!is.na(x[i]))
sum_x = sum_x + x[i]
}
return(sum_x/length(x))
}
I use aggregate along with a vector created by rep(). This has the advantage of using cbind() to aggregate more than 1 column in your dataframe at time. Below is an example of a moving average of 60 for a vector (v) of length 1000:
v=1:1000*0.002+rnorm(1000)
mrng=rep(1:round(length(v)/60+0.5), length.out=length(v), each=60)
aggregate(v~mrng, FUN=mean, na.rm=T)
Note the first argument in rep is to simply get enough unique values for the moving range, based on the length of the vector and the amount to be averaged; the second argument keeps the length equal to the vector length, and the last repeats the values of the first argument the same number of times as the averaging period.
In aggregate you could use several functions (median, max, min) - mean shown for example. Again, could could use a formula with cbind to do this on more than one (or all) columns in a dataframe.
Suppose that my vector numbers contains c(1,2,3,5,7,8), and I wish to find if it contains 3 consecutive numbers, which in this case, are 1,2,3.
numbers = c(1,2,3,5,7,8)
difference = diff(numbers) //The difference output would be 1,1,2,2,1
To verify that there are 3 consecutive integers in my numbers vector, I've tried the following with little reward.
rep(1,2)%in%difference
The above code works in this case, but if my difference vector = (1,2,2,2,1), it would still return TRUE even though the "1"s are not consecutive.
Using diff and rle, something like this should work:
result <- rle(diff(numbers))
any(result$lengths>=2 & result$values==1)
# [1] TRUE
In response to the comments below, my previous answer was specifically only testing for runs of length==3 excluding longer lengths. Changing the == to >= fixes this. It also works for runs involving negative numbers:
> numbers4 <- c(-2, -1, 0, 5, 7, 8)
> result <- rle(diff(numbers4))
> any(result$lengths>=2 & result$values==1)
[1] TRUE
Benchmarks!
I am including a couple functions of mine. Feel free to add yours. To qualify, you need to write a general function that tells if a vector x contains n or more consecutive numbers. I provide a unit test function below.
The contenders:
flodel.filter <- function(x, n, incr = 1L) {
if (n > length(x)) return(FALSE)
x <- as.integer(x)
is.cons <- tail(x, -1L) == head(x, -1L) + incr
any(filter(is.cons, rep(1L, n-1L), sides = 1, method = "convolution") == n-1L,
na.rm = TRUE)
}
flodel.which <- function(x, n, incr = 1L) {
is.cons <- tail(x, -1L) == head(x, -1L) + incr
any(diff(c(0L, which(!is.cons), length(x))) >= n)
}
thelatemail.rle <- function(x, n, incr = 1L) {
result <- rle(diff(x))
any(result$lengths >= n-1L & result$values == incr)
}
improved.rle <- function(x, n, incr = 1L) {
result <- rle(diff(as.integer(x)) == incr)
any(result$lengths >= n-1L & result$values)
}
carl.seqle <- function(x, n, incr = 1) {
if(!is.numeric(x)) x <- as.numeric(x)
z <- length(x)
y <- x[-1L] != x[-z] + incr
i <- c(which(y | is.na(y)), z)
any(diff(c(0L, i)) >= n)
}
Unit tests:
check.fun <- function(fun)
stopifnot(
fun(c(1,2,3), 3),
!fun(c(1,2), 3),
!fun(c(1), 3),
!fun(c(1,1,1,1), 3),
!fun(c(1,1,2,2), 3),
fun(c(1,1,2,3), 3)
)
check.fun(flodel.filter)
check.fun(flodel.which)
check.fun(thelatemail.rle)
check.fun(improved.rle)
check.fun(carl.seqle)
Benchmarks:
x <- sample(1:10, 1000000, replace = TRUE)
library(microbenchmark)
microbenchmark(
flodel.filter(x, 6),
flodel.which(x, 6),
thelatemail.rle(x, 6),
improved.rle(x, 6),
carl.seqle(x, 6),
times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# flodel.filter(x, 6) 96.03966 102.1383 144.9404 160.9698 177.7937 10
# flodel.which(x, 6) 131.69193 137.7081 140.5211 185.3061 189.1644 10
# thelatemail.rle(x, 6) 347.79586 353.1015 361.5744 378.3878 469.5869 10
# improved.rle(x, 6) 199.35402 200.7455 205.2737 246.9670 252.4958 10
# carl.seqle(x, 6) 213.72756 240.6023 245.2652 254.1725 259.2275 10
After diff you can check for any consecutive 1s -
numbers = c(1,2,3,5,7,8)
difference = diff(numbers) == 1
## [1] TRUE TRUE FALSE FALSE TRUE
## find alteast one consecutive TRUE
any(tail(difference, -1) &
head(difference, -1))
## [1] TRUE
It's nice to see home-grown solutions here.
Fellow Stack Overflow user Carl Witthoft posted a function he named seqle() and shared it here.
The function looks like this:
seqle <- function(x,incr=1) {
if(!is.numeric(x)) x <- as.numeric(x)
n <- length(x)
y <- x[-1L] != x[-n] + incr
i <- c(which(y|is.na(y)),n)
list(lengths = diff(c(0L,i)),
values = x[head(c(0L,i)+1L,-1L)])
}
Let's see it in action. First, some data:
numbers1 <- c(1, 2, 3, 5, 7, 8)
numbers2 <- c(-2, 2, 3, 5, 6, 7, 8)
numbers3 <- c(1, 2, 2, 2, 1, 2, 3)
Now, the output:
seqle(numbers1)
# $lengths
# [1] 3 1 2
#
# $values
# [1] 1 5 7
#
seqle(numbers2)
# $lengths
# [1] 1 2 4
#
# $values
# [1] -2 2 5
#
seqle(numbers3)
# $lengths
# [1] 2 1 1 3
#
# $values
# [1] 1 2 2 1
#
Of particular interest to you is the "lengths" in the result.
Another interesting point is the incr argument. Here we can set the increment to, say, "2" and look for sequences where the difference between the numbers are two. So, for the first vector, we would expect the sequence of 3, 5, and 7 to be detected.
Let's try:
> seqle(numbers1, incr = 2)
$lengths
[1] 1 1 3 1
$values
[1] 1 2 3 8
So, we can see that we have a sequence of 1 (1), 1 (2), 3 (3, 5, 7), and 1 (8) if we set incr = 2.
How does it work with ECII's second challenge? Seems OK!
> numbers4 <- c(-2, -1, 0, 5, 7, 8)
> seqle(numbers4)
$lengths
[1] 3 1 2
$values
[1] -2 5 7
Simple but works
numbers = c(-2,2,3,4,5,10,6,7,8)
x1<-c(diff(numbers),0)
x2<-c(0,diff(numbers[-1]),0)
x3<-c(0,diff(numbers[c(-1,-2)]),0,0)
rbind(x1,x2,x3)
colSums(rbind(x1,x2,x3) )==3 #Returns TRUE or FALSE where in the vector the consecutive intervals triplet takes place
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
sum(colSums(rbind(x1,x2,x3) )==3) #How many triplets of consecutive intervals occur in the vector
[1] 3
which(colSums(rbind(x1,x2,x3) )==3) #Returns the location of the triplets consecutive integers
[1] 2 3 7
Note that this will not work for consecutive negative intervals c(-2,-1,0) because of how diff() works