Create grouping variable for consecutive sequences and split vector - r

I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in:
L1: 1
L2: 3,4,5
L3: 9,10
L4: 17
L5: 29,30
Naive code (of an ex-C programmer):
partition.neighbors <- function(v)
{
result <<- list() #jagged array
currentList <<- v[1] #current series
for(i in 2:length(v))
{
if(v[i] - v [i-1] == 1)
{
currentList <<- c(currentList, v[i])
}
else
{
result <<- c(result, list(currentList))
currentList <<- v[i] #next series
}
}
return(result)
}
Now I understand that a) R is not C (despite the curly brackets) b) global variables are pure evil c) that is a horribly inefficient way of achieving the result
, so any better solutions are welcome.

Making heavy use of some R idioms:
> split(v, cumsum(c(1, diff(v) != 1)))
$`1`
[1] 1
$`2`
[1] 3 4 5
$`3`
[1] 9 10
$`4`
[1] 17
$`5`
[1] 29 30

daroczig writes "you could write a lot neater code based on diff"...
Here's one way:
split(v, cumsum(diff(c(-Inf, v)) != 1))
EDIT (added timings):
Tommy discovered this could be faster by being careful with types; the reason it got faster is that split is faster on integers, and is actually faster still on factors.
Here's Joshua's solution; the result from the cumsum is a numeric because it's being c'd with 1, so it's the slowest.
system.time({
a <- cumsum(c(1, diff(v) != 1))
split(v, a)
})
# user system elapsed
# 1.839 0.004 1.848
Just cing with 1L so the result is an integer speeds it up considerably.
system.time({
a <- cumsum(c(1L, diff(v) != 1))
split(v, a)
})
# user system elapsed
# 0.744 0.000 0.746
This is Tommy's solution, for reference; it's also splitting on an integer.
> system.time({
a <- cumsum(c(TRUE, diff(v) != 1L))
split(v, a)
})
# user system elapsed
# 0.742 0.000 0.746
Here's my original solution; it also is splitting on an integer.
system.time({
a <- cumsum(diff(c(-Inf, v)) != 1)
split(v, a)
})
# user system elapsed
# 0.750 0.000 0.754
Here's Joshua's, with the result converted to an integer before the split.
system.time({
a <- cumsum(c(1, diff(v) != 1))
a <- as.integer(a)
split(v, a)
})
# user system elapsed
# 0.736 0.002 0.740
All the versions that split on an integer vector are about the same; it could be even faster if that integer vector was already a factor, as the conversion from integer to factor actually takes about half the time. Here I make it into a factor directly; this is not recommended in general because it depends on the structure of the factor class. It'ss done here for comparison purposes only.
system.time({
a <- cumsum(c(1L, diff(v) != 1))
a <- structure(a, class = "factor", levels = 1L:a[length(a)])
split(v,a)
})
# user system elapsed
# 0.356 0.000 0.357

Joshua and Aaron were spot on. However, their code can still be made more than twice as fast by careful use of the correct types, integers and logicals:
split(v, cumsum(c(TRUE, diff(v) != 1L)))
v <- rep(c(1:5, 19), len = 1e6) # Huge vector...
system.time( split(v, cumsum(c(1, diff(v) != 1))) ) # Joshua's code
# user system elapsed
# 2.64 0.00 2.64
system.time( split(v, cumsum(c(TRUE, diff(v) != 1L))) ) # Modified code
# user system elapsed
# 1.09 0.00 1.12

You could define the cut-points easily:
which(diff(v) != 1)
Based on that try:
v <- c(1,3,4,5,9,10,17,29,30)
cutpoints <- c(0, which(diff(v) != 1), length(v))
ragged.vector <- vector("list", length(cutpoints)-1)
for (i in 2:length(cutpoints)) ragged.vector[[i-1]] <- v[(cutpoints[i-1]+1):cutpoints[i]]
Which results in:
> ragged.vector
[[1]]
[1] 1
[[2]]
[1] 3 4 5
[[3]]
[1] 9 10
[[4]]
[1] 17
[[5]]
[1] 29 30
This algorithm is not a nice one but you could write a lot neater code based on diff :) Good luck!

You can create a data.frame and assign the elements to groups using diff, ifelse and cumsum, then aggregate using tapply:
v.df <- data.frame(v = v)
v.df$group <- cumsum(ifelse(c(1, diff(v) - 1), 1, 0))
tapply(v.df$v, v.df$group, function(x) x)
$`1`
[1] 1
$`2`
[1] 3 4 5
$`3`
[1] 9 10
$`4`
[1] 17
$`5`
[1] 29 30

Related

Removing sequential values from a vector, but not iteratively

I'm looking to subset a vector to where there are no sequential numbers. However, if there is a sequence of more than two sequential numbers, then only every second sequential number is removed, since removing that number will disrupt the sequence.
e.g. 1,2,4,6,7 would give 1,4,6
e.g. 6,7,8,9 would give 6,8
This is easy to do iteratively, but iterating over 10M+ elements is incredibly slow:
x <- c(1,2,4,6,7,8,9) # Ideal output is c(1,4,6,8)
for (i in 2:length(x)) {
if (!is.na(x[i-1])) {
if (x[i] == x[i-1]+1) {x[i] <- NA_integer_}
}
}
x[!is.na(x)]
Is there another solution that would significantly faster?
Using convenience functions collapse::seqid and data.table::rowid:
library(collapse)
library(data.table)
x[rowid(seqid(x)) %% 2 == 1]
# [1] 1 4 6 8
Seems faster on a longer vector:
x = rep(c(1,2,4,6,7,8,9), 1e7)
system.time({
seq_id = data.table::rleid(x - seq_along(x))
obs_id = unlist(lapply(split(seq_id, seq_id), seq_along))
r1 = x[obs_id %% 2 == 1]
})
# user system elapsed
# 112.77 55.99 177.11
system.time({
r2 = x[rowid(seqid(x)) %% 2 == 1]
})
# user system elapsed
# 8.03 5.97 10.23
all.equal(r1, r2)
# [1] TRUE
We can use the fantastic data.table::rleid to generate an ID for each sequence, then keep only the odd numbered elements within sequence. This should be quite fast, though more optimization is certainly possible.
disrupt_seqs = function(x) {
seq_id = data.table::rleid(x - seq_along(x))
obs_id = unlist(lapply(split(seq_id, seq_id), seq_along))
x[obs_id %% 2 == 1]
}
x <- c(1,2,4,6,7,8,9)
disrupt_seqs(x)
# [1] 1 4 6 8

Floor and ceiling with 2 or more significant digits

It is possible to round results into two significant digits using signif:
> signif(12500,2)
[1] 12000
> signif(12501,2)
[1] 13000
But are there an equally handy functions, like the fictitious functions below signif.floor and signif.ceiling, so that I could get two or more significant digits with flooring or ceiling?
> signif.ceiling(12500,2)
[1] 13000
> signif.floor(12501,2)
[1] 12000
EDIT:
The existing signif function works with negative numbers and decimal numbers.
Therefore, the possible solution would preferably work also with negative numbers:
> signif(-125,2)
[1] -120
> signif.floor(-125,2)
[1] -130
and decimal numbers:
> signif(1.23,2)
[1] 1.2
> signif.ceiling(1.23,2)
[1] 1.3
As a special case, also 0 should return 0:
> signif.floor(0,2)
[1] 0
I think this approach is proper for all types of numbers (i.e. integers, negative, decimal).
The floor function
signif.floor <- function(x, n){
pow <- floor( log10( abs(x) ) ) + 1 - n
y <- floor(x / 10 ^ pow) * 10^pow
# handle the x = 0 case
y[x==0] <- 0
y
}
The ceiling function
signif.ceiling <- function(x, n){
pow <- floor( log10( abs(x) ) ) + 1 - n
y <- ceiling(x / 10 ^ pow) * 10^pow
# handle the x = 0 case
y[x==0] <- 0
y
}
They both do the same thing. First count the number of digits, next use the standard floor/ceiling function. Check if it works for you.
Edit 1 Added the handler for the case of x = 0 as suggested in the comments by Heikki.
Edit 2 Again following Heikki I add some examples:
Testing different values of x
# for negative values
> values <- -0.12151 * 10^(0:4); values
# [1] -0.12151 -1.21510 -12.15100 -121.51000 -1215.10000
> sapply(values, function(x) signif.floor(x, 2))
# [1] -0.13 -1.30 -13.00 -130.00 -1300.00
> sapply(values, function(x) signif.ceiling(x, 2))
# [1] -0.12 -1.20 -12.00 -120.00 -1200.00
# for positive values
> sapply(-values, function(x) signif.floor(x, 2))
# [1] 0.12 1.20 12.00 120.00 1200.00
> sapply(-values, function(x) signif.ceiling(x, 2))
# [1] 0.13 1.30 13.00 130.00 1300.00
Testing different values of n
> sapply(1:5, function(n) signif.floor(-121.51,n))
# [1] -200.00 -130.00 -122.00 -121.60 -121.51
> sapply(1:5, function(n) signif.ceiling(-121.51,n))
# [1] -100.00 -120.00 -121.00 -121.50 -121.51
Edit Nowhere near as nice as #storaged's answer, but I'd started so I might as well finish:
Basically runs through each case (positive, negative, decimal or not)
signif.floor=function(x,n){
if(x==0)(out=0)
if(x%%round(x)==0 & sign(x)==1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:n],collapse=''))*10^(nchar(x)-n)}
if(x%%round(x) >0 & sign(x)==1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse=''))}
if(x%%round(x)==0 & sign(x)==-1){out=(as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse=''))-1)*10^(nchar(x)-n-1)}
if(x%%round(x) <0 & sign(x)==-1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+2)],collapse=''))-+10^(-n+1)}
return(out)
}
signif.ceiling=function(x,n){
if(x==0)(out=0)
if(x%%round(x)==0 & sign(x)==1){out=(as.numeric(paste0(el(strsplit(as.character(x),''))[1:n],collapse=''))+1)*10^(nchar(x)-n)}
if(x%%round(x) >0 & sign(x)==1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse=''))+10^(-n+1)}
if(x%%round(x)==0 & sign(x)==-1){out=(as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+1)],collapse='')))*10^(nchar(x)-n-1)}
if(x%%round(x) < 0 & sign(x)==-1){out=as.numeric(paste0(el(strsplit(as.character(x),''))[1:(n+2)],collapse=''))}
return(out)
}

"raise" inner list to level of outer list in R [duplicate]

I am trying to achieve the functionality similar to unlist, with the exception that types are not coerced to a vector, but the list with preserved types is returned instead. For instance:
flatten(list(NA, list("TRUE", list(FALSE), 0L))
should return
list(NA, "TRUE", FALSE, 0L)
instead of
c(NA, "TRUE", "FALSE", "0")
which would be returned by unlist(list(list(NA, list("TRUE", list(FALSE), 0L)).
As it is seen from the example above, the flattening should be recursive. Is there a function in standard R library which achieves this, or at least some other function which can be used to easily and efficiently implement this?
UPDATE: I don't know if it is clear from the above, but non-lists should not be flattened, i.e. flatten(list(1:3, list(4, 5))) should return list(c(1, 2, 3), 4, 5).
Interesting non-trivial problem!
MAJOR UPDATE With all that's happened, I've rewrote the answer and removed some dead ends. I also timed the various solutions on different cases.
Here's the first, rather simple but slow, solution:
flatten1 <- function(x) {
y <- list()
rapply(x, function(x) y <<- c(y,x))
y
}
rapply lets you traverse a list and apply a function on each leaf element. Unfortunately, it works exactly as unlist with the returned values. So I ignore the result from rapply and instead I append values to the variable y by doing <<-.
Growing y in this manner is not very efficient (it's quadratic in time). So if there are many thousands of elements this will be very slow.
A more efficient approach is the following, with simplifications from #JoshuaUlrich:
flatten2 <- function(x) {
len <- sum(rapply(x, function(x) 1L))
y <- vector('list', len)
i <- 0L
rapply(x, function(x) { i <<- i+1L; y[[i]] <<- x })
y
}
Here I first find out the result length and pre-allocate the vector. Then I fill in the values.
As you can will see, this solution is much faster.
Here's a version of #JoshO'Brien great solution based on Reduce, but extended so it handles arbitrary depth:
flatten3 <- function(x) {
repeat {
if(!any(vapply(x, is.list, logical(1)))) return(x)
x <- Reduce(c, x)
}
}
Now let the battle begin!
# Check correctness on original problem
x <- list(NA, list("TRUE", list(FALSE), 0L))
dput( flatten1(x) )
#list(NA, "TRUE", FALSE, 0L)
dput( flatten2(x) )
#list(NA, "TRUE", FALSE, 0L)
dput( flatten3(x) )
#list(NA_character_, "TRUE", FALSE, 0L)
# Time on a huge flat list
x <- as.list(1:1e5)
#system.time( flatten1(x) ) # Long time
system.time( flatten2(x) ) # 0.39 secs
system.time( flatten3(x) ) # 0.04 secs
# Time on a huge deep list
x <-'leaf'; for(i in 1:11) { x <- list(left=x, right=x, value=i) }
#system.time( flatten1(x) ) # Long time
system.time( flatten2(x) ) # 0.05 secs
system.time( flatten3(x) ) # 1.28 secs
...So what we observe is that the Reduce solution is faster when the depth is low, and the rapply solution is faster when the depth is large!
As correctness goes, here are some tests:
> dput(flatten1( list(1:3, list(1:3, 'foo')) ))
list(1L, 2L, 3L, 1L, 2L, 3L, "foo")
> dput(flatten2( list(1:3, list(1:3, 'foo')) ))
list(1:3, 1:3, "foo")
> dput(flatten3( list(1:3, list(1:3, 'foo')) ))
list(1L, 2L, 3L, 1:3, "foo")
Unclear what result is desired, but I lean towards the result from flatten2...
For lists that are only a few nestings deep, you could use Reduce() and c() to do something like the following. Each application of c() removes one level of nesting. (For fully general solution, see EDITs below.)
L <- (list(NA, list("TRUE", list(FALSE), 0L)))
Reduce(c, Reduce(c, L))
[[1]]
[1] NA
[[2]]
[1] "TRUE"
[[3]]
[1] FALSE
[[4]]
[1] 0
# TIMING TEST
x <- as.list(1:4e3)
system.time(flatten(x)) # Using the improved version
# user system elapsed
# 0.14 0.00 0.13
system.time(Reduce(c, x))
# user system elapsed
# 0.04 0.00 0.03
EDIT Just for fun, here's a version of #Tommy's version of #JoshO'Brien's solution that does work for already flat lists. FURTHER EDIT Now #Tommy's solved that problem as well, but in a cleaner way. I'll leave this version in place.
flatten <- function(x) {
x <- list(x)
repeat {
x <- Reduce(c, x)
if(!any(vapply(x, is.list, logical(1)))) return(x)
}
}
flatten(list(3, TRUE, 'foo'))
# [[1]]
# [1] 3
#
# [[2]]
# [1] TRUE
#
# [[3]]
# [1] "foo"
How about this? It builds off Josh O'Brien's solution but does the recursion with a while loop instead using unlist with recursive=FALSE.
flatten4 <- function(x) {
while(any(vapply(x, is.list, logical(1)))) {
# this next line gives behavior like Tommy's answer;
# removing it gives behavior like Josh's
x <- lapply(x, function(x) if(is.list(x)) x else list(x))
x <- unlist(x, recursive=FALSE)
}
x
}
Keeping the commented line in gives results like this (which Tommy prefers, and so do I, for that matter).
> x <- list(1:3, list(1:3, 'foo'))
> dput(flatten4(x))
list(1:3, 1:3, "foo")
Output from my system, using Tommy's tests:
dput(flatten4(foo))
#list(NA, "TRUE", FALSE, 0L)
# Time on a long
x <- as.list(1:1e5)
system.time( x2 <- flatten2(x) ) # 0.48 secs
system.time( x3 <- flatten3(x) ) # 0.07 secs
system.time( x4 <- flatten4(x) ) # 0.07 secs
identical(x2, x4) # TRUE
identical(x3, x4) # TRUE
# Time on a huge deep list
x <-'leaf'; for(i in 1:11) { x <- list(left=x, right=x, value=i) }
system.time( x2 <- flatten2(x) ) # 0.05 secs
system.time( x3 <- flatten3(x) ) # 1.45 secs
system.time( x4 <- flatten4(x) ) # 0.03 secs
identical(x2, unname(x4)) # TRUE
identical(unname(x3), unname(x4)) # TRUE
EDIT: As for getting the depth of a list, maybe something like this would work; it gets the index for each element recursively.
depth <- function(x) {
foo <- function(x, i=NULL) {
if(is.list(x)) { lapply(seq_along(x), function(xi) foo(x[[xi]], c(i,xi))) }
else { i }
}
flatten4(foo(x))
}
It's not super fast but it seems to work fine.
x <- as.list(1:1e5)
system.time(d <- depth(x)) # 0.327 s
x <-'leaf'; for(i in 1:11) { x <- list(left=x, right=x, value=i) }
system.time(d <- depth(x)) # 0.041s
I'd imagined it being used this way:
> x[[ d[[5]] ]]
[1] "leaf"
> x[[ d[[6]] ]]
[1] 1
But you could also get a count of how many nodes are at each depth too.
> table(sapply(d, length))
1 2 3 4 5 6 7 8 9 10 11
1 2 4 8 16 32 64 128 256 512 3072
Edited to address a flaw pointed out in the comments. Sadly, it just makes it even less efficient. Ah well.
Another approach, although I'm not sure it will be more efficient than anything #Tommy has suggested:
l <- list(NA, list("TRUE", list(FALSE), 0L))
flatten <- function(x){
obj <- rapply(x,identity,how = "unlist")
cl <- rapply(x,class,how = "unlist")
len <- rapply(x,length,how = "unlist")
cl <- rep(cl,times = len)
mapply(function(obj,cl){rs <- as(obj,cl); rs}, obj, cl,
SIMPLIFY = FALSE, USE.NAMES = FALSE)
}
> flatten(l)
[[1]]
[1] NA
[[2]]
[1] "TRUE"
[[3]]
[1] FALSE
[[4]]
[1] 0
purrr::flatten achieves that. Though it is not recursive (by design).
So applying it twice should work:
library(purrr)
l <- list(NA, list("TRUE", list(FALSE), 0L))
flatten(flatten(l))
Here is an attempt at a recursive version:
flatten_recursive <- function(x) {
stopifnot(is.list(x))
if (any(vapply(x, is.list, logical(1)))) Recall(purrr::flatten(x)) else x
}
flatten_recursive(l)
hack_list <- function(.list) {
.list[['_hack']] <- function() NULL
.list <- unlist(.list)
.list$`_hack` <- NULL
.list
}
You can also use rrapply in the rrapply-package (extended version of base-rapply) by setting how = "flatten":
library(rrapply)
rrapply(list(NA, list("TRUE", list(FALSE), 0L)), how = "flatten")
#> [[1]]
#> [1] NA
#>
#> [[2]]
#> [1] "TRUE"
#>
#> [[3]]
#> [1] FALSE
#>
#> [[4]]
#> [1] 0
Computation times
Below are some benchmark timings against the flatten2 and flatten3 functions in Tommy's response for two large nested lists:
flatten2 <- function(x) {
len <- sum(rapply(x, function(x) 1L))
y <- vector('list', len)
i <- 0L
rapply(x, function(x) { i <<- i+1L; y[[i]] <<- x })
y
}
flatten3 <- function(x) {
repeat {
if(!any(vapply(x, is.list, logical(1)))) return(x)
x <- Reduce(c, x)
}
}
## large deeply nested list (1E6 elements, 6 layers)
deep_list <- rrapply(replicate(10, 1, simplify = F), classes = c("list", "numeric"), condition = function(x, .xpos) length(.xpos) < 6, f = function(x) replicate(10, 1, simplify = F), how = "recurse")
system.time(flatten2(deep_list))
#> user system elapsed
#> 1.715 0.012 1.727
## system.time(flatten3(deep_list)), not run takes more than 10 minutes
system.time(rrapply(deep_list, how = "flatten"))
#> user system elapsed
#> 0.105 0.016 0.121
## large shallow nested list (1E6 elements, 2 layers)
shallow_list <- lapply(replicate(1000, 1, simplify = F), function(x) replicate(1000, 1, simplify = F))
system.time(flatten2(shallow_list))
#> user system elapsed
#> 1.308 0.040 1.348
system.time(flatten3(shallow_list))
#> user system elapsed
#> 5.246 0.012 5.259
system.time(rrapply(shallow_list, how = "flatten"))
#> user system elapsed
#> 0.09 0.00 0.09

Memoize and vectorize a custom function

I want to know how to vectorize and memoize a custom function in R. It seems
my way of thinking is not aligned with R's way of operation. So, I gladly
welcome any links to good reading material. For example, R inferno is a nice
resource, but it didn't help to figure out memoization in R.
More generally, can you provide a relevant usage example for the memoise
or R.cache packages?
I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions. I emailed the mailing list and did not receive a complete
answer.
I am not solely interested in memoizing the GC function, and I am aware of
Bioconductor and the various packages
available there.
Here's my data:
seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank "".
I have a function for calculating GC content:
> GC <- function(s) {
if (!is.character(s)) return(NA)
n <- nchar(s)
if (n == 0) return(NA)
m <- gregexpr('[GCSgcs]', s)[[1]]
if (m[1] < 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
I want to memoize it. Then, I want to vectorize it.
Apparently, I must have the wrong mindset for using the memoise or
R.cache R packages:
> system.time(dummy <- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the hash package, but things seem to be happening behind the
scenes and I don't understand the output. The sequence C should have a
value of 100, not NULL.
Note that using has.key(s, cache) instead of exists(s, cache) results
in the same output. Also, using cache[s] <<- result instead of
cache[[s]] <<- result results in the same output.
> cache <- hash()
> GCc <- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result <- GC(s)
cache[[s]] <<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv <- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
Relevant stackoverflow posts:
Options for caching / memoization / hashing in R
While this won't give you memoization across calls, you can use factors to make individual calls a lot faster if there is a fair bit of repetition. Eg using Joshua's GC2 (though I had to remove fixed=T to get it to work):
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
100.0 * len/n
}
One can easily define a wrapper like:
GC3 <- function(s) {
x <- factor(s)
GC2(levels(x))[x]
}
system.time(GC2(rep(seqs, 50000)))
# user system elapsed
# 8.97 0.00 8.99
system.time(GC3(rep(seqs, 50000)))
# user system elapsed
# 0.06 0.00 0.06
This doesn't explicitly answer your question, but this function is ~4 times faster than yours.
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
len/n
}

Return system.time from evaluated function

R version 2.12, Windows XP
I am attempting to write a function (say 'g') that takes one argument, a function (say 'f'), and returns the matched function. Furthermore, enclosed within the body of 'g' is a statement that tells the resulting object to return the value of system.time when the object is called. An example will clarify.
What I want:
g <- function(f) {...}
z <- g(mean)
z(c(1, 4, 7))
with output
user system elapsed
0.04 0.00 0.04
What I have:
g <- function(f) {if (!exists("x")) {x <- match.fun(f)} else {system.time(x)}}
z <- g(mean)
z(c(1, 4, 7))
with output
[1] 4
Any help is greatly appreciated.
Maybe this will help:
g <- function(f){
function(x){
zz <- system.time(
xx <- match.fun(f)(x)
)
list(value=xx, system.time=zz)
}
}
In use:
g(mean)(c(1, 4, 7))
$value
[1] 4
$system.time
user system elapsed
0 0 0
You may want to think about how your return the values. I used a list, but another option is to print the system time as a side effect and return the calculated value.
Recently I made similar function for myself:
with_times <- function(f) {
f <- match.fun(f)
function(...) {
.times <- system.time(res <- f(...))
attr(res, "system.time") <- as.list(na.omit(.times))
res
}
}
For example:
g <- function(x,y) {r<-x+y; Sys.sleep(.5); r}
g(1, 1)
# [1] 2
g2 <- with_times(g)
w <- g2(1, 1)
Timings can be extracted in two ways:
attributes(w)$system.time
# $user.self
# [1] 0
# $sys.self
# [1] 0
# $elapsed
# [1] 0.5
or
attr(w, "system.time")
# $user.self
# [1] 0
# $sys.self
# [1] 0
# $elapsed
# [1] 0.5

Resources