Apply a function on subsequent pairs of elements - r

Suppose x is a vector, and myfunc is a function of two arguments. I wish to get a vector of the results of myfunc on subsequent pairs of elements from x. By definition, that vector should be of length 1 less than x's length.
For example, if x <- 1:4 and
myfunc <- function(a,b) {
return(log(b/a))
}
Then I would expect
> apply_on_pairs(x, myfunc)
[1] 0.6931472 0.4054651 0.2876821
(which is equivalent to c(myfunc(1,2), myfunc(2,3), myfunc(3,4)))

mapply(myfunc,x[-length(x)],x[-1])
# [1] 0.6931472 0.4054651 0.2876821
mapply(...) "applies" the function in the first argument to the subsequent arguments, in this case we pass x[1:3] and x[2:4] as the second the third arguments to mapply(...).

library(zoo)
x <- 1:4
rollapply(x, width=2, FUN=function(x) return(log(x[2]/x[1])))
## [1] 0.6931472 0.4054651 0.2876821

In this case you can diff() the log() of your vector.
x <- 1:4
diff(log(x))
Yields:
> diff(log(x))
[1] 0.6931472 0.4054651 0.2876821
Update: I more general solution uses head() and tail() to remove the last and first elements. You want to do your best to stick to vectorized solutions, which should be faster and more memory efficient.
myFun <- function(x) log(tail(x, -1)) - log(head(x, -1))
There's a slight speed edge to diff().
> x <- seq(1e8)
> system.time(A <- diff(log(x)))
user system elapsed
8.42 1.28 9.90
> myFun <- function(x) log(tail(x, -1)) - log(head(x, -1))
> system.time(B <- myFun(x))
user system elapsed
9.29 1.40 10.78
> all.equal(A, B)
[1] TRUE

Related

Save a variable in succession with "for" loop

I need to save the sum of results from a loop with for.
The results shown are only a list of the values of variable n, but not the sum of these values.
for(i in 1:10){
n=1/(i+1)^2
m=sum(n)
print(m)
}
[1] 0.25
[1] 0.1111111
[1] 0.0625
[1] 0.04
[1] 0.02777778
[1] 0.02040816
[1] 0.015625
[1] 0.01234568
[1] 0.01
[1] 0.008264463
Most of the operations in R are vectorized, so you can use them instead of for loop.
In this case, we can do
i <- 1:10
ans <- sum((1/(i+1))^2)
ans
#[1] 0.558
and if you have to use a for loop you can need to store all the values in a vector using index. Currently, you are just overwriting the previous value with the new value calculated.
n <- numeric(length = 10L)
for(i in seq_along(n)) {
n[i] = 1/(i+1)^2
}
ans <- sum(n)

How to compose a list of functions

For example, I have a vector of functions: fun_vec <- c(step1,step2,step3).
Now I want to compose them like this: step1(step2(step3(x))). How do I do this using fun_vec? (Suppose that fun_vec isn't fixed and can have more or less functions.)
Similar to Frank's use of freduce, you can use Reduce:
step1 <- function(a) a^2
step2 <- function(a) sum(a)
step3 <- function(a) sqrt(a)
steps <- list(step1, step2, step3)
Reduce(function(a,f) f(a), steps, 1:3)
# [1] 3.741657
step3(step2(step1(1:3)))
# [1] 3.741657
You can see it "in action" with:
Reduce(function(a,f) f(a), steps, 1:3, accumulate=TRUE)
# [[1]]
# [1] 1 2 3
# [[2]]
# [1] 1 4 9
# [[3]]
# [1] 14
# [[4]]
# [1] 3.741657
You can use freduce from the magrittr package:
fun_vec = c(function(x) x^2, function(x) sum(x), function(x) sqrt(x))
library(magrittr)
freduce(1:10, fun_vec)
Alternately, define a function sequence with pipes like...
library(magrittr)
f = . %>% raise_to_power(2) %>% sum %>% sqrt
f(1:10)
A similar example: Is there a way to `pipe through a list'?
Here's a base R recursive approach:
compose <- function(funs) {
n <- length(funs)
fcomp <- function(x) funs[[n - 1]](funs[[n]](x))
ifelse(n > 2, compose(c(funs[1:(n - 2)], fcomp)), fcomp)
}
x <- c(sqrt, log, exp)
compose(x)(2)
# [1] 1.414214
sqrt(log(exp(2)))
# [1] 1.414214
If the number of functions in funs is greater than two, we shorten the list by one by replacing the last two functions by their composition. Otherwise, we return the composition of the last remaining two. It's assumed that initially there are at least two functions in funs.
Take a look at purrr::compose. If your functions are stored inside a list, use purrr::invoke to pass that list to compose:
fun_vec <- c( exp, log10, sqrt )
f <- purrr::invoke( purrr::compose, fun_vec )
f(4) # 1.35125
exp( log10( sqrt(4) ) ) # 1.35125

unique() or duplicated() with all.equal() functionality?

I am searching for a (simple) function in R to remove duplicated elements, like unique() or duplicated() which can consider for "near equality" of numerical values like all.equal():
unique( c(0, 0))
[1] 0
works fine, but
unique( c(0, cos(pi/2)) )
[1] 0.000000e+00 6.123032e-17
does not remove the second element, although a comparison with all.equal returns TRUE:
all.equal( 0, cos(pi/2) )
[1] TRUE
Same is valid for duplicated:
duplicated( c(0, cos(pi/2)))
[1] FALSE FALSE
Any suggestions? Thanks!
You might also consider the zapsmall function:
x <- rep(c(1,2), each=5) + rnorm(10)/(10^rep(1:5,2))
unique(x)
# [1] 1.0571484 1.0022854 1.0014347 0.9998829 0.9999985 2.1095720 1.9888208 2.0002687 1.9999723 2.0000078
unique(zapsmall(x, digits=4))
# [1] 1.0571 1.0023 1.0014 0.9999 1.0000 2.1096 1.9888 2.0003 2.0000
unique(zapsmall(x, digits=2))
# [1] 1.06 1.00 2.11 1.99 2.00
unique(zapsmall(x, digits=0))
# [1] 1 2
If you'd like to take into account the absolute error, and not the relative error (as all.equal does), try:
x <- c(0, cos(pi/2), 1, 1+1e-16)
unique(x)
## [1] 0.000000e+00 6.123234e-17 1.000000e+00
(x <- x[!duplicated(round(x, 10))])
## [1] 0 1
Here we remove the elements that are the same w.r.t. a fixed (10 above) number of decimal digits.
You could try this code (disclaimer: from my package cgwtools)
approxeq <- function (x, y, tolerance = .Machine$double.eps^0.5, ...)
{
if (length(x) != length(y))
warning("x,y lengths differ. Will recycle.")
checkit <- abs(x - y) < tolerance
return(invisible(checkit))
}

Memoize and vectorize a custom function

I want to know how to vectorize and memoize a custom function in R. It seems
my way of thinking is not aligned with R's way of operation. So, I gladly
welcome any links to good reading material. For example, R inferno is a nice
resource, but it didn't help to figure out memoization in R.
More generally, can you provide a relevant usage example for the memoise
or R.cache packages?
I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions. I emailed the mailing list and did not receive a complete
answer.
I am not solely interested in memoizing the GC function, and I am aware of
Bioconductor and the various packages
available there.
Here's my data:
seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank "".
I have a function for calculating GC content:
> GC <- function(s) {
if (!is.character(s)) return(NA)
n <- nchar(s)
if (n == 0) return(NA)
m <- gregexpr('[GCSgcs]', s)[[1]]
if (m[1] < 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
I want to memoize it. Then, I want to vectorize it.
Apparently, I must have the wrong mindset for using the memoise or
R.cache R packages:
> system.time(dummy <- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the hash package, but things seem to be happening behind the
scenes and I don't understand the output. The sequence C should have a
value of 100, not NULL.
Note that using has.key(s, cache) instead of exists(s, cache) results
in the same output. Also, using cache[s] <<- result instead of
cache[[s]] <<- result results in the same output.
> cache <- hash()
> GCc <- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result <- GC(s)
cache[[s]] <<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv <- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
Relevant stackoverflow posts:
Options for caching / memoization / hashing in R
While this won't give you memoization across calls, you can use factors to make individual calls a lot faster if there is a fair bit of repetition. Eg using Joshua's GC2 (though I had to remove fixed=T to get it to work):
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
100.0 * len/n
}
One can easily define a wrapper like:
GC3 <- function(s) {
x <- factor(s)
GC2(levels(x))[x]
}
system.time(GC2(rep(seqs, 50000)))
# user system elapsed
# 8.97 0.00 8.99
system.time(GC3(rep(seqs, 50000)))
# user system elapsed
# 0.06 0.00 0.06
This doesn't explicitly answer your question, but this function is ~4 times faster than yours.
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
len/n
}

Create grouping variable for consecutive sequences and split vector

I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in:
L1: 1
L2: 3,4,5
L3: 9,10
L4: 17
L5: 29,30
Naive code (of an ex-C programmer):
partition.neighbors <- function(v)
{
result <<- list() #jagged array
currentList <<- v[1] #current series
for(i in 2:length(v))
{
if(v[i] - v [i-1] == 1)
{
currentList <<- c(currentList, v[i])
}
else
{
result <<- c(result, list(currentList))
currentList <<- v[i] #next series
}
}
return(result)
}
Now I understand that a) R is not C (despite the curly brackets) b) global variables are pure evil c) that is a horribly inefficient way of achieving the result
, so any better solutions are welcome.
Making heavy use of some R idioms:
> split(v, cumsum(c(1, diff(v) != 1)))
$`1`
[1] 1
$`2`
[1] 3 4 5
$`3`
[1] 9 10
$`4`
[1] 17
$`5`
[1] 29 30
daroczig writes "you could write a lot neater code based on diff"...
Here's one way:
split(v, cumsum(diff(c(-Inf, v)) != 1))
EDIT (added timings):
Tommy discovered this could be faster by being careful with types; the reason it got faster is that split is faster on integers, and is actually faster still on factors.
Here's Joshua's solution; the result from the cumsum is a numeric because it's being c'd with 1, so it's the slowest.
system.time({
a <- cumsum(c(1, diff(v) != 1))
split(v, a)
})
# user system elapsed
# 1.839 0.004 1.848
Just cing with 1L so the result is an integer speeds it up considerably.
system.time({
a <- cumsum(c(1L, diff(v) != 1))
split(v, a)
})
# user system elapsed
# 0.744 0.000 0.746
This is Tommy's solution, for reference; it's also splitting on an integer.
> system.time({
a <- cumsum(c(TRUE, diff(v) != 1L))
split(v, a)
})
# user system elapsed
# 0.742 0.000 0.746
Here's my original solution; it also is splitting on an integer.
system.time({
a <- cumsum(diff(c(-Inf, v)) != 1)
split(v, a)
})
# user system elapsed
# 0.750 0.000 0.754
Here's Joshua's, with the result converted to an integer before the split.
system.time({
a <- cumsum(c(1, diff(v) != 1))
a <- as.integer(a)
split(v, a)
})
# user system elapsed
# 0.736 0.002 0.740
All the versions that split on an integer vector are about the same; it could be even faster if that integer vector was already a factor, as the conversion from integer to factor actually takes about half the time. Here I make it into a factor directly; this is not recommended in general because it depends on the structure of the factor class. It'ss done here for comparison purposes only.
system.time({
a <- cumsum(c(1L, diff(v) != 1))
a <- structure(a, class = "factor", levels = 1L:a[length(a)])
split(v,a)
})
# user system elapsed
# 0.356 0.000 0.357
Joshua and Aaron were spot on. However, their code can still be made more than twice as fast by careful use of the correct types, integers and logicals:
split(v, cumsum(c(TRUE, diff(v) != 1L)))
v <- rep(c(1:5, 19), len = 1e6) # Huge vector...
system.time( split(v, cumsum(c(1, diff(v) != 1))) ) # Joshua's code
# user system elapsed
# 2.64 0.00 2.64
system.time( split(v, cumsum(c(TRUE, diff(v) != 1L))) ) # Modified code
# user system elapsed
# 1.09 0.00 1.12
You could define the cut-points easily:
which(diff(v) != 1)
Based on that try:
v <- c(1,3,4,5,9,10,17,29,30)
cutpoints <- c(0, which(diff(v) != 1), length(v))
ragged.vector <- vector("list", length(cutpoints)-1)
for (i in 2:length(cutpoints)) ragged.vector[[i-1]] <- v[(cutpoints[i-1]+1):cutpoints[i]]
Which results in:
> ragged.vector
[[1]]
[1] 1
[[2]]
[1] 3 4 5
[[3]]
[1] 9 10
[[4]]
[1] 17
[[5]]
[1] 29 30
This algorithm is not a nice one but you could write a lot neater code based on diff :) Good luck!
You can create a data.frame and assign the elements to groups using diff, ifelse and cumsum, then aggregate using tapply:
v.df <- data.frame(v = v)
v.df$group <- cumsum(ifelse(c(1, diff(v) - 1), 1, 0))
tapply(v.df$v, v.df$group, function(x) x)
$`1`
[1] 1
$`2`
[1] 3 4 5
$`3`
[1] 9 10
$`4`
[1] 17
$`5`
[1] 29 30

Resources