compare adjacent elements of the same vector (avoiding loops) - r

I managed to write a for loop to compare letters in the following vector:
bases <- c("G","C","A","T")
test <- sample(bases, replace=T, 20)
test will return
[1] "T" "G" "T" "G" "C" "A" "A" "G" "A" "C" "A" "T" "T" "T" "T" "C" "A" "G" "G" "C"
with the function Comp() I can check if a letter is matching to the next letter
Comp <- function(data)
{
output <- vector()
for(i in 1:(length(data)-1))
{
if(data[i]==data[i+1])
{
output[i] <-1
}
else
{
output[i] <-0
}
}
return(output)
}
Resulting in;
> Comp(test)
[1] 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 1 0
This is working, however its verry slow with large numbers. Therefor i tried sapply()
Comp <- function(x,i) if(x[i]==x[i+1]) 1 else 0
unlist(lapply(test, Comp, test))
Unfortunately its not working... (Error in i + 1 : non-numeric argument to binary operator) I have trouble figuring out how to access the preceding letter in the vector to compare it. Also the length(data)-1, to "not compare" the last letter might become a problem.
Thank you all for the help!
Cheers
Lucky

Just "lag" test and use ==, which is vectorized.
bases <- c("G","C","A","T")
set.seed(21)
test <- sample(bases, replace=TRUE, 20)
lag.test <- c(tail(test,-1),NA)
#lag.test <- c(NA,head(test,-1))
test == lag.test
Update:
Also, your Comp function is slow because you don't specify the length of output when you initialize it. I suspect you were trying to pre-allocate, but vector() creates a zero-length vector that must be expanded during every iteration of your loop. Your Comp function is significantly faster if you change the call to vector() to vector(length=NROW(data)-1).
set.seed(21)
test <- sample(bases, replace=T, 1e5)
system.time(orig <- Comp(test))
# user system elapsed
# 34.760 0.010 34.884
system.time(prealloc <- Comp.prealloc(test))
# user system elapsed
# 1.18 0.00 1.19
identical(orig, prealloc)
# [1] TRUE

As #Joshua wrote, you should of course use vectorization - it is way more efficient.
...But just for reference, your Comp function can still be optimized a bit.
The result of a comparison is TRUE/FALSE which is glorified versions of 1/0. Also, ensuring the result is integer instead of numeric consumes half the memory.
Comp.opt <- function(data)
{
output <- integer(length(data)-1L)
for(i in seq_along(output))
{
output[[i]] <- (data[[i]]==data[[i+1L]])
}
return(output)
}
...and the speed difference:
> system.time(orig <- Comp(test))
user system elapsed
21.10 0.00 21.11
> system.time(prealloc <- Comp.prealloc(test))
user system elapsed
0.49 0.00 0.49
> system.time(opt <- Comp.opt(test))
user system elapsed
0.41 0.00 0.40
> all.equal(opt, orig) # opt is integer, orig is double
[1] TRUE

Have a look at this :
> x = c("T", "G", "T", "G", "G","T","T","T")
>
> res = sequence(rle(x)$lengths)-1
>
> dt = data.frame(x,res)
>
> dt
x res
1 T 0
2 G 0
3 T 0
4 G 0
5 G 1
6 T 0
7 T 1
8 T 2
Might work faster.

Related

Selecting consecutive values from a bootstrap sample in r with repeated values

I'm not exactly sure how to go about this in R. I've got a data set with 40 values, some of which repeat and I want to perform a small bootstrap on this dataset to find the mean of two or more consecutive values. For example, I randomly select a value from the dataset provided below, say the very first value is selected which is 0.2, so x1=0.2. How can I make sure that in the same for loop R is able to select the next value, x2, to be 0.2 as that is the second value in the dataset? Thus it would appear as x1=0.2 and x2=0.2.
I can't really think of a way for this to be done as it would need to be repeated for each iteration and since the sample() function selects any random value that makes it harder to pinpoint exactly which value it selected given there are repeated values.
I've provided a sample code that calculates the mean for 1 observation and I would like to get it to work for 2 consecutive observations. So then I can calculate the means individually and display them.
If anyone has any way to handle this I would appreciate it.
Thanks ahead of time.
x=c(0.20,0.20,0.21,0.21,0.21,0.20,0.19,0.18,0.16,0.10,
0.02,-0.02,0.01,0.03,0.07,0.14,0.22,0.13,0.12,
0.16,0.17,0.18,0.18,0.17,0.15,0.15,0.13,0.12,
0.10,0.08,0.06,0.04,0.03,0.02,0.03,0.05,0.34,
0.13,0.11,0.12)
B<- 500
result1<- numeric(B)
# result2<- numerib(B)
for (b in 1:B){
x1<-sample(x=x,size =1, replace=TRUE)
# x2<-
result1[b]<-x1
# result2[b]<-x2
}
mean1<- mean(result1)
# mean2<- mean(result2)
A simple approach could be:
result <- matrix(nrow = B, ncol = 2)
for (b in 1:B){
idx1 <- sample(seq_along(x), size = 1)
idx2 <- idx1 %% length(x) + 1
result[b, 1] <- x[idx1]
result[b, 2] <- x[idx2]
}
storing the results in a matrix:
> result
[,1] [,2]
[1,] 0.21 0.21
[2,] 0.12 0.20
[3,] 0.21 0.21
[4,] 0.10 0.02
[5,] 0.10 0.02
[6,] 0.21 0.20
[7,] 0.02 -0.02
[8,] -0.02 0.01
[9,] 0.21 0.20
[10,] 0.17 0.15
Sample the indices of x, then use this to subset x for result1. Use the sampled index + 1 to subset x for result2. However, you also need a wrap around so that if you sample the last member of x, you sample the first as well (as the "next" value)
B <- 500
result1<- numeric(B)
result2 <- numeric(B)
for(i in 1:B) {
j <- sample(seq_along(x), 1)
if(j == 40) k <- 1
else k <- j + 1
result1[i] <- x[j]
result2[i] <- x[k]
}
mean(result1)
#> [1] 0.12618
mean(result2)
#> [1] 0.13034
Note also that since R is vectorized, you don't need a loop here at all. You could just do:
result1 <- sample(seq_along(x), 500, replace = TRUE)
result2 <- result1 + 1
result2[result2 == 41] <- 1
mean(x[result1])
#> [1] 0.12568
mean(x[result2])
#> [1] 0.12596
Created on 2022-03-28 by the reprex package (v2.0.1)
Could you work out all the possible consecutive means and then sample from that? How about:
library(RcppRoll)
x=c(0.20,0.20,0.21,0.21,0.21,0.20,0.19,0.18,0.16,0.10,
0.02,-0.02,0.01,0.03,0.07,0.14,0.22,0.13,0.12,
0.16,0.17,0.18,0.18,0.17,0.15,0.15,0.13,0.12,
0.10,0.08,0.06,0.04,0.03,0.02,0.03,0.05,0.34,
0.13,0.11,0.12)
rollmean <- roll_mean(x,2)
r <- sample(rollmean, 500, replace= T)
hist(r)
Which gives you:

Append value to empty vector in R?

I'm trying to learn R and I can't figure out how to append to a list.
If this were Python I would . . .
#Python
vector = []
values = ['a','b','c','d','e','f','g']
for i in range(0,len(values)):
vector.append(values[i])
How do you do this in R?
#R Programming
> vector = c()
> values = c('a','b','c','d','e','f','g')
> for (i in 1:length(values))
+ #append value[i] to empty vector
Appending to an object in a for loop causes the entire object to be copied on every iteration, which causes a lot of people to say "R is slow", or "R loops should be avoided".
As BrodieG mentioned in the comments: it is much better to pre-allocate a vector of the desired length, then set the element values in the loop.
Here are several ways to append values to a vector. All of them are discouraged.
Appending to a vector in a loop
# one way
for (i in 1:length(values))
vector[i] <- values[i]
# another way
for (i in 1:length(values))
vector <- c(vector, values[i])
# yet another way?!?
for (v in values)
vector <- c(vector, v)
# ... more ways
help("append") would have answered your question and saved the time it took you to write this question (but would have caused you to develop bad habits). ;-)
Note that vector <- c() isn't an empty vector; it's NULL. If you want an empty character vector, use vector <- character().
Pre-allocate the vector before looping
If you absolutely must use a for loop, you should pre-allocate the entire vector before the loop. This will be much faster than appending for larger vectors.
set.seed(21)
values <- sample(letters, 1e4, TRUE)
vector <- character(0)
# slow
system.time( for (i in 1:length(values)) vector[i] <- values[i] )
# user system elapsed
# 0.340 0.000 0.343
vector <- character(length(values))
# fast(er)
system.time( for (i in 1:length(values)) vector[i] <- values[i] )
# user system elapsed
# 0.024 0.000 0.023
FWIW: analogous to python's append():
b <- 1
b <- c(b, 2)
You have a few options:
c(vector, values)
append(vector, values)
vector[(length(vector) + 1):(length(vector) + length(values))] <- values
The first one is the standard approach. The second one gives you the option to append someplace other than the end. The last one is a bit contorted but has the advantage of modifying vector (though really, you could just as easily do vector <- c(vector, values).
Notice that in R you don't need to cycle through vectors. You can just operate on them in whole.
Also, this is fairly basic stuff, so you should go through some of the references.
Some more options based on OP feedback:
for(i in values) vector <- c(vector, i)
Just for the sake of completeness, appending values to a vector in a for loop is not really the philosophy in R. R works better by operating on vectors as a whole, as #BrodieG pointed out. See if your code can't be rewritten as:
ouput <- sapply(values, function(v) return(2*v))
Output will be a vector of return values. You can also use lapply if values is a list instead of a vector.
Sometimes we have to use loops, for example, when we don't know how many iterations we need to get the result. Take while loops as an example. Below are methods you absolutely should avoid:
a=numeric(0)
b=1
system.time(
{
while(b<=1e5){
b=b+1
a<-c(a,pi)
}
}
)
# user system elapsed
# 13.2 0.0 13.2
a=numeric(0)
b=1
system.time(
{
while(b<=1e5){
b=b+1
a<-append(a,pi)
}
}
)
# user system elapsed
# 11.06 5.72 16.84
These are very inefficient because R copies the vector every time it appends.
The most efficient way to append is to use index. Note that this time I let it iterate 1e7 times, but it's still much faster than c.
a=numeric(0)
system.time(
{
while(length(a)<1e7){
a[length(a)+1]=pi
}
}
)
# user system elapsed
# 5.71 0.39 6.12
This is acceptable. And we can make it a bit faster by replacing [ with [[.
a=numeric(0)
system.time(
{
while(length(a)<1e7){
a[[length(a)+1]]=pi
}
}
)
# user system elapsed
# 5.29 0.38 5.69
Maybe you already noticed that length can be time consuming. If we replace length with a counter:
a=numeric(0)
b=1
system.time(
{
while(b<=1e7){
a[[b]]=pi
b=b+1
}
}
)
# user system elapsed
# 3.35 0.41 3.76
As other users mentioned, pre-allocating the vector is very helpful. But this is a trade-off between speed and memory usage if you don't know how many loops you need to get the result.
a=rep(NaN,2*1e7)
b=1
system.time(
{
while(b<=1e7){
a[[b]]=pi
b=b+1
}
a=a[!is.na(a)]
}
)
# user system elapsed
# 1.57 0.06 1.63
An intermediate method is to gradually add blocks of results.
a=numeric(0)
b=0
step_count=0
step=1e6
system.time(
{
repeat{
a_step=rep(NaN,step)
for(i in seq_len(step)){
b=b+1
a_step[[i]]=pi
if(b>=1e7){
a_step=a_step[1:i]
break
}
}
a[(step_count*step+1):b]=a_step
if(b>=1e7) break
step_count=step_count+1
}
}
)
#user system elapsed
#1.71 0.17 1.89
In R, you can try out this way:
X = NULL
X
# NULL
values = letters[1:10]
values
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
X = append(X,values)
X
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
X = append(X,letters[23:26])
X
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "w" "x" "y" "z"
> vec <- c(letters[1:3]) # vec <- c("a","b","c") ; or just empty vector: vec <- c()
> values<- c(1,2,3)
> for (i in 1:length(values)){
print(paste("length of vec", length(vec)));
vec[length(vec)+1] <- values[i] #Appends value at the end of vector
}
[1] "length of vec 3"
[1] "length of vec 4"
[1] "length of vec 5"
> vec
[1] "a" "b" "c" "1" "2" "3"
What you're using in the python code is called a list in python, and it's tottaly different from R vectors, if i get what you wanna do:
# you can do like this if you'll put them manually
v <- c("a", "b", "c")
# if your values are in a list
v <- as.vector(your_list)
# if you just need to append
v <- append(v, value, after=length(v))
in R you create a "list" doing this:
v <- numeric() (is a numeric vector, or int in Python)
v <- character() (is a character vector, or str in Python)
then, if you want yo append a single value you have to do this:
v[1] <- 10 (append to vector "v", in position "1" a value 10)
v[2] <- 11 (append to vector "v", in position "2" a value 11)
So, if yoy want yo append multiple values in a for loop, try this:
v <- numeric()
for (value in 1:10) {
v[value] <- value
}
v
[1] 1 2 3 4 5 6 7 8 9 10

Why is R's which function not returning "correct" answer

I'm writing a variant of the Monty Hall problem, building up on another person's code. The difference is that instead of 3 doors, I have "n" doors. Let's say n = 4 for this question. The doors are labeled A, B, C and D.
The code is as follows:
n <- 4
doors <- LETTERS[seq( from = 1, to = n )]
xdata = c()
for(i in 1:10000) {
prize <- sample(doors)[1]
pick <- sample(doors)[1]
open1 <- doors[which(doors != pick & doors != prize)]
open <- sample(open1,n-2)
# the line with the problem
switchyes <- doors[which( doors != open & doors != pick)]
if(pick==prize) {
xdata <- c(xdata, "noswitchwin")
}
if(switchyes==prize) {
xdata=c(xdata, "switchwin")
}
}
When I run the code, I get the warning:
There were 50 or more warnings (use warnings() to see the first 50)
The problem seems to be due to the line:
switchyes <- doors[which( doors != open & doors != pick)]
This should only return 1 item (C) since the statement doors != open and doors != pick eliminates doors A and B and D. However, I'm getting more than one, B and C. Anybody see what's going on?
length(which(xdata == "switchwin"))
# [1] 4728
length(which(xdata == "noswitchwin"))
# [1] 2424
switchyes
# [1] "B" "C"
open
# [1] "B" "D"
open1
# [1] "B" "D"
pick
# [1] "A"
prize
# [1] "C"
The problem you have is the usage of != when LHS and RHS size differ:
p <- letters[1:4]
# [1] "a" "b" "c" "d"
q <- c("a", "e", "d", "d")
# [1] "a" "e" "d" "d"
p == q
# [1] TRUE FALSE FALSE TRUE
p != q
# [1] FALSE TRUE TRUE FALSE
What is happening? since p and q are of equal size, each element of p is compared to the value at the corresponding index of q. Now, what if we change q to this:
q <- c("b", "d")
p == q
# [1] FALSE FALSE FALSE TRUE
What's happening here? Since the length of q (RHS) is not equal to p (LHS), q gets recycled to get to the length of p. That is,
# p q p q
a == b, b == d # first two comparisons
c == b, d == d # recycled comparisons
Instead you should use
!(doors %in% open) & !(doors %in% pick).
Also, by noting that !A AND !B = !(A OR B). So, you could rewrite this as
!(doors %in% open | doors %in% pick)
In turn, this could be simplified to use only one %in% as:
!(doors %in% c(open, pick))
Further, you could create a function using Negate, say %nin% (corresponding to !(x %in% y) and replace the ! and %in% in the above statement as follows:
`%nin%` <- Negate(`%in%`)
doors %nin% c(open, pick) # note the %nin% here
So basically your statement assigning to switchyes could read just:
# using %bin% after defining the function
switchyes <- doors[doors %nin% c(open, pick)]
You don't need to use which here as you are not looking for indices. You can directly use the logicals here to get the result. Hope this helps.

Memoize and vectorize a custom function

I want to know how to vectorize and memoize a custom function in R. It seems
my way of thinking is not aligned with R's way of operation. So, I gladly
welcome any links to good reading material. For example, R inferno is a nice
resource, but it didn't help to figure out memoization in R.
More generally, can you provide a relevant usage example for the memoise
or R.cache packages?
I haven't been able to find any other discussions on this subject. Searching
for "memoise" or "memoize" on r-bloggers.com returns zero results. Searching
for those keywords at http://r-project.markmail.org/ does not return helpful
discussions. I emailed the mailing list and did not receive a complete
answer.
I am not solely interested in memoizing the GC function, and I am aware of
Bioconductor and the various packages
available there.
Here's my data:
seqs <- c("","G","C","CCC","T","","TTCCT","","C","CTC")
Some sequences are missing, so they're blank "".
I have a function for calculating GC content:
> GC <- function(s) {
if (!is.character(s)) return(NA)
n <- nchar(s)
if (n == 0) return(NA)
m <- gregexpr('[GCSgcs]', s)[[1]]
if (m[1] < 1) return(0)
return(100.0 * length(m) / n)
}
It works:
> GC('')
[1] NA
> GC('G')
[1] 100
> GC('GAG')
[1] 66.66667
> sapply(seqs, GC)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
I want to memoize it. Then, I want to vectorize it.
Apparently, I must have the wrong mindset for using the memoise or
R.cache R packages:
> system.time(dummy <- sapply(rep(seqs,100), GC))
user system elapsed
0.044 0.000 0.054
>
> library(memoise)
> GCm1 <- memoise(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm1))
user system elapsed
0.164 0.000 0.173
>
> library(R.cache)
> GCm2 <- addMemoization(GC)
> system.time(dummy <- sapply(rep(seqs,100), GCm2))
user system elapsed
10.601 0.252 10.926
Notice that the memoized functions are several orders of magnitude slower.
I tried the hash package, but things seem to be happening behind the
scenes and I don't understand the output. The sequence C should have a
value of 100, not NULL.
Note that using has.key(s, cache) instead of exists(s, cache) results
in the same output. Also, using cache[s] <<- result instead of
cache[[s]] <<- result results in the same output.
> cache <- hash()
> GCc <- function(s) {
if (!is.character(s) || nchar(s) == 0) {
return(NA)
}
if(exists(s, cache)) {
return(cache[[s]])
}
result <- GC(s)
cache[[s]] <<- result
return(result)
}
> sapply(seqs,GCc)
[[1]]
[1] NA
$G
[1] 100
$C
NULL
$CCC
[1] 100
$T
NULL
[[6]]
[1] NA
$TTCCT
[1] 40
[[8]]
[1] NA
$C
NULL
$CTC
[1] 66.66667
At least I figured out how to vectorize:
> GCv <- Vectorize(GC)
> GCv(seqs)
G C CCC T TTCCT
NA 100.00000 100.00000 100.00000 0.00000 NA 40.00000 NA
C CTC
100.00000 66.66667
Relevant stackoverflow posts:
Options for caching / memoization / hashing in R
While this won't give you memoization across calls, you can use factors to make individual calls a lot faster if there is a fair bit of repetition. Eg using Joshua's GC2 (though I had to remove fixed=T to get it to work):
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
100.0 * len/n
}
One can easily define a wrapper like:
GC3 <- function(s) {
x <- factor(s)
GC2(levels(x))[x]
}
system.time(GC2(rep(seqs, 50000)))
# user system elapsed
# 8.97 0.00 8.99
system.time(GC3(rep(seqs, 50000)))
# user system elapsed
# 0.06 0.00 0.06
This doesn't explicitly answer your question, but this function is ~4 times faster than yours.
GC2 <- function(s) {
if(!is.character(s)) stop("'s' must be character")
n <- nchar(s)
m <- gregexpr('[GCSgcs]', s)
len <- sapply(m, length)
neg <- sapply(m, "[[", 1)
len <- len*(neg > 0)
len/n
}

Return system.time from evaluated function

R version 2.12, Windows XP
I am attempting to write a function (say 'g') that takes one argument, a function (say 'f'), and returns the matched function. Furthermore, enclosed within the body of 'g' is a statement that tells the resulting object to return the value of system.time when the object is called. An example will clarify.
What I want:
g <- function(f) {...}
z <- g(mean)
z(c(1, 4, 7))
with output
user system elapsed
0.04 0.00 0.04
What I have:
g <- function(f) {if (!exists("x")) {x <- match.fun(f)} else {system.time(x)}}
z <- g(mean)
z(c(1, 4, 7))
with output
[1] 4
Any help is greatly appreciated.
Maybe this will help:
g <- function(f){
function(x){
zz <- system.time(
xx <- match.fun(f)(x)
)
list(value=xx, system.time=zz)
}
}
In use:
g(mean)(c(1, 4, 7))
$value
[1] 4
$system.time
user system elapsed
0 0 0
You may want to think about how your return the values. I used a list, but another option is to print the system time as a side effect and return the calculated value.
Recently I made similar function for myself:
with_times <- function(f) {
f <- match.fun(f)
function(...) {
.times <- system.time(res <- f(...))
attr(res, "system.time") <- as.list(na.omit(.times))
res
}
}
For example:
g <- function(x,y) {r<-x+y; Sys.sleep(.5); r}
g(1, 1)
# [1] 2
g2 <- with_times(g)
w <- g2(1, 1)
Timings can be extracted in two ways:
attributes(w)$system.time
# $user.self
# [1] 0
# $sys.self
# [1] 0
# $elapsed
# [1] 0.5
or
attr(w, "system.time")
# $user.self
# [1] 0
# $sys.self
# [1] 0
# $elapsed
# [1] 0.5

Resources