Understanding the behaviour of subsetting using 'which'

Understanding the behaviour of subsetting using 'which' - r

I was trying to define a function generating all the prime numbers till n.
I came up with the following solution, which I compared to a solution readily available (given below for reference). Essentially there's just a line of difference in both the codes (indicated below)
sieve <- function(n){
sq.n <- sqrt(n)
vec <- 2:n
primes <- rep(0, times=(sq.n))
i <- 1
while (!(is.na(primes[i] < sq.n)) && (primes[i]) < (sq.n)) {
primes[i] <- vec[1]
vec <- vec[which(vec%%primes[i] != 0)] # This keeps all the numbers not divisible by
# the prime in question
i <- i + 1
}
return(c(primes[which(primes!=0)], vec))
}
Curious about efficiency, a google search yielded the following code -
getPrimeNumTilln <- function(n) {
a <- c(2:n)
l <- 2
r <- c()
while (l*l < n) {
r <- c(r,a[1])
a <- a[-(which(a %% l ==0))] # This removes all the numbers which are
# divisible by the prime in question
l <- a[1]
}
c(r,a)
}
Both solutions work okay. (The internet solution gives a wrong answer if n is the square of a prime, but that can be corrected easily)
And these are the microbenchmark results -
microbenchmark(sieve(100),getPrimeNumTilln(100),times=100)
Unit: microseconds
expr min lq mean median uq max neval
sieve(100) 142.107 153.106 165.85155 162.785 165.425 466.795 100
getPrimeNumTilln(100) 41.797 47.076 51.09312 49.276 51.036 126.269 100
I would like to understand the fair difference in the runtime of both the functions

The loop of the first function does 10 iterations for n = 100, the second function does 4.
sieve <- function(n){
sq.n <- sqrt(n)
vec <- 2:n
primes <- rep(0, times=(sq.n))
i <- 1
while (!(is.na(primes[i] < sq.n)) && (primes[i]) < (sq.n)) {
count <<- count + 1
primes[i] <- vec[1]
vec <- vec[which(vec%%primes[i] != 0)] # This keeps all the numbers not divisible by
# the prime in question
i <- i + 1
}
return(c(primes[which(primes!=0)], vec))
}
count <- 0
sieve(100)
#[1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
count
#[1] 10
getPrimeNumTilln <- function(n) {
a <- c(2:n)
l <- 2
r <- c()
while (l*l < n) {
count <<- count + 1
r <- c(r,a[1])
a <- a[-(which(a %% l ==0))] # This removes all the numbers which are
# divisible by the prime in question
l <- a[1]
}
c(r,a)
}
count <- 0
getPrimeNumTilln(100)
# [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
count
#[1] 4

Related

Find pattern and filter the start location

I would like to find the location of a pattern AND filter the location.
I'm looking for a function to return the start location of the pattern "gaaa" between 30 and 34 for each row.
I explain, for the moment here is what I have as a result with the function str_locate_all :
library(stringr)
Sequence <- data.frame(All = c("ggcgaagcagugcucccaguguuuuagagcuagaaauagcaaguuaaaauaaggcuaguccguuaucaacuugaaaaaguggcaccgagucggugcuu",
"aggacaacucgcuccacggccguuuuagagcuagaaauagcaaguuaaaauaaggcuaguccguuaucaacuugaaaaaguggcaccgagucggugcuu",
"cugaaauggcagcagaaacguuuuagagcuagaaauagcaaguuaaaauaaggcuaguccguuaucaacuugaaaaaguggcaccgagucggugcaacaaa",
"ggucaaagaggaggagcucguuuuagagcuagaaauagcaaguuaaaauaaggcuaguccguuaucaacuugaaaaaguggcaccgagucggugcuu"))
str_locate_all(pattern = 'gaaa', Sequence$All)
[[1]]
start end
[1,] 33 36
[2,] 73 76
[[2]]
start end
[1,] 34 37
[2,] 74 77
[[3]]
start end
[1,] 3 6
[2,] 15 18
[3,] 32 35
[4,] 72 75
[[4]]
start end
[1,] 32 35
[2,] 72 75
Here is what I would like to have as a result:
start
1 33
2 34
3 32
4 32
Thanks you!

Here is a way. It uses the output of the str_locate_all instruction in the question and filters it in a lapply loop.
found <- str_locate_all(pattern = 'gaaa', Sequence$All)
found <- lapply(found, function(x){
y <- x[, 'start']
data.frame(start = y[y >= 30 & y <= 34])
})
do.call(rbind, found)
# start
#1 33
#2 34
#3 32
#4 32
Here is another way. It searches only a sub string of the original string.
first <- 30
last <- 34
tmp <- substr(Sequence$All, first, last + nchar('gaaa') - 1)
data.frame(start = str_locate(pattern = 'gaaa', tmp)[, 1] + first - 1)
#Timings
Here are the timings of the 3 answers available so far, r2evans, mine and tmfmnk.
I only post the results with larger input, since that's what should make the timings important.
library(stringr)
library(dplyr)
library(purrr)
r2evans <- function(){
Sequence$start <-
sapply(str_locate_all(pattern = 'gaaa', Sequence$All),
function(z) { ind <- which(30 <= z[,1] & z[,1] <= 34); if (length(ind)) z[ind[1],1] else NA })
Sequence[,2,drop=FALSE]
}
rui <- function(){
first <- 30
last <- 34
tmp <- substr(Sequence$All, first, last + nchar('gaaa') - 1)
data.frame(start = str_locate(pattern = 'gaaa', tmp)[, 1] + first - 1)
}
tmfmnk <- function(){
map_dfr(.x = str_locate_all(pattern = "gaaa", Sequence$All),
~ as.data.frame(.x) %>%
filter(start %in% c(30:34)),
.id = "ID")
}
library(microbenchmark)
for(i in 1:8) Sequence <- rbind(Sequence, Sequence)
dim(Sequence)
#[1] 1024 1
mb <- microbenchmark(
revans = f1(),
rui = f2()
tmfmnk = f3()
)
print(mb, unit = 'relative', order = 'median')
#Unit: relative
# expr min lq mean median uq max neval
# rui 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
# r2evans 19.66135 17.52724 16.28008 15.47317 16.20747 5.60779 100
# tmfmnk 1529.51644 1235.86285 1079.56958 1073.49131 1072.39265 317.95638 100

Sequence$start <-
sapply(str_locate_all(pattern = 'gaaa', Sequence$All),
function(z) { ind <- which(30 <= z[,1] & z[,1] <= 34); if (length(ind)) z[ind[1],1] else NA })
Sequence[,2,drop=FALSE]
# start
# 1 33
# 2 34
# 3 32
# 4 32

One dplyr and purrr solution could be:
map_dfr(.x = str_locate_all(pattern = "gaaa", Sequence$All),
~ as.data.frame(.x) %>%
filter(start %in% c(30:34)),
.id = "ID")
ID start end
1 1 33 36
2 2 34 37
3 3 32 35
4 4 32 35

Add number to vector repeatdly and duplicate vector

I have a two value
3 and 5
and I make vector
num1 <- 3
num2 <- 12
a <- c(num1, num2)
I want add number(12) to vector "a" and
also I want to make new vector with repeat and append
like this:
3,12, 15,24, 27,36, 39,48 ....
repeat number "n" is 6
I don't have any idea.

Here are two methods in base R.
with outer, you could do
c(outer(c(3, 12), (12 * 0:4), "+"))
[1] 3 12 15 24 27 36 39 48 51 60
or with sapply, you can explicitly loop through and calculate the pairs of sums.
c(sapply(0:4, function(i) c(3, 12) + (12 * i)))
[1] 3 12 15 24 27 36 39 48 51 60
outer returns a matrix where every pair of elements of the two vectors have been added together. c is used to return a vector. sapply loops through 0:4 and then calculates the element-wise sum. It also returns a matrix in this instance, so c is used to return a vector.

Here is a somewhat generic function that takes as input your original vector a, the number to add 12, and n,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n/len1), function(i) x*i)
v2 <- rep(v1, each = n/length(v1))
v3 <- rep(vec, n/len1)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48
f1(a, 11, 12)
#[1] 3 12 14 23 25 34 36 45 47 56 58 67 69 78
f1(a, 3, 2)
#[1] 3 12 6 15
EDIT
If by n=6 you mean 6 times the whole vector then,
f1 <- function(vec, x, n){
len1 <- length(vec)
v1 <- sapply(seq(n), function(i) x*i)
v2 <- rep(v1, each = len1)
v3 <- rep(vec, n)
return(c(vec, v3 + v2))
}
f1(a, 12, 6)
#[1] 3 12 15 24 27 36 39 48 51 60 63 72 75 84

Using rep for repeating and cumsum for the addition:
n = 6
rep(a, n) + cumsum(rep(c(12, 0), n))
# [1] 15 24 27 36 39 48 51 60 63 72 75 84

Multiple range of rows deletion in R

Let's say I have
v <- matrix(seq(150), 50, 3)
k <- c(10, 40)
delta <- 5
How can I delete the 10-delta to 10+delta rows and 40-delta to 40+delta rows simultaneously?
I used vnew <- v[-((k-delta):(k+delta)),] but it seems that the command only delete using the first element of k (which is 10) and does not delete the 40-delta to 40+delta rows. Does anyone have any idea how to do this?
Oh and I will need to put this inside a loop where k is being updated in each iteration, so v[c(-{(10-delta):(10+delta)},-{(40-delta):(40+delta)}),] won't work.

If k is growing in each iteration and delta doesn't change I would suggest the following:
d <- -delta:delta
for (...) {
# ...
vnew <- v[-(rep(k, each=length(d)) + d),]
# ...
}
For your example:
d <- -5:5
k <- c(10, 40)
rep(k, each=length(d)) + d
# [1] 5 6 7 8 9 10 11 12 13 14 15 35 36 37 38 39 40 41 42 43 44 45
EDIT: a benchmark of both solutions:
library("rbenchmark")
idx1 <- function(k, delta) {
d <- -delta:delta
lapply(seq_along(k), function(i) {
rep(k[1:i], each=length(d)) + d
})
}
idx2 <- function(k, delta) {
lapply(seq_along(k), function(i) {
c(sapply(1:i, function(ii) {
(k[ii]-delta):(k[ii]+delta)
}))
})
}
set.seed(1)
k <- sample(1e3, 1e2)
delta <- 5
all.equal(idx1(k, delta), idx2(k, delta))
# [1] TRUE
benchmark(idx1(k, delta), idx2(k, delta), order="relative", replications=100)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 idx1(k, delta) 100 0.174 1.000 0.172 0 0 0
# 2 idx2(k, delta) 100 1.579 9.075 1.576 0 0 0

Richard Scriven's answer only returns the indexes 10-delta:10+delta and 40-delta:40+delta of the lines to be removed from v. To effectly do it, you must combined it with what you tried like this:
v[-c(sapply(seq(k), function(i) (k[i]-delta):(k[i]+delta))), ]
or shorter but dirtier(?): v[-sapply(seq(k), function(i) (k[i]-delta):(k[i]+delta)), ]

Condense a matrix in R

I have loaded a table of integer data with 2,200 columns. What I'd like to do is condense the data down by averaging the values in every 5 columns and placing that in a new column in a new table.
For example, if I had:
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10
2 4 6 8 10 12 14 16 18 20
I would get:
Col1 | Col2
6 16
Which is just the average of the values in columns 1-5 from the original table in Col1 and the average of the values in columns 6-10 in Col2.
I haven't quite wrapped my head around R syntax, so any help would be appreciated.

Here's one approach that's applicable if the number of elements to be grouped is divisible by n (5, in your case):
x <- 1:100
n <- 5
tapply(x, rep(seq(1, length(x), n), each=n), mean)
# 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
# 3 8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93 98
The first row of output contains element names, and the second row contains means of successive groups of n elements.
To apply this to all rows of a matrix or data.frame, you can do, e.g.:
m <- matrix(1:1000, ncol=100)
apply(m, 1, function(x) tapply(x, rep(seq(1, length(x), n), each=n), mean))
EDIT
This alternative approach will give you some performance gains due to vectorisation with rowMeans:
t(mapply(function(x, y) rowMeans(m[, x:y]),
seq(1, ncol(m), n), seq(n, ncol(m), n)))

Oops, I see this is the comment of #user20650 in #jbaums answer. The rowsum function splits rows of a matrix by a factor, and sums the columns of each split. So for
m <- matrix(1:1000, ncol=100)
n <- 5
we have
rowsum(t(m), rep(seq_len(ncol(m) / n), each=n)) / n
This is fast, if that's important
library(microbenchmark)
f0 = function(m, n) rowsum(t(m), rep(seq_len(ncol(m) / n), each=n)) / n
f1 = function(m, n)
apply(m, 1, function(x) tapply(x, rep(seq(1, length(x), n), each=n), mean))
f2 = function(m, n)
t(mapply(function(x, y) rowMeans(m[, x:y]),
seq(1, ncol(m), n), seq(n, ncol(m), n)))
all.equal(f0(m, n), f1(m, n), check.attributes=FALSE)
## [1] TRUE
all.equal(f0(m, n), f2(m, n), check.attributes=FALSE)
## [1] TRUE
microbenchmark(f0(m, n), f1(m, n), f2(m, n))
## Unit: microseconds
## expr min lq median uq max neval
## f0(m, n) 164.351 170.1675 176.730 187.8570 237.419 100
## f1(m, n) 8060.639 8513.3035 8696.742 8908.5190 9771.019 100
## f2(m, n) 540.894 588.3820 603.787 634.1615 732.209 100

Here's another approach using a loop and rowMeans instead, in case you prefer a loop in this case. Will work for matrices, but needs adjustment for vectors.
# example data
dat <- as.data.frame( matrix(1:20,ncol=10,byrow=TRUE) )
# pick range
range <- 5
ind <- seq(1,ncol(dat),range)
newdat <- NULL
for(i in ind){
newcol <- rowMeans(dat[,i:(i+range-1)])
newdat <- cbind(newdat, newcol)
}
Will result in:
> newdat
newcol newcol
[1,] 3 8
[2,] 13 18

#jbaums answer looks pretty good. Since I had already started this answer, I thought I would post my solution as well.
#Make some fake data
require(data.table)
data <- data.table(t(iris[,1:4]))
#Transpose since rows are easier to deal with than columns
data <- data.table(t(data))
data[ , row := .I]
#Sum by every 5 rows
data <- data[ , lapply(.SD,sum), by=cut(row,seq(0,nrow(data),5))]
#Transpose back to original results
result <- data.table(t(data))

If you wanted to get the means of the elements from col1-col5, col6-col10, etc.
m1 <- matrix(c(rep(1:100, 2), 1:20), ncol=22)
n <- 5
p1 <- prod(dim(m1))
n1 <- nrow(m1)*n
n2 <- p1-p1%%n1
c(rowMeans(matrix(m1[1:n2], nrow=p1%/%n1, byrow=TRUE)), mean(m1[(n2+1):p1]))
#[1] 25.5 75.5 25.5 75.5 10.5
Or
sapply(seq(1,ncol(m1), by=n), function(i) mean(m1[,i:(min(c(i+n-1), ncol(m1)))]) )
#[1] 25.5 75.5 25.5 75.5 10.5
With some labels
indx <- seq(1,n2/nrow(m1), by=n)
indx1 <- paste("Col",paste(indx, indx+4, sep="-"),sep="_")
indx2 <- paste("Col", paste(seq(p1%%n1+1, ncol(m1)),collapse="-"), sep="_")
c(rowMeans(matrix(m1[1:n2], nrow=p1%/%n1, byrow=TRUE, dimnames=list(indx1, NULL))), setNames(mean(m1[(n2+1):p1]), indx2))
# Col_1-5 Col_6-10 Col_11-15 Col_16-20 Col_21-22
# 25.5 75.5 25.5 75.5 10.5
Update
I realized that you wanted the rowMeans by splitting up columns 1:5, 6:10, 11:15 etc. If that is the case:
res1 <- cbind( colMeans(aperm(array(m1[1:n2], dim=c(nrow(m1), n, p1%/%n1)), c(2,1,3))),
rowMeans(m1[,(ncol(m1)-ncol(m1)%%n+1):ncol(m1)]))
which is equal to manual splitting the columns
res2 <- cbind(rowMeans(m1[,1:5]), rowMeans(m1[,6:10]), rowMeans(m1[,11:15]),
rowMeans(m1[,16:20]), rowMeans(m1[,21:22]))
identical(res1,res2)
#[1] TRUE
colnames(res1) <- c(indx1,indx2)
res1
# Col_1-5 Col_6-10 Col_11-15 Col_16-20 Col_21-22
#[1,] 21 71 21 71 6
#[2,] 22 72 22 72 7
#[3,] 23 73 23 73 8
#[4,] 24 74 24 74 9
#[5,] 25 75 25 75 10
#[6,] 26 76 26 76 11
#[7,] 27 77 27 77 12
#[8,] 28 78 28 78 13
#[9,] 29 79 29 79 14
#[10,] 30 80 30 80 15

Advice wanted on getting rid of loops

I have written a program that works with the 3n + 1 problem (aka "wondrous numbers" and various other things). But it has a double loop. How could I vectorize it?
the code is
count <- vector("numeric", 100000)
L <- length(count)
for (i in 1:L)
{
x <- i
while (x > 1)
{
if (round(x/2) == x/2)
{
x <- x/2
count[i] <- count[i] + 1
} else
{
x <- 3*x + 1
count[i] <- count[i] + 1
}
}
}
Thanks!

I turned this 'inside-out' by creating a vector x where the ith element is the value after each iteration of the algorithm. The result is relatively intelligible as
f1 <- function(L) {
x <- seq_len(L)
count <- integer(L)
while (any(i <- x > 1)) {
count[i] <- count[i] + 1L
x <- ifelse(round(x/2) == x/2, x / 2, 3 * x + 1) * i
}
count
}
This can be optimized to (a) track only those values still in play (via idx) and (b) avoid unnecessary operations, e.g., ifelse evaluates both arguments for all values of x, x/2 evaluated twice.
f2 <- function(L) {
idx <- x <- seq_len(L)
count <- integer(L)
while (length(x)) {
ix <- x > 1
x <- x[ix]
idx <- idx[ix]
count[idx] <- count[idx] + 1L
i <- as.logical(x %% 2)
x[i] <- 3 * x[i] + 1
i <- !i
x[i] <- x[i] / 2
}
count
}
with f0 the original function, I have
> L <- 10000
> system.time(ans0 <- f0(L))
user system elapsed
7.785 0.000 7.812
> system.time(ans1 <- f1(L))
user system elapsed
1.738 0.000 1.741
> identical(ans0, ans1)
[1] TRUE
> system.time(ans2 <- f2(L))
user system elapsed
0.301 0.000 0.301
> identical(ans1, ans2)
[1] TRUE
A tweak is to update odd values to 3 * x[i] + 1 and then do the division by two unconditionally
x[i] <- 3 * x[i] + 1
count[idx[i]] <- count[idx[i]] + 1L
x <- x / 2
count[idx] <- count[idx] + 1
With this as f3 (not sure why f2 is slower this morning!) I get
> system.time(ans2 <- f2(L))
user system elapsed
0.36 0.00 0.36
> system.time(ans3 <- f3(L))
user system elapsed
0.201 0.003 0.206
> identical(ans2, ans3)
[1] TRUE
It seems like larger steps can be taken at the divide-by-two stage, e.g., 8 is 2^3 so we could take 3 steps (add 3 to count) and be finished, 20 is 2^2 * 5 so we could take two steps and enter the next iteration at 5. Implementations?

Because you need to iterate on values of x you can't really vectorize this. At some point, R has to work on each value of x separately and in turn. You might be able to run the computations on separate CPU cores to speed things up, perhaps using foreach in the package of the same name.
Otherwise, (and this is just hiding the loop from you), wrap the main body of your loop as a function, e.g.:
wonderous <- function(n) {
count <- 0
while(n > 1) {
if(isTRUE(all.equal(n %% 2, 0))) {
n <- n / 2
} else {
n <- (3*n) + 1
}
count <- count + 1
}
return(count)
}
and then you can use sapply() to run the function on a set of numbers:
> sapply(1:50, wonderous)
[1] 0 1 7 2 5 8 16 3 19 6 14 9 9 17 17
[16] 4 12 20 20 7 7 15 15 10 23 10 111 18 18 18
[31] 106 5 26 13 13 21 21 21 34 8 109 8 29 16 16
[46] 16 104 11 24 24
Or you can use Vectorize to return a vectorized version of wonderous which is itself a function that hides even more of this from you:
> wonderousV <- Vectorize(wonderous)
> wonderousV(1:50)
[1] 0 1 7 2 5 8 16 3 19 6 14 9 9 17 17
[16] 4 12 20 20 7 7 15 15 10 23 10 111 18 18 18
[31] 106 5 26 13 13 21 21 21 34 8 109 8 29 16 16
[46] 16 104 11 24 24
I think that is about as far as you can get with standard R tools at the moment.#Martin Morgan shows you can do a lot better than this with an ingenious take on solving the problem that does used R's vectorised abilities.

A different approach recognizes that one frequently revisits low numbers, so why not remember them and save the re-calculation cost?
memo_f <- function() {
e <- new.env(parent=emptyenv())
e[["1"]] <- 0L
f <- function(x) {
k <- as.character(x)
if (!exists(k, envir=e))
e[[k]] <- 1L + if (x %% 2) f(3L * x + 1L) else f(x / 2L)
e[[k]]
}
f
}
which gives
> L <- 100
> vals <- seq_len(L)
> system.time({ f <- memo_f(); memo1 <- sapply(vals, f) })
user system elapsed
0.018 0.000 0.019
> system.time(won <- sapply(vals, wonderous))
user system elapsed
0.921 0.005 0.930
> all.equal(memo1, won) ## integer vs. numeric
[1] TRUE
This might not parallelize well, but then maybe that's not necessary with the 50x speedup? Also the recursion might get too deep, but the recursion could be written as a loop (which is probably faster, anyway).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Understanding the behaviour of subsetting using 'which' - r

Related

Find pattern and filter the start location

Add number to vector repeatdly and duplicate vector

Multiple range of rows deletion in R

Condense a matrix in R

Advice wanted on getting rid of loops

Categories

Resources