Multiple range of rows deletion in R - r

Let's say I have
v <- matrix(seq(150), 50, 3)
k <- c(10, 40)
delta <- 5
How can I delete the 10-delta to 10+delta rows and 40-delta to 40+delta rows simultaneously?
I used vnew <- v[-((k-delta):(k+delta)),] but it seems that the command only delete using the first element of k (which is 10) and does not delete the 40-delta to 40+delta rows. Does anyone have any idea how to do this?
Oh and I will need to put this inside a loop where k is being updated in each iteration, so v[c(-{(10-delta):(10+delta)},-{(40-delta):(40+delta)}),] won't work.

If k is growing in each iteration and delta doesn't change I would suggest the following:
d <- -delta:delta
for (...) {
# ...
vnew <- v[-(rep(k, each=length(d)) + d),]
# ...
}
For your example:
d <- -5:5
k <- c(10, 40)
rep(k, each=length(d)) + d
# [1] 5 6 7 8 9 10 11 12 13 14 15 35 36 37 38 39 40 41 42 43 44 45
EDIT: a benchmark of both solutions:
library("rbenchmark")
idx1 <- function(k, delta) {
d <- -delta:delta
lapply(seq_along(k), function(i) {
rep(k[1:i], each=length(d)) + d
})
}
idx2 <- function(k, delta) {
lapply(seq_along(k), function(i) {
c(sapply(1:i, function(ii) {
(k[ii]-delta):(k[ii]+delta)
}))
})
}
set.seed(1)
k <- sample(1e3, 1e2)
delta <- 5
all.equal(idx1(k, delta), idx2(k, delta))
# [1] TRUE
benchmark(idx1(k, delta), idx2(k, delta), order="relative", replications=100)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 idx1(k, delta) 100 0.174 1.000 0.172 0 0 0
# 2 idx2(k, delta) 100 1.579 9.075 1.576 0 0 0

Richard Scriven's answer only returns the indexes 10-delta:10+delta and 40-delta:40+delta of the lines to be removed from v. To effectly do it, you must combined it with what you tried like this:
v[-c(sapply(seq(k), function(i) (k[i]-delta):(k[i]+delta))), ]
or shorter but dirtier(?): v[-sapply(seq(k), function(i) (k[i]-delta):(k[i]+delta)), ]

Related

In R, how do I make an iterative calculation without using a loop?

Here is a simple example of one type of iterative calc:
vals <- data.frame( "x"=c( 14, 15, 12, 10, 17 ), "ema"=0 )
vals$ema[1] <- vals$x[1]
K <- 0.90
for( jj in 2:nrow( vals ) )
vals$ema[jj] <- K * vals$ema[jj-1] + (1-K) * vals$x[jj]
vals
x ema
1 14 14.0000
2 15 14.1000
3 12 13.8900
4 10 13.5010
5 17 13.8509
The more involved examples use if...else to determine the next value:
for( jj in 2:nrow( vals ) )
if( K * vals$ema[jj-1] + (1-K) * vals$x[jj] < 5.0 )
vals$ema[jj] <- 5.0
else if( K * vals$ema[jj-1] + (1-K) * vals$x[jj] > 15.0 )
vals$ema[jj] <- 15.0
else
vals$ema[jj] <- K * vals$ema[jj-1] + (1-K) * vals$x[jj]
I am not sure if it would be more involved or not, but the decision can be based on the previous value as well:
K1 <- 0.999
K2 <- 0.95
K3 <- 0.90
for( jj in 2:now( vals ) )
if( vals$ema[jj-1] < 0.0 )
vals$ema[jj] <- K1 * vals$ema[jj-1] + (1-K1) * vals$x[jj]
else if( vals$ema[jj-1] > 100.0 )
vals$ema[jj] <- K3 * vals$ema[jj-1] + (1-K3) * vals$x[jj]
else
vals$ema[jj] <- K2 * vals$ema[jj-1] + (1-K2) * vals$x[jj]
This answer by WaltS to a similar question I had about recursive calculations provides two potential solutions. Adapting one of them to your question:
vals$ema.Reduce <- Reduce(function(myema, x) K * myema + (1-K) * x,
x = tail(vals$x, -1), init = 14, accumulate = TRUE)
vals
# x ema ema.Reduce
#1 14 14.0000 14.0000
#2 15 14.1000 14.1000
#3 12 13.8900 13.8900
#4 10 13.5010 13.5010
#5 17 13.8509 13.8509
Explanation of the function:
Reduce() is calculating ema for the current jj row, and myema is the previous value (jj-1) starting with init. The x vector required by Reduce consists of vals$x for the rows you want to calculate: row 2 to the last row = x = tail(vals$x, -1). The accumulate = TRUE option returns the vector instead of the final value. (Note the x term in Reduce is a generic term and not the same as vals$x in the example data. For calculations that do not require the additional term vals$x, a vector of 0's would work (as in the linked answer)).
Adding if/else conditions to Reduce (note: init is changed in these examples to illustrate the conditional statements):
Reduce(function(myema, x) {
if(myema < 5) {
5
} else if(myema > 15) {
15
} else {
K * myema + (1-K) * x
}
}, x = tail(vals$x, -1), init = 16, accumulate = TRUE)
#[1] 16.000 15.000 14.700 14.230 14.507
Reduce(function(myema, x) {
if(myema < 0) {
K1 * myema + (1-K1) * x
} else if(myema > 100) {
K3 * myema + (1-K3) * x
} else {
K2 * myema + (1-K2) * x
}
}, x = tail(vals$x, -1), init = 110, accumulate = TRUE)
#[1] 110.00000 100.50000 91.65000 87.56750 84.03912
K3*110 + (1-K3)*vals$x[2] #100.5
K3*100.5 + (1-K3)*vals$x[3] #91.65
K2*91.65 + (1-K2)*vals$x[4] #87.5675
K2*87.5675 + (1-K2)*vals$x[5] #84.03912
Seems this succeeds:
vals$ema2 <- c(vals$ema[1], K*vals$ema[1:4] +(1-K)*vals$x[2:5] )
> vals
x ema ema2
1 14 14.0000 14.0000
2 15 14.1000 14.1000
3 12 13.8900 13.8900
4 10 13.5010 13.5010
5 17 13.8509 13.8509
Sometimes it is best to work with the time series and data munging libraries. In this case, lag.zoo from the zoo library handles lagged values for you.
library(dplyr)
library(zoo)
vals <- data.frame( "x"=c( 14, 15, 12, 10, 17 ) )
K <- 0.90
vals %>% mutate(ema = (1-K)*vals$x + K*(lag(vals$x,1)))
For this particular problem, the weights for each value is some function of k and i (as in the ith value). We can write a function for the weights, and vectorize it:
weights <- function(i, k) {
q <- 1-k
qs <- '^'(q, 1:i)
rev(qs) * c(1, rep(k, (i-1)))
}
v_weights <- Vectorize(weights)
An example:
> v_weights(1:3, .1)
[[1]]
[1] 0.9
[[2]]
[1] 0.81 0.09
[[3]]
[1] 0.729 0.081 0.090
where these are the weights of the "preceding" x values. We proceed with some matrix algebra. I write a function to make the weights (above) into a matrix:
weight_matrix <- function(j, k) {
w <- v_weights(1:j, k=k)
Ws <- matrix(0, j+1, j+1)
Ws[row(Ws)+col(Ws)<(j+2)] <- unlist(rev(w))
Ws <- t(Ws)
Ws[row(Ws)+col(Ws)==(j+2)] <- k
Ws[(j+1),1] <- 1
Ws
}
Example:
> weight_matrix(3, .1)
[,1] [,2] [,3] [,4]
[1,] 0.729 0.081 0.09 0.1
[2,] 0.810 0.090 0.10 0.0
[3,] 0.900 0.100 0.00 0.0
[4,] 1.000 0.000 0.00 0.0
Then multiply this with the vector of xs. Function: ema <- function(x, k) rev(weight_matrix(length(x)-1, k) %*% x[1:(length(x))]).
To get the dataframe above (I "flipped" the k so it's 0.1 instead of 0.9):
> x <- c(14, 15, 12, 10, 17)
> k <- .1
> vals <- data.frame("x"=x, "ema"=ema(x, k))
> vals
x ema
1 14 14.0000
2 15 14.1000
3 12 13.8900
4 10 13.5010
5 17 13.8509
#shayaa's answer is 99% correct. dplyr implements lag just fine, and apart from a typo in that answer (one value of x should be ema), extraneous calls to column names, and a missing default value (otherwise it puts NA in the first row) it works perfectly well.
library(dplyr)
vals %>% mutate(ema = K*lag(ema, 1, default=ema[1]) + (1-K)*x)
#> x ema
#> 1 14 14.0000
#> 2 15 14.1000
#> 3 12 13.8900
#> 4 10 13.5010
#> 5 17 13.8509

Understanding the behaviour of subsetting using 'which'

I was trying to define a function generating all the prime numbers till n.
I came up with the following solution, which I compared to a solution readily available (given below for reference). Essentially there's just a line of difference in both the codes (indicated below)
sieve <- function(n){
sq.n <- sqrt(n)
vec <- 2:n
primes <- rep(0, times=(sq.n))
i <- 1
while (!(is.na(primes[i] < sq.n)) && (primes[i]) < (sq.n)) {
primes[i] <- vec[1]
vec <- vec[which(vec%%primes[i] != 0)] # This keeps all the numbers not divisible by
# the prime in question
i <- i + 1
}
return(c(primes[which(primes!=0)], vec))
}
Curious about efficiency, a google search yielded the following code -
getPrimeNumTilln <- function(n) {
a <- c(2:n)
l <- 2
r <- c()
while (l*l < n) {
r <- c(r,a[1])
a <- a[-(which(a %% l ==0))] # This removes all the numbers which are
# divisible by the prime in question
l <- a[1]
}
c(r,a)
}
Both solutions work okay. (The internet solution gives a wrong answer if n is the square of a prime, but that can be corrected easily)
And these are the microbenchmark results -
microbenchmark(sieve(100),getPrimeNumTilln(100),times=100)
Unit: microseconds
expr min lq mean median uq max neval
sieve(100) 142.107 153.106 165.85155 162.785 165.425 466.795 100
getPrimeNumTilln(100) 41.797 47.076 51.09312 49.276 51.036 126.269 100
I would like to understand the fair difference in the runtime of both the functions
The loop of the first function does 10 iterations for n = 100, the second function does 4.
sieve <- function(n){
sq.n <- sqrt(n)
vec <- 2:n
primes <- rep(0, times=(sq.n))
i <- 1
while (!(is.na(primes[i] < sq.n)) && (primes[i]) < (sq.n)) {
count <<- count + 1
primes[i] <- vec[1]
vec <- vec[which(vec%%primes[i] != 0)] # This keeps all the numbers not divisible by
# the prime in question
i <- i + 1
}
return(c(primes[which(primes!=0)], vec))
}
count <- 0
sieve(100)
#[1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
count
#[1] 10
getPrimeNumTilln <- function(n) {
a <- c(2:n)
l <- 2
r <- c()
while (l*l < n) {
count <<- count + 1
r <- c(r,a[1])
a <- a[-(which(a %% l ==0))] # This removes all the numbers which are
# divisible by the prime in question
l <- a[1]
}
c(r,a)
}
count <- 0
getPrimeNumTilln(100)
# [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
count
#[1] 4

Finding position of the first TRUE of a series of `n` TRUEs

From a vector of TRUE/FALSE
set.seed(1)
x = rnorm(1503501) > 0
I am seeking for a performant (fast) method for getting the position of the first TRUE of the first series of n TRUEs.
The vectors (x) I am dealing with contain exactly 1503501 elements (with the exception of some of them that are much shorter). Below is my current solution. It uses for loop but for loops are extremely slow in R. Are there nicer and especially faster solutions?
n = 20
count = 0
solution = -1
for (i in 1:length(x)){
if (x[i]){
count = count + 1
if (count == n){solution = i+1-n; break}
} else {count = 0}
}
print(solution)
1182796
I was thinking about using vectorized functions and doing something like y = which(x) or eventually y = paste(which(x)) and seek for particular pattern but I am not sure how to do that.
You can use Rcpp:
library(Rcpp)
cppFunction('int fC(LogicalVector x, int n) {
int xs = x.size();
int count = 0;
int solution = -1;
for (int i = 0; i < xs; ++i) {
if (x[i]){
if (++count == n){solution = i+2-n; break;}
} else {
count = 0;
}
}
return solution;
}')
Here is a small benchmarking study:
f1 <- function(x,n) {
count = 0
solution = -1
for (i in 1:length(x)){
if (x[i]){
count = count + 1
if (count == n){solution = i+1-n; break}
} else {count = 0}
}
solution
}
set.seed(1)
x = rnorm(150350100) > 0
n = 20
print(f1(x,n)==fC(x,n))
# [1] TRUE
library(rbenchmark)
benchmark(f1(x,n),fC(x,n))
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n) 100 80.038 180.673 63.300 16.686 0 0
# 2 fC(x, n) 100 0.443 1.000 0.442 0.000 0 0
[Updated benchmark]
# Suggested by BondedDust
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths;
sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
set.seed(1)
x = rnorm(1503501) > 0
n = 20
print(f1(x,n)==fC(x,n))
# [1] TRUE
print(f1(x,n)==tpos(x,n))
# [1] TRUE
benchmark(f1(x,n),fC(x,n),tpos(x,n),replications = 10)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 f1(x, n) 10 4.756 110.605 4.735 0.020 0 0
# 2 fC(x, n) 10 0.043 1.000 0.043 0.000 0 0
# 3 tpos(x, n) 10 2.591 60.256 2.376 0.205 0 0
Take a look at this transcript (using only a much smaller random sample). I think it fairly clear that it will be easy to write a function that picks out the first position that satisfies the joint condition and use cumsum on the lengths up to that point:
> x = rnorm(1500) > 0
> rle(x)
Run Length Encoding
lengths: int [1:751] 1 1 1 2 1 3 1 2 2 1 ...
values : logi [1:751] FALSE TRUE FALSE TRUE FALSE TRUE ...
> table( rle(x)$lengths )
1 2 3 4 5 6 7 8 9
368 193 94 46 33 10 2 4 1
> table( rle(x)$lengths , rle(x)$values)
FALSE TRUE
1 175 193
2 100 93
3 47 47
4 23 23
5 21 12
6 5 5
7 2 0
8 3 1
9 0 1
> which( rle(x)$lengths==8 & rle(x)$values==TRUE)
[1] 542
> which( rle(x)$lengths==7 & rle(x)$values==TRUE)
integer(0)
> which( rle(x)$lengths==6 & rle(x)$values==TRUE)
[1] 12 484 510 720 744
This is my candidate function:
tpos <- function(x,pos) { rl <- rle(x); len <- rl$lengths;
sum(len[ 1:(which( len == pos & rl$values==TRUE)[1]-1)],1)}
tpos(x,6)
#[1] 18
Note that I subtracted one from the first index so the length of the first qualifying run of TRUE's would not be added in, and then added one to that sum so that the position of the first such TRUE would be calculated. I'm guessing the positon of the first run of n-TRUEs will be distributed as one of the extreme value distributions (although it's not always monotonic increase)
> tpos(x,8)
[1] 1045
> tpos(x,8)
[1] 1045
> tpos(x,9)
[1] 1417
> tpos(x,10)
[1] 4806
> tpos(x,11)
[1] 2845
> tpos(x,12)
Error in 1:(which(len == pos & rl$values == TRUE)[1] - 1) :
NA/NaN argument
> set.seed(1)
> x = rnorm(30000) > 0
> tpos(x,12)
[1] 23509
You can take your vector and add a FALSE (zero) to the beginning and remove the end and then add this augmented vector to your original vector (as 0/1 vectors of integers), and then do the same thing again by adding one more FALSE (zero) to the beginning from the previously augmented vector and removing the end and then adding this to your currently rolling sum vector (again, adding as vectors of integers) and do this until you have added up n total shifted copies of your vector. Then you can do which(sum_x == n) where sum_x is the sum vector and take the minimum returned by which(), and subtract n-1 and this will get you the start of the first occurrence of n TRUE's in a row. This will work much faster if n is somewhat small compared to the length of your vector.

Determine position of ith element in vector

I have a vector: a<-rep(sample(1:5,20, replace=T))
I determine the frequency of occurrence of each value:
tabulate(a)
I would now like to determine the position of the most frequently occurring values.
Let's say the vector is:
[1] 3 3 3 5 2 2 4 1 4 2 5 1 2 1 3 1 3 2 5 1
tabulate returns:
[1] 5 5 5 2 3
Now I determine the highest value returned by tabulate max(tabulate(a))
this returns
[1] 5
There are 3 values with frequency 5. I would like to know the position of these values in the tabulate output.
i.e. I the first three entries of tabulate.
Perhaps it is easier to work with table:
x <- table(a)
x
# a
# 1 2 3 4 5
# 5 5 5 2 3
names(x)[x == max(x)]
# [1] "1" "2" "3"
which(a %in% names(x)[x == max(x)])
# [1] 1 2 3 5 6 8 10 12 13 14 15 16 17 18 20
Alternatively, there's a similar approach with tabulate:
x <- tabulate(a)
sort(unique(a))[x == max(x)]
Here are some benchmarks on numeric and character vectors. The difference in performance is more noticeable with numeric data.
Sample data
set.seed(1)
a <- sample(20, 1000000, replace = TRUE)
b <- sample(letters, 1000000, replace = TRUE)
Functions to benchmark
t1 <- function() {
x <- table(a)
out1 <- names(x)[x == max(x)]
out1
}
t2 <- function() {
x <- tabulate(a)
out2 <- sort(unique(a))[x == max(x)]
out2
}
t3 <- function() {
x <- table(b)
out3 <- names(x)[x == max(x)]
out3
}
t4 <- function() {
x <- tabulate(factor(b))
out4 <- sort(unique(b))[x == max(x)]
out4
}
The results
library(rbenchmark)
benchmark(t1(), t2(), t3(), t4(), replications = 50)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 t1() 50 30.548 24.244 30.416 0.064 0 0
# 2 t2() 50 1.260 1.000 1.240 0.016 0 0
# 3 t3() 50 8.919 7.079 8.740 0.160 0 0
# 4 t4() 50 5.680 4.508 5.564 0.100 0 0

Advice wanted on getting rid of loops

I have written a program that works with the 3n + 1 problem (aka "wondrous numbers" and various other things). But it has a double loop. How could I vectorize it?
the code is
count <- vector("numeric", 100000)
L <- length(count)
for (i in 1:L)
{
x <- i
while (x > 1)
{
if (round(x/2) == x/2)
{
x <- x/2
count[i] <- count[i] + 1
} else
{
x <- 3*x + 1
count[i] <- count[i] + 1
}
}
}
Thanks!
I turned this 'inside-out' by creating a vector x where the ith element is the value after each iteration of the algorithm. The result is relatively intelligible as
f1 <- function(L) {
x <- seq_len(L)
count <- integer(L)
while (any(i <- x > 1)) {
count[i] <- count[i] + 1L
x <- ifelse(round(x/2) == x/2, x / 2, 3 * x + 1) * i
}
count
}
This can be optimized to (a) track only those values still in play (via idx) and (b) avoid unnecessary operations, e.g., ifelse evaluates both arguments for all values of x, x/2 evaluated twice.
f2 <- function(L) {
idx <- x <- seq_len(L)
count <- integer(L)
while (length(x)) {
ix <- x > 1
x <- x[ix]
idx <- idx[ix]
count[idx] <- count[idx] + 1L
i <- as.logical(x %% 2)
x[i] <- 3 * x[i] + 1
i <- !i
x[i] <- x[i] / 2
}
count
}
with f0 the original function, I have
> L <- 10000
> system.time(ans0 <- f0(L))
user system elapsed
7.785 0.000 7.812
> system.time(ans1 <- f1(L))
user system elapsed
1.738 0.000 1.741
> identical(ans0, ans1)
[1] TRUE
> system.time(ans2 <- f2(L))
user system elapsed
0.301 0.000 0.301
> identical(ans1, ans2)
[1] TRUE
A tweak is to update odd values to 3 * x[i] + 1 and then do the division by two unconditionally
x[i] <- 3 * x[i] + 1
count[idx[i]] <- count[idx[i]] + 1L
x <- x / 2
count[idx] <- count[idx] + 1
With this as f3 (not sure why f2 is slower this morning!) I get
> system.time(ans2 <- f2(L))
user system elapsed
0.36 0.00 0.36
> system.time(ans3 <- f3(L))
user system elapsed
0.201 0.003 0.206
> identical(ans2, ans3)
[1] TRUE
It seems like larger steps can be taken at the divide-by-two stage, e.g., 8 is 2^3 so we could take 3 steps (add 3 to count) and be finished, 20 is 2^2 * 5 so we could take two steps and enter the next iteration at 5. Implementations?
Because you need to iterate on values of x you can't really vectorize this. At some point, R has to work on each value of x separately and in turn. You might be able to run the computations on separate CPU cores to speed things up, perhaps using foreach in the package of the same name.
Otherwise, (and this is just hiding the loop from you), wrap the main body of your loop as a function, e.g.:
wonderous <- function(n) {
count <- 0
while(n > 1) {
if(isTRUE(all.equal(n %% 2, 0))) {
n <- n / 2
} else {
n <- (3*n) + 1
}
count <- count + 1
}
return(count)
}
and then you can use sapply() to run the function on a set of numbers:
> sapply(1:50, wonderous)
[1] 0 1 7 2 5 8 16 3 19 6 14 9 9 17 17
[16] 4 12 20 20 7 7 15 15 10 23 10 111 18 18 18
[31] 106 5 26 13 13 21 21 21 34 8 109 8 29 16 16
[46] 16 104 11 24 24
Or you can use Vectorize to return a vectorized version of wonderous which is itself a function that hides even more of this from you:
> wonderousV <- Vectorize(wonderous)
> wonderousV(1:50)
[1] 0 1 7 2 5 8 16 3 19 6 14 9 9 17 17
[16] 4 12 20 20 7 7 15 15 10 23 10 111 18 18 18
[31] 106 5 26 13 13 21 21 21 34 8 109 8 29 16 16
[46] 16 104 11 24 24
I think that is about as far as you can get with standard R tools at the moment.#Martin Morgan shows you can do a lot better than this with an ingenious take on solving the problem that does used R's vectorised abilities.
A different approach recognizes that one frequently revisits low numbers, so why not remember them and save the re-calculation cost?
memo_f <- function() {
e <- new.env(parent=emptyenv())
e[["1"]] <- 0L
f <- function(x) {
k <- as.character(x)
if (!exists(k, envir=e))
e[[k]] <- 1L + if (x %% 2) f(3L * x + 1L) else f(x / 2L)
e[[k]]
}
f
}
which gives
> L <- 100
> vals <- seq_len(L)
> system.time({ f <- memo_f(); memo1 <- sapply(vals, f) })
user system elapsed
0.018 0.000 0.019
> system.time(won <- sapply(vals, wonderous))
user system elapsed
0.921 0.005 0.930
> all.equal(memo1, won) ## integer vs. numeric
[1] TRUE
This might not parallelize well, but then maybe that's not necessary with the 50x speedup? Also the recursion might get too deep, but the recursion could be written as a loop (which is probably faster, anyway).

Resources