Can this R function be vectorized?

Can this R function be vectorized? - r

bucketIndex <- function(v, N){
o <- rep(0, length(v))
curSum <- 0
index <- 1
for(i in seq(length(v))){
o[i] <- index
curSum <- curSum + v[i]
if(curSum > N){
curSum <- 0
index <- index + 1
}
}
o
}
> bucketIndex(c(1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 2 2 3
I'm wondering if this function is fundamentally un-vectorizable. If it is, is there some package to deal with this "class" of functions, or is the only alternative (if I want speed) to write it as a c extension?

Here's a try (does not yet arrive at bucketIndex!):
your
curSum <- curSum + v[i]
if(curSum > N){
curSum <- 0
index <- index + 1
}
is almost an integer division %/% of cumsum (v).
But not quite, your index only counts up 1 even if v [i] is > several times N and you start with 1. We can almost take care of that by conversion to a factor and back to integer.
However, I'm wondering (from the name of the function) whether this behaviour is really intended:
> bucketIndex (c(1, 1, 2, 1, 2, 1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 2 2 2 3 3 3 4 5
> bucketIndex (c(1, 1, 1, 2, 2, 1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 1 2 2 2 3 3 3 4
I.e. just exchangig two consecutive entries in v can lead to different maximum in the result.
the other point is that you count up only after the element that causes the sum to be > N. Which means that the results should have an additional 1 at the beginning and the last element should be dropped.
You reset curSum to 0 regardless how much it shoots over N. So for all elements with cumsum (v) > N, you'd need to subtract this value, then look for the next cumsum (v) > N and so on. This reduces the number of loop iterations with respect to your for loop, but whether this gives you a substrantial improvement depends on the entries of v and on N (or, on the max (index) : length (v) ratio). If that is 50% as in your example, I don't think you can get a substantial gain. Unless there is at least an order of magnitute between them, I'd go for inline::cfunction.

I'm going to go out on a limb here and say the answer is "no." Essentially, you're changing what it is you sum over based on the results of the current sum. This means future calculations depend on the result of an intermediate calculation, which vectorized operations can't do.

I don't think that this is completely vectorizable, but #cbeleites gets at one way to reduce the number of iterations in the loop by dealing with a whole chunk (bucket) at a time. Each iteration looks for where the cumulative sum exceeds N, assigns the index to that range, reduces the cumulative sum by whatever value was that which exceeded N, and repeats until the vector is exhausted. The rest is bookkeeping (initialization of value and incrementation of values).
bucketIndex2 <- function(v, N) {
index <- 1
cs <- cumsum(v)
bk.old <- 0
o <- rep(0, length(v))
repeat {
bk <- suppressWarnings(min(which(cs > N)))
o[(bk.old+1):min(bk,length(v))] <- index
if (bk >= length(v)) break
cs <- cs - cs[bk]
index <- index + 1
bk.old <- bk
}
o
}
This matches your function for a variety of random inputs:
for (i in 1:200) {
v <- sample(sample(20,1), sample(50,1)+20, replace=TRUE)
N <- sample(10,1)
bi <- bucketIndex(v, N)
bi2 <- bucketIndex2(v, N)
if (any(bi != bi2)) {
print("MISMATCH:")
dump("v","")
dump("N","")
}
}

Related

Faster ways to generate Yellowstone sequence (A098550) in R?

I just saw a YouTube video from Numberphile on the Yellowstone sequence (A098550). It's base on a sequence starting with 1 and 2, with subsequent terms generated by the rules:
no repeated terms
always pick the lowest integer
gcd(a_n, a_(n-1)) = 1
gcd(a_n, a_(n-2)) > 1
The first 15 terms would be: 1 2 3 4 9 8 15 14 5 6 25 12 35 16 7
A Q&D approach in R could be something like this, but understandably, this becomes very slow at attempts to make longer sequences. It also make some assumptions about the highest number that is possible within the sequence (as info: the sequence of 10,000 items never goes higher than 5000).
What can we do to make this faster?
library(DescTools)
a <- c(1, 2, 3)
p <- length(a)
# all natural numbers
all_ints <- 1:5000
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
next_a_set <- all_ints[which(!all_ints %in% a)]
# rule 3 - search the remaining set for numbers that have gcd == 1
next_a_option <- next_a_set[which(
sapply(
next_a_set,
function(x) GCD(a[n], x)
) == 1
)]
# rule 4 - search the remaining number for gcd > 1
next_a <- next_a_option[which(
sapply(
next_a_option,
function(x) GCD(a[n - 1], x)
) > 1
)]
# select the lowest
a <- c(a, min(next_a))
n <- n + 1
}

Here's a version that's about 20 times faster than yours, with comments about the changes:
# Set a to the final length from the start.
a <- c(1, 2, 3, rep(NA, 997))
p <- 3
# Define a vectorized gcd() function. We'll be testing
# lots of gcds at once. This uses the Euclidean algorithm.
gcd <- function(x, y) { # vectorized gcd
while (any(y != 0)) {
x1 <- ifelse(y == 0, x, y)
y <- ifelse(y == 0, 0, x %% y)
x <- x1
}
x
}
# Guess at a reasonably large vector to work from,
# but we'll grow it later if not big enough.
allnum <- 1:1000
# Keep a logical record of what has been used
used <- c(rep(TRUE, 3), rep(FALSE, length(allnum) - 3))
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
# nothing to do -- used already records that.
repeat {
# rule 3 - search the remaining set for numbers that have gcd == 1
keep <- !used & gcd(a[n], allnum) == 1
# rule 4 - search the remaining number for gcd > 1
keep <- keep & gcd(a[n-1], allnum) > 1
# If we found anything, break out of this loop
if (any(keep))
break
# Otherwise, make the set of possible values twice as big,
# and try again
allnum <- seq_len(2*length(allnum))
used <- c(used, rep(FALSE, length(used)))
}
# select the lowest
newval <- which.max(keep)
# Assign into the appropriate place
a[n+1] <- newval
# Record that it has been used
used[newval] <- TRUE
}
If you profile it, you'll see it spends most of its time in the gcd() function. You could probably make that a lot faster by redoing it in C or C++.

The biggest change here is pre-allocation and restricting the search to numbers that have not yet been used.
library(numbers)
N <- 5e3
a <- integer(N)
a[1:3] <- 1:3
b <- logical(N) # which numbers have been used already?
b[1:3] <- TRUE
NN <- 1:N
system.time({
for (n in 4:N) {
a1 <- a[n - 1L]
a2 <- a[n - 2L]
for (k in NN[!b]) {
if (GCD(k, a1) == 1L & GCD(k, a2) > 1L) {
a[n] <- k
b[k] <- TRUE
break
}
}
if (!a[n]) {
a <- a[1:(n - 1L)]
break
}
}
})
#> user system elapsed
#> 1.28 0.00 1.28
length(a)
#> [1] 1137
For a fast C++ algorithm, see here.

Random Sample in R: Limiting Number of Repetitionsout of a 2 digit vector

I have the following vector in R: c(0,1).
I am wishing to randomly sample from this vector 10 elements at a time, but such that no more than 2 elements repeat.
The code I have tried is sample(c(0,1),10,replace=T)
But I would like to get
sample(c(0,1),10,replace=T) = (0,1,1,0,1,1,0,0,1,0)
sample(z,4,replace=T) = (0,1,0,1,0,0,1,0,1,0)
but not
sample(z,4,replace=T) = (1,0,0,0,1,1,0,0,0)
And so on.
How could I accomplish this?

Since the number of repeats can only be 1 or 2, and since the value needs to alternate, you can achieve this in a one-liner by randomly choosing 1 or 2 repeats of each of a sequence of 1s and 0s, and truncating the result to 10 elements.
rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10]
#> [1] 0 0 1 1 0 1 1 0 1 0
If you wish to remove the constraint of the sequence always starting with a zero, you can randomly subtract the result from 1:
abs(sample(0:1, 1) - rep(rep(0:1, 5), times = sample(c(1:2), 10, TRUE))[1:10])
#> [1] 1 1 0 0 1 0 0 1 1 0

foo <- function(){
innerfunc <- function(){sample(c(0, 1), 10, T)}
x <- innerfunc()
while(max(rle(x)$lengths) > 2){
x <- innerfunc()
}
x
}
foo()
This function will look at the max length of a sequence of zeroes and ones. If this is > 2, it reruns your sample function, named innerfunc in here.

I think this is an interesting coding practice if you would like to use recurssions, and below might be an option that gives some hints
f <- function(n) {
if (n <= 2) {
return(sample(c(0, 1), n, replace = TRUE))
}
m <- sample(c(1, 2), 1)
v <- Recall(n - m)
c(v, rep((tail(v, 1) + 1) %% 2, m))
}

How can I create a for-loop to count the number of values within a vector that fall between a set boundary?

I'm trying to set an upper and lower boundary of a vector by simply adding and subtracting a set value from each index. I then want to create a loop that tells me for each value (i) in the vector, how many other points within the vector falls within that boundary.
Essentially creating a pseudo-density calculation based on how many values fall within the established range.
I have my vector "v" that contains random values. I then add/subtract three to it to get the upper and lower ranges. But can't create a loop that will count how many other values from that vector fall within that.
v <- c(1, 3, 4, 5, 8, 9, 10, 54)
for (i in v){
vec2 <- (vec +3 > vec[i] & vec -3 < vec[i])
}
vec2
I get NA's from this code.
I've also tried indexing the vec +/- 3 and it also didn't work.
vec2 <- (vec[i] +3 > vec[i] & vec - 3 < vec[i))
What I want is for every "i" value in the vector, I want to know how many points fall within that value + and -3.
i.e. first value being 1: so the upper limit would be 4 and the lower would be -2. I want it to count how many values remaining in the vector, fall within this. Which would be 3 for the first index (if it includes itself).
vec2 = (3, 4, 3, . . . )

Are you looking for something like this? Your code doesn't work because your syntax is incorrect.
vec <- c(1, 3, 4, 5, 8, 9, 10, 54) #Input vector
countvalswithin <- vector() #Empty vector that will store counts of values within bounds
#For loop to cycle through values stored in input vector
for(i in 1:length(vec)){
currval <- vec[i] #Take current value
lbound <- (currval - 3) #Calculate lower bound w.r.t. this value
ubound <- (currval + 3) #Calculate upper bound w.r.t. this value
#Create vector containing all values from source vector except current value
#This will be used for comparison against current value to find values within bounds.
othervals <- subset(vec, vec != currval)
currcount <- 1 #Set to 0 to exclude self; count(er) of values within bounds of current value
#For loop to cycle through all other values (excluding current value) to find values within bounds of current value
for(j in 1:length(othervals)){
#If statement to evaluate whether compared value is within bounds of current value; if it is, counter updates by 1
if(othervals[j] > lbound & othervals[j] <= ubound){
currcount <- currcount + 1
}
}
countvalswithin[i] <- currcount #Append count for current value to a vector
}
df <- data.frame(vec, countvalswithin) #Input vector and respective counts as a dataframe
df
# vec countvalswithin
# 1 1 3
# 2 3 4
# 3 4 3
# 4 5 4
# 5 8 3
# 6 9 3
# 7 10 3
# 8 54 1
Edit: added comments to the code explaining what it does.

In your for loop we can loop over every element in v, create range (-3, +3) and check how many of the elements in v fall within that range and store the result in new vector vec2.
vec2 <- numeric(length = length(v))
for (i in seq_along(v)) {
vec2[i] <- sum((v >= v[i] - 3) & (v <= v[i] + 3))
}
vec2
#[1] 3 4 4 4 4 3 3 1
However, you can avoid the for loop by using mapply
mapply(function(x, y) sum(v >= y & v <= x), v + 3, v - 3)
#[1] 3 4 4 4 4 3 3 1

How to skip a step and increase the number of iterations in a for loop in R

We have a big for loop in R for simulating various data where for some iterations the data generate in such a way that a quantity comes 0 inside the loop, which is not desirable and we should skip that step of data generation. But at the same time we also need to increase the number of iterations by one step because of such skip, otherwise we will have fewer observations than required.
For example, while running the following code, we get z=0 in iteration 1, 8 and 9.
rm(list=ls())
n <- 10
z <- NULL
for(i in 1:n){
set.seed(i)
a <- rbinom(1,1,0.5)
b <- rbinom(1,1,0.5)
z[i] <- a+b
}
z
[1] 0 1 1 1 1 2 1 0 0 1
We desire to skip these steps so that we do not have any z=0 but we also want a vector z of length 10. It may be done in many ways. But what I particularly want to see is how we can stop the iteration and skip the current step when z=0 is encountered and go to the next step, ultimately obtaining 10 observations for z.

Normally we do this via a while loop, as the number of iterations required is unknown beforehand.
n <- 10L
z <- integer(n)
m <- 1L; i <- 0L
while (m <= n) {
set.seed(i)
z_i <- sum(rbinom(2L, 1, 0.5))
if (z_i > 0L) {z[m] <- z_i; m <- m + 1L}
i <- i + 1L
}
Output:
z
# [1] 1 1 1 1 1 2 1 1 1 1
i
# [1] 14
So we sample 14 times, 4 of which are 0 and the rest 10 are retained.
More efficient vectorized method
set.seed(0)
n <- 10L
z <- rbinom(n, 1, 0.5) + rbinom(n, 1, 0.5)
m <- length(z <- z[z > 0L]) ## filtered samples
p <- m / n ## estimated success probability
k <- round(1.5 * (n - m) / p) ## further number of samples to ensure successful (n - m) non-zero samples
z_more <- rbinom(k, 1, 0.5) + rbinom(k, 1, 0.5)
z <- c(z, z_more[which(z_more > 0)[seq_len(n - m)]])
Some probability theory of geometric distribution has been used here. Initially we sample n samples, m of which are retained. So the estimated probability of success in accepting samples is p <- m/n. According to theory of Geometric distribution, on average, we need at least 1/p samples to observe a success. Therefore, we should at least sample (n-m)/p more times to expect (n-m) success. The 1.5 is just an inflation factor. By sampling 1.5 times more samples we hopefully can ensure (n-m) success.
According to Law of large numbers, the estimate of p is more precise when n is large. Therefore, this approach is stable for large n.
If you feel that 1.5 is not large enough, use 2 or 3. But my feeling is that it is sufficient.

R create vector with a for and while loop

Good morning,
I have the following problem.
My Data.frame "data" has the format:
Type amount
1 2
2 0
3 3
I would like to create a vector with the format:
1
1
3
3
3
This means I would like to transform my data.
I created a vector and wrote the following code for my transformation in R:
vector <- numeric(5)
for (i in 1:3){
k <- 1
while (k <= data[i,2]){
vector[k] <- data[i,1]
k <- k+1
}
}
The problem is, I get the following results and I have no Idea at which part I go wrong…
3
3
3
0
0
There might be many different ways in solving this particular problem in R but I am curious why my solution doesn't work. I am thankful for alternatives, but really would like to know what my mistake is.
Thank's for your help!

Try this solution:
df <- data.frame(type = c(1, 2, 3), amount = c(2, 0, 3))
result <- unlist(mapply(function(x, y) rep.int(x, y), df[, "type"], df[, "amount"]))
result
Output is following:
# [1] 1 1 3 3 3
Exaclty your code is buggy. Correct code should looks following:
df <- data.frame(type = c(1, 2, 3), amount = c(2, 0, 3))
vector <- numeric(5)
k <- 1
for (i in 1:3) {
j <- 1
while (j <= df[i, 2]) {
vector[k] <- df[i, 1]
k <- k + 1
j <- j + 1
}
}
vector
# [1] 1 1 3 3 3

Probably the fastest and most elegant way to obtain this result has been posted before in a comment by #akrun:
with(data, rep(Type, amount))
[1] 1 1 3 3 3
However, if you want to do this with for/while loops, it could be helpful to use a list for such cases, where the number of entries is not known at the beginning.
Here is an example with minimal modifications of your code:
my_list <- vector("list", 3)
for (i in 1:3) {
k <- 1
while (k <= data[i,2]){
my_list[[i]][k] <- data[i,1]
k <- k + 1
}
}
vector <- unlist(my_list)
#> vector
#[1] 1 1 3 3 3
The reason why your code didn't work was essentially that you were trying to put too much information into a single variable, k. It cannot serve as both, an index of your output vector, and as a counter for the individual entries in the first column of data; a counter which is reset to 1 each time the while loop has finished.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Can this R function be vectorized? - r

I'm going to go out on a limb here and say the answer is "no." Essentially, you're changing what it is you sum over based on the results of the current sum. This means future calculations depend on the result of an intermediate calculation, which vectorized operations can't do.

Related

Faster ways to generate Yellowstone sequence (A098550) in R?

Random Sample in R: Limiting Number of Repetitionsout of a 2 digit vector

How can I create a for-loop to count the number of values within a vector that fall between a set boundary?

How to skip a step and increase the number of iterations in a for loop in R

R create vector with a for and while loop

Categories

Resources