I just saw a YouTube video from Numberphile on the Yellowstone sequence (A098550). It's base on a sequence starting with 1 and 2, with subsequent terms generated by the rules:
no repeated terms
always pick the lowest integer
gcd(a_n, a_(n-1)) = 1
gcd(a_n, a_(n-2)) > 1
The first 15 terms would be: 1 2 3 4 9 8 15 14 5 6 25 12 35 16 7
A Q&D approach in R could be something like this, but understandably, this becomes very slow at attempts to make longer sequences. It also make some assumptions about the highest number that is possible within the sequence (as info: the sequence of 10,000 items never goes higher than 5000).
What can we do to make this faster?
library(DescTools)
a <- c(1, 2, 3)
p <- length(a)
# all natural numbers
all_ints <- 1:5000
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
next_a_set <- all_ints[which(!all_ints %in% a)]
# rule 3 - search the remaining set for numbers that have gcd == 1
next_a_option <- next_a_set[which(
sapply(
next_a_set,
function(x) GCD(a[n], x)
) == 1
)]
# rule 4 - search the remaining number for gcd > 1
next_a <- next_a_option[which(
sapply(
next_a_option,
function(x) GCD(a[n - 1], x)
) > 1
)]
# select the lowest
a <- c(a, min(next_a))
n <- n + 1
}
Here's a version that's about 20 times faster than yours, with comments about the changes:
# Set a to the final length from the start.
a <- c(1, 2, 3, rep(NA, 997))
p <- 3
# Define a vectorized gcd() function. We'll be testing
# lots of gcds at once. This uses the Euclidean algorithm.
gcd <- function(x, y) { # vectorized gcd
while (any(y != 0)) {
x1 <- ifelse(y == 0, x, y)
y <- ifelse(y == 0, 0, x %% y)
x <- x1
}
x
}
# Guess at a reasonably large vector to work from,
# but we'll grow it later if not big enough.
allnum <- 1:1000
# Keep a logical record of what has been used
used <- c(rep(TRUE, 3), rep(FALSE, length(allnum) - 3))
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
# nothing to do -- used already records that.
repeat {
# rule 3 - search the remaining set for numbers that have gcd == 1
keep <- !used & gcd(a[n], allnum) == 1
# rule 4 - search the remaining number for gcd > 1
keep <- keep & gcd(a[n-1], allnum) > 1
# If we found anything, break out of this loop
if (any(keep))
break
# Otherwise, make the set of possible values twice as big,
# and try again
allnum <- seq_len(2*length(allnum))
used <- c(used, rep(FALSE, length(used)))
}
# select the lowest
newval <- which.max(keep)
# Assign into the appropriate place
a[n+1] <- newval
# Record that it has been used
used[newval] <- TRUE
}
If you profile it, you'll see it spends most of its time in the gcd() function. You could probably make that a lot faster by redoing it in C or C++.
The biggest change here is pre-allocation and restricting the search to numbers that have not yet been used.
library(numbers)
N <- 5e3
a <- integer(N)
a[1:3] <- 1:3
b <- logical(N) # which numbers have been used already?
b[1:3] <- TRUE
NN <- 1:N
system.time({
for (n in 4:N) {
a1 <- a[n - 1L]
a2 <- a[n - 2L]
for (k in NN[!b]) {
if (GCD(k, a1) == 1L & GCD(k, a2) > 1L) {
a[n] <- k
b[k] <- TRUE
break
}
}
if (!a[n]) {
a <- a[1:(n - 1L)]
break
}
}
})
#> user system elapsed
#> 1.28 0.00 1.28
length(a)
#> [1] 1137
For a fast C++ algorithm, see here.
Related
How do I retrieve maximum sum of possible divisors numbers
I have a below function which will give possible divisors of number
Code
divisors <- function(x) {
y <- seq_len(ceiling(x / 2))
y[x %% y == 0]
}
Example
Divisors of 99 will give the below possible values.
divisors(99)
[1] 1 3 9 11 33
My expected Logic :
Go from last digit to first digit in the divisors value
The last number is 33, Here next immediate number divisible by 33 is 11 . So I selected 11 , now traversing from 11 the next immediate number divisible by 11 is 1. So selected 1. Now add all the numbers.
33 + 11 + 1 = 45
Move to next number 11, Now next immediate number divisible by 11 is 1. So selected 1. Now add all the numbers.
11 + 1 = 12
Here immediate
Move to next number 9, Now next immediate number divisible by 11 is 1. So selected 1. Now add all the numbers.
9 + 3 + 1 = 13
Move to next number 3, Now next immediate number divisible by 3 is 1. So selected 1. Now add all the numbers.
3+1=4
Now maximum among these is 45.
Now I am struggling to write this logic in R . Help / Advice much appreciated.
Note : Prime numbers can be ignored.
update
For large integers, e.g., the maximum integer .Machine$integer.max (prime number), you can run the code below (note that I modified functions divisors and f a bit)
divisors <- function(x) {
y <- seq(x / 2)
y[as.integer(x) %% y == 0]
}
f <- function(y) {
if (length(y) <= 2) {
return(as.integer(sum(y)))
}
l <- length(y)
h <- y[l]
yy <- y[-l]
h + f(yy[h %% yy == 0])
}
and you will see
> n <- .Machine$integer.max - 1
> x <- divisors(n)
> max(sapply(length(x):2, function(k) f(head(x, k))))
[1] 1569603656
You can define a recursive function f that gives successive divisors
f <- function(y) {
if (length(y) == 1) {
return(y)
}
h <- y[length(y)]
yy <- y[-length(y)]
c(f(yy[h %% yy == 0]), h)
}
and you will see all possible successive divisor tuples
> sapply(rev(seq_along(x)), function(k) f(head(x, k)))
[[1]]
[1] 1 11 33
[[2]]
[1] 1 11
[[3]]
[1] 1 3 9
[[4]]
[1] 1 3
[[5]]
[1] 1
Then, we apply f within sapply like below
> max(sapply(rev(seq_along(x)), function(k) sum(f(head(x, k)))))
[1] 45
which gives the desired output.
You can also use the following solution. It may sound a little bit complicated and of course there is always an easier, more efficient solution. However, I thought this could be useful to you. I will take it from your divisors output:
> x
[1] 1 3 9 11 33
# First I created a list whose first element is our original x and from then on
# I subset the first element till the last element of the list
lst <- lapply(0:(length(x)-1), function(a) x[1:(length(x)-a)])
> lst
[[1]]
[1] 1 3 9 11 33
[[2]]
[1] 1 3 9 11
[[3]]
[1] 1 3 9
[[4]]
[1] 1 3
[[5]]
[1] 1
Then I wrote a custom function in order to implement your conditions and gather your desired output. For this purpose I created a function factory which in fact is a function that creates a function:
As you might have noticed the outermost function does not take any argument. It only sets up an empty vector out to save our desired elements in. It is created in the execution environment of the outermost function to shield it from any changes that might affect it in the global environment
The inner function is the one that takes our vector x so in general we call the whole setup like fnf()(x). First element of of our out vector is in fact the first element of the original x(33). Then I found all divisors of the first element whose quotient were 0. After I fount them I took the second element (11) as the first one was (33) and stored it in our out vector. Then I modified the original x vector and omitted the max value (33) and repeated the same process
Since we were going to repeat the process over again, I thought this might be a good case to use recursion. Recursion is a programming technique that a function actually calls itself from its body or from inside itself. As you might have noticed I used fn inside the function to repeat the process again but each time with one fewer value
This may sound a bit complicated but I believed there may be some good points for you to pick up for future exploration, since I found them very useful, hoped that's the case for you too.
fnf <- function() {
out <- c()
fn <- function(x) {
out <<- c(out, x[1])
z <- x[out[length(out)]%%x == 0]
if(length(z) >= 2) {
out[length(out) + 1] <<- z[2]
} else {
return(out)
}
x <- x[!duplicated(x)][which(x[!duplicated(x)] == z[2]):length(x[!duplicated(x)])]
fn(x)
out[!duplicated(out)]
}
}
# The result of applying the custom function on `lst` would result in your
# divisor values
lapply(lst, function(x) fnf()(sort(x, decreasing = TRUE)))
[[1]]
[1] 33 11 1
[[2]]
[1] 11 1
[[3]]
[1] 9 3 1
[[4]]
[1] 3 1
[[5]]
[1] 1
In the end we sum each element and extract the max value
Reduce(max, lapply(lst, function(x) sum(fnf()(sort(x, decreasing = TRUE)))))
[1] 45
Testing a very large integer number, I used dear #ThomasIsCoding's modified divisors function:
divisors <- function(x) {
y <- seq(x / 2)
y[as.integer(x) %% y == 0]
}
x <- divisors(.Machine$integer.max - 1)
lst <- lapply(0:(length(x)-1), function(a) x[1:(length(x)-a)])
Reduce(max, lapply(lst, function(x) sum(fnf()(sort(x, decreasing = TRUE)))))
[1] 1569603656
You'll need to recurse. If I understand correctly, this should do what you want:
fact <- function(x) {
x <- as.integer(x)
div <- seq_len(abs(x)/2)
factors <- div[x %% div == 0L]
return(factors)
}
maxfact <- function(x) {
factors <- fact(x)
if (length(factors) < 3L) {
return(sum(factors))
} else {
return(max(factors + mapply(maxfact, factors)))
}
}
maxfact(99)
[1] 45
I'm looking to subset a vector to where there are no sequential numbers. However, if there is a sequence of more than two sequential numbers, then only every second sequential number is removed, since removing that number will disrupt the sequence.
e.g. 1,2,4,6,7 would give 1,4,6
e.g. 6,7,8,9 would give 6,8
This is easy to do iteratively, but iterating over 10M+ elements is incredibly slow:
x <- c(1,2,4,6,7,8,9) # Ideal output is c(1,4,6,8)
for (i in 2:length(x)) {
if (!is.na(x[i-1])) {
if (x[i] == x[i-1]+1) {x[i] <- NA_integer_}
}
}
x[!is.na(x)]
Is there another solution that would significantly faster?
Using convenience functions collapse::seqid and data.table::rowid:
library(collapse)
library(data.table)
x[rowid(seqid(x)) %% 2 == 1]
# [1] 1 4 6 8
Seems faster on a longer vector:
x = rep(c(1,2,4,6,7,8,9), 1e7)
system.time({
seq_id = data.table::rleid(x - seq_along(x))
obs_id = unlist(lapply(split(seq_id, seq_id), seq_along))
r1 = x[obs_id %% 2 == 1]
})
# user system elapsed
# 112.77 55.99 177.11
system.time({
r2 = x[rowid(seqid(x)) %% 2 == 1]
})
# user system elapsed
# 8.03 5.97 10.23
all.equal(r1, r2)
# [1] TRUE
We can use the fantastic data.table::rleid to generate an ID for each sequence, then keep only the odd numbered elements within sequence. This should be quite fast, though more optimization is certainly possible.
disrupt_seqs = function(x) {
seq_id = data.table::rleid(x - seq_along(x))
obs_id = unlist(lapply(split(seq_id, seq_id), seq_along))
x[obs_id %% 2 == 1]
}
x <- c(1,2,4,6,7,8,9)
disrupt_seqs(x)
# [1] 1 4 6 8
I'm a beginning R programmer. I have trouble in a loop calculation with a previous value like recursion.
An example of my data:
dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))
And calculated value 'c' is y[n] = (y[n-1] + b[n])*a[n]. Initial value of c is 0. (c[1] = 0)
I used the for loop and the code and result is as below.
dt$y <- 0
for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
}
a b y
1: 0 0 0
2: 1 1 1
3: 2 2 6
4: 3 1 21
5: 4 3 96
This result is what I want. However, my data has over 1,000,000 rows and several columns, therefore I'm trying to find other ways without using a for loop. I tried to use "Reduce()", but it only works with a single vector (ex. y[n] = y_[n-1]+b[n]). As shown above, my function uses two vectors, a and b, so I can't find a solution.
Is there a more efficient way to be faster without using a for loop, such as using a recursive function or any good package functions?
This kind of computation cannot make use of R's advantage of vectorization because of the iterative dependencies. But the slow-down appears to really be coming from indexing performance on a data.frame or data.table.
Interestingly, I was able to speed up the loop considerably by accessing a, b, and y directly as numeric vectors (1000+ fold advantage for 2*10^5 rows) or as matrix "columns" (100+ fold advantage for 2*10^5 rows) versus as columns in a data.table or data.frame.
This old discussion may still shed some light on this rather surprising result: https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html
Please note that I also made a different toy data.frame, so I could test a larger example without returning Inf as y grew with i:
Option data.frame (numeric vectors embedded in a data.frame or data.table per your example):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
})
#user system elapsed
#79.39 146.30 225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.
tail(dt$y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option vector (numeric vectors extracted in advance of loop):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
#user system elapsed
#0.03 0.00 0.03
tail(y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option matrix (data.frame converted to matrix before loop):
vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#0.67 0.01 0.69
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns
Option data.frame with matrix style indexing:
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#110.69 0.03 112.01
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
An option is to use Rcpp since for this recursive equation is easy to code in C++:
library(Rcpp)
cppFunction("
NumericVector func(NumericVector b, NumericVector a) {
int len = b.size();
NumericVector y(len);
for (int i = 1; i < len; i++) {
y[i] = (y[i-1] + b[i]) * a[i];
}
return(y);
}
")
func(c( 0, 1, 2, 1, 3), c(0:4))
#[1] 0 1 6 21 96
timing code:
vec_length <- 1e7
dt <- data.frame(a=1:vec_length, b=1:vec_length, y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
# user system elapsed
# 19.22 0.06 19.44
system.time(func(b, a))
# user system elapsed
# 0.09 0.02 0.09
Here is a base R solution.
According to the information from #ThetaFC, an indication for speedup is to use matrix or vector (rather than data.frame for data.table). Thus, it is better to have the following preprocessing before calculating df$y, i.e.,
a <- as.numeric(df$a)
b <- as.numeric(df$b)
Then, you have two approaches to get df$y:
writing your customized recursion function
f <- function(k) {
if (k == 1) return(0)
c(f(k-1),(tail(f(k-1),1) + b[k])*a[k])
}
df$y <- f(nrow(df))
Or a non-recursion function (I guess this will be much faster then the recursive approach)
g <- Vectorize(function(k) sum(rev(cumprod(rev(a[2:k])))*b[2:k]))
df$y <- g(seq(nrow(df)))
such that
> df
a b y
1 0 0 0
2 1 1 1
3 2 2 6
4 3 1 21
5 4 3 96
I don't think this will be any faster, but here's one way to do it without an explicit loop
dt[, y := purrr::accumulate2(a, b, function(last, a, b) (last + b)*a
, .init = 0)[-1]]
dt
# a b y
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 6
# 4: 3 1 21
# 5: 4 3 96
Problem
Find the sum of all numbers below 1000 that can be divisible by 3 or 5
One solution I created:
x <- c(1:999)
values <- x[x %% 3 == 0 | x %% 5 == 0]
sum(values
Second solution I can't get to work and need help with. I've pasted it below.
I'm trying to use a loop (here, I use while() and after this I'll try for()). I am still struggling with keeping references to indexes (locations in a vector) separate from values/observations within vectors. Loops seem to make it more challenging for me to distinguish the two.
Why does this not produce the answer to Euler #1?
x <- 0
i <- 1
while (i < 100) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- c(x, i)
}
i <- i + 1
}
sum(x)
And in words, line by line this is what I understand is happening:
x gets value 0
i gets value 1
while object i's value (not the index #) is < 1000
if is divisible by 3 or 5
add that number i to the vector x
add 1 to i in order (in order to keep the loop going to defined limit of 1e3
sum all items in vector x
I am guessing x[i] <- c(x, i) is not the right way to add an element to vector x. How do I fix this and what else is not accurate?
First, your loop runs until i < 100, not i < 1000.
Second, replace x[i] <- c(x, i) with x <- c(x, i) to add an element to the vector.
Here is a shortcut that performs this sum, which is probably more in the spirit of the problem:
3*(333*334/2) + 5*(199*200/2) - 15*(66*67/2)
## [1] 233168
Here's why this works:
In the set of integers [1,999] there are:
333 values that are divisible by 3. Their sum is 3*sum(1:333) or 3*(333*334/2).
199 values that are divisible by 5. Their sum is 5*sum(1:199) or 5*(199*200/2).
Adding these up gives a number that is too high by their intersection, which are the values that are divisible by 15. There are 66 such values, and their sum is 15*(1:66) or 15*(66*67/2)
As a function of N, this can be written:
f <- function(N) {
threes <- floor(N/3)
fives <- floor(N/5)
fifteens <- floor(N/15)
3*(threes*(threes+1)/2) + 5*(fives*(fives+1)/2) - 15*(fifteens*(fifteens+1)/2)
}
Giving:
f(999)
## [1] 233168
f(99)
## [1] 2318
And another way:
x <- 1:999
sum(which(x%%5==0 | x%%3==0))
# [1] 233168
A very efficient approach is the following:
div_sum <- function(x, n) {
# calculates the double of the sum of all integers from 1 to n
# that are divisible by x
max_num <- n %/% x
(x * (max_num + 1) * max_num)
}
n <- 999
a <- 3
b <- 5
(div_sum(a, n) + div_sum(b, n) - div_sum(a * b, n)) / 2
In contrast, a very short code is the following:
x=1:999
sum(x[!x%%3|!x%%5])
Here is an alternative that I think gives the same answer (using 99 instead of 999 as the upper bound):
iters <- 100
x <- rep(0, iters-1)
i <- 1
while (i < iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318
Here is the for-loop mentioned in the original post:
iters <- 99
x <- rep(0, iters)
i <- 1
for (i in 1:iters) {
if (i %% 3 == 0 | i %% 5 == 0) {
x[i] <- i
}
i <- i + 1
}
sum(x)
# [1] 2318
bucketIndex <- function(v, N){
o <- rep(0, length(v))
curSum <- 0
index <- 1
for(i in seq(length(v))){
o[i] <- index
curSum <- curSum + v[i]
if(curSum > N){
curSum <- 0
index <- index + 1
}
}
o
}
> bucketIndex(c(1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 2 2 3
I'm wondering if this function is fundamentally un-vectorizable. If it is, is there some package to deal with this "class" of functions, or is the only alternative (if I want speed) to write it as a c extension?
Here's a try (does not yet arrive at bucketIndex!):
your
curSum <- curSum + v[i]
if(curSum > N){
curSum <- 0
index <- index + 1
}
is almost an integer division %/% of cumsum (v).
But not quite, your index only counts up 1 even if v [i] is > several times N and you start with 1. We can almost take care of that by conversion to a factor and back to integer.
However, I'm wondering (from the name of the function) whether this behaviour is really intended:
> bucketIndex (c(1, 1, 2, 1, 2, 1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 2 2 2 3 3 3 4 5
> bucketIndex (c(1, 1, 1, 2, 2, 1, 1, 2, 1, 5, 1), 3)
[1] 1 1 1 1 2 2 2 3 3 3 4
I.e. just exchangig two consecutive entries in v can lead to different maximum in the result.
the other point is that you count up only after the element that causes the sum to be > N. Which means that the results should have an additional 1 at the beginning and the last element should be dropped.
You reset curSum to 0 regardless how much it shoots over N. So for all elements with cumsum (v) > N, you'd need to subtract this value, then look for the next cumsum (v) > N and so on. This reduces the number of loop iterations with respect to your for loop, but whether this gives you a substrantial improvement depends on the entries of v and on N (or, on the max (index) : length (v) ratio). If that is 50% as in your example, I don't think you can get a substantial gain. Unless there is at least an order of magnitute between them, I'd go for inline::cfunction.
I'm going to go out on a limb here and say the answer is "no." Essentially, you're changing what it is you sum over based on the results of the current sum. This means future calculations depend on the result of an intermediate calculation, which vectorized operations can't do.
I don't think that this is completely vectorizable, but #cbeleites gets at one way to reduce the number of iterations in the loop by dealing with a whole chunk (bucket) at a time. Each iteration looks for where the cumulative sum exceeds N, assigns the index to that range, reduces the cumulative sum by whatever value was that which exceeded N, and repeats until the vector is exhausted. The rest is bookkeeping (initialization of value and incrementation of values).
bucketIndex2 <- function(v, N) {
index <- 1
cs <- cumsum(v)
bk.old <- 0
o <- rep(0, length(v))
repeat {
bk <- suppressWarnings(min(which(cs > N)))
o[(bk.old+1):min(bk,length(v))] <- index
if (bk >= length(v)) break
cs <- cs - cs[bk]
index <- index + 1
bk.old <- bk
}
o
}
This matches your function for a variety of random inputs:
for (i in 1:200) {
v <- sample(sample(20,1), sample(50,1)+20, replace=TRUE)
N <- sample(10,1)
bi <- bucketIndex(v, N)
bi2 <- bucketIndex2(v, N)
if (any(bi != bi2)) {
print("MISMATCH:")
dump("v","")
dump("N","")
}
}