Merge Sort in R - r

I am self studying the book "Introduction to Algorithms" by Cormen et alli. In their book, they use pseudo-code which assumes that arrays are passed by pointer (by reference). This is different from R (where objects are passed by value), so I am having some difficulties trying to translate their pseudo-code as close as possible, especially when recursion is involved. Most of the time, I have to implement things a lot differently.
For example, with the Merge Sort algorithm, they define the Merge Function (which I think I have translated correctly) and the recursive MergeSort function (where direct translation to R does not work).
The merge function in pseudo-code is as follows where: A is an array and p, q, and r are indices into the array such that p < q < r. The procedure assumes that the subarrays A[p:q] and A[q+1:r] are in sorted order. It merges them to form a single sorted subarray that replaces the current subarray A[p:r]
Merge(A, p, q, r)
n1 = q - p + 1
n2 = r - q
let L[1...n1+1] and R[1...n2+1] be new arrays
for i = 1 to n1
L[i] = A[p+i-1]
for j = 1 to n2
R[j] = A[q+j]
L[n1+1] = infinite
R[n2+1] = infinite
i=1
j=1
for k = p to r
if L[i] <= R[j]
A[j] = L[i]
i = i + 1
else
A[k] = R[j]
j = j + 1
Which I've translated to R as:
Merge <- function(a, p, q, r){
n1 <- q - p + 1
n2 <- r - q
L <- numeric(n1+1)
R <- numeric(n2+1)
for(i in 1:n1){
L[i] <- a[p+i-1]
}
for(j in 1:n2){
R[j] <- a[q+j]
}
L[n1+1] <- Inf
R[n2+1] <- Inf
i=1
j=1
for(k in p:r){
if(L[i] <= R[j]){
a[k] <- L[i]
i <- i +1
}else{
a[k] <- R[j]
j <- j+1
}
}
a
}
And it seems to work fine.
Merge(c(1,3,5, 2,4,6), 1, 3, 6)
[1] 1 2 3 4 5 6
Now the MergeSort function is defined in pseudo-code as follows:
MergeSort(A, p, r)
if p < r
q = (p+r)/2
MergeSort(A, p, q)
MergeSort(A, q+1, r)
Merge(A, p, q, r)
This assumes that A is passed by reference and that every change is visible to every recursive call, which is not true in R.
So, given the Merge function defined above, how you would implement the MergeSort function in R to obtain the correct results? (if possible, and preferable, but not necessary, somewhat similar to the pseudo-code)

Trying to do a literal translation of pseudocode that is written for a language that allows for pass-by-reference in a language that does not support it is a terrible idea. R's not meant to work on slices of an array within a function. That's just not an appropriate translation. The pseudocode is supposed to communicate the spirit of the algorithm which you then translate into the appropriate language. Here's one possible translation of the spirit of mergesort into R.
mmerge<-function(a,b) {
r<-numeric(length(a)+length(b))
ai<-1; bi<-1; j<-1;
for(j in 1:length(r)) {
if((ai<=length(a) && a[ai]<b[bi]) || bi>length(b)) {
r[j] <- a[ai]
ai <- ai+1
} else {
r[j] <- b[bi]
bi <- bi+1
}
}
r
}
mmergesort<-function(A) {
if(length(A)>1) {
q <- ceiling(length(A)/2)
a <- mmergesort(A[1:q])
b <- mmergesort(A[(q+1):length(A)])
mmerge(a,b)
} else {
A
}
}
You can run it with
x<-c(18, 16, 8, 7, 6, 3, 11, 9, 15, 1)
mmergesort(x)
In this version thing is replaced via reference: all functions return new values. Additional, rather than passing in slide indexes, we simply subset vectors and pass them whole to the functions.
Of course the performance of this version is likely to suffer because of all the memory reallocations that occur at the intermediate steps. There's not much you can do about that in base R because of how the language was designed. If you like, you can write C/C++ code and call that via the foreign language interfaces.
If you want to leave your Merge as-is (and ignore the R-way to do things), then you could do...
MergeSort<-function(A, p, r) {
if(p < r) {
q <- floor((p+r)/2)
A <- MergeSort(A, p, q)
A <- MergeSort(A, q+1, r)
Merge(A, p, q, r)
} else {
A
}
}
x <- c(18, 16, 8, 7, 6, 3, 11, 9, 15, 1)
MergeSort(x, 1, length(x))
UPDATE:
Including benchmarking harness
m1<-function() {
x<-sample(1000, 250);
mmergesort(x)
}
m2<-function() {
x<-sample(1000, 250);
MergeSort(x, 1, length(x))
}
microbenchmark(m1(), m2())

This solution runs with getting length only once and simpler logic. And merge is implemented inside mergesort:
mergesort = function(x){
l = length(x)
if(l==1)
{
return(x)
}
else
{
a = mergesort(x[1:((l - l %% 2)/2)])
b = mergesort(x[((l + 2 - l %% 2)/2):l])
a = c(a, Inf)
b = c(b, Inf)
for(el in 1:l){
if(a[1]>=b[1]){
x[el] = b[1]
b = b[-1]
}
else{
x[el] = a[1]
a = a[-1]
}
}
return(x)
}
}

Related

Rewriting a loop with Rccp

I am a novice Rcpp user. I want to fasten my for loop which uses several entities from r environment, and updates two vectors through iterations.
The problem is that this is my first time facing c or c++ so I do not understand how to write rcpp with inline packages.
Here is the reproducible loop that I want to rewrite.
rsi <- c(NaN, 0, 0, 9.2, 28, 11, 9, 8, 38, 27, 62, 57,59,67, 76, 68, 69, 49)
L <- 2
o <- 2
T_min <-100
T_m <- 0
# Predefine two vectors for results to be written in
rsi_u <- rep(0, length(rsi))
rsi_d <- rep(0, length(rsi))
# Set range of for loop to be apllied on
st <- L + 1 # L and o is some param fron environment
en <- length(rsi) - o - 2
for (i in st:en) {
k <- i - o + 1
k1 <- i - L + 1
if (sum(rsi_u[k:i]) == 0 & sum(rsi_d[k:i]) == 0) {
if (min(rsi[k1:i]) == rsi[i] & rsi[i] < T_min) {
rsi_d[i] <- 1
}
if (max(rsi[k1:i]) == rsi[i] & rsi[i] > T_m) {
rsi_u[i] <- 1
}
}
}
So as you can see there are loop which checks first condition
if (sum(rsi_u[k:i]) == 0 & sum(rsi_d[k:i]) == 0)
and then checks two other conditions. If one of the condition is T, then it writes 1L to ith element of one of two predefined vecs. In addition each iteration relies on result of previous iterations.
The result of this loop is two vecs: rsi_u and rsi_d
In order to speed up this loop I decided to rewrite it with rccp and inline.
This is what I ended up with:
library("Rcpp")
library("inline")
loop_c <- cxxfunction(signature(k = "numeric", L = "numeric",
en = "numeric", rsi = "numeric", o = "numeric", T_min = "numeric", T_m ="numeric"),
plugin = "Rcpp", body = "
for (int i = L + 1; i <= en; i++) {
k = i - o + 1
k1 = i - L + 1
if (accumulate(rsi_u.k(), rsi_u.i(), 0)=0 &&
accumulate(rsi_d.k(), rsi_d.i(), 0)=0) {
if (min_element(rsi.k1(), rsi.i()) = rsi.i() && rsi.i < T_min) {
rsi_u.i = 1
}
if (max_element(rsi.k1(), rsi.i()) = rsi.i() && rsi.i > T_m) {
rsi_d.i = 1
}
}
}
return ?")
So here is the questions:
How can I return to R environment vecs rsi_u and rsi_d in form of data.frame or matrix with 2 cols and length(rsi) rows?
May be this loop can be speeded up with other tools? I tried apply family, but it was slower.
How can I return to R environment vecs rsi_u and rsi_d in form of data.frame or matrix with 2 cols and length(rsi) rows?
Not entirely sure what you're trying to achieve, but regardless you can rewrite your code in C++ using Rcpp and the sugar functions sum, max and min. The code is very similar to the R equivalent code. Some important things to be aware of is that C++ is type-strong, meaning 2 and 2.0 are different numbers, (equivalent to 2 and 2L in R), and vectors are 0-indexed rather than 1-index as in R (eg: The first element of NumericVector F(3) is 0 and the last is 2, in R it would be 1 and 3). This can lead to some confusion but the remaining code is the same.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List fun(NumericVector rsi,
double T_min, double T_m,
R_xlen_t L, R_xlen_t o) {
R_xlen_t n = rsi.size(),
st = L + 1,
en = n - o - 2;
NumericVector rsi_u(n), rsi_d(n);
// Note subsets are 0 indexed, so add -1 to indices
for(R_xlen_t i = st - 1; i < en; i++) {
R_xlen_t k = i - o + 1;
R_xlen_t k1 = i - L + 1;
Range sr(k, i), mr(k1, i);
//LogicalVector rsub = sum(rsi_u[sr]) == 0, rsdb = sum(rsi_d[sr]) == 0;
if(sum(rsi_u[sr]) == 0 && sum(rsi_d[sr]) == 0){
if(min(rsi[mr]) == rsi[i] && rsi[i] < T_min){
rsi_d[i] = 1.0;
}
if(max(rsi[mr]) == rsi[i] && rsi[i] > T_m){
rsi_u[i] = 1.0;
}
}
}
return DataFrame::create(Named("rsi_d") = rsi_d, Named("rsi_u") = rsi_u);
}
As a side note, the inline package is now-a-days completely redundant. Most (if not all?) of the functionality is encapsulated within the Rcpp::cppFunction and Rcpp::sourceCpp functions. The code above can be imported using either of the commands below:
library(Rcpp)
cppFunction(
'
// copy code to here. Note the single " ' "! Needed if there are double quotes in your C++ code
')
# Alternative
sourceCpp(
file = # Insert file path to file with code here
# Alt:
# code = '
# // copy code to here. Note the single " ' "! Needed if there are double quotes in your C++ code
# '
)
And that's it.
May be this loop can be speeded up with other tools? I tried apply family, but it was slower.
As for this part of your question, the main ideas you should be looking toward is vectorizing your code. In your example it is not immediately possible, as you are overwriting part of the rsi_d and rsi_u vectors used in your conditions within the loop. Using *apply is equivalent to using a for-loop and will not improve performance significantly.

Slow recursion even with memoization in R

I'm trying to solve the problem #14 of Project Euler.
So the main objective is finding length of Collatz sequence.
Firstly I solved problem with regular loop:
compute <- function(n) {
result <- 0
max_chain <- 0
hashmap <- 1
for (i in 1:n) {
chain <- 1
number <- i
while (number > 1) {
if (!is.na(hashmap[number])) {
chain <- chain + hashmap[number]
break
}
if (number %% 2 == 0) {
chain <- chain + 1
number <- number / 2
} else {
chain <- chain + 2
number <- (3 * number + 1) / 2
}
}
hashmap[i] <- chain
if (chain > max_chain) {
max_chain <- chain
result <- i
}
}
return(result)
}
Only 2 seconds for n = 1000000.
I decided to replace while loop to recursion
len_collatz_chain <- function(n, hashmap) {
get_len <- function(n) {
if (is.na(hashmap[n])) {
hashmap[n] <<- ifelse(n %% 2 == 0, 1 + get_len(n / 2), 2 + get_len((3 * n + 1) / 2))
}
return(hashmap[n])
}
get_len(n)
return(hashmap)
}
compute <- function(n) {
result <- 0
max_chain <- 0
hashmap <- 1
for (i in 1:n) {
hashmap <- len_collatz_chain(i, hashmap)
print(length(hashmap))
if (hashmap[i] > max_chain) {
max_chain <- hashmap[i]
result <- i
}
}
return(result)
}
This solution works but works so slow. Almost 1 min for n = 10000.
I suppose that one of the reasons is R creates hashmap object each time when call function len_collatz_chain.
I know about Rcpp packages and yes, the first solution works fine but I can't understand where I'm wrong.
Any tips?
For example, my Python recursive solution works in 1 second with n = 1000000
def len_collatz_chain(n: int, hashmap: dict) -> int:
if n not in hashmap:
hashmap[n] = 1 + len_collatz_chain(n // 2, hashmap) if n % 2 == 0 else 2 + len_collatz_chain((3 * n + 1) // 2, hashmap)
return hashmap[n]
def compute(n: int) -> int:
result, max_chain, hashmap = 0, 0, {1: 1}
for i in range(2, n):
chain = len_collatz_chain(i, hashmap)
if chain > max_chain:
result, max_chain = i, chain
return result
The main difference between your R and Python code is that in R you use a vector for the hashmap, while in Python you use a dictionary and that hashmap is transferred many times as function argument.
In Python, if you have a Dictionary as function argument, only a reference to the actual data is transfered to the called function. This is fast. The called function works on the same data as the caller.
In R, a vector is copied when used as function argument. This is potentially slow, but safer in the sense that the called function cannot alter the data of the caller.
This the main reason that Python is so much faster in your code.
You can however alter the R code slightly, such that the hashmap is not transfered as function argument anymore:
len_collatz_chain <- local({
hashmap <- 1L
get_len <- function(n) {
if (is.na(hashmap[n])) {
hashmap[n] <<- ifelse(n %% 2 == 0, 1 + get_len(n / 2), 2 + get_len((3 * n + 1) / 2))
}
hashmap[n]
}
get_len
})
compute <- function(n) {
result <- rep(NA_integer_, n)
for (i in seq_len(n)) {
result[i] <- len_collatz_chain(i)
}
result
}
compute(n=10000)
This makes the R code much faster. (Python will probably still be faster though).
Note that I have also removed the return statements in the R code, as they are not needed and add one level to the call stack.

How to use the output of an r function in another function?

I want to create an script that calculates probabilities for a rol game.
I´m new to programming and I´m stuck with the return values and nested functions. What I want is to use the values returned by the first function in the next one.
I have two functions dice(k, n) and fight(a, b). (for the example, the functions are partly written):
dice <- function (k, n) {
if (k > 3 && n > 2){
a <- 3
b <- 2
attack <- sample(1:6, a)
deff <- sample(1:6, b)
}
return(c(attack, deff))
}
So I want to use the vector attack, and deff in the next function:
fight <- function(a, b){
if (a == 3 && b == 2){
if(sort(attack,T)[1] > sort(deff,T)[1]){
n <- n - 1}
if (sort(attack,T)[1] <= sort(deff,T)[1]) {
k <- k - 1}
if (sort(attack,T)[2] > sort(deff,T)[2]) {
n <- n - 1}
if (sort(attack,T)[2]<= sort(deff,T)[2]){
k <- k - 1}
}
return(c(k, n)
}
But this gives me the next error:
Error in sort(attack, T) : object 'attack' not found
Any ideas? Thanks!

Quicksort returning same vector as passed

I recently started using R and am as an exercise trying to implement quicksort. I am using the book "Introduction to Algorithms (3rd ed)"
I am using RStudio and am not seeing any errors, but it returns the same vector as it is passed, am I assuming something wrong? I believe the my code matches what the pseudo code of the book shows. The Psuedo code is as follows:
Partition(A, p, r)
x = A[r]
i = p - 1
for j = p to r - 1
if A[j] <= x
i = i + 1
swap(A[i], A[j])
swap(A[i+1], A[r]
return i + 1
Quicksort(A, p, r)
if p < r
q = Partition(A, p, r)
Quicksort(A, p, q - 1)
Quicksort(A, q + 1, r)
I than wrote the same two functions in R:
partition <- function(a, p, r) {
x = a[r]
i = p - 1
for (j in p:(r-1)) {
if (a[j] <= x) {
i = i + 1
t = a[i]
a[i] = a[j]
a[j] = t
}
}
t = a[i+1]
a[i+1] = a[r]
a[r] = t
i+1
}
quicksort <- function(a, p, r) {
if (p < r) {
q = partition(a, p, r)
quicksort(a, p, q-1)
quicksort(a, q+1, r)
}
a
}
In RStudio, I source the file and call it with a vector I created:
> v
[1] 8 5 6 7 4 1 3 2
> quicksort(v, 1, length(v))
[1] 8 5 6 7 4 1 3 2
As far as I have read you are able to do recursive functions in R, I know you can't pass by reference, but wouldnt this be calling the same function with a changed vector? I am confused as to why it is returning the same vector passed. Any help would be appreciated.
Every time you want to change an object using a function you have to return it and assign it. I changed your code in two places: the partition function returns a list of two items, the vector a and the position i. In quicksort: the result of partition is stored initially in temp and its items assigned to a and q. Also, you have to assign the result of each change of a back to it.
partition <- function(a, p, r) {
x = a[r]
i = p - 1
for (j in p:(r-1)) {
if (a[j] <= x) {
i = i + 1
t = a[i]
a[i] = a[j]
a[j] = t
}
}
t = a[i+1]
a[i+1] = a[r]
a[r] = t
list(i = i+1, a = a)
}
quicksort <- function(a, p, r) {
if (p < r) {
temp = partition(a, p, r)
a <- temp$a
q = temp$i
a = quicksort(a, p, q-1)
a = quicksort(a, q+1, r)
}
a
}
v = c(8, 5, 6, 7, 4, 1, 3, 2)
quicksort(v, 1, length(v))
## [1] 1 2 3 4 5 6 7 8
Cheers,
alex
This is because you aren't modifying a within the functions quicksort or partition. R (in effect) passes by value. When you modify the value within the function, you're modifying a copy. This modification does not persist after the function has returned.
So you need to return the partitioned value instead of modifying within parition in order for this to work.
For example, partition could return a list of two vectors, the two partitions. And you can then call quicksort on each of these returned vectors.

Speeding up Julia's poorly written R examples

The Julia examples to compare performance against R seem particularly convoluted. https://github.com/JuliaLang/julia/blob/master/test/perf/perf.R
What is the fastest performance you can eke out of the two algorithms below (preferably with an explanation of what you changed to make it more R-like)?
## mandel
mandel = function(z) {
c = z
maxiter = 80
for (n in 1:maxiter) {
if (Mod(z) > 2) return(n-1)
z = z^2+c
}
return(maxiter)
}
mandelperf = function() {
re = seq(-2,0.5,.1)
im = seq(-1,1,.1)
M = matrix(0.0,nrow=length(re),ncol=length(im))
count = 1
for (r in re) {
for (i in im) {
M[count] = mandel(complex(real=r,imag=i))
count = count + 1
}
}
return(M)
}
assert(sum(mandelperf()) == 14791)
## quicksort ##
qsort_kernel = function(a, lo, hi) {
i = lo
j = hi
while (i < hi) {
pivot = a[floor((lo+hi)/2)]
while (i <= j) {
while (a[i] < pivot) i = i + 1
while (a[j] > pivot) j = j - 1
if (i <= j) {
t = a[i]
a[i] = a[j]
a[j] = t
}
i = i + 1;
j = j - 1;
}
if (lo < j) qsort_kernel(a, lo, j)
lo = i
j = hi
}
return(a)
}
qsort = function(a) {
return(qsort_kernel(a, 1, length(a)))
}
sortperf = function(n) {
v = runif(n)
return(qsort(v))
}
sortperf(5000)
The key word in this question is "algorithm":
What is the fastest performance you can eke out of the two algorithms below (preferably with an explanation of what you changed to make it more R-like)?
As in "how fast can you make these algorithms in R?" The algorithms in question here are the standard Mandelbrot complex loop iteration algorithm and the standard recursive quicksort kernel.
There are certainly faster ways to compute the answers to the problems posed in these benchmarks – but not using the same algorithms. You can avoid recursion, avoid iteration, and avoid whatever else R isn't good at. But then you're no longer comparing the same algorithms.
If you really wanted to compute Mandelbrot sets in R or sort numbers, yes, this is not how you would write the code. You would either vectorize it as much as possible – thereby pushing all the work into predefined C kernels – or just write a custom C extension and do the computation there. Either way, the conclusion is that R isn't fast enough to get really good performance on its own – you need have C do most of the work in order to get good performance.
And that's exactly the point of these benchmarks: in Julia you never have to rely on C code to get good performance. You can just write what you want to do in pure Julia and it will have good performance. If an iterative scalar loop algorithm is the most natural way to do what you want to do, then just do that. If recursion is the most natural way to solve the problem, then that's ok too. At no point will you be forced to rely on C for performance – whether via unnatural vectorization or writing custom C extensions. Of course, you can write vectorized code when it's natural, as it often is in linear algebra; and you can call C if you already have some library that does what you want. But you don't have to.
We do want to have the fairest possible comparison of the same algorithms across languages:
If someone does have faster versions in R that use the same algorithm, please submit patches!
I believe that the R benchmarks on the julia site are already byte-compiled, but if I'm doing it wrong and the comparison is unfair to R, please let me know and I will fix it and update the benchmarks.
Hmm, in the Mandelbrot example the matrix M has its dimensions transposed
M = matrix(0.0,nrow=length(im), ncol=length(re))
because it's filled by incrementing count in the inner loop (successive values of im). My implementation creates a vector of complex numbers in mandelperf.1 and operates on all elements, using an index and subsetting to keep track of which elements of the vector have not yet satisfied the condition Mod(z) <= 2
mandel.1 = function(z, maxiter=80L) {
c <- z
result <- integer(length(z))
i <- seq_along(z)
n <- 0L
while (n < maxiter && length(z)) {
j <- Mod(z) <= 2
if (!all(j)) {
result[i[!j]] <- n
i <- i[j]
z <- z[j]
c <- c[j]
}
z <- z^2 + c
n <- n + 1L
}
result[i] <- maxiter
result
}
mandelperf.1 = function() {
re = seq(-2,0.5,.1)
im = seq(-1,1,.1)
mandel.1(complex(real=rep(re, each=length(im)),
imaginary=im))
}
for a 13-fold speed-up (the results are equal but not identical because the original returns numeric rather than integer values).
> library(rbenchmark)
> benchmark(mandelperf(), mandelperf.1(),
+ columns=c("test", "elapsed", "relative"),
+ order="relative")
test elapsed relative
2 mandelperf.1() 0.412 1.00000
1 mandelperf() 5.705 13.84709
> all.equal(sum(mandelperf()), sum(mandelperf.1()))
[1] TRUE
The quicksort example doesn't actually sort
> set.seed(123L); qsort(sample(5))
[1] 2 4 1 3 5
but my main speed-up was to vectorize the partition around the pivot
qsort_kernel.1 = function(a) {
if (length(a) < 2L)
return(a)
pivot <- a[floor(length(a) / 2)]
c(qsort_kernel.1(a[a < pivot]), a[a == pivot], qsort_kernel.1(a[a > pivot]))
}
qsort.1 = function(a) {
qsort_kernel.1(a)
}
sortperf.1 = function(n) {
v = runif(n)
return(qsort.1(v))
}
for a 7-fold speedup (in comparison to the uncorrected original)
> benchmark(sortperf(5000), sortperf.1(5000),
+ columns=c("test", "elapsed", "relative"),
+ order="relative")
test elapsed relative
2 sortperf.1(5000) 6.60 1.000000
1 sortperf(5000) 47.73 7.231818
Since in the original comparison Julia is about 30 times faster than R for mandel, and 500 times faster for quicksort, the implementations above are still not really competitive.

Resources