R: Optimise spike pruning function - r

Since I have not found an R package for analysis of electrophysiological data, I have used a function for spike pruning from my group:
prune.spikes <- function(spikes, min.isi) {
# copy spike matrix
prunedspikes <- spikes
# initialise index of last spike: infinitely before the first one.
for (i in 1:ncol(spikes)) {
last <- -Inf
for (j in 1:nrow(spikes)) {
if (spikes[j, i] == 1) {
if (j - last < min.isi) {
prunedspikes[j, i] <- 0; # remove the spike
}
else {
last <- j
}
}
}
}
return(prunedspikes)
}
The function takes a spike vector or matrix consisting of 0 and 1 values and removes any 1 if it occurred within a minimum interval.
Because of the two nested loops it takes ages to run. In order to optimise it I have come up with this solution (removes one loop):
prune.cols <- function(spikes, min.isi) {
prunedspikes <- apply(spikes, 2, FUN = prune.rows, min.isi = min.isi)
return(prunedspikes)
}
prune.rows <- function(spikes, min.isi) {
prunedspikes <- spikes
last <- -Inf
for (i in 1:length(spikes)) {
if (spikes[i] == 1) {
if (i - last < min.isi) {
prunedspikes[i] <- 0; # remove the spike
}
else {
last <- i
}
}
}
return(prunedspikes)
}
Calling prune.cols on a large data set is noticeable faster compared to the original version (~60 times). One loop remains, though. So far I could not come up with a nice and simple solution. How can the function be even further improved?

Like #Khashaa proposed, I implemented the function with the help of Rcpp:
NumericMatrix prunespikes(NumericMatrix spikes, double minisi) {
NumericMatrix prunedspikes = spikes;
int ncol = spikes.ncol();
int nrow = spikes.nrow();
for (int i = 0; i < ncol; i++) {
int last = 0;
while (spikes(last, i) == 0) {
last++;
}
for (int j = last + 1; j < nrow; j++) {
if (spikes(j, i) == 1) {
if (j - last < minisi) {
prunedspikes(j, i) = 0;
} else {
last = j;
}
}
}
}
return prunedspikes;
}

If the speed difference is not a problem yet, it may be better to keep the loop instead of using Rcpp.
According to Hadley Wickham's article Loops that should be left as is, it is not a bad idea to have this loop as it can be categorized into the Recursive relationship case.
Once the speed is the bottleneck, then resorting to Rcpp or this page (suggested by the article too) may be the solution.

Related

R - nested loop alternatives/optimization

I'm currently trying to implement an algorithm in R that requires to loop through the rows and columns of a matrix and that for every cell it computes a value based on the value of previously computed cells.
Here is the code that does what I said above, it is a part of the Needleman Wunsch algorithm:
globalSequenceAlignment <- function(seq1, seq2, match, mismatch, gap) {
# splitting the sequences in order to use them as rows and columns names
seq1_split <- unlist(strsplit(toString(seq1), ""))
seq2_split <- unlist(strsplit(toString(seq2), ""))
len1 <- length(seq1_split)
len2 <- length(seq2_split)
# creating the alignment matrix
alignment_matrix <- matrix(0, nrow = len2+1, ncol = len1+1)
colnames(alignment_matrix) <- c("-", seq1_split)
rownames(alignment_matrix) <- c("-", seq2_split)
# filling first row and column of the alignment matrix
for (i in 2:ncol(alignment_matrix)) {
alignment_matrix[1,i] <- (alignment_matrix[1,i]+(i-1))*(gap)
}
for (j in 2:nrow(alignment_matrix)) {
alignment_matrix[j,1] <- (alignment_matrix[j,1]+(j-1))*(gap)
}
for (i in 2:ncol(alignment_matrix)) {
for (j in 2:nrow(alignment_matrix)) {
horizontal_score <- alignment_matrix[j,i-1] + gap
vertical_score <- alignment_matrix[j-1,i] + gap
if (colnames(alignment_matrix)[i] == rownames(alignment_matrix)[j]) {
diagonal_score <- alignment_matrix[j-1,i-1] + match
} else {
diagonal_score <- alignment_matrix[j-1,i-1] + mismatch
}
scores <- c(horizontal_score, vertical_score, diagonal_score)
alignment_matrix[j,i] <- max(scores)
}
}
return(alignment_matrix)
}
a <- 'GAATC'
b <- 'CATACG'
globalSequenceAlignment(a, b, 10,-5,-4)
Using this code I get the result that I want.
The problem is that with matrices with dimensions grater than 500x500 the nested loops become way too slow (running this code with a 500x500 matrix takes more or less 2 minutes).
I know that *apply functions could improve this but I couldn't achieve to use them since for computing each cell it requires that the previous ones have been computed yet.
I was wondering if there is a way to achieve the same result using *apply functions or a way to vectorize this type of code so that it's more rapid in R.
If someone would ever need this I wrote my own solution to this problem using the package Rcpp. The runtime, from about 3 minutes for sequences of 500 characters, is now about 0.3s.
I post here the code for the part of the two nested loops that you can see in the text of the question, hope that will be useful for someone.
library(Rcpp)
rcppFunction('IntegerMatrix rcpp_compute_matrices(IntegerMatrix Am, StringMatrix Dm,
StringVector seq1, StringVector seq2,
int gap, int miss, int match) {
int nrow = Am.nrow(), ncol = Am.ncol();
for (int i = 1; i < nrow; i++) {
for (int j = 1; j < ncol; j++) {
int vertical_score = Am(i-1, j) + gap;
int horizontal_score = Am(i, j-1) + gap;
int diagonal_score = 0;
if (seq1[j-1] == seq2[i-1]) {
diagonal_score = Am(i-1, j-1) + match;
}
else {
diagonal_score = Am(i-1, j-1) + miss;
}
IntegerVector score = {vertical_score, horizontal_score, diagonal_score};
int max_score = max(score);
Am(i, j) = max_score;
}
}
return Am;
}')

Why is this recursive function exceeding call stack size?

I'm trying to write a function to find the lowest number that all integers between 1 and 20 divide. (Let's call this Condition D)
Here's my solution, which is somehow exceeding the call stack size limit.
function findSmallest(num){
var count = 2
while (count<21){
count++
if (num % count !== 0){
// exit the loop
return findSmallest(num++)
}
}
return num
}
console.log(findSmallest(20))
Somewhere my reasoning on this is faulty but here's how I see it (please correct me where I'm wrong):
Calling this function with a number N that doesn't meet Condition D will result in the function being called again with N + 1. Eventually, when it reaches a number M that should satisfy Condition D, the while loop runs all the way through and the number M is returned by the function and there are no more recursive calls.
But I get this error on running it:
function findSmallest(num){
^
RangeError: Maximum call stack size exceeded
I know errors like this are almost always due to recursive functions not reaching a base case. Is this the problem here, and if so, where's the problem?
I found two bugs.
in your while loop, the value of count is 3 to 21.
the value of num is changed in loop. num++ should be num + 1
However, even if these bugs are fixed, the error will not be solved.
The answer is 232792560.
This recursion depth is too large, so stack memory exhausted.
For example, this code causes same error.
function foo (num) {
if (num === 0) return
else foo(num - 1)
}
foo(232792560)
Coding without recursion can avoid errors.
Your problem is that you enter the recursion more than 200 million times (plus the bug spotted in the previous answer). The number you are looking for is the multiple of all prime numbers times their max occurrences in each number of the defined range. So here is your solution:
function findSmallestDivisible(n) {
if(n < 2 || n > 100) {
throw "Numbers between 2 and 100 please";
}
var arr = new Array(n), res = 2;
arr[0] = 1;
arr[1] = 2;
for(var i = 2; i < arr.length; i++) {
arr[i] = fix(i, arr);
res *= arr[i];
}
return res;
}
function fix(idx, arr) {
var res = idx + 1;
for(var i = 1; i < idx; i++) {
if((res % arr[i]) == 0) {
res /= arr[i];
}
}
return res;
}
https://jsfiddle.net/7ewkeamL/

R: How to compute correlation between rows of a matrix without having to transpose it?

I have a big matrix and am interested in computing the correlation between the rows of the matrix. Since the cor method computes correlation between the columns of a matrix, I am transposing the matrix before calling cor. But since the matrix is big, transposing it is expensive and is slowing down my program. Is there a way to compute the correlations among the rows without having to take transpose?
EDIT: thanks for the responses. thought i'd share some findings. my input matrix is 16 rows by 239766 cols and comes from a .mat file. I wrote C# code to do the same thing using the csmatio library. it looks like this:
foreach (var file in Directory.GetFiles(path, interictal_pattern))
{
var reader = new MatFileReader(file);
var mla = reader.Data[0] as MLStructure;
convert(mla.AllFields[0] as MLNumericArray<double>, data);
double sum = 0;
for (var i = 0; i < 16; i++)
{
for (var j = i + 1; j < 16; j++)
{
sum += cor(data, i, j);
}
}
var avg = sum / 120;
if (++count == 10)
{
var t2 = DateTime.Now;
var t = t2 - t1;
Console.WriteLine(t.TotalSeconds);
break;
}
}
static double[][] createArray(int rows, int cols)
{
var ans = new double[rows][];
for (var row = 0; row < rows; row++)
{
ans[row] = new double[cols];
}
return ans;
}
static void convert(MLNumericArray<double> mla, double[][] M)
{
var rows = M.Length;
var cols = M[0].Length;
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
M[i][j] = mla.Get(i, j);
}
static double cor(double[][] M, int i, int j)
{
var count = M[0].Length;
double sum1 = 0, sum2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
sum1 += M[i][ctr];
sum2 += M[j][ctr];
}
var mu1 = sum1 / count;
var mu2 = sum2 / count;
double numerator = 0, sumOfSquares1 = 0, sumOfSquares2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
var x = M[i][ctr] - mu1;
var y = M[j][ctr] - mu2;
numerator += x * y;
sumOfSquares1 += x * x;
sumOfSquares2 += y * y;
}
return numerator / Math.Sqrt(sumOfSquares1 * sumOfSquares2);
}
this gave a throughput of 22.22s for 10 files or 2.22s/file
Then I profiled my R code:
ptm=proc.time()
for(file in files)
{
i = i + 1;
mat = readMat(paste(path,file,sep=""))
a = t(mat[[1]][[1]])
C = cor(a)
correlations[i] = mean(C[lower.tri(C)])
}
print(proc.time()-ptm)
to my surprise its running faster than C# and is giving throughput of 5.7s per 10 files or 0.6s/file (an improvement of almost 4x!). The bottleneck in C# is the methods inside csmatio library to parse double values from input stream.
and if i do not convert the csmatio classes into a double[][] then the C# code runs extremely slow (order of magnitude slower ~20-30s/file).
Seeing that this problem arises from a data input issue whose details are not stated (and only hinted at in a comment), I will assume this is a comma-delimited file of unquoted numbers with the number of columns= Ncol. This does the transposition on input.
in.mat <- matrix( scan("path/to/the_file/fil.txt", what =numeric(0), sep=","),
ncol=Ncol, byrow=TRUE)
cor(in.nmat)
One dirty work-around would be to apply cor-functions row-wise and produce the correlation matrix from the results. You could try if this is any more efficient (which I doubt, though you could fine-tune it by not double computing everything or the redundant diagonal cases):
# Apply 2-fold nested row-wise functions
set.seed(1)
dat <- matrix(rnorm(1000), nrow=10)
cormat <- apply(dat, MARGIN=1, FUN=function(z) apply(dat, MARGIN=1, FUN=function(y) cor(z, y)))
cormat[1:3,1:3] # Show few first
# [,1] [,2] [,3]
#[1,] 1.000000000 0.002175792 0.1559263
#[2,] 0.002175792 1.000000000 -0.1870054
#[3,] 0.155926259 -0.187005418 1.0000000
Though, generally I would expect the transpose to have a really, really efficient implementation, so it's hard to imagine when that would be the bottle-neck. But, you could also dig through the implementation of 'cor' function and call the correlation C-function itself by first making sure your rows are suitable. Type 'cor' in the terminal to see the implementation, which is mostly a wrapper that makes input suitable for the C-function:
# Row with C-call from the implementation of 'cor':
# if (method == "pearson")
# .Call(C_cor, x, y, na.method, FALSE)
You can use outer:
outer(seq(nrow(mat)), seq(nrow(mat)),
Vectorize(function(x, y) cor(mat[x , ], mat[y , ])))
where mat is the name of your matrix.

Find closest value in a vector with binary search

As a silly toy example, suppose
x=4.5
w=c(1,2,4,6,7)
I wonder if there is a simple R function that finds the index of the closest match to x in w. So if foo is that function, foo(w,x) would return 3. The function match is the right idea, but seems to apply only for exact matches.
Solutions here (e.g. which.min(abs(w - x)), which(abs(w-x)==min(abs(w-x))), etc.) are all O(n) instead of log(n) (I'm assuming that w is already sorted).
R>findInterval(4.5, c(1,2,4,5,6))
[1] 3
will do that with price-is-right matching (closest without going over).
You can use data.table to do a binary search:
dt = data.table(w, val = w) # you'll see why val is needed in a sec
setattr(dt, "sorted", "w") # let data.table know that w is sorted
Note that if the column w isn't already sorted, then you'll have to use setkey(dt, w) instead of setattr(.).
# binary search and "roll" to the nearest neighbour
dt[J(x), roll = "nearest"]
# w val
#1: 4.5 4
In the final expression the val column will have the you're looking for.
# or to get the index as Josh points out
# (and then you don't need the val column):
dt[J(x), .I, roll = "nearest", by = .EACHI]
# w .I
#1: 4.5 3
# or to get the index alone
dt[J(x), roll = "nearest", which = TRUE]
#[1] 3
See match.closest() from the MALDIquant package:
> library(MALDIquant)
> match.closest(x, w)
[1] 3
x = 4.5
w = c(1,2,4,6,7)
closestLoc = which(min(abs(w-x)))
closestVal = w[which(min(abs(w-x)))]
# On my phone- please pardon typos
If your vector is lengthy, try a 2-step approach:
x = 4.5
w = c(1,2,4,6,7)
sdev = sapply(w,function(v,x) abs(v-x), x = x)
closestLoc = which(min(sdev))
for maddeningly long vectors (millions of rows!, warning- this will actually be slower for data which is not very, very, very large.)
require(doMC)
registerDoMC()
closestLoc = which(min(foreach(i = w) %dopar% {
abs(i-x)
}))
This example is just to give you a basic idea of leveraging parallel processing when you have huge data. Note, I do not recommend you use it for simple & fast functions like abs().
To do this on character vectors, Martin Morgan suggested this function on R-help:
bsearch7 <-
function(val, tab, L=1L, H=length(tab))
{
b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
i0 <- seq_along(val)
repeat {
updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
tabM <- tab[M]
val0 <- val[i0]
i <- tabM < val0
updt[i] <- M[i] + 1L
i <- tabM > val0
updt[i] <- M[i] - 1L
b[i0 + i * length(val)] <- updt
i0 <- which(b[i0, "H"] >= b[i0, "L"])
if (!length(i0)) break;
}
b[,"L"] - 1L
}
NearestValueSearch = function(x, w){
## A simple binary search algo
## Assume the w vector is sorted so we can use binary search
left = 1
right = length(w)
while(right - left > 1){
middle = floor((left + right) / 2)
if(x < w[middle]){
right = middle
}
else{
left = middle
}
}
if(abs(x - w[right]) < abs(x - w[left])){
return(right)
}
else{
return(left)
}
}
x = 4.5
w = c(1,2,4,6,7)
NearestValueSearch(x, w) # return 3
Based on #neal-fultz answer, here is a simple function that uses findInterval():
get_closest_index <- function(x, vec){
# vec must be sorted
iv <- findInterval(x, vec)
dist_left <- x - vec[ifelse(iv == 0, NA, iv)]
dist_right <- vec[iv + 1] - x
ifelse(! is.na(dist_left) & (is.na(dist_right) | dist_left < dist_right), iv, iv + 1)
}
values <- c(-15, -0.01, 3.1, 6, 10, 100)
grid <- c(-2, -0.1, 0.1, 3, 7)
get_closest_index(values, grid)
#> [1] 1 2 4 5 5 5
Created on 2020-05-29 by the reprex package (v0.3.0)
You can always implement custom binary search algorithm to find the closest value. Alternately, you can leverage standard implementation of libc bsearch(). You can use other binary search implementations as well, but it does not change the fact that you have to implement the comparing function carefully to find the closest element in array. The issue with standard binary search implementation is that it is meant for exact comparison. That means your improvised comparing function needs to do some kind of exactification to figure out if an element in array is close-enough. To achieve it, the comparing function needs to have awareness of other elements in the array, especially following aspects:
position of the current element (one which is being compared with the
key).
the distance with key and how it compares with neighbors (previous
or next element).
To provide this extra knowledge in comparing function, the key needs to be packaged with additional information (not just the key value). Once the comparing function have awareness on these aspects, it can figure out if the element itself is closest. When it knows that it is the closest, it returns "match".
The the following C code finds the closest value.
#include <stdio.h>
#include <stdlib.h>
struct key {
int key_val;
int *array_head;
int array_size;
};
int compar(const void *k, const void *e) {
struct key *key = (struct key*)k;
int *elem = (int*)e;
int *arr_first = key->array_head;
int *arr_last = key->array_head + key->array_size -1;
int kv = key->key_val;
int dist_left;
int dist_right;
if (kv == *elem) {
/* easy case: if both same, got to be closest */
return 0;
} else if (key->array_size == 1) {
/* easy case: only element got to be closest */
return 0;
} else if (elem == arr_first) {
/* element is the first in array */
if (kv < *elem) {
/* if keyval is less the first element then
* first elem is closest.
*/
return 0;
} else {
/* check distance between first and 2nd elem.
* if distance with first elem is smaller, it is closest.
*/
dist_left = kv - *elem;
dist_right = *(elem+1) - kv;
return (dist_left <= dist_right) ? 0:1;
}
} else if (elem == arr_last) {
/* element is the last in array */
if (kv > *elem) {
/* if keyval is larger than the last element then
* last elem is closest.
*/
return 0;
} else {
/* check distance between last and last-but-one.
* if distance with last elem is smaller, it is closest.
*/
dist_left = kv - *(elem-1);
dist_right = *elem - kv;
return (dist_right <= dist_left) ? 0:-1;
}
}
/* condition for remaining cases (other cases are handled already):
* - elem is neither first or last in the array
* - array has atleast three elements.
*/
if (kv < *elem) {
/* keyval is smaller than elem */
if (kv <= *(elem -1)) {
/* keyval is smaller than previous (of "elem") too.
* hence, elem cannot be closest.
*/
return -1;
} else {
/* check distance between elem and elem-prev.
* if distance with elem is smaller, it is closest.
*/
dist_left = kv - *(elem -1);
dist_right = *elem - kv;
return (dist_right <= dist_left) ? 0:-1;
}
}
/* remaining case: (keyval > *elem) */
if (kv >= *(elem+1)) {
/* keyval is larger than next (of "elem") too.
* hence, elem cannot be closest.
*/
return 1;
}
/* check distance between elem and elem-next.
* if distance with elem is smaller, it is closest.
*/
dist_right = *(elem+1) - kv;
dist_left = kv - *elem;
return (dist_left <= dist_right) ? 0:1;
}
int main(int argc, char **argv) {
int arr[] = {10, 20, 30, 40, 50, 60, 70};
int *found;
struct key k;
if (argc < 2) {
return 1;
}
k.key_val = atoi(argv[1]);
k.array_head = arr;
k.array_size = sizeof(arr)/sizeof(int);
found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int),
compar);
if(found) {
printf("found closest: %d\n", *found);
} else {
printf("closest not found. absurd! \n");
}
return 0;
}
Needless to say that bsearch() in above example should never fail (unless the array size is zero).
If you implement your own custom binary search, essentially you have to embed same comparing logic in the main body of binary search code (instead of having this logic in comparing function in above example).

Handling large groups of numbers

Project Euler problem 14:
The following iterative sequence is
defined for the set of positive
integers:
n → n/2 (n is even) n → 3n + 1 (n is
odd)
Using the rule above and starting with
13, we generate the following
sequence: 13 → 40 → 20 → 10 → 5 → 16 →
8 → 4 → 2 → 1
It can be seen that this sequence
(starting at 13 and finishing at 1)
contains 10 terms. Although it has not
been proved yet (Collatz Problem), it
is thought that all starting numbers
finish at 1.
Which starting number, under one
million, produces the longest chain?
My first instinct is to create a function to calculate the chains, and run it with every number between 1 and 1 million. Obviously, that takes a long time. Way longer than solving this should take, according to Project Euler's "About" page. I've found several problems on Project Euler that involve large groups of numbers that a program running for hours didn't finish. Clearly, I'm doing something wrong.
How can I handle large groups of numbers quickly?
What am I missing here?
Have a read about memoization. The key insight is that if you've got a sequence starting A that has length 1001, and then you get a sequence B that produces an A, you don't to repeat all that work again.
This is the code in Mathematica, using memoization and recursion. Just four lines :)
f[x_] := f[x] = If[x == 1, 1, 1 + f[If[EvenQ[x], x/2, (3 x + 1)]]];
Block[{$RecursionLimit = 1000, a = 0, j},
Do[If[a < f[i], a = f[i]; j = i], {i, Reverse#Range#10^6}];
Print#a; Print[j];
]
Output .... chain length´525´ and the number is ... ohhhh ... font too small ! :)
BTW, here you can see a plot of the frequency for each chain length
Starting with 1,000,000, generate the chain. Keep track of each number that was generated in the chain, as you know for sure that their chain is smaller than the chain for the starting number. Once you reach 1, store the starting number along with its chain length. Take the next biggest number that has not being generated before, and repeat the process.
This will give you the list of numbers and chain length. Take the greatest chain length, and that's your answer.
I'll make some code to clarify.
public static long nextInChain(long n) {
if (n==1) return 1;
if (n%2==0) {
return n/2;
} else {
return (3 * n) + 1;
}
}
public static void main(String[] args) {
long iniTime=System.currentTimeMillis();
HashSet<Long> numbers=new HashSet<Long>();
HashMap<Long,Long> lenghts=new HashMap<Long, Long>();
long currentTry=1000000l;
int i=0;
do {
doTry(currentTry,numbers, lenghts);
currentTry=findNext(currentTry,numbers);
i++;
} while (currentTry!=0);
Set<Long> longs = lenghts.keySet();
long max=0;
long key=0;
for (Long aLong : longs) {
if (max < lenghts.get(aLong)) {
key = aLong;
max = lenghts.get(aLong);
}
}
System.out.println("number = " + key);
System.out.println("chain lenght = " + max);
System.out.println("Elapsed = " + ((System.currentTimeMillis()-iniTime)/1000));
}
private static long findNext(long currentTry, HashSet<Long> numbers) {
for(currentTry=currentTry-1;currentTry>=0;currentTry--) {
if (!numbers.contains(currentTry)) return currentTry;
}
return 0;
}
private static void doTry(Long tryNumber,HashSet<Long> numbers, HashMap<Long, Long> lenghts) {
long i=1;
long n=tryNumber;
do {
numbers.add(n);
n=nextInChain(n);
i++;
} while (n!=1);
lenghts.put(tryNumber,i);
}
Suppose you have a function CalcDistance(i) that calculates the "distance" to 1. For instance, CalcDistance(1) == 0 and CalcDistance(13) == 9. Here is a naive recursive implementation of this function (in C#):
public static int CalcDistance(long i)
{
if (i == 1)
return 0;
return (i % 2 == 0) ? CalcDistance(i / 2) + 1 : CalcDistance(3 * i + 1) + 1;
}
The problem is that this function has to calculate the distance of many numbers over and over again. You can make it a little bit smarter (and a lot faster) by giving it a memory. For instance, lets create a static array that can store the distance for the first million numbers:
static int[] list = new int[1000000];
We prefill each value in the list with -1 to indicate that the value for that position is not yet calculated. After this, we can optimize the CalcDistance() function:
public static int CalcDistance(long i)
{
if (i == 1)
return 0;
if (i >= 1000000)
return (i % 2 == 0) ? CalcDistance(i / 2) + 1 : CalcDistance(3 * i + 1) + 1;
if (list[i] == -1)
list[i] = (i % 2 == 0) ? CalcDistance(i / 2) + 1: CalcDistance(3 * i + 1) + 1;
return list[i];
}
If i >= 1000000, then we cannot use our list, so we must always calculate it. If i < 1000000, then we check if the value is in the list. If not, we calculate it first and store it in the list. Otherwise we just return the value from the list. With this code, it took about ~120ms to process all million numbers.
This is a very simple example of memoization. I use a simple list to store intermediate values in this example. You can use more advanced data structures like hashtables, vectors or graphs when appropriate.
Minimize how many levels deep your loops are, and use an efficient data structure such as IList or IDictionary, that can auto-resize itself when it needs to expand. If you use plain arrays they need to be copied to larger arrays as they expand - not nearly as efficient.
This variant doesn't use an HashMap but tries only to not repeat the first 1000000 numbers. I don't use an hashmap because the biggest number found is around 56 billions, and an hash map could crash.
I have already done some premature optimization. Instead of / I use >>, instead of % I use &. Instead of * I use some +.
void Main()
{
var elements = new bool[1000000];
int longestStart = -1;
int longestRun = -1;
long biggest = 0;
for (int i = elements.Length - 1; i >= 1; i--) {
if (elements[i]) {
continue;
}
elements[i] = true;
int currentStart = i;
int currentRun = 1;
long current = i;
while (current != 1) {
if (current > biggest) {
biggest = current;
}
if ((current & 1) == 0) {
current = current >> 1;
} else {
current = current + current + current + 1;
}
currentRun++;
if (current < elements.Length) {
elements[current] = true;
}
}
if (currentRun > longestRun) {
longestStart = i;
longestRun = currentRun;
}
}
Console.WriteLine("Longest Start: {0}, Run {1}", longestStart, longestRun);
Console.WriteLine("Biggest number: {0}", biggest);
}

Resources