Julia for loops slower than while? - julia

I'm playing around with the Julia language, and noticed that the small program I wrote was quite slow.
Suspecting that it was somehow related to the for loops, I rewrote it to use while, and got about 15x faster.
I'm sure there's something I'm doing wrong with the ranges etc., but I can't figure out what.
function primes_for()
num_primes = 0
for a = 2:3000000
prime = true
sa = floor(sqrt(a))
for c in 2:sa
if a % c == 0
prime = false
break
end
end
if prime
num_primes += 1
end
end
println("Number of primes is $num_primes")
end
function primes()
num_primes = 0
a = 2
while a < 3000000
prime = true
c = 2
while c * c <= a
if a % c == 0
prime = false
break
end
c += 1
end
if prime
num_primes += 1
end
a += 1
end
println("Number of primes is $num_primes")
end
#time primes_for()
#time primes()

As explained in the comments by #Vincent Yu and #Kelly Bundy, this is because sa = floor(sqrt(a)) creates a float. Then c becomes a float, and a % c is slow.
You can replace floor(sqrt(a)) with floor(Int, sqrt(a)), or preferably, I think, with isqrt(a), which returns
Integer square root: the largest integer m such that m*m <= n.
This avoids the (unlikely) event that floor(Int, sqrt(a)) may round down too far, which could happen if sqrt(x^2) = x - ε due to floating point errors.
Edit: Here's a benchmark to demonstrate (note the use of isqrt):
function primes_for2()
num_primes = 0
for a = 2:3000000
prime = true
# sa = floor(sqrt(a))
sa = isqrt(a)
for c in 2:sa
if a % c == 0
prime = false
break
end
end
if prime
num_primes += 1
end
end
println("Number of primes is $num_primes")
end
1.7.0> #time primes_for()
Number of primes is 216816
6.705099 seconds (15 allocations: 480 bytes)
1.7.0> #time primes_for2()
Number of primes is 216816
0.691304 seconds (15 allocations: 480 bytes)
1.7.0> #time primes()
Number of primes is 216816
0.671784 seconds (15 allocations: 480 bytes)
I can note that each call to isqrt on my computer takes approximately 8ns, and that 3000000 times 8ns is 0.024 seconds. A call to regular sqrt is approximately 1ns.

It's not the for/while that makes the speed difference, it's the sqrt. It doesn't help that sqrt returns float, which promotes all the rest of the code around the sqrt output from integers.
Note that #time is not measuring the while and for loops, but also the code outside those loops.
If you are benchmarking code, the rest of your code needs to be the same, and removing the sqrt is one of the prime optimizations in this algorithm. It's also possible to remove the c * c in the test, but this is trickier.

Related

Julia: Searching for a column in a sorted matrix

I have a matrix that is sorted like the one shown below
1 1 2 2 3
1 2 3 4 1
2 1 2 1 1
It's a bit hard for me to describe the ordering, but hopefully it's clear from the example. The rough idea is that we first sort on the first row, then the second, etc.
I would like to find a specific column in the matrix, and that column may or may not exist in it.
I tried the following code:
index = searchsortedfirst(1:total_cols, col, lt=(index,x) -> (matrix[: index] < x))
The above code works, but it is slow. I profiled the code, and it spends a lot of time in "_get_index". I then tried the following
#views index = searchsortedfirst(1:total_cols, col, lt=(index,x) -> (matrix[: index] < x))
As expected this helped a lot, likely due to the slices I'm taking. However, is there a better way to go about this? There still seems to be a lot of overhead, and I feel like there might be a cleaner way to write this, which would be easier to optimize.
However, I absolutely value speed over clarity.
Here is some code I wrote to compare binary vs. linear search.
using Profile
function test_search()
max_val = 20
rows = 4
matrix = rand(1:max_val, rows, 10^5)
matrix = Array{Int64,2}(sortslices(matrix, dims=2))
indices = #time #profile lin_search(matrix, rows, max_val, 10^3)
indices = #time #profile bin_search(matrix, rows, max_val, 10^3)
end
function bin_search(matrix, rows, max_val, repeats)
indices = zeros(repeats)
x = zeros(Int64, rows)
cols = size(matrix)[2]
for i = 1:repeats
x = rand(1:max_val, rows)
#inbounds #views index = searchsortedfirst(1:cols, x, lt=(index,x)->(matrix[:,index] < x))
indices[i] = index
end
return indices
end
function array_eq(matrix, index, y, rows)
for i=1:rows
#inbounds if view(matrix, i, index) != y[i]
return false
end
end
return true
end
function lin_search(matrix, rows, max_val, repeats)
indices = zeros(repeats)
x = zeros(Int64, rows)
cols = size(matrix)[2]
for i = 1:repeats
index = cols + 1
x = rand(1:max_val, rows)
for j=1:cols
if array_eq(matrix, j, x, rows)
index = j;
break
end
end
indices[i] = index
end
return indices
end
Profile.clear()
test_search()
Here is some sample output
0.041356 seconds (68.90 k allocations: 3.431 MiB)
0.070224 seconds (110.45 k allocations: 5.418 MiB)
After adding some more #inbounds, it looks like a linear search is faster than binary. Seems strange when there are 10^5 columns.
If speed is most important, why not simply use the fact that Julia allows you to write fast loops?
julia> function findcol(M, col)
#inbounds #views for c in axes(M, 2)
M[:,c] == col && return c
end
return nothing
end
findcol (generic function with 1 method)
julia> col = [2,3,2];
julia> M = [1 1 2 2 3;
1 2 3 4 1;
2 1 2 1 1];
julia> #btime findcol($M, $col)
32.854 ns (3 allocations: 144 bytes)
3
This should probably be fast enough and does not even take into account any ordering.
I discovered two issues, that when fixed result in both linear and binary searches being much faster. And the binary search becomes faster than linear.
First, there was some type instability. I changed on one of the lines to
matrix::Array{Int64,2} = Array{Int64,2}(sortslices(matrix, dims=2))
This resulted in an order of magnitude speedup. Also it turns out that using #views does not do anything in the following code
#inbounds #views index = searchsortedfirst(1:cols, x, lt=(index,x)->(matrix[:,index] < x))
I am new to Julia, but my hunch is that since matrix[:,index] is copied no matter what in the anonymous function. This would make sense, since it allows for closures.
If I write a separate non-anonymous function, then that copy goes away. Linear search didn't copy the slices, so this also really sped up the binary search.

how to change max recursion depth in Julia?

I was curious how quick and accurate, algorithm from Rosseta code ( https://rosettacode.org/wiki/Ackermann_function ) for (4,2) parameters, could be. But got StackOverflowError.
julia> using Memoize
#memoize ack3(m, n) =
m == 0 ? n + 1 :
n == 0 ? ack3(m-1, 1) :
ack3(m-1, ack3(m, n-1))
# WARNING! Next line has to calculate and print number with 19729 digits!
julia> ack3(4,2) # -> StackOverflowError
# has to be -> 2003529930406846464979072351560255750447825475569751419265016973710894059556311
# ...
# 4717124577965048175856395072895337539755822087777506072339445587895905719156733
EDIT:
Oscar Smith is right that trying ack3(4,2) is unrealistic. This is version translated from Rosseta's C++:
module Ackermann
function ackermann(m::UInt, n::UInt)
function ack(m::UInt, n::BigInt)
if m == 0
return n + 1
elseif m == 1
return n + 2
elseif m == 2
return 3 + 2 * n;
elseif m == 3
return 5 + 8 * (BigInt(2) ^ n - 1)
else
if n == 0
return ack(m - 1, BigInt(1))
else
return ack(m - 1, ack(m, n - 1))
end
end
end
return ack(m, BigInt(n))
end
end
julia> import Ackermann;Ackermann.ackermann(UInt(1),UInt(1));#time(a4_2 = Ackermann.ackermann(UInt(4),UInt(2)));t = "$a4_2"; println("len = $(length(t)) first_digits=$(t[1:20]) last digits=$(t[end-20:end])")
0.000041 seconds (57 allocations: 33.344 KiB)
len = 19729 first_digits=20035299304068464649 last digits=445587895905719156733
Julia itself does not have an internal limit to the stack size, but your operating system does. The exact limits here (and how to change them) will be system dependent. On my Mac (and I assume other POSIX-y systems), I can check and change the stack size of programs that get called by my shell with ulimit:
$ ulimit -s
8192
$ julia -q
julia> f(x) = x > 0 ? f(x-1) : 0 # a simpler recursive function
f (generic function with 1 method)
julia> f(523918)
0
julia> f(523919)
ERROR: StackOverflowError:
Stacktrace:
[1] f(::Int64) at ./REPL[1]:1 (repeats 80000 times)
$ ulimit -s 16384
$ julia -q
julia> f(x) = x > 0 ? f(x-1) : 0
f (generic function with 1 method)
julia> f(1048206)
0
julia> f(1048207)
ERROR: StackOverflowError:
Stacktrace:
[1] f(::Int64) at ./REPL[1]:1 (repeats 80000 times)
I believe the exact number of recursive calls that will fit on your stack will depend upon both your system and the complexity of the function itself (that is, how much each recursive call needs to store on the stack). This is the bare minimum. I have no idea how big you'd need to make the stack limit in order to compute that Ackermann function.
Note that I doubled the stack size and it more than doubled the number of recursive calls — this is because of a constant overhead:
julia> log2(523918)
18.998981503278365
julia> 2^19 - 523918
370
julia> log2(1048206)
19.99949084151746
julia> 2^20 - 1048206
370
Just fyi, even if you change the max recursion depth, you won't get the right answer as Julia uses 64 bit integers, so integer overflow with make stuff not work. To get the right answer, you will have to use big ints to have any hope. The next problem is that you probably don't want to memoize, as almost all of the computations are not repeated, and you will be computing the function more than 10^19729 different inputs, which you really do not want to store.

Julia significantly slower with #parallel

I have this code(primitive heat transfer):
function heat(first, second, m)
#sync #parallel for d = 2:m - 1
for c = 2:m - 1
#inbounds second[c,d] = (first[c,d] + first[c+1, d] + first[c-1, d] + first[c, d+1] + first[c, d-1]) / 5.0;
end
end
end
m = parse(Int,ARGS[1]) #size of matrix
firstm = SharedArray(Float64, (m,m))
secondm = SharedArray(Float64, (m,m))
for c = 1:m
for d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
end
#time for i = 0:opak
heat(firstm, secondm, m)
firstm, secondm = secondm, firstm
end
This code give good times when run sequentially, but when I add #parallel it slow down even if I run on one thread. I just need explanation why this is happening? Code only if it doesn't change algorithm of heat function.
Have a look at http://docs.julialang.org/en/release-0.4/manual/performance-tips/ . Contrary to advised, you make use of global variables a lot. They are considered to change types anytime so they have to be boxed and unboxed everytime they are referenced. This question also Julia pi aproximation slow suffers from the same. In order to make your function faster, have global variables as input arguments to the function.
There are some points to consider. One of them is the size of m. If it is small, parallelism would give much overhead for not a big gain:
julia 36967257.jl 4
# Parallel:
0.040434 seconds (4.44 k allocations: 241.606 KB)
# Normal:
0.042141 seconds (29.13 k allocations: 1.308 MB)
For bigger m you could have better results:
julia 36967257.jl 4000
# Parallel:
0.054848 seconds (4.46 k allocations: 241.935 KB)
# Normal:
3.779843 seconds (29.13 k allocations: 1.308 MB)
Plus two remarks:
1/ initialisation could be simplified to:
for c = 1:m, d = 1:m
if c == m || d == 1
firstm[c,d] = 100.0
secondm[c,d] = 100.0
else
firstm[c,d] = 0.0
secondm[c,d] = 0.0
end
end
2/ your finite difference schema does not look stable. Please take a look at Linear multistep method or ADI/Crank Nicolson.

Updating a dense vector by a sparse vector in Julia is slow

I am using Julia version 0.4.5 and I am experiencing the following issue:
As far as I know, taking inner product between a sparse vector and a dense vector should be as fast as updating the dense vector by a sparse vector. The latter one is much slower.
A = sprand(100000,100000,0.01)
w = rand(100000)
#time for i=1:100000
w += A[:,i]
end
26.304380 seconds (1.30 M allocations: 150.556 GB, 8.16% gc time)
#time for i=1:100000
A[:,i]'*w
end
0.815443 seconds (921.91 k allocations: 1.540 GB, 5.58% gc time)
I created a simple sparse matrix type of my own, and the addition code was ~ the same as the inner product.
Am I doing something wrong? I feel like there should be a special function doing the operation w += A[:,i], but I couldn't find it.
Any help is appreciated.
I asked the same question on GitHub and we came to the following conclusion. The type SparseVector was added as of Julia 0.4 and with it the BLAS function LinAlg.axpy!, which updates in-place a (possibly dense) vector x by a sparse vector y multiplied by a scalar a, i.e. performs x += a*y efficiently. However, in Julia 0.4 it is not implemented properly. It works only in Julia 0.5
#time for i=1:100000
LinAlg.axpy!(1,A[:,i],w)
end
1.041587 seconds (799.49 k allocations: 1.530 GB, 8.01% gc time)
However, this code is still sub-optimal, as it creates the SparseVector A[:,i]. One can get an even faster version with the following function:
function upd!(w,A,i,c)
rowval = A.rowval
nzval = A.nzval
#inbounds for j = nzrange(A,i)
w[rowval[j]] += c* nzval[j]
end
return w
end
#time for i=1:100000
upd!(w,A,i,1)
end
0.500323 seconds (99.49 k allocations: 1.518 MB)
This is exactly what I needed to achieve, after some research we managed to get there, thanks everyone!
Assuming you want to compute w += c * A[:, i], there is an easy way to vectorize it:
>>> A = sprand(100000, 100000, 0.01)
>>> c = rand(100000)
>>> r1 = zeros(100000)
>>> #time for i = 1:100000
>>> r1 += A[:, i] * c[i]
>>> end
29.997412 seconds (1.90 M allocations: 152.077 GB, 12.73% gc time)
>>> #time r2 = sum(A .* c', 2);
1.191850 seconds (50 allocations: 1.493 GB, 0.14% gc time)
>>> all(r1 == r2)
true
First, create a vector c of the constants to multiply with. Then multiplay de columns of A element-wise by the values of c (A .* c', it does broadcasting inside). Last, reduce over the columns of A (the part sum(.., 2)).

How can I do a bitwise-or reduction along an axis of a boolean array in Julia?

I'm trying to find the best way to do a bitwise-or reduction of a 3D boolean array of masks to 2D in Julia.
I can always write a for loop, of course:
x = randbool(3,3,3)
out = copy(x[:,:,1])
for i = 1:3
for j = 1:3
for k = 2:3
out[i,j] |= x[i,j,k]
end
end
end
But I'm wondering if there is a better way to do the reduction.
A simple answer would be
out = x[:,:,1] | x[:,:,2] | x[:,:,3]
but I did some benchmarking:
function simple(n,x)
out = x[:,:,1] | x[:,:,2]
for k = 3:n
#inbounds out |= x[:,:,k]
end
return out
end
function forloops(n,x)
out = copy(x[:,:,1])
for i = 1:n
for j = 1:n
for k = 2:n
#inbounds out[i,j] |= x[i,j,k]
end
end
end
return out
end
function forloopscolfirst(n,x)
out = copy(x[:,:,1])
for j = 1:n
for i = 1:n
for k = 2:n
#inbounds out[i,j] |= x[i,j,k]
end
end
end
return out
end
shorty(n,x) = |([x[:,:,i] for i in 1:n]...)
timholy(n,x) = any(x,3)
function runtest(n)
x = randbool(n,n,n)
#time out1 = simple(n,x)
#time out2 = forloops(n,x)
#time out3 = forloopscolfirst(n,x)
#time out4 = shorty(n,x)
#time out5 = timholy(n,x)
println(all(out1 .== out2))
println(all(out1 .== out3))
println(all(out1 .== out4))
println(all(out1 .== out5))
end
runtest(3)
runtest(500)
which gave the following results
# For 500
simple: 0.039403016 seconds (39716840 bytes allocated)
forloops: 6.259421683 seconds (77504 bytes allocated)
forloopscolfirst 1.809124505 seconds (77504 bytes allocated)
shorty: elapsed time: 0.050384062 seconds (39464608 bytes allocated)
timholy: 2.396887396 seconds (31784 bytes allocated)
So I'd go with simple or shorty
Try any(x, 3). Just typing a little more here so StackOverflow doesn't nix this response.
There are various standard optimization tricks and hints that can be applied, but the critical observation to make here is that Julia organizes array in column-major rather than row-major order. For small size arrays this is not easily seen but when the arrays grow large it's telling. There is a method reduce provided that is optimized to perform an function on a collection (in this case OR), but it comes at a cost. If the number of combining steps is relatively small then it's better to simply loop. In all cases minimizing the number of memory access is over all better. Below are various attempts at optimization using these 2 things in mind.
Various Attempts and Observations
Initial function
Here's a function that takes your example and generalizes it.
function boolReduce1(x)
out = copy(x[:,:,1])
for i = 1:size(x,1)
for j = 1:size(x,2)
for k = 2:size(x,3)
out[i,j] |= x[i,j,k]
end
end
end
out
end
Creating a fairly large array, we can time it's performance
julia> #time boolReduce1(b);
elapsed time: 42.372058096 seconds (1056704 bytes allocated)
Applying optimizations
Here's another similar version but with the standard type hints, use of #inbounds and inverting the loops.
function boolReduce2(b::BitArray{3})
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
#inbounds a[i,j] = b[i,j,1]
for k = 2:size(b,3)
#inbounds a[i,j] |= b[i,j,k]
end
end
end
a
end
And take the time
julia> #time boolReduce2(b);
elapsed time: 12.892392891 seconds (500520 bytes allocated)
The insight
The 2nd function is a lot faster, and also less memory is allocated because a temporary array wasn't created. But what if we simply take the first function and invert the array indexing?
function boolReduce3(x)
out = copy(x[:,:,1])
for j = 1:size(x,2)
for i = 1:size(x,1)
for k = 2:size(x,3)
out[i,j] |= x[i,j,k]
end
end
end
out
end
and take the time now
julia> #time boolReduce3(b);
elapsed time: 12.451501749 seconds (1056704 bytes allocated)
That's just as fast as the 2nd function.
Using reduce
There is a function called reduce that we can use to eliminate the 3rd loop. Its function is to repeatedly apply an operation on all of the elements with the result of the previous operation. This is exactly what we want.
function boolReduce4(b)
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
#inbounds a[i,j] = reduce(|,b[i,j,:])
end
end
a
end
Now take it's time
julia> #time boolReduce4(b);
elapsed time: 15.828273008 seconds (1503092520 bytes allocated, 4.07% gc time)
That's ok, but not even as fast as the simple optimized original. The reason is, take a look at all of the extra memory that was allocated. This is because data has to be copied from all over to produce input for reduce.
Combining things
But what if we max out the insight as best we can. Instead of the last index being reduced, the first one is?
function boolReduceX(b)
a = BitArray{2}(size(b)[2:3]...)
for j = 1:size(b,3)
for i = 1:size(b,2)
#inbounds a[i,j] = reduce(|,b[:,i,j])
end
end
a
end
And now create a similar array and time it.
julia> c = randbool(200,2000,2000);
julia> #time boolReduceX(c);
elapsed time: 1.877547669 seconds (927092520 bytes allocated, 21.66% gc time)
Resulting in a function 20x faster than the original version for large arrays. Pretty good.
But what if medium size?
If the size is very large then the above function appears best, but if the data set size is smaller, the use of reduce doesn't pay enough back and the following is faster. Including a temporary variable speeds things from version 2. Another version of boolReduceX using a loop instead of reduce (not show here) was even faster.
function boolReduce5(b)
a = BitArray{2}(size(b)[1:2]...)
for j = 1:size(b,2)
for i = 1:size(b,1)
#inbounds t = b[i,j,1]
for k = 2:size(b,3)
#inbounds t |= b[i,j,k]
end
#inbounds a[i,j] = t
end
end
a
end
julia> b = randbool(2000,2000,20);
julia> c = randbool(20,2000,2000);
julia> #time boolReduceX(c);
elapsed time: 1.535334322 seconds (799092520 bytes allocated, 23.79% gc time)
julia> #time boolReduce5(b);
elapsed time: 0.491410981 seconds (500520 bytes allocated)
It is faster to devectorize. It's just a matter of how much work you want to put in. The naïve devectorized approach is slow because it's a BitArray: extracting contiguous regions and bitwise OR can both be done a 64-bit chunk at a time, but the naïve devectorized approach operates an element at a time. On top of that, indexing BitArrays is slow, both because there is a sequence of bit operations involved and because it can't presently be inlined due to the bounds check. Here's a strategy that is devectorized but exploits the structure of the BitArray. Most of the code is copy-pasted from copy_chunks! in bitarray.jl and I didn't try to prettify it (sorry!).
function devec(n::Int, x::BitArray)
src = x.chunks
out = falses(n, n)
dest = out.chunks
numbits = n*n
kd0 = 1
ld0 = 0
for j = 1:n
pos_s = (n*n)*(j-1)+1
kd1, ld1 = Base.get_chunks_id(numbits - 1)
ks0, ls0 = Base.get_chunks_id(pos_s)
ks1, ls1 = Base.get_chunks_id(pos_s + numbits - 1)
delta_kd = kd1 - kd0
delta_ks = ks1 - ks0
u = Base._msk64
if delta_kd == 0
msk_d0 = ~(u << ld0) | (u << (ld1+1))
else
msk_d0 = ~(u << ld0)
msk_d1 = (u << (ld1+1))
end
if delta_ks == 0
msk_s0 = (u << ls0) & ~(u << (ls1+1))
else
msk_s0 = (u << ls0)
end
chunk_s0 = Base.glue_src_bitchunks(src, ks0, ks1, msk_s0, ls0)
dest[kd0] |= (dest[kd0] & msk_d0) | ((chunk_s0 << ld0) & ~msk_d0)
delta_kd == 0 && continue
for i = 1 : kd1 - kd0
chunk_s1 = Base.glue_src_bitchunks(src, ks0 + i, ks1, msk_s0, ls0)
chunk_s = (chunk_s0 >>> (64 - ld0)) | (chunk_s1 << ld0)
dest[kd0 + i] |= chunk_s
chunk_s0 = chunk_s1
end
end
out
end
With Iain's benchmarks, this gives me:
simple: 0.051321131 seconds (46356000 bytes allocated, 30.03% gc time)
forloops: 6.226652258 seconds (92976 bytes allocated)
forloopscolfirst: 2.099381939 seconds (89472 bytes allocated)
shorty: 0.060194226 seconds (46387760 bytes allocated, 36.27% gc time)
timholy: 2.464298752 seconds (31784 bytes allocated)
devec: 0.008734413 seconds (31472 bytes allocated)

Resources