writedlm a matrix of Bool as a 0,1 matrix - julia

I have a matrix of Bool values, for example
x = bitrand(2,3)
If I try to save this to a file:
writedlm("mat.txt", x)
I get a matrix of true and false. I would like to get instead a matrix of 0 and 1 (where 0 replaces false and 1 replaces true). Is there a simple way to do this, perhaps by some options in writedlm, without writing the file line by line myself?

writedlm("mat.txt", map(Int8,x))
Takes each element of x and converts it to an integer using the Int8 function/constructor.
You could also use other integer types but Int8 is more memory efficient than for example Int64.

Try 1*x, it gets the numerical version (perhaps not super memory/time efficient, but good enough for non "big data" stuff). 0x1*x will get a UInt8 - more memory compact (but probably slower).

Another option that is slightly faster is to copy the array to UInt8 on the fly as Array{UInt8, ndims(x)}(x), rather than applying map:
>>> x = bitrand(100,100)
>>> a = map(UInt8, x)
>>> b = Array{UInt8, ndims(x)}(x)
>>> all(a .== b)
true
I ran quick some tests and it is sightly faster the larger the matrices are (at least in my computer).
for i in [10, 100, 1_000, 10_000]
x = bitrand(i,i)
println("$i x $i")
#time map(UInt8, x)
#time Array{UInt8, ndims(x)}(x)
end
Outputs:
10 x 10
0.000002 seconds (2 allocations: 208 bytes)
0.000006 seconds (2 allocations: 208 bytes)
100 x 100
0.000053 seconds (2 allocations: 9.891 KB)
0.000018 seconds (2 allocations: 9.891 KB)
1000 x 1000
0.001945 seconds (5 allocations: 976.703 KB)
0.001490 seconds (5 allocations: 976.703 KB)
10000 x 10000
0.224491 seconds (5 allocations: 95.368 MB)
0.117774 seconds (5 allocations: 95.368 MB)

Related

Optimize looping over a large string to reduce allocations

I am trying to loop over a string in Julia to parse it. I have a DefaultDict inside a struct, containing the number of times I have seen a particular character.
#with_kw mutable struct Metrics
...
nucleotides = DefaultDict{Char, Int64}(0)
...
end
I have written a function to loop over a string and increment the value of each character in the DefaultDict.
function compute_base_composition(sequence::String, metrics::Metrics)
for i in 1:sizeof(sequence)
metrics.nucleotides[sequence[i]] += 1
end
end
This function is called in a for loop because I need to do this for multiple strings (which can be up to 2 billions characters long). When I run the #time macro, I get this result:
#time compute_base_composition(sequence, metrics)
0.167172 seconds (606.20 k allocations: 15.559 MiB, 78.00% compilation time)
0.099403 seconds (1.63 M allocations: 24.816 MiB)
0.032346 seconds (633.24 k allocations: 9.663 MiB)
0.171382 seconds (3.06 M allocations: 46.751 MiB, 4.64% gc time)
As you can see, there are a lot of memory allocations for such a simple function. I have tried to change the for loop to something like for c in sequence but that didn't change much. Would there be a way to reduce them and make the function faster?
Work on bytes no on unicode chars
Use Vectors not Dicts
Avoid untyped fields in containers
#with_kw struct MetricsB
nucleotides::Vector{Int}=zeros(Int, 256)
end
function compute_base_composition(sequence::String, metrics::MetricsB)
bs = Vector{UInt8}(sequence)
for i in 1:length(bs)
#inbounds metrics.nucleotides[bs[i]] += 1
end
end
And a benchmark with a nice speedup of 90x :
julia> st = randstring(10_000_000);
julia> #time compute_base_composition(st, Metrics())
1.793991 seconds (19.94 M allocations: 304.213 MiB, 3.33% gc time)
julia> #time compute_base_composition(st, MetricsB())
0.019398 seconds (3 allocations: 9.539 MiB)
Actually you can almost totally avoid allocations with the following code:
function compute_base_composition2(sequence::String, metrics::MetricsB)
pp = pointer(sequence)
for i in 1:length(sequence)
#inbounds metrics.nucleotides[Base.pointerref(pp, i, 1)] += 1
end
end
and now:
julia> #time compute_base_composition2(st, MetricsB())
0.021161 seconds (1 allocation: 2.125 KiB)

Generic maximum/minimum function for complex numbers

In julia, one can find (supposedly) efficient implementations of the min/minimum and max/maximum over collections of real numbers.
As these concepts are not uniquely defined for complex numbers, I was wondering if a parametrized version of these functions was already implemented somewhere.
I am currently sorting elements of the array of interest, then taking the last value, which is as far as I know, much more costly than finding the value with the maximum absolute value (or something else).
This is mostly to reproduce the Matlab behavior of the max function over complex arrays.
Here is my current code
a = rand(ComplexF64,4)
b = sort(a,by = (x) -> abs(x))
c = b[end]
The probable function call would look like
c = maximum/minimum(a,by=real/imag/abs/phase)
EDIT Some performance tests in Julia 1.5.3 with the provided solutions
function maxby0(f,iter)
b = sort(iter,by = (x) -> f(x))
c = b[end]
end
function maxby1(f, iter)
reduce(iter) do x, y
f(x) > f(y) ? x : y
end
end
function maxby2(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
end
end
return res
end
compmax(CArray) = CArray[ (abs.(CArray) .== maximum(abs.(CArray))) .& (angle.(CArray) .== maximum( angle.(CArray))) ][1]
Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
function maxby5(arr)
arr_max = arr[argmax(map(abs, arr))]
end
a = rand(ComplexF64,10)
using BenchmarkTools
#btime maxby0(abs,$a)
#btime maxby1(abs,$a)
#btime maxby2(abs,$a)
#btime compmax($a)
#btime maximum($a)
#btime maxby5($a)
Output for a vector of length 10:
>841.653 ns (1 allocation: 240 bytes)
>214.797 ns (0 allocations: 0 bytes)
>118.961 ns (0 allocations: 0 bytes)
>Execution fails
>20.340 ns (0 allocations: 0 bytes)
>144.444 ns (1 allocation: 160 bytes)
Output for a vector of length 1000:
>315.100 μs (1 allocation: 15.75 KiB)
>25.299 μs (0 allocations: 0 bytes)
>12.899 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>14.199 μs (1 allocation: 7.94 KiB)
Output for a vector of length 1000 (with all comparisons made with abs2):
>35.399 μs (1 allocation: 15.75 KiB)
>3.075 μs (0 allocations: 0 bytes)
>1.460 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>2.211 μs (1 allocation: 7.94 KiB)
Some remarks :
Sorting clearly (and as expected) slows the operations
Using abs2 saves a lot of performance (expected as well)
To conclude :
As a built-in function will provide this in 1.7, I will avoid using the additional Main.isless method, though it is all things considered the most performing one, to not modify the core of my julia
The maxby1 and maxby2 allocate nothing
The maxby1 feels more readable
the winner is therefore Andrej Oskin!
EDIT n°2 a new benchmark using the corrected compmax implementation
julia> #btime maxby0(abs2,$a)
36.799 μs (1 allocation: 15.75 KiB)
julia> #btime maxby1(abs2,$a)
3.062 μs (0 allocations: 0 bytes)
julia> #btime maxby2(abs2,$a)
1.160 μs (0 allocations: 0 bytes)
julia> #btime compmax($a)
26.899 μs (9 allocations: 12.77 KiB)
julia> #btime maximum($a)
1.520 μs (0 allocations: 0 bytes)
julia> #btime maxby5(abs2,$a)
2.500 μs (4 allocations: 8.00 KiB)
In Julia 1.7 you can use argmax
julia> a = rand(ComplexF64,4)
4-element Vector{ComplexF64}:
0.3443509906876845 + 0.49984979589871426im
0.1658370274750809 + 0.47815764287341156im
0.4084798173736195 + 0.9688268736875587im
0.47476987432458806 + 0.13651720575229853im
julia> argmax(abs2, a)
0.4084798173736195 + 0.9688268736875587im
Since it will take some time to get to 1.7, you can use the following analog
maxby(f, iter) = reduce(iter) do x, y
f(x) > f(y) ? x : y
end
julia> maxby(abs2, a)
0.4084798173736195 + 0.9688268736875587im
UPD: slightly more efficient way to find such maximum is to use something like
function maxby(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
end
end
return res
end
According to octave's documentation (which presumably mimics matlab's behaviour):
For complex arguments, the magnitude of the elements are used for
comparison. If the magnitudes are identical, then the results are
ordered by phase angle in the range (-pi, pi]. Hence,
max ([-1 i 1 -i])
=> -1
because all entries have magnitude 1, but -1 has the largest phase
angle with value pi.
Therefore, if you'd like to mimic matlab/octave functionality exactly, then based on this logic, I'd construct a 'max' function for complex numbers as:
function compmax( CArray )
Absmax = CArray[ abs.(CArray) .== maximum( abs.(CArray)) ]
Totalmax = Absmax[ angle.(Absmax) .== maximum(angle.(Absmax)) ]
return Totalmax[1]
end
(adding appropriate typing as desired).
Examples:
Nums0 = [ 1, 2, 3 + 4im, 3 - 4im, 5 ]; compmax( Nums0 )
# 1-element Array{Complex{Int64},1}:
# 3 + 4im
Nums1 = [ -1, im, 1, -im ]; compmax( Nums1 )
# 1-element Array{Complex{Int64},1}:
# -1 + 0im
If this was a code for my computations, I would have made my life much simpler by:
julia> Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
julia> maximum(rand(ComplexF64, 10))
0.9876138798492835 + 0.9267321874614858im
This adds a new implementation for an existing method in Main. Therefore for a library code it is not an elegant idea, but it will you get where you need it with the least effort.
The "size" of a complex number is determined by the size of its modulus. You can use abs for that. Or get 1.7 as Andrej Oskin said.
julia> arr = rand(ComplexF64, 10)
10-element Array{Complex{Float64},1}:
0.12749588414783353 + 0.09918182087026373im
0.7486501790575264 + 0.5577981676269863im
0.9399200789666509 + 0.28339836191094747im
0.9695470502095325 + 0.9978696209350371im
0.6599207157942191 + 0.0999992072342546im
0.30059521996405425 + 0.6840859625686171im
0.22746651600614132 + 0.33739559003514885im
0.9212471084010287 + 0.2590484924393446im
0.74848598947588 + 0.41129443181449554im
0.8304447441317468 + 0.8014240389454632im
julia> arr_max = arr[argmax(map(abs, arr))]
0.9695470502095325 + 0.9978696209350371im
julia> arr_min = arr[argmin(map(abs, arr))]
0.12749588414783353 + 0.09918182087026373im

Converting SuiteSparse.SPQR.QRSparseQ to SparseMatrixCSC?

I have this problem that converting the native sparse format for the QR decomposition of a sparse Matrix takes forever. However, I need it in the CSC format to use it for further computations.
using LinearAlgebra, SparseArrays
N = 1000
A = sprand(N,N,1e-4)
#time F = qr(A)
#time F.Q
#time Q_sparse = sparse(F.Q)
0.000420 seconds (1.15 k allocations: 241.017 KiB)
0.000008 seconds (6 allocations: 208 bytes)
6.067351 seconds (2.00 M allocations: 15.140 GiB, 36.25% gc time)
Any suggestions?
Okay, I found the problem. For other people trying to do it:
factors = F.Q.factors
τ = F.Q.τ
Nτ = size(factors)[2]
Isp = sparse(I(N));
#time Q_constr = prod(Isp - factors[:,i]*τ[i]*factors[:,i]' for i in 1:Nτ)
Q_constr ≈ Q_sparse
0.084461 seconds (62.64 k allocations: 3.321 MiB, 18.28% gc time)
true
You see that the method sparse(F.Q) is somehow using the wrong representation. If you construct Q as I did above, it will be considerably faster.

How to find the index of the last maximum in julialang?

I have an array that contains repeated nonnegative integers, e.g., A=[5,5,5,0,1,1,0,0,0,3,3,0,0]. I would like to find the position of the last maximum in A. That is the largest index i such that A[i]>=A[j] for all j. In my example, i=3.
I tried to find the indices of all maximum of A then find the maximum of these indices:
A = [5,5,5,0,1,1,0,0,0,3,3,0,0];
Amax = maximum(A);
i = maximum(find(x -> x == Amax, A));
Is there any better way?
length(A) - indmax(#view A[end:-1:1]) + 1
should be pretty fast, but I didn't benchmark it.
EDIT: I should note that by definition #crstnbr 's solution (to write the algorithm from scratch) is faster (how much faster is shown in Xiaodai's response). This is an attempt to do it using julia's inbuilt array functions.
What about findlast(A.==maximum(A)) (which of course is conceptually similar to your approach)?
The fastest thing would probably be explicit loop implementation like this:
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
I tried #Michael's solution and #crstnbr's solution and I found the latter much faster
a = rand(Int8(1):Int8(5),1_000_000_000)
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 19 seconds
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 18 seconds
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
#time lastindmax(a) # 3 seconds
#time lastindmax(a) # 2.8 seconds
Michael's solution doesn't support Strings (ERROR: MethodError: no method matching view(::String, ::StepRange{Int64,Int64})) or sequences so I add another solution:
julia> lastimax(x) = maximum((j,i) for (i,j) in enumerate(x))[2]
julia> A="abžcdž"; lastimax(A) # unicode is OK
6
julia> lastimax(i^2 for i in -10:7)
1
If you more like don't catch exception for empty Sequence:
julia> lastimax(x) = !isempty(x) ? maximum((j,i) for (i,j) in enumerate(x))[2] : 0;
julia> lastimax(i for i in 1:3 if i>4)
0
Simple(!) benchmarks:
This is up to 10 times slower than Michael's solution for Float64:
julia> mlastimax(A) = length(A) - indmax(#view A[end:-1:1]) + 1;
julia> julia> A = rand(Float64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.166389 seconds (4.00 M allocations: 91.553 MiB, 4.63% gc time)
0.019560 seconds (6 allocations: 240 bytes)
80346
(I am surprised) it is 2 times faster for Int64!
julia> A = rand(Int64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.015453 seconds (10 allocations: 304 bytes)
0.031197 seconds (6 allocations: 240 bytes)
423400
it is 2-3 times slower for Strings
julia> A = ["A$i" for i in 1:1_000_000]; #time lastimax(A); #time mlastimax(A)
0.175117 seconds (2.00 M allocations: 61.035 MiB, 41.29% gc time)
0.077098 seconds (7 allocations: 272 bytes)
999999
EDIT2:
#crstnbr solution is faster and works with Strings too (doesn't work with generators). There difference between lastindmax and lastimax - first return byte index, second return character index:
julia> S = "1š3456789ž"
julia> length(S)
10
julia> lastindmax(S) # return value is bigger than length
11
julia> lastimax(S) # return character index (which is not byte index to String) of last max character
10
julia> S[chr2ind(S, lastimax(S))]
'ž': Unicode U+017e (category Ll: Letter, lowercase)
julia> S[chr2ind(S, lastimax(S))]==S[lastindmax(S)]
true

What is the fastest way to compute the sum of outer products [Julia]

In Julia, what is the fastest way to do:
where is an -dimensional column vector of variables at time t.
In Julia code, one option is:
A = zeros(n,n);
for j=1:T
A = A + Y(j,:)'*Y(j,:);
end
where
Y = [y_1'
...
y_T']`
is a (Txn)matrix.
But, is there a faster way ? Thanks.
For comparison, I have tried several codes for computing the A matrix (which I hope to be what OP wants...), including built-in matrix multiplication, BLAS.ger!, and explicit loops:
print_(x) = print(rpad(x,12))
# built-in vector * vector'
function perf0v( n, T, Y )
print_("perf0v")
out = zeros(n,n)
for t = 1 : T
out += slice( Y, :,t ) * slice( Y, :,t )'
end
return out
end
# built-in matrix * matrix'
function perf0m( n, T, Y )
print_("perf0m")
out = Y * Y'
return out
end
# BLAS.ger!
function perf1( n, T, Y )
print_("perf1")
out = zeros(n,n)
for t = 1 : T
BLAS.ger!( 1.0, Y[ :,t ], Y[ :,t ], out )
end
return out
end
# BLAS.ger! with sub
function perf1sub( n, T, Y )
print_("perf1sub")
out = zeros(n,n)
for t = 1 : T
BLAS.ger!( 1.0, sub( Y, :,t ), sub( Y, :,t ), out )
end
return out
end
# explicit loop
function perf2( n, T, Y )
print_("perf2")
out = zeros(n,n)
for t = 1 : T,
i2 = 1 : n,
i1 = 1 : n
out[ i1, i2 ] += Y[ i1, t ] * Y[ i2, t ]
end
return out
end
# explicit loop with simd
function perf2simd( n, T, Y )
print_("perf2simd")
out = zeros(n,n)
for i2 = 1 : n,
i1 = 1 : n
#simd for t = 1 : T
out[ i1, i2 ] += Y[ i1, t ] * Y[ i2, t ]
end
end
return out
end
# transposed perf2
function perf2tr( n, T, Yt )
print_("perf2tr")
out = zeros(n,n)
for t = 1 : T,
i2 = 1 : n,
i1 = 1 : n
out[ i1, i2 ] += Yt[ t, i1 ] * Yt[ t, i2 ]
end
return out
end
# transposed perf2simd
function perf2simdtr( n, T, Yt )
print_("perf2simdtr")
out = zeros(n,n)
for i2 = 1 : n,
i1 = 1 : n
#simd for t = 1 : T
out[ i1, i2 ] += Yt[ t, i1 ] * Yt[ t, i2 ]
end
end
return out
end
#.........................................................
n = 100
T = 1000
#show n, T
Y = rand( n, T )
Yt = copy( Y' )
out = Dict()
for loop = 1:2
println("loop = ", loop)
for fn in [ perf0v, perf0m, perf1, perf1sub, perf2, perf2simd ]
#time out[ fn ] = fn( n, T, Y )
end
for fn in [ perf2tr, perf2simdtr ]
#time out[ fn ] = fn( n, T, Yt )
end
end
# Check
error = 0.0
for k1 in keys( out ),
k2 in keys( out )
#assert sumabs( out[ k1 ] ) > 0.0
#assert sumabs( out[ k2 ] ) > 0.0
error += sumabs( out[ k1 ] - out[ k2 ] )
end
#show error
The result obtained with julia -O --check-bounds=no test.jl (ver0.4.5) is:
(n,T) = (100,1000)
loop = 2
perf0v 0.056345 seconds (15.04 k allocations: 154.803 MB, 31.66% gc time)
perf0m 0.000785 seconds (7 allocations: 78.406 KB)
perf1 0.155182 seconds (5.96 k allocations: 1.846 MB)
perf1sub 0.155089 seconds (8.01 k allocations: 359.625 KB)
perf2 0.011192 seconds (6 allocations: 78.375 KB)
perf2simd 0.016677 seconds (6 allocations: 78.375 KB)
perf2tr 0.011698 seconds (6 allocations: 78.375 KB)
perf2simdtr 0.009682 seconds (6 allocations: 78.375 KB)
and for some different values of n & T:
(n,T) = (1000,100)
loop = 2
perf0v 0.610885 seconds (2.01 k allocations: 1.499 GB, 25.11% gc time)
perf0m 0.008866 seconds (9 allocations: 7.630 MB)
perf1 0.182409 seconds (606 allocations: 9.177 MB)
perf1sub 0.180720 seconds (806 allocations: 7.657 MB, 0.67% gc time)
perf2 0.104961 seconds (6 allocations: 7.630 MB)
perf2simd 0.119964 seconds (6 allocations: 7.630 MB)
perf2tr 0.137186 seconds (6 allocations: 7.630 MB)
perf2simdtr 0.103878 seconds (6 allocations: 7.630 MB)
(n,T) = (2000,100)
loop = 2
perf0v 2.514622 seconds (2.01 k allocations: 5.993 GB, 24.38% gc time)
perf0m 0.035801 seconds (9 allocations: 30.518 MB)
perf1 0.473479 seconds (606 allocations: 33.591 MB, 0.04% gc time)
perf1sub 0.475796 seconds (806 allocations: 30.545 MB, 0.95% gc time)
perf2 0.422808 seconds (6 allocations: 30.518 MB)
perf2simd 0.488539 seconds (6 allocations: 30.518 MB)
perf2tr 0.554685 seconds (6 allocations: 30.518 MB)
perf2simdtr 0.400741 seconds (6 allocations: 30.518 MB)
(n,T) = (3000,100)
loop = 2
perf0v 5.444797 seconds (2.21 k allocations: 13.483 GB, 20.77% gc time)
perf0m 0.080458 seconds (9 allocations: 68.665 MB)
perf1 0.927325 seconds (806 allocations: 73.261 MB, 0.02% gc time)
perf1sub 0.926690 seconds (806 allocations: 68.692 MB, 0.51% gc time)
perf2 0.958189 seconds (6 allocations: 68.665 MB)
perf2simd 1.067098 seconds (6 allocations: 68.665 MB)
perf2tr 1.765001 seconds (6 allocations: 68.665 MB)
perf2simdtr 0.902838 seconds (6 allocations: 68.665 MB)
Hmm, so the built-in matrix * matrix (Y * Y') was fastest. It seems that BLAS gemm is called at the end (from the output of #less Y * Y').
If you know the components of y_t in advance, the simplest, easiest, and likely fastest way is simply:
A = Y*Y'
Where the different values of y_t are stored as columns in the matrix Y.
If you don't know the component of y_t in advance, you can use BLAS:
n = 100;
t = 1000;
Y = rand(n,t);
outer_sum = zeros(n,n);
for tdx = 1:t
BLAS.ger!(1.0, Y[:,tdx], Y[:,tdx], outer_sum)
end
See this post (for a similar example) in case you're new to BLAS and would like help interpreting the arguments in this function here.
One of the key things here is to store the y_t vectors as columns rather than as rows of Y because accessing columns is much faster than accessing rows. See the Julia performance tips for more info on this.
Update for the second option (where it isn't known in advance what the components of Y will be, BLAS will sometimes but not always be fastest. The big determining factor is the size of the vector you're using. Calling BLAS incurs a certain overhead and so is only worthwhile in certain settings. Julia's native matrix multiplication will automatically chose whether to use BLAS, and will generally do a good job with it. But, if you know ahead of time that you're dealing with a situation where BLAS will be optimal, then you can save the Julia optimizer some work (and thus speed up your code) by specifying it ahead of time.
See the great response by roygvib below. It presents a LOT of creative and instructive ways to compute this sum of dot products. Many will be faster than BLAS in certain situations. From the time trials that roygvib presents, it looks like the break-even point is around n = 3000.
For the sake of completion, here is another, vectorised approach:
Assume Y is as follows:
julia> Y = rand(1:10, 10,5)
10×5 Array{Int64,2}:
2 1 6 2 10
8 2 6 8 2
2 10 10 4 6
5 9 8 5 1
5 4 9 9 4
4 6 3 4 8
2 9 2 8 1
6 8 5 10 2
1 7 10 6 9
8 7 10 10 8
julia> Y = reshape(Y, 10,5,1); # add a singular 3rd dimension, so we can
# be allowed to shuffle the dimensions
The idea is that you create one array which is defined in dimensions 1 and 3, and only has one column, and you array-multiply this by an array which is defined in dimensions 2 and 3, but only has one row. Your 'Time' variable varies along dimension 3. This essentially results in the individual kronecker products from each timestep, concatenated along the time (i.e. 3rd) dimension.
julia> KroneckerProducts = permutedims(Y, [2,3,1]) .* permutedims(Y, [3,2,1]);
Now it wasn't clear to me if your end result was meant to be an "nxn" matrix, resulting from the sum of all timings at each 'kronecker' position
julia> sum(KroneckerProducts, 3)
5×5×1 Array{Int64,3}:
[:, :, 1] =
243 256 301 324 192
256 481 442 427 291
301 442 555 459 382
324 427 459 506 295
192 291 382 295 371
or simply the sum of all the elements in that massive 3D array
julia> sum(KroneckerProducts)
8894
Choose your preferred poison :p
I'm not sure this will be faster than Michael's approach above, since the permutedims step is presumably expensive, and for very large arrays it may actually be a bottleneck (but I don't know how it's implemented in Julia ... maybe it's not!), so it may not necessarily perform better than a simple loop iterating for each timestep, even though it's "vectorised code". You can try both approaches and see for yourself what is fastest for your particular arrays!

Resources