Julia - Linear combination of row-wise outer products - julia

I have a matrix A of dimension (n, m) and a matrix B of dimension (n, p). For each of the n rows, I would like to compute the outer product between the row of A and the row of B, which are (m, p) matrices. I then have a vector x of size n and I would like to multiply each of these matrices by the corresponding entry of x and sum everything up. How can I do that?
# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
# Slow method
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end

Here's another verison
# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
function old_way(A, B, x)
# Slow method
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end
end
function another_way(A, B, x)
sum(xi * (Arow * Brow') for (xi, Arow, Brow) in zip(x, eachrow(A), eachrow(B)))
end
And benchmarking:
julia> using BenchmarkTools
julia> #benchmark old_way(A, B, x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 83.495 μs … 2.653 ms ┊ GC (min … max): 0.00% … 95.79%
Time (median): 87.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 101.496 μs ± 115.196 μs ┊ GC (mean ± σ): 6.46% ± 5.53%
▃█▇▆▂ ▁ ▁ ▁ ▁
███████████████▇█▇██████▆▇▇▇▆▇▇▇▇▇▇▇▇▆▇▆▆▇▆▇▅▆▆▆▄▅▅▅▅▆▅▆▅▅▅▄▅ █
83.5 μs Histogram: log(frequency) by time 200 μs <
Memory estimate: 153.48 KiB, allocs estimate: 1802.
julia> #benchmark another_way(A, B, x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 25.850 μs … 923.032 μs ┊ GC (min … max): 0.00% … 95.94%
Time (median): 27.477 μs ┊ GC (median): 0.00%
Time (mean ± σ): 31.851 μs ± 35.440 μs ┊ GC (mean ± σ): 5.03% ± 4.49%
▇█▇▅▃▁ ▁▁▁ ▁
███████▇▇▆▇████████████▇█▇▇▇▆▇▆▅▆▆▆▅▆▆▅▆▆▇▆▅▅▅▆▅▆▅▅▅▄▃▅▅▅▄▄▄ █
25.8 μs Histogram: log(frequency) by time 77.4 μs <
Memory estimate: 98.31 KiB, allocs estimate: 304.
So it's a little faster, and uses a little less memory.
Does this answer your question?

General tips to save time and memory:
Put code in a method instead of the global scope, and make sure every variable in that function comes from the arguments, not global variables. That way, Julia's compiler can infer the types of variables and optimize.
Reduce allocations where possible, and you have many opportunities. The changes here distinguish the old_way and new_way methods, and it causes a 5-6x speedup and reduction to 1 allocation.
When slicing an array, use #view to avoid default behavior of allocating a copy.
You can change result in-place with .+=. += allocates a new array and reassigns the variable result to it.
For elementwise operations like x[i] * ..., chaining dotted operators fuses the underlying elementwise loops and reduces allocations of intermediate arrays.
A matrix multiplication of a column (Mx1) vector and a row (1xN) vector can be simplified to elementwise multiplication.
n, m, p = 100, 10, 3
A, B, x = randn(n, m), randn(n, p), randn(n)
# Methods below do not use the above global variables
function old_way(A, B, x, n, m, p)
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end
result
end
function new_way(A, B, x, n, m, p)
result = zeros(m, p)
for i in 1:n
result .+= x[i] .* ( #view(A[i, :]) .* #view(B[i, :])' )
end
result
end
using BenchmarkTools
#btime old_way(A, B, x, n, m, p);
# 36.753 μs (501 allocations: 125.33 KiB)
#btime new_way(A, B, x, n, m, p);
# 6.542 μs (1 allocation: 336 bytes)
old_way(A, B, x, n, m, p) == new_way(A, B, x, n, m, p)
# true
The example above avoided global variables so far, and the example below will show why. Even if you put your code in a method but still use global variables, not only is the performance just generally worse, trying to reduce allocations backfires:
# Methods below use n, m, p as global inputs
function old_oops(A, B, x)
# same code as old_way(A, B, x, n, m, p)
end
function new_oops(A, B, x)
# same code as new_way(A, B, x, n, m, p)
end
#btime old_oops(A, B, x);
# 95.317 μs (1802 allocations: 153.48 KiB)
#btime new_oops(A, B, x);
# 235.191 μs (1302 allocations: 81.61 KiB)

If your setup has the same structure as your MWE, using LinearAlgebra:
faster(A,B,x) = (diagm(x)*A)'*B
runs 4x faster:
using LinearAlgebra, BenchmarkTools
# Parameters
n, m, p = 100, 10, 3
# Matrices & Vectors
A, B, x = randn(n, m), randn(n, p), randn(n)
# Slow method
function slow(A,B,x)
result = zeros(m, p)
for i in 1:n
result += x[i] * (A[i, :] * B[i, :]')
end
result
end
faster(A,B,x) = (diagm(x)*A)'*B
#assert(slow(A,B,x) ≈ faster(A,B,x))
#btime slow($A,$B,$x) # 113.400 μs (1802 allocations: 139.39 KiB)
#btime faster($A,$B,$x) # 28.100 μs (4 allocations: 86.41 KiB)

Related

Binding multiple variables to copies in BenchmarkTools.jl setup

I am trying to use the #benchmarkable macro from BenchmarkTools.jl. In the package documentaton they explain how to pass setup expressions to #benchmark and #benchmarkable. They also explain that this can be used for in-place/mutating functions in order to bind copies of the input variables.
I am not sure, however, how to use the setup expression to copy multiple variables at the same time.
For example, imagine I want to benchmark the following function (the actual function is irrelevant):
function my_function!(x, y)
deleteat!(x, y .== 0)
deleteat!(y, y .== 0)
x .= x .* 2
end
With the following inputs:
using BenchmarkTools
a = collect(1:30)
b = rand(0:5, 30)
I would like to perform the benchmark by binding a copy of a and b to variables y and z respectively.
t = #benchmarkable my_function!(m, n) setup=(m = copy($a), n = copy($b)) evals = 30
run(t)
However, running the previous code returns the following error:
ERROR: LoadError: UndefVarError: m not defined
setup requires a block of code so:
t = #benchmarkable my_function!(m, n) setup=begin; m = copy($a); n = copy($b);end evals = 30
Now you can run it:
julia> run(t)
BenchmarkTools.Trial: 10000 samples with 30 evaluations.
Range (min … max): 173.333 ns … 131.320 μs ┊ GC (min … max): 0.00% … 99.33%
Time (median): 210.000 ns ┊ GC (median): 0.00%
Time (mean ± σ): 251.470 ns ± 1.315 μs ┊ GC (mean ± σ): 5.19% ± 0.99%
▂█ ▁
██▇█▅▇█▅▆▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
173 ns Histogram: frequency by time 807 ns <

julia vector of matrix or 3d Array

I want to multiply several matrices (all of same dimensions) to a vector beta.
I have tried two versions, to store the matrices, either as a vector of matrices, or as a tri-dimensions array.
Is it normal that the version with the vector of matrices runs faster?
using BenchmarkTools
nid =10000
npar = 5
x1 = reshape(repeat([1], nid * npar), nid,:)
x2 = reshape(repeat([2], nid * npar), nid,:)
x3 = reshape(repeat([3], nid * npar), nid,:)
X = reshape([x1 x2 x3], nid, npar, :);
X1 = [x1, x2, x3]
beta = rand(npar)
function f(X::Array{Int,3}, beta::Vector{Float64})::Array{Float64}
hcat([ X[:,:,i] * beta for i=1:size(X,3)]...)
end
function g(X::Array{Array{Int,2},1}, beta::Vector{Float64})::Array{Float64}
hcat([X[i] * beta for i=1:size(X)[1]]...)
end
f(X,beta);
g(X1,beta);
#benchmark f(X, beta)
#benchmark g(X1, beta)
Results indicate that f takes almost 2x time of g.
Is it a normal pattern, or I am not using the 3D Array properly?
That's because the slice operator forces each matrix to be copied hence memory allocation growth.
Notice the last line of the benchmark:
julia> #benchmark f(X, beta)
BenchmarkTools.Trial: 8011 samples with 1 evaluation.
Range (min … max): 356.013 μs … 4.076 ms ┊ GC (min … max): 0.00% … 87.21%
Time (median): 457.231 μs ┊ GC (median): 0.00%
Time (mean ± σ): 615.235 μs ± 351.236 μs ┊ GC (mean ± σ): 6.54% ± 11.69%
▃▇██▆▅▄▄▄▃▂ ▂▃▄▄▄▃▁▁▁ ▁ ▁ ▂
████████████▆▇▅▅▄▄▇███████████████▇█▇▇█▇▆▇█▆▆▆▆▅▆▅▅▄▃▃▂▂▄▂▄▄▄ █
356 μs Histogram: log(frequency) by time 1.96 ms <
Memory estimate: 1.60 MiB, allocs estimate: 17.
julia> #benchmark g(X1, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 174.348 μs … 2.493 ms ┊ GC (min … max): 0.00% … 83.85%
Time (median): 219.383 μs ┊ GC (median): 0.00%
Time (mean ± σ): 245.192 μs ± 119.612 μs ┊ GC (mean ± σ): 3.54% ± 7.68%
▃▇█▂
▅████▆▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
174 μs Histogram: frequency by time 974 μs <
Memory estimate: 469.25 KiB, allocs estimate: 11.
To avoid it, just use the macro #views that forces a reference, without allocation. The time between the two implementations then becomes the same out of random noises:
function fbis(X::Array{Int,3}, beta::Vector{Float64})::Array{Float64}
#views hcat([ X[:,:,i] * beta for i=1:size(X,3)]...)
end
julia> #benchmark fbis(X, beta)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 175.984 μs … 2.710 ms ┊ GC (min … max): 0.00% … 79.70%
Time (median): 225.990 μs ┊ GC (median): 0.00%
Time (mean ± σ): 274.611 μs ± 166.015 μs ┊ GC (mean ± σ): 4.17% ± 7.78%
▅█▃
▆███▆▄▃▄▄▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
176 μs Histogram: frequency by time 1.15 ms <
Memory estimate: 469.25 KiB, allocs estimate: 11.
While in this case using references improves the benchmarks, put attention not to abuse them. If you are "creating" some matrix that you are going to use over and over again, the initial allocation time if you copy the matrix may become negligible compared to the lookup time of different memory spaces if you use the reference way.

Generic maximum/minimum function for complex numbers

In julia, one can find (supposedly) efficient implementations of the min/minimum and max/maximum over collections of real numbers.
As these concepts are not uniquely defined for complex numbers, I was wondering if a parametrized version of these functions was already implemented somewhere.
I am currently sorting elements of the array of interest, then taking the last value, which is as far as I know, much more costly than finding the value with the maximum absolute value (or something else).
This is mostly to reproduce the Matlab behavior of the max function over complex arrays.
Here is my current code
a = rand(ComplexF64,4)
b = sort(a,by = (x) -> abs(x))
c = b[end]
The probable function call would look like
c = maximum/minimum(a,by=real/imag/abs/phase)
EDIT Some performance tests in Julia 1.5.3 with the provided solutions
function maxby0(f,iter)
b = sort(iter,by = (x) -> f(x))
c = b[end]
end
function maxby1(f, iter)
reduce(iter) do x, y
f(x) > f(y) ? x : y
end
end
function maxby2(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
end
end
return res
end
compmax(CArray) = CArray[ (abs.(CArray) .== maximum(abs.(CArray))) .& (angle.(CArray) .== maximum( angle.(CArray))) ][1]
Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
function maxby5(arr)
arr_max = arr[argmax(map(abs, arr))]
end
a = rand(ComplexF64,10)
using BenchmarkTools
#btime maxby0(abs,$a)
#btime maxby1(abs,$a)
#btime maxby2(abs,$a)
#btime compmax($a)
#btime maximum($a)
#btime maxby5($a)
Output for a vector of length 10:
>841.653 ns (1 allocation: 240 bytes)
>214.797 ns (0 allocations: 0 bytes)
>118.961 ns (0 allocations: 0 bytes)
>Execution fails
>20.340 ns (0 allocations: 0 bytes)
>144.444 ns (1 allocation: 160 bytes)
Output for a vector of length 1000:
>315.100 μs (1 allocation: 15.75 KiB)
>25.299 μs (0 allocations: 0 bytes)
>12.899 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>14.199 μs (1 allocation: 7.94 KiB)
Output for a vector of length 1000 (with all comparisons made with abs2):
>35.399 μs (1 allocation: 15.75 KiB)
>3.075 μs (0 allocations: 0 bytes)
>1.460 μs (0 allocations: 0 bytes)
>Execution fails
>1.520 μs (0 allocations: 0 bytes)
>2.211 μs (1 allocation: 7.94 KiB)
Some remarks :
Sorting clearly (and as expected) slows the operations
Using abs2 saves a lot of performance (expected as well)
To conclude :
As a built-in function will provide this in 1.7, I will avoid using the additional Main.isless method, though it is all things considered the most performing one, to not modify the core of my julia
The maxby1 and maxby2 allocate nothing
The maxby1 feels more readable
the winner is therefore Andrej Oskin!
EDIT n°2 a new benchmark using the corrected compmax implementation
julia> #btime maxby0(abs2,$a)
36.799 μs (1 allocation: 15.75 KiB)
julia> #btime maxby1(abs2,$a)
3.062 μs (0 allocations: 0 bytes)
julia> #btime maxby2(abs2,$a)
1.160 μs (0 allocations: 0 bytes)
julia> #btime compmax($a)
26.899 μs (9 allocations: 12.77 KiB)
julia> #btime maximum($a)
1.520 μs (0 allocations: 0 bytes)
julia> #btime maxby5(abs2,$a)
2.500 μs (4 allocations: 8.00 KiB)
In Julia 1.7 you can use argmax
julia> a = rand(ComplexF64,4)
4-element Vector{ComplexF64}:
0.3443509906876845 + 0.49984979589871426im
0.1658370274750809 + 0.47815764287341156im
0.4084798173736195 + 0.9688268736875587im
0.47476987432458806 + 0.13651720575229853im
julia> argmax(abs2, a)
0.4084798173736195 + 0.9688268736875587im
Since it will take some time to get to 1.7, you can use the following analog
maxby(f, iter) = reduce(iter) do x, y
f(x) > f(y) ? x : y
end
julia> maxby(abs2, a)
0.4084798173736195 + 0.9688268736875587im
UPD: slightly more efficient way to find such maximum is to use something like
function maxby(f, iter; default = zero(eltype(iter)))
isempty(iter) && return default
res, rest = Iterators.peel(iter)
fa = f(res)
for x in rest
fx = f(x)
if fx > fa
res = x
fa = fx
end
end
return res
end
According to octave's documentation (which presumably mimics matlab's behaviour):
For complex arguments, the magnitude of the elements are used for
comparison. If the magnitudes are identical, then the results are
ordered by phase angle in the range (-pi, pi]. Hence,
max ([-1 i 1 -i])
=> -1
because all entries have magnitude 1, but -1 has the largest phase
angle with value pi.
Therefore, if you'd like to mimic matlab/octave functionality exactly, then based on this logic, I'd construct a 'max' function for complex numbers as:
function compmax( CArray )
Absmax = CArray[ abs.(CArray) .== maximum( abs.(CArray)) ]
Totalmax = Absmax[ angle.(Absmax) .== maximum(angle.(Absmax)) ]
return Totalmax[1]
end
(adding appropriate typing as desired).
Examples:
Nums0 = [ 1, 2, 3 + 4im, 3 - 4im, 5 ]; compmax( Nums0 )
# 1-element Array{Complex{Int64},1}:
# 3 + 4im
Nums1 = [ -1, im, 1, -im ]; compmax( Nums1 )
# 1-element Array{Complex{Int64},1}:
# -1 + 0im
If this was a code for my computations, I would have made my life much simpler by:
julia> Main.isless(u1::ComplexF64, u2::ComplexF64) = abs2(u1) < abs2(u2)
julia> maximum(rand(ComplexF64, 10))
0.9876138798492835 + 0.9267321874614858im
This adds a new implementation for an existing method in Main. Therefore for a library code it is not an elegant idea, but it will you get where you need it with the least effort.
The "size" of a complex number is determined by the size of its modulus. You can use abs for that. Or get 1.7 as Andrej Oskin said.
julia> arr = rand(ComplexF64, 10)
10-element Array{Complex{Float64},1}:
0.12749588414783353 + 0.09918182087026373im
0.7486501790575264 + 0.5577981676269863im
0.9399200789666509 + 0.28339836191094747im
0.9695470502095325 + 0.9978696209350371im
0.6599207157942191 + 0.0999992072342546im
0.30059521996405425 + 0.6840859625686171im
0.22746651600614132 + 0.33739559003514885im
0.9212471084010287 + 0.2590484924393446im
0.74848598947588 + 0.41129443181449554im
0.8304447441317468 + 0.8014240389454632im
julia> arr_max = arr[argmax(map(abs, arr))]
0.9695470502095325 + 0.9978696209350371im
julia> arr_min = arr[argmin(map(abs, arr))]
0.12749588414783353 + 0.09918182087026373im

Julia: nested loops for pairwise distances is really slow

I have some code which loads a csv file of 2000 2D coordinates, then a function called collision_count counts the number of pairs of coordinates that are closer than a distance d of each other:
using BenchmarkTools
using CSV
using LinearAlgebra
function load_csv()::Array{Float64,2}
df = CSV.read("pos.csv", header=0)
return Matrix(df)'
end
function collision_count(pos::Array{Float64,2}, d::Float64)::Int64
count::Int64 = 0
N::Int64 = size(pos, 2)
for i in 1:N
for j in (i+1):N
#views dist = norm(pos[:,i] - pos[:,j])
count += dist < d
end
end
return count
end
Here are the results:
pos = load_csv()
#benchmark collision_count($pos, 2.0)
BenchmarkTools.Trial:
memory estimate: 366.03 MiB
allocs estimate: 5997000
--------------
minimum time: 152.070 ms (18.80% GC)
median time: 158.915 ms (20.60% GC)
mean time: 158.751 ms (20.61% GC)
maximum time: 181.726 ms (21.98% GC)
--------------
samples: 32
evals/sample: 1
This is about 30x slower than this Python code:
import numpy as np
import scipy.spatial.distance
pos = np.loadtxt('pos.csv',delimiter=',')
def collision_count(pos, d):
pdist = scipy.spatial.distance.pdist(pos)
return np.count_nonzero(pdist < d)
%timeit collision_count(pos, 2)
5.41 ms ± 63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Any way to make it faster? And what's up with all the allocations?
The fastest I can get trivially is the follows
using Distances
using StaticArrays
using LinearAlgebra
pos = [#SVector rand(2) for _ in 1:2000]
function collision_count(pos::Vector{<:AbstractVector}, d)
count = 0
#inbounds for i in axes(pos,2)
for j in (i+1):lastindex(pos,2)
dist = sqeuclidean(pos[i], pos[j])
count += dist < d*d
end
end
return count
end
There are a variety of changes here, some stylistic, some structural. Starting with style, you may note that I don't type anything more restrictively than I need to. This has no performance benefit, since Julia is smart enough to infer types for your code.
The biggest structural change is switching from using a matrix to a vector of StaticVectors. The reason for this change is that since points are your scalar type, it makes more sense to have a vector of elements where each element is a point. The next change I made is to use a squared norm, since sqrt operations are expensive. The results speak for themselves:
#benchmark collision_count($pos, .1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.182 ms (0.00% GC)
median time: 1.214 ms (0.00% GC)
mean time: 1.218 ms (0.00% GC)
maximum time: 2.160 ms (0.00% GC)
--------------
samples: 4101
evals/sample: 1
Note that there are n log(n) algorithms that may be faster, but this should be pretty close to optimal for a naive implementation.
Here's a solution that doesn't rely on specific knowledge about the dimensionality of the points:
(Edit: I updated the function to make it more robust with respect to indexing. Some AbstractArrays have indices that do not start at 1, so now I use axes and lastindex instead of size.)
function collision_count2(pos::AbstractMatrix, d)
count = 0
#inbounds for i in axes(pos, 2)
for j in (i+1):lastindex(pos, 2)
dist2 = sum(abs2(pos[k, i] - pos[k, j]) for k in axes(pos, 1))
count += dist2 < d^2
end
end
return count
end
Benchmarks:
julia> using BenchmarkTools
julia> #btime collision_count(pos, 0.7) setup=(pos=rand(2, 2000));
533.055 ms (13991005 allocations: 488.01 MiB)
julia> #btime collision_count2(pos, 0.7) setup=(pos=rand(2, 2000));
4.700 ms (0 allocations: 0 bytes)
The speed is actually close to the SVector solution. On the upcoming Julia version 1.5, the difference compared to the OP's code should be much smaller, since views become more efficient.
BTW: drop the type annotations, like these
count::Int64 = 0
N::Int64 = size(pos, 2)
it's just adding visual noise.

Julia multiply each matrix along dim

I have a 3 dimensional array
x = rand(6,6,2^10)
I want to multiply each matrix along the third dimension by a vector. Is there a more clean way to do this than:
y = rand(6,1)
z = zeros(6,1,2^10)
for i in 1:2^10
z[:,:,i] = x[:,:,i] * y
end
If you are working with matrices, it may be appropriate to consider x as a vector of matrices instead of a 3D array. Then you could do
x = [rand(6,6) for _ in 1:2^10]
y = [rand(6)]
z = x .* y
z is now a vector of vectors.
And if z is preallocated, that would be
z .= x .* y
And, if you want it really fast, use vectors of StaticArrays
using StaticArrays
x = [#SMatrix rand(6, 6) for _ in 1:2^10]
y = [#SVector rand(6)]
z = x .* y
That's showing a 10x speedup on my computer, running in 12us.
mapslices(i->i*y, x, (1,2)) is maybe "cleaner" but it will be slower.
Read as: apply the function "times by y" to each slice of the first two dimensions.
function tst(x,y)
z = zeros(6,1,2^10)
for i in 1:2^10
z[:,:,i] = x[:,:,i] * y
end
return z
end
tst2(x,y) = mapslices(i->i*y, x, (1,2))
time tst(x,y);
0.002152 seconds (4.10 k allocations: 624.266 KB)
#time tst2(x,y);
0.005720 seconds (13.36 k allocations: 466.969 KB)
sum(x.*y',2) is a clean short solution.
It also has good speed and memory properties. The trick is to view matrix-vector multiplication as a linear combination of matrix's columns scaled by the vector elements. Instead of doing each linear combination for matrix x[:,:,i], we use the same scale y[i] for x[:,i,:]. In code:
const x = rand(6,6,2^10);
const y = rand(6,1);
function tst(x,y)
z = zeros(6,1,2^10)
for i in 1:2^10
z[:,:,i] = x[:,:,i]*y
end
return z
end
tst2(x,y) = mapslices(i->i*y,x,(1,2))
tst3(x,y) = sum(x.*y',2)
Benchmarking gives:
julia> using BenchmarkTools
julia> z = tst(x,y); z2 = tst2(x,y); z3 = tst3(x,y);
julia> #benchmark tst(x,y)
BenchmarkTools.Trial:
memory estimate: 688.11 KiB
allocs estimate: 8196
--------------
median time: 759.545 μs (0.00% GC)
samples: 6068
julia> #benchmark tst2(x,y)
BenchmarkTools.Trial:
memory estimate: 426.81 KiB
allocs estimate: 10798
--------------
median time: 1.634 ms (0.00% GC)
samples: 2869
julia> #benchmark tst3(x,y)
BenchmarkTools.Trial:
memory estimate: 336.41 KiB
allocs estimate: 12
--------------
median time: 114.060 μs (0.00% GC)
samples: 10000
So tst3 using sum has better performance (~7x over tst and ~15x over tst2).
Using StaticArrays as suggested by #DNF is also an option, and it would be nice to compare it to the solutions here.

Resources