Efficient way to calculate part of a multivariate normal density - julia

I am interested in calculating quantity:
where x_i is a 1xD vector (one out of my N data of dimension D), μ is a DxK matrix and W is a list of K DxD matrices.
This should result in a 1XK vector. I try it for all N and K in the following way that works:
res = zeros(N,K);
for i in 1:N
for k in 1:K
res[i,k] = (x_matrix[i,:]-mus_matrix[:,k])'*
w_matrix[k]*(x_matrix[i,:]-mus_matrix[:,k])
If I try to vectorize it, using the following:
res = zeros(N,K);
for i in 1:N
res[i,:] = (x_matrix[i,:].-mus_matrix)'.*w_matrix.*(x_matrix[i,:].-mus_matrix)
I get the following error:
ERROR: DimensionMismatch("arrays could not be broadcast to a common size")
Stacktrace:
[1] _bcs1(::Base.OneTo{Int64}, ::Base.OneTo{Int64}) at ./broadcast.jl:70
[2] _bcs at ./broadcast.jl:63 [inlined]
[3] broadcast_shape(::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Vararg{Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},N} where N) at ./broadcast.jl:57 (repeats 3 times)
[4] broadcast_indices(::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:53
[5] broadcast_c(::Function, ::Type{Array}, ::Array{Float64,2}, ::Array{Any,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:311
[6] broadcast(::Function, ::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:434
Here is an example:
julia> N = 5
5
julia> D=2
2
julia> K = 4
4
julia> W=[]
0-element Array{Any,1}
julia> x = rand(N,D)
5×2 Array{Float64,2}:
0.576477 0.9575
0.184454 0.660436
0.470267 0.729649
0.648879 0.782561
0.626453 0.111332
julia> mu = rand(K,D)
4×2 Array{Float64,2}:
0.989281 0.00126782
0.659106 0.66136
0.50843 0.289442
0.327962 0.523229
julia> for i in 1:K
push!(W,rand(D,D))
end
And then run
julia> (x_matrix[i,:]-mus_matrix[:,k])'*
w_matrix[k]*(x_matrix[i,:]-mus_matrix[:,k])
34649.850360744866
But with the second code
julia> (x_matrix[i,:].-mus_matrix)'.*w_matrix.*(x_matrix[i,:].-mus_matrix)
ERROR: DimensionMismatch("arrays could not be broadcast to a common size")
Stacktrace:
[1] _bcs1(::Base.OneTo{Int64}, ::Base.OneTo{Int64}) at ./broadcast.jl:70
[2] _bcs at ./broadcast.jl:63 [inlined]
[3] broadcast_shape(::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64}}, ::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::Vararg{Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},N} where N) at ./broadcast.jl:57 (repeats 3 times)
[4] broadcast_indices(::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:53
[5] broadcast_c(::Function, ::Type{Array}, ::Array{Float64,2}, ::Array{Any,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:311
[6] broadcast(::Function, ::Array{Float64,2}, ::Array{Any,1}, ::Array{Float64,1}, ::Vararg{Any,N} where N) at ./broadcast.jl:434

TL/DR: optimized variant below, but Einsum looks nicer, IMHO.
Looks like a case for using Einstein summation notation. In Julia, Einsum.jl can do this:
julia> N = 5
5
julia> D = 3
3
julia> K = 10
10
julia> x = rand(N, D)
5×3 Array{Float64,2}:
0.587436 0.210529 0.261725
0.527269 0.457477 0.482939
0.52726 0.411209 0.138872
0.89107 0.464789 0.758392
0.885267 0.931014 0.672959
julia> μ = rand(D, K)
3×10 Array{Float64,2}:
0.280792 0.265066 0.81437 0.503377 0.0717916 … 0.275872 0.609961 0.0820088 0.0042564
0.0177643 0.0959438 0.563948 0.332433 0.088527 0.691971 0.0296638 0.604488 0.956057
0.668128 0.444816 0.74203 0.518232 0.48689 0.465067 0.117469 0.729514 0.109973
julia> W = rand(K, D, D)
10×3×3 Array{Float64,3}:
[:, :, 1] =
0.320861 0.662103 0.219234
0.780944 0.769377 0.566203
0.466207 0.428527 0.330901
0.15534 0.035435 0.346737
0.810676 0.328116 0.469505
0.676575 0.668204 0.285334
0.455551 0.211295 0.85295
0.229995 0.741487 0.783361
0.0937583 0.401419 0.47032
0.956335 0.434213 0.967791
[:, :, 2] =
0.275903 0.130298 0.184485
0.941648 0.940107 0.439454
0.425292 0.252654 0.797115
0.0203406 0.594075 0.484809
0.164309 0.941597 0.455314
0.73628 0.109502 0.920664
0.906305 0.177235 0.540193
0.360038 0.0486971 0.20626
0.914357 0.699901 0.295872
0.284143 0.659117 0.291479
[:, :, 3] =
0.138311 0.921371 0.353719
0.345247 0.70865 0.246736
0.361364 0.636543 0.343837
0.752149 0.581561 0.346399
0.705888 0.24765 0.703952
0.992327 0.369668 0.109407
0.341624 0.223715 0.970667
0.762169 0.94248 0.917569
0.0367128 0.589345 0.121106
0.826602 0.692111 0.229499
julia> using Einsum
julia> #einsum r[n,k] := (x[n,i] - μ[i,k]) * W[k,i,j] * (x[n,j] - μ[j,k])
julia> r
5×10 Array{Float64,2}:
0.0176889 0.087092 0.522184 0.0417967 … -0.0430999 0.041266 -0.0596579 0.432076
0.0521066 0.364059 0.181515 0.00434307 -0.0248712 0.226976 -0.0686294 0.437169
-0.0472136 0.127803 0.458812 0.0119074 0.0391649 -0.0190299 -0.0585371 0.264379
0.468634 1.16498 -0.00263205 0.192809 0.273537 1.13787 -0.0653081 1.41321
0.749655 2.20266 0.0205068 0.420249 0.573358 1.42499 0.441232 1.67574
Which #macroexpands to essentially the following loops (plus preparation and bounds checking):
begin
local k
for k = 1:size(μ, 2)
begin
local n
for n = 1:size(x, 1)
begin
local s = zero(T)
begin
local j
for j = 1:size(W, 3)
begin
local i
for i = 1:size(x, 2)
s += (x[n, i] - μ[i, k]) * W[k, i, j] * (x[n, j] - μ[j, k])
end
end
end
end
r[n, k] = s
end
end
end
end
end
Now, to find something more performant, I compared a couple of variants using BenchmarkTools.jl. You can see the full code and results on my laptop here. It shows that the Einsum variant already is in fact better than the original:
# Original:
# memory estimate: 1017.73 MiB
# allocs estimate: 3429967
# median time: 361.982 ms (15.94% GC)
# Einsum:
# memory estimate: 2.64 MiB
# allocs estimate: 76
# median time: 127.536 ms (0.00% GC)
By far the most efficient and least allocating variant is the following, which requires x = x' and W = permutedims(W, [2, 3, 1]) (assuming you can change your representation easily):
function test_optimized!(res, x, μ, W)
z = zero(eltype(x))
for k = 1:size(μ, 1)
for n = 1:size(x, 1)
res[n, k] = z
for i = 1:size(W, 1)
for j = 1:size(W, 2)
#inbounds res[n, k] += (x[i, n] - μ[i, k]) * W[i, j, k] * (x[j, n] - μ[j, k])
end
end
end
end
end
function test_optimized(x, μ, W)
res = zeros(N, K)
test_optimized!(res, x, μ, W)
res
end
This brings us down to
# memory estimate: 2.63 MiB
# allocs estimate: 2
# median time: 521.215 μs (0.00% GC)
It uses a couple of "tricks" that can be found in the docs: filling a preallocated matrix in a separate method, accessing strides in column-major order, and using #inbounds (although that only improves things at the order of a microsecond).
There is also TensorOperations.jl, which I think does more intelligent things under the hood, but it fail on this:
julia> #tensor r[n,k] := (x[n,i] - μ[i,k]) * W[k,i,j] * (x[n,j] - μ[j,k])
ERROR: TensorOperations.IndexError{String}("invalid index specification: (:n, :i) to (:i, :k)")
Stacktrace:
[1] add_indices(::Tuple{Symbol,Symbol}, ::Tuple{Symbol,Symbol}) at /home/philipp/.julia/v0.6/TensorOperations/src/implementation/indices.jl:22
[2] + at /home/philipp/.julia/v0.6/TensorOperations/src/indexnotation/sum.jl:40 [inlined]
[3] -(::TensorOperations.IndexedObject{(:n, :i),:N,Array{Float64,2},Int64}, ::TensorOperations.IndexedObject{(:i, :k),:N,Array{Float64,2},Int64}) at /home/philipp/.julia/v0.6/TensorOperations/src/indexnotation/sum.jl:44
I guess that's deliberate and has to do with efficiency, see this issue.

Related

How to efficiently generate random unique series of bits with specific length?

Let's say I want to generate 3 unique random series of bits with a length of three. The possible output can be:
001 or [0, 0, 1]
010 or [0, 1, 0]
111 or [1, 1, 1]
#or
011 or [0, 1, 1]
110 or [1, 1, 0]
111 or [1, 1, 1]
# etc.
I provided two notations above (the Vector notation is preferred). The point is where they should be unique. I tried:
julia> unique(convert.(BitVector, rand.(Ref([0, 1]), repeat([3], 3))))
2-element Vector{BitVector}:
[0, 1, 1]
[0, 1, 0]
As you can see, there might be a set of two unique BitVectors rather than 3 and this is natural here. I can replace repeat([3], 3) with repeat([3], 6) to somewhat ensure I would get three unique sets:
julia> unique(convert.(BitVector, rand.(Ref([0, 1]), repeat([3], 5))))[1:3]
3-element Vector{BitVector}:
[1, 0, 0]
[1, 1, 1]
[1, 0, 1]
But I wonder if there's any better idea for this?
*However, I'm really curious about how I can efficiently generate the first notation for this question (like 101, 001, etc.).
Update: The following randBitSeq will be 3X faster. It generates unique random numbers first, then it fills a Boolean matrix with their binary values.
using StatsBase
function randBitSeq(N, L)
M = Matrix{Bool}(undef,N,L)
S = sample(0:2^L-1, N; replace=false)
i = 0
for n in S
i += 1
for j = 1:L
if n > 0
M[i,j] = isodd(n)
n ÷= 2
else
M[i,j] = false
end
end
end
return M
end
#btime randBitSeq(50, 10)
1.350 μs (3 allocations: 9.17 KiB)
# vs.
#btime randseqset(50, 10)
3.050 μs (5 allocations: 10.84 KiB)
Constructing all possible combinations will exponentially eat memory. A better option is to generate a Set of N random binary series of length L each. Then add more series if the required number is not achieved. This seems much faster for N,L > 3.
function randSeq(N, L)
s = Set(rand(Bool,L) for i=1:N)
while length(s) < N
push!(s, rand(Bool,L))
end
s
end
N = 50; L = 10
#btime randSeq($N, $L)
4.071 μs (57 allocations: 4.71 KiB)
Another nice option is:
using Random, StatsBase
function randseqset(N, L)
L < sizeof(Int)*8-1 || error("Short seqs only")
m = BitArray(undef, L, N)
s = sample(0:(1<<L)-1, N; replace=false)
map(i->digits!(#view(m[:,i]), s[i]; base=2), 1:N)
end
A version with simple vectors instead of #views is:
function randseqset(N, L)
L < sizeof(Int)*8-1 || error("Short seqs only")
s = sample(0:(1<<L)-1, N; replace=false)
map(i->digits!(Vector{Bool}(undef, L),
s[i]; base=2), 1:N)
end
It has the benefit of adapting to parameters a bit (inherited from the no-replace sample code). And it is quite performant and allocation thrifty.
For example:
julia> N = 10; L = 4;
julia> #btime randSeq($N, $L);
1.126 μs (16 allocations: 1.12 KiB)
julia> #btime randseqset($N, $L);
654.878 ns (5 allocations: 944 bytes)
PS If Matrix{Bool} preferable to BitMatrix then replace m = ... line with m = Matrix{Bool}(undef, L, N)
PPS As for the question about the strings, the following works (using same logic as above):
randseqstrset(N, L) = getindex.(
bitstring.(sample(0:(1<<L)-1, N; replace=false)),
Ref(sizeof(Int)*8-L+1:sizeof(Int)*8))
for example:
julia> randseqstrset(3,3)
3-element Vector{String}:
"101"
"000"
"011"
UPDATE: If speed is really an issue, another version can use some BitMatrix trickery:
function randBitSeq2(N, L)
BM = BitMatrix(undef, 0,0)
BM.chunks = sample(0:2^L-1, N; replace=false)
BM.dims = (sizeof(Int64)*8, N)
BM.len = sizeof(Int64)*8*N
return #view(BM[1:L,:])
end
This version is called randBitSeq2 because it returns a matrix like randBitSeq but is twice as fast:
julia> #btime randBitSeq2(50,10);
1.882 μs (5 allocations: 9.20 KiB)
julia> #btime randBitSeq(50,10);
3.634 μs (3 allocations: 9.17 KiB)
Here's a thought: given you're only talking about bit strings of length 3, instead of trying to randomly generate them and then enforce uniqueness, how about just take the set of all 3 bit strings and shuffle it around, and then select 3.
For example you could use collect(Iterators.product([0,1],[0,1],[0,1])) to generates all length 3 bit strings, and then shuffle(x)[1:3] from Random to sample without replacement, e.g.
julia> using Random
julia> shuffle(collect(Iterators.product([0,1],[0,1],[0,1])))[1:3]
3-element Array{Tuple{Int64,Int64,Int64},1}:
(0, 1, 1)
(1, 1, 1)
(1, 0, 1)
Also if you want BitVectors instead you can do
julia> shuffle(BitArray.(Iterators.product([0,1],[0,1],[0,1])))[1:3]
3-element Array{BitArray{1},1}:
[1, 0, 1]
[0, 1, 1]
[1, 1, 1]
Obviously this won't scale well with the length of the strings, but a suggestion for this small case.

Combining vectors of unequal length

x = [1, 2, 3, 4]
y = [1, 2]
If I want to be able to operate on the two vectors with a default value filling in, what are the strategies?
E.g. would like to do the following and implicitly fill in with 0 or missing
x + y # would like [2, 4, 3, 4]
Ideally would like to do this in a generic way so that I could do arbitrary operations with the two.
Disregarding whether Julia has something built-in to do this, remember that Julia is fast. This means that you can write code to support this kind of need.
extend!(x, y::Vector, default=0) = extend!(x, length(y), default)
extend!(x, n::Int, default=0) = begin
while length(x) < n
push!(x, default)
end
x
end
Then when you have code such as you describe, you can symmetrically extend x and y:
x = [1, 2, 3, 4]
y = [1, 2]
extend!(x, y)
extend!(y, x)
x + y
==> [2, 4, 3, 4]
Note that this mutates y. In many cases, the desired length would come from outside the code and would be applied to both x and y. I can also imagine that 0 is a bad default in general (even though it is completely appropriate in your context of addition.
A comment below makes the worthy point that you should consider using append! instead of looping over push!. In fact, it is best to measure differences like that if you care about very small differences. I went ahead and tested:
julia> using BenchmarkTools
julia> extend1(x, n) = begin
while length(x) < n
push!(x, 0)
end
x
end
julia> #btime begin
x = rand(10)
sum(x)
end
59.815 ns (1 allocation: 160 bytes)
5.037723569560573
julia> #btime begin
x = rand(10)
extend1(x, 1000)
sum(x)
end
7.281 μs (8 allocations: 20.33 KiB)
6.079832879992913
julia> x = rand(10)
julia> #btime begin
x = rand(10)
append!(x, zeros(990))
sum(x)
end
1.290 μs (3 allocations: 15.91 KiB)
3.688526541987817
julia>
Pushing primitives in a loop is damned fast, allocating a vector of zeros so we can use append! is very slightly faster.
But the real lesson here is seen in the fact that the loop version takes microseconds to append nearly 1000 values (reallocating the array several times). Appending 10 values one by one takes just over 150ns (and append! is slightly faster). This is blindingly fast. Literally doing nothing in R or Python can take longer than this.
This difference would matter in some situations and would be undetectable in many others. If it matters, measure. If it doesn't, do the simplest thing that comes to mind because Julia has your back (performance-wise).
FURTHER UPDATE
Taking a hint from another of Colin's comments, here are results where we use append! but we don't allocate a list. Instead, we use a generator ... that is, a data structure that invents data when asked for it with an interface much like a list. The results are much better than what I showed above.
julia> #btime begin
x = rand(10)
append!(x, (0 for i in 1:990))
sum(x)
end
565.814 ns (2 allocations: 8.03 KiB)
Note the round brackets around the 0 for i in 1:990.
In the end, Colin was right. Using append! is much faster if we can avoid related overheads. Surprisingly, the base function Iterators.repeated(0, 990) is much slower.
But, no matter what, all of these options are pretty blazingly fast and all of them would probably be so fast that none of these subtle differences would matter.
Julia is fun!
Note that if you want to fill with missing or some other type different from the element type in your original vector, then you will need to change the type of your vectors to allow those new elements. The function below will handle any case.
function fillvectors(x, y, fillvalue=missing)
xl = length(x)
yl = length(y)
if xl < yl
x::Vector{Union{eltype(x), typeof(fillvalue)}} = x
for i in xl+1:yl
push!(x, fillvalue)
end
end
if yl < xl
y::Vector{Union{eltype(y), typeof(fillvalue)}} = y
for i in yl+1:xl
push!(y, fillvalue)
end
end
return x, y
end
x = [1, 2, 3, 4]
y = [1, 2]
julia> (x, y) = fillvectors(x, y)
([1, 2, 3, 4], Union{Missing, Int64}[1, 2, missing, missing])
julia> y
4-element Vector{Union{Missing, Int64}}:
1
2
missing
missing
julia> (x, y) = fillvectors(x, y, 0)
([1, 2, 3, 4], [1, 2, 0, 0])
julia> y
4-element Vector{Int64}:
1
2
0
0
julia> (x, y) = fillvectors(x, y, 1.001)
([1, 2, 3, 4], Union{Float64, Int64}[1, 2, 1.001, 1.001])
julia> y
4-element Vector{Union{Float64, Int64}}:
1
2
1.001
1.001

Can't call `sort_exercise()`

I am trying to call both functions, starting with sort_exercise
# reference https://www.geeksforgeeks.org/merge-sort/
# Merges two subarrays of A[]
# First subarray is A[p..m]
# Second subarray is A[m+1..r]
julia> function sort_exercise(A::Vector{Int}, p, m, r)
n1 = m - p + 1
n2 = r - m
# create temp arrays
L = zeros(Int, n1)
R = zeros(Int, n2)
# copy data to temp arrays L[] and R[]
for i = 1:n1
L[i] = A[p + i]
end
for j = 1:n2
R[j] = A[m + 1 + j]
end
# Merge temp arrays back to A[1..r]
i = 0 # Initial index of first subarray
j = 0 # Initial index of second subarray
k = p # Initial index of merged subarray
while i < n1; j < n2
if L[i] <= R[j]
A[k] = L[i]
i += 1
else
A[k] = R[j]
j += 1
end
k += 1
end
# Copy any possible remaining elements of L[]
while i < n1
A[k] = L[i]
i += 1
k += 1
end
# Copy any possible remaining elements of R[]
while j < n2
A[k] = R[j]
j += 1
k += 1
end
end
sort_exercise (generic function with 1 method)
julia> sort_exercise([4, 5, 22, 1, 3], 1, 3, 5)
ERROR: BoundsError: attempt to access 5-element Array{Int64,1} at index [6]
Stacktrace:
[1] sort_exercise(::Array{Int64,1}, ::Int64, ::Int64, ::Int64) at ./REPL[1]:14
julia> function merge_exercise(A::Vector{Int}, p, r)
if p < r
# equivalent to `(p + r) / 2` w/o overflow for big p and
h (no idea what h is)
m = (p+(r - 1)) / 2
# merge first half
merge_exercise(A, p, m)
# with second half
merge_exercise(A, m + 1, r)
# sort merged halves
sort_exercise(A, p, m, r)
end
end
merge_exercise (generic function with 1 method)
It seems that you have translated the Python code.
In fact, in python L = [0] * (n1) creates an array of size n1 filled with 0. In Julia you can use L = zeros(Int, n1) to accomplish the same.
L = zeros(Int, 1) * n1 is just the array [0] therefore you have the out-of-bound error.
Note that for i in range(1,n1) can also be written as for i = 1:n1.

How to find the index of the last maximum in julialang?

I have an array that contains repeated nonnegative integers, e.g., A=[5,5,5,0,1,1,0,0,0,3,3,0,0]. I would like to find the position of the last maximum in A. That is the largest index i such that A[i]>=A[j] for all j. In my example, i=3.
I tried to find the indices of all maximum of A then find the maximum of these indices:
A = [5,5,5,0,1,1,0,0,0,3,3,0,0];
Amax = maximum(A);
i = maximum(find(x -> x == Amax, A));
Is there any better way?
length(A) - indmax(#view A[end:-1:1]) + 1
should be pretty fast, but I didn't benchmark it.
EDIT: I should note that by definition #crstnbr 's solution (to write the algorithm from scratch) is faster (how much faster is shown in Xiaodai's response). This is an attempt to do it using julia's inbuilt array functions.
What about findlast(A.==maximum(A)) (which of course is conceptually similar to your approach)?
The fastest thing would probably be explicit loop implementation like this:
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
I tried #Michael's solution and #crstnbr's solution and I found the latter much faster
a = rand(Int8(1):Int8(5),1_000_000_000)
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 19 seconds
#time length(a) - indmax(#view a[end:-1:1]) + 1 # 18 seconds
function lastindmax(x)
k = 1
m = x[1]
#inbounds for i in eachindex(x)
if x[i]>=m
k = i
m = x[i]
end
end
return k
end
#time lastindmax(a) # 3 seconds
#time lastindmax(a) # 2.8 seconds
Michael's solution doesn't support Strings (ERROR: MethodError: no method matching view(::String, ::StepRange{Int64,Int64})) or sequences so I add another solution:
julia> lastimax(x) = maximum((j,i) for (i,j) in enumerate(x))[2]
julia> A="abžcdž"; lastimax(A) # unicode is OK
6
julia> lastimax(i^2 for i in -10:7)
1
If you more like don't catch exception for empty Sequence:
julia> lastimax(x) = !isempty(x) ? maximum((j,i) for (i,j) in enumerate(x))[2] : 0;
julia> lastimax(i for i in 1:3 if i>4)
0
Simple(!) benchmarks:
This is up to 10 times slower than Michael's solution for Float64:
julia> mlastimax(A) = length(A) - indmax(#view A[end:-1:1]) + 1;
julia> julia> A = rand(Float64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.166389 seconds (4.00 M allocations: 91.553 MiB, 4.63% gc time)
0.019560 seconds (6 allocations: 240 bytes)
80346
(I am surprised) it is 2 times faster for Int64!
julia> A = rand(Int64, 1_000_000); #time lastimax(A); #time mlastimax(A)
0.015453 seconds (10 allocations: 304 bytes)
0.031197 seconds (6 allocations: 240 bytes)
423400
it is 2-3 times slower for Strings
julia> A = ["A$i" for i in 1:1_000_000]; #time lastimax(A); #time mlastimax(A)
0.175117 seconds (2.00 M allocations: 61.035 MiB, 41.29% gc time)
0.077098 seconds (7 allocations: 272 bytes)
999999
EDIT2:
#crstnbr solution is faster and works with Strings too (doesn't work with generators). There difference between lastindmax and lastimax - first return byte index, second return character index:
julia> S = "1š3456789ž"
julia> length(S)
10
julia> lastindmax(S) # return value is bigger than length
11
julia> lastimax(S) # return character index (which is not byte index to String) of last max character
10
julia> S[chr2ind(S, lastimax(S))]
'ž': Unicode U+017e (category Ll: Letter, lowercase)
julia> S[chr2ind(S, lastimax(S))]==S[lastindmax(S)]
true

Order of linear algebra operations in Julia

If I have a command y = A*B*x where A & B are large matrices and x & y are vectors, will Julia preform y = ((A*B)*x) or y = (A*(B*x))?
The second option should be the best as it only has to allocate an extra vector rather than a large matrix.
The best way to verify this kind of thing is to dump the lowered code via #code_lowered macro:
julia> #code_lowered A * B * x
CodeInfo(:(begin
nothing
return (Core._apply)(Base.afoldl, (Core.tuple)(Base.*, (a * b) * c), xs)
end))
Like many other languages, Julia does y = (A*B)*x instead of y = A*(B*x), so it's up to you to explicitly use parens to reduce the allocation.
julia> using BenchmarkTools
julia> #btime $A * ($B * $x);
6.800 μs (2 allocations: 1.75 KiB)
julia> #btime $A * $B * $x;
45.453 μs (3 allocations: 79.08 KiB)

Resources