How to overload Base.view() for a custom struct of arrays type in Julia or extend StructArrays type with overloaded base.Push!() method - julia

I should preface this by stating I am very new to the Julia language and I think I am missing something fundamental.
I am trying to setup a numerical simulation code and ultimately want it to work on both CPU and GPU. The method i am attempting is the Material Point Method (MPM) and it has been reported (here for instance) that generally the better memory footprints for this type of problem are SoA and AoSoA (Struct of Array and Array of Struct of Array) as they allow for coalesced memory access. As a large portion of MPM falls into the category of "embarrassingly parallel" this is of great benefit.
I have a pretty good idea of how I might approach this just using AoS (Array of Structs) as it follows the OOP paradigms I am more familiar with. I also have some idea of what a SoA should (I think...) look like when converting from AoS but am having difficulties realizing things in Julia (as I said the language is new to me)
I realize that the Julia package StructArrays gets close to what I am after but does not have push built in and I wanted to get more familiar with the language by doing things more manually.
What I have so far
struct Node{S,V,M}
mass::S
velocity::V
affine::M
end
function Node()::Node
return Node(0.,zeros(3),zeros(3,3))
end
struct Nodes{S,V,M}
mass::Array{S}
velocity::Array{V}
affine::Array{M}
end
function Nodes(S,V,M)::Nodes
return Nodes{S,V,M}(Array{S}(undef,0),Array{V}(undef,0),Array{M}(undef,0))
end
function Base.push!(nodes::Nodes)
push!(nodes.mass,0.0)
push!(nodes.velocity,zeros(3))
push!(nodes.affine,zeros(3,3))
end
function Base.push!(nodes::Nodes,m,v,a)
push!(nodes.mass,m)
push!(nodes.velocity,v)
push!(nodes.affine,a)
end
function Base.getindex(nodes::Nodes,idx::Int)::Node
return Node(nodes.mass[idx],nodes.velocity[idx],nodes.affine[idx])
end
Which works like it should, e.g. -
julia> nodes = Nodes(Float64,SVector{3,Float64},SMatrix{3,3,Float64})
Nodes{Float64, SVector{3, Float64}, SMatrix{3, 3, Float64, L} where
L}(Float64[], SVector{3, Float64}[], SMatrix{3, 3, Float64, L} where
L[])
julia> #benchmark push!(nodes)
BenchmarkTools.Trial: 10000 samples with 952 evaluations.
Range (min … max): 96.954 ns … 409.226 μs ┊ GC (min … max): 0.00% … 84.19%
Time (median): 166.807 ns ┊ GC (median): 0.00%
Time (mean ± σ): 480.813 ns ± 7.516 μs ┊ GC (mean ± σ): 64.80 ± 6.64%
Memory estimate: 352 bytes, allocs estimate: 3.
julia> nodes.mass
10020502-element Vector{Float64}:
but I wanted to then extend Nodes to be compatible with views and slices but:
julia> view(nodes,1:1) ERROR: MethodError: no method matching
view(::Nodes{Float64, SVector{3, Float64}, SMatrix{3, 3, Float64, L}
where L}, ::UnitRange{Int64})
Which seemed fairly straight forward as I have not defined a method having that (or a compatible) signature. I thought I would look at one of the closest candidates (which just so happens to be from Struct Arrays) below:
Closest candidates are: view(::ChainRulesCore.AbstractZero,
::Any...) at
C:\Users\DIi.julia\packages\ChainRulesCore\a4mIA\src\tangent_types\abstract_zero.jl:39
view(::StructArray{T, N, C, I} where I, ::Any...) where {T, N, C} at
C:\Users\DIi.julia\packages\StructArrays\F5fDf\src\structarray.jl:361
view(::AbstractUnitRange, ::AbstractUnitRange{var"#s77"} where
var"#s77"<:Integer) at subarray.jl:186 ...
Which I went and looked at
#inline function Base.view(s::StructArray{T, N, C}, I...) where {T, N, C}
#boundscheck checkbounds(s, I...)
StructArray{T}(map(v -> #inbounds(view(v, I...)), components(s)))
end
And I seem to know relatively little of what is going on...
In the above what is "I"? I know T is the type of the structure, N is number of elements, C is a tuple argument which allows forming a StructArray using a NamedTuple (and is also used by the constructor function using new() to package the field components by name ...), but I have no idea what the I is other than in the base library it is a UniformScaling (as in Identity matrix, vector, value 1., etc.) but what is it doing in the call signature? Note, the docstring of structarray.jl explains T,N, and C as well as 'components' below but makes no mention of I.
Why is "I" followed by ellipses?
What does the where {T,C,N} mean? Is it just asking that all three parametric types where supplied as an assertion before continuing?
What is "StructArray{T}(map(v -> #inbounds(view(v, I...)), components(s)))" doing? I get what #inbounds macro does and map i think is supposed to be applying view to each of the elements of the StructArray and then is calling its own constructor to instantiate that view of the existing allocation but what is with the "v,I..." and for that matter what is "v->" doing?
Lastly, more questions:
Is there a more julia way to define the structs and methods of Nodes, Node, etc? I think I should be returning something more general like AbstractVector?
How should I write a view method for the Nodes struct?
How would I add a push!() method to extend use of StructArray as I think I will have better luck using that than rolling my own. I can use
nodes = StructArray{Node}(undef)
which does allow me to define the first element as in
nodes[1]=Node()
but then BoundErrors if I try to add another
nodes[2]=Node()
I could pre-define some length
StructArray{Node}(undef,N)
but then don't know how to grow beyond that size and would like to be able to arbitrarily shrink or grow the Nodes arrays as I intend to use a hierarchal grid structure and get some kind of automatic refinement working.
If you got this far, thank you for your patience.

Turns out I got myself confused. While messing around with push!() on the individual arrays of a StructArray I ended up leaving them in a state that was incompatible with push!().
Turns out StructArray is compatible with psuh!() right out of the box so there is no reason to re-invent the implementation.
As an aside there is additionally memory overhead associated with use of StructArray but indexing in is faster so do with that what you will.

Related

Writing a function that will take an AbstractVector as its parameter

From Performance Tips:
When working with parameterized types, including arrays, it is best to
avoid parameterizing with abstract types where possible.
Suppose you're writing a function that requires a Vector, and with each use of the function, the Vector can contain different types.
function selection_sort!(unsorted_vect::AbstractVector)
# Code to sort.
end
names = String["Sarah","Kathy","Amber"]
unsorted_fibonacci = Int32[8, 1, 34, 21, 3, 5, 0, 13, 2, 1]
selection_sort!(names)
selection_sort!(unsorted_fibonacci)
When using selection_sort!() the first time, the Vector contains String, the second time Int32.
The function knows it's getting an AbstractVector, but it doesn't know the type of elements.
Will performance be inefficient? Would it be the same as writing selection_sort!(unsorted_vect::AbstractVector{Real})?
The same section also says:
If you cannot avoid containers with abstract value types, it is
sometimes better to parametrize with Any to avoid runtime type
checking. E.g. IdDict{Any, Any} performs better than IdDict{Type,
Vector}
Would it be better to write the function this way?
function selection_sort!(unsorted_vect::AbstractVector{Any})
If so, why do the sorting algorithms just use AbstractVector?
function sort!(v::AbstractVector, # left out other parameters)
What you need is:
function selection_sort!(unsorted_vect::AbstractVector{T}) where T
# Code to sort.
end
In this way you have an abstract type for the container (and hence any container will get accepted).
Yet those containers con be non-abstract - because they can have a concrete type for their elements - T. As noted by Bogumil - if in the body code you do not need the type T you could do function selection_sort!(unsorted_vect::AbstractVector) instead.
However, this is just a limit for argument types. Julia would be as happy as to have function selection_sort!(unsorted_vect) and the code would be as efficient.
What will really matter are the types of arguments. Consider the following 3 variables:
a = Vector{Float64}(rand(10))
b = Vector{Union{Float64,Int}}(rand(10))
c = Vector{Any}(rand(10))
a is a typed container, b is a small-union type container and c is an abstract container. Let us have a look on what is happening with the performance:
julia> #btime minimum($a);
18.530 ns (0 allocations: 0 bytes)
julia> #btime minimum($b);
28.600 ns (0 allocations: 0 bytes)
julia> #btime minimum($c);
241.071 ns (9 allocations: 144 bytes)
The variable c requires value unboxing (due to unknown element type) and hence the performance is degraded by an order of magnitude.
In conclusion, it is the parameter type what actually matters.

How to increase Julia code performance by preventing memory allocation?

I am reading Julia performance tips,
https://docs.julialang.org/en/v1/manual/performance-tips/
At the beginning, it mentions two examples.
Example 1,
julia> x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global()
0.009639 seconds (7.36 k allocations: 300.310 KiB, 98.32% compilation time)
496.84883432553846
julia> #time sum_global()
0.000140 seconds (3.49 k allocations: 70.313 KiB)
496.84883432553846
We see a lot of memory allocations.
Now example 2,
julia> x = rand(1000);
julia> function sum_arg(x)
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_arg(x)
0.006202 seconds (4.18 k allocations: 217.860 KiB, 99.72% compilation time)
496.84883432553846
julia> #time sum_arg(x)
0.000005 seconds (1 allocation: 16 bytes)
496.84883432553846
We see that by putting x into into the argument of the function, memory allocations almost disappeared and the speed is much faster.
My question are, can anyone explain,
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
I am a little confused.
in the two examples, we see that the second time we run Julia, it is always faster than the first time.
Does that mean we need to run Julia twice? If Julia is only fast at the second run, then what is point? Why not Julia just do a compiling first, then do a run, just like Fortran?
Is there any general rule to preventing memory allocations? Or do we just always have to do a #time to identify the issue?
Thanks!
why example 1 needs so many allocation, and why example 2 does not need as many allocations as example 1?
Example 1 needs so many allocations, because x is a global variable (defined out of scope of the function sum_arg). Therefore the type of variable x can potentially change at any time, i.e. it is possible that:
you define x and sum_arg
you compile sum_arg
you redefine x (change its type) and run sum_arg
In particular, as Julia supports multiple threading, both actions in step 3 in general could happen even in parallel (i.e. you could have changed the type of x in one thread while sum_arg would be running in another thread).
So because after compilation of sum_arg the type of x can change Julia, when compiling sum_arg has to ensure that the compiled code does not rely on the type of x that was present when the compilation took place. Instead Julia, in such cases, allows the type of x to be changed dynamically. However, this dynamic nature of allowed x means that it has to be checked in run-time (not compile time). And this dynamic checking of x causes performance degradation and memory allocations.
You could have fixed this by declaring x to be a const (as const ensures that the type of x may not change):
julia> const x = rand(1000);
julia> function sum_global()
s = 0.0
for i in x
s += i
end
return s
end;
julia> #time sum_global() # this is now fast
0.000002 seconds
498.9290555615045
Why not Julia just do a compiling first, then do a run, just like Fortran?
This is exactly what Julia does. However, the benefit of Julia is that it does compilation automatically when needed. This allows you for a smooth interactive development process.
If you wanted you could compile the function before it is run with the precompile function, and then run it separately. However, normally people just run the function without doing it explicitly.
The consequence is that if you use #time:
The first time you run a function it returns you both execution time and compilation time (and as you can see in examples you have pasted - you get information what percentage of time was spent on compilation).
In the consecutive runs the function is already compiled so only execution time is returned.
Is there any general rule to preventing memory allocations?
These rules are exactly given in the Performance Tips section of the manual that you are quoting in your question. The tip on using #time is a diagnostic tip there. All other tips are rules that are recommended to get a fast code. However, I understand that the list is long so a shorter list that is good enough to start with in my experience is:
Avoid global variables
Avoid containers with abstract type parameters
Write type stable functions
Avoid changing the type of a variable

Efficiently loop through structs in Julia

I have a simple question. I have defined a struct, and I need to inititate a lot (in the order of millions) of them and loop over them.
I am initiating one at a time and going through the loop as follows:
using Distributions
mutable struct help_me{Z<:Bool}
can_you_help_me::Z
millions_of_thanks::Z
end
for i in 1:max_iter
tmp_help = help_me(rand(Bernoulli(0.5),1)[1],rand(Bernoulli(0.99),1)[1])
# many follow-up processes
end
The memory allocation scales up in max_iter. For my purpose, I do not need to save each struct. Is there a way to "re-use" the memory allocation used by the struct?
Your main problem lies here:
rand(Bernoulli(0.5),1)[1], rand(Bernoulli(0.99),1)[1]
You are creating a length-1 array and then reading the first element from that array. This allocates unnecessary memory and takes time. Don't create an array here. Instead, write
rand(Bernoulli(0.5)), rand(Bernoulli(0.99))
This will just create random scalar numbers, no array.
Compare timings here:
julia> using BenchmarkTools
julia> #btime rand(Bernoulli(0.5),1)[1]
36.290 ns (1 allocation: 96 bytes)
false
julia> #btime rand(Bernoulli(0.5))
6.708 ns (0 allocations: 0 bytes)
false
6 times as fast, and no memory allocation.
This seems to be a general issue. Very often I see people writing rand(1)[1], when they should be using just rand().
Also, consider whether you actually need to make the struct mutable, as others have mentioned.
If the structure is not needed anymore (i.e. not referenced anywhere outside the current loop iteration), the Garbage Collector will free up its memory automatically if required.
Otherwise, I agree with the suggestions of Oscar Smith: memory allocation and garbage collection take time, avoid it for performance reasons if possible.

Why is this in-place assignment allocating more memory?

I am trying to understand memory management a little better. I have the following example code:
begin
mutable struct SimplestStruct
a::Float64
end
function SimplestLoop!(a::Array{SimplestStruct, 1}, x::Float64)
for i in 1:length(a)
#inbounds a[i].a = x
end
end
simples = fill(SimplestStruct(rand()), 100)
#time SimplestLoop!(simples, 6.0)
end
As far as I can work out from the docs + various good posts around about in-place operations, SimplestLoop! should operate on its first argument without allocating any extra memory. However, #time reports 17k allocations.
Is there an obvious reason why this is happening?
Thank you in advance!
If you perform the #time measurement several times, you'll see that the first measurement is different from the others. This is because you're actually mostly measuring (just-ahead-of-time) compilation time and memory allocations.
When the objective is to better understand runtime performance, it is generally recommended to use BenchmarkTools to perform the benchmarks:
julia> using BenchmarkTools
julia> #btime SimplestLoop!($simples, 6.0)
82.930 ns (0 allocations: 0 bytes)
BenchmarkTools's #btime macro takes care of handling compilation times, as well as averaging runtime measurements over a sufficiently large number of samples to get accurate estimations. With this, we see that there are indeed no allocations in your code, as expected.

Immutable types and performances

I'm wondering about immutable types and performances in Julia.
In which case does making a composite type immutable improve perfomances? The documentation says
They are more efficient in some cases. Types like the Complex example
above can be packed efficiently into arrays, and in some cases the
compiler is able to avoid allocating immutable objects entirely.
I don't really understand the second part.
Are there cases where making a composite type immutable reduce performance (beyond the case where a field needs to be changed by reference)? I thought one example could be when an object of an immutable type is used repeatedly as an argument, since
An object with an immutable type is passed around (both in assignment statements and in function calls) by copying, whereas a mutable type is passed around by reference.
However, I can't find any difference in a simple example:
abstract MyType
type MyType1 <: MyType
v::Vector{Int}
end
immutable MyType2 <: MyType
v::Vector{Int}
end
g(x::MyType) = sum(x.v)
function f(x::MyType)
a = zero(Int)
for i in 1:10_000
a += g(x)
end
return a
end
x = fill(one(Int), 10_000)
x1 = MyType1(x)
#time f(x1)
# elapsed time: 0.030698826 seconds (96 bytes allocated)
x2 = MyType2(x)
#time f(x2)
# elapsed time: 0.031835494 seconds (96 bytes allocated)
So why isn't f slower with an immutable type? Are there cases where using immutable types makes a code slower?
Immutable types are especially fast when they are small and consist entirely of immediate data, with no references (pointers) to heap-allocated objects. For example, an immutable type that consists of two Ints can potentially be stored in registers and never exist in memory at all.
Knowing that a value won't change also helps us optimize code. For example you access x.v inside a loop, and since x.v will always refer to the same vector we can hoist the load for it outside the loop instead of re-loading on every iteration. However whether you get any benefit from that depends on whether that load was taking a significant fraction of the time in the loop.
It is rare in practice for immutables to slow down code, but there are two cases where it might happen. First, if you have a large immutable type (say 100 Ints) and do something like sorting an array of them where you need to move them around a lot, the extra copying might be slower than pointing to objects with references. Second, immutable objects are usually not allocated on the heap initially. If you need to store a heap reference to one (e.g. in an Any array), we need to move the object to the heap. From there the compiler is often not smart enough to re-use the heap-allocated version of the object, and so might copy it repeatedly. In such a case it would have been faster to just heap-allocate a single mutable object up front.
This test includes a special cases, so is not extendable and could not reject better performance of immutable types.
check following test and look at different allocation times,when create a vector of immutables compare to a vector of mutables
abstract MyType
type MyType1 <: MyType
i::Int
b::Bool
f::Float64
end
immutable MyType2 <: MyType
i::Int
b::Bool
f::Float64
end
#time x=[MyType2(i,1,1) for i=1:100_000];
# => 0.001396 seconds (2 allocations: 1.526 MB)
#time x=[MyType1(i,1,1) for i=1:100_000];
# => 0.003683 seconds (100.00 k allocations: 3.433 MB)

Resources