Performance of looping over array of dicts in Julia - julia

I am doing loops over arrays/vector of dicts. My Julia performance is slower than Python.
Run time is 25% longer than Python.
The toy program below represents the structure of what I am doing.
const poems = [
Dict(
:text => "Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore—",
:author => "Edgar Allen Poe"),
Dict(
:text => "Because I could not stop for Death – He kindly stopped for me – The Carriage held but just Ourselves – And Immortality.",
:author => "Emily Dickinson"),
# etc... 10,000 more
]
const remove_count = [
Dict(:words => "the", :count => 0),
Dict(:words => "and", :count => 0),
# etc... 100 more
]
function loop_poems(poems, remove_count)
for p in poems
for r in remove_count
if occursin(r[:words], p[:text])
r[:count] += 1
end
end
end
end
How do I optimize? I have read through Performance Tips in Julia website:
First, I declare constants.
Second, I assume since I pass arguments remove count and poems into the function, I don't need to declare as global.
Third, the meat of the processing (the loops) are in a function.
Fourth... I don't know how to declare types in an array of dicts (specifically for performance). How to do it? Would it help performance?

The issue here seems to be, what we call "type instability". Julia code is fast, when Julia can figure out the correct types at runtime, but is slower when the types are not known. To figure out, if there is any kind of type instability, you can use the #code_warntype macro in your REPL:
julia> #code_warntype loop_poems(poems, remove_count)
StackOverflow does not show the colors of the output, but you should look out for red parts, that indicate that Julia cannot narrow down the types enough. Yellow parts also indicate places, where the type is not known exactly, but these parts are often intentionally, so we have to worry less about them.
In my case (Julia v1.8.5) the following lines have some red color
│ %17 = Base.getindex(r, :words)::Any
│ %18 = Base.getindex(p, :text)::String
│ %19 = Main.occursin(%17, %18)::Bool
└── goto #5 if not %19
4 ─ %21 = Base.getindex(r, :count)::Any
│ %22 = (%21 + 1)::Any
the suffixes ::Any indicate that Julia could only infer the type Any here, which could be any type.
We also see that this happens in the cases of Base.getindex(r, :words) and Base.getindex(r, :count) - these are just the de-sugared expressions r[:words] and r[:count].
So why is that the case? If we look at the type of remove_count with
julia> typeof(remove_count)
Vector{Dict{Symbol, Any}}
We see that that the key type of the dictionary can only be a Symbol - but the value type can be any kind of type. We can get a very moderate speed up by constructing remove_count so that we narrow down the value type to a union:
const remove_count = Dict{Symbol, Union{String, Int}[
Dict(:words => "the", :count => 0),
Dict(:words => "and", :count => 0),
# etc... 100 more
]
Running #code_warntype again, shows, that we still have some red entries, but this time at least they are of type Union{String, Int} - but the speed up is still disappointing.
As others have pointed out, it might be better, if you find a different data structure so that your code is type stable.
There are multiple ways to do that. Probably the easiest is to use a vector of NamedTuple:
const remove_count = [
(word="the", count=0),
(word="and", count=0),
# etc... 100 more
)
so that
typeof(remove_count)
Vector{NamedTuple{(:word, :count), Tuple{String, Int64}}}
and your function then becomes
function loop_poems(poems, remove_count)
for p in poems
for i in eachindex(remove_count)
word = remove_count[i].word
count = remove_count[i].count
if occursin(word, text)
# NamedTuple is immutable, so wee need to create a new one
remove_count[i] = (word=word, count=count + 1)
end
end
end
end
If we use #code_warntype again, the red parts have disappeared.
There are few other easy improvements:
Use the #inbounds macro when looping over arrays: https://docs.julialang.org/en/v1/devdocs/boundscheck/#Eliding-bounds-checks
Move p[:text] into the outer loop
and your function then becomes:
function loop_poems(poems, remove_count)
#inbounds for p in poems
text = p[:text]
for i in eachindex(remove_count)
word = remove_count[i].word
count = remove_count[i].count
if occursin(word, text)
# NamedTuple is immutable, so wee need to create a new one
remove_count[i] = (word=word, count=count + 1)
end
end
end
end
It also might make sense to also convert poems into a vector of NamedTuple.
Ultimately, if you still need more performance, you might better look at your domain and at more complex string algorithms:
Can your words contain any white spaces? If not, maybe split the poems into tokens.
If you have a lot of words, your words might share a lot of prefixes - in such a case a trie might be of help: https://en.wikipedia.org/wiki/Trie

You want function loop_poems(poems, remove_count), not function loop_poems(poem, remove_count). Your code as written is accessing poems as a global variable.

Related

Trying to pass an array into a function

I'm very new to Julia, and I'm trying to just pass an array of numbers into a function and count the number of zeros in it. I keep getting the error:
ERROR: UndefVarError: array not defined
I really don't understand what I am doing wrong, so I'm sorry if this seems like such an easy task that I can't do.
function number_of_zeros(lst::array[])
count = 0
for e in lst
if e == 0
count + 1
end
end
println(count)
end
lst = [0,1,2,3,0,4]
number_of_zeros(lst)
There are two issues with your function definition:
As noted in Shayan's answer and Dan's comment, the array type in Julia is called Array (capitalized) rather than array. To see:
julia> array
ERROR: UndefVarError: array not defined
julia> Array
Array
Empty square brackets are used to instantiate an array, and if preceded by a type, they specifically instantiate an array holding objects of that type:
julia> x = Int[]
Int64[]
julia> push!(x, 3); x
1-element Vector{Int64}:
3
julia> push!(x, "test"); x
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
Thus when you do Array[] you are actually instantiating an empty vector of Arrays:
julia> y = Array[]
Array[]
julia> push!(y, rand(2)); y
1-element Vector{Array}:
[0.10298669573927233, 0.04327245960128345]
Now it is important to note that there's a difference between a type and an object of a type, and if you want to restrict the types of input arguments to your functions, you want to do this by specifying the type that the function should accept, not an instance of this type. To see this, consider what would happen if you had fixed your array typo and passed an Array[] instead:
julia> f(x::Array[])
ERROR: TypeError: in typeassert, expected Type, got a value of type Vector{Array}
Here Julia complains that you have provided a value of the type Vector{Array} in the type annotation, when I should have provided a type.
More generally though, you should think about why you are adding any type restrictions to your functions. If you define a function without any input types, Julia will still compile a method instance specialised for the type of input provided when first call the function, and therefore generate (most of the time) machine code that is optimal with respect to the specific types passed.
That is, there is no difference between
number_of_zeros(lst::Vector{Int64})
and
number_of_zeros(lst)
in terms of runtime performance when the second definition is called with an argument of type Vector{Int64}. Some people still like type annotations as a form of error check, but you also need to consider that adding type annotations makes your methods less generic and will often restrict you from using them in combination with code other people have written. The most common example of this are Julia's excellent autodiff capabilities - they rely on running your code with dual numbers, which are a specific numerical type enabling automatic differentiation. If you strictly type your functions as suggested (Vector{Int}) you preclude your functions from being automatically differentiated in this way.
Finally just a note of caution about the Array type - Julia's array's can be multidimensional, which means that Array{Int} is not a concrete type:
julia> isconcretetype(Array{Int})
false
to make it concrete, the dimensionality of the array has to be provided:
julia> isconcretetype(Array{Int, 1})
true
First, it might be better to avoid variable names similar to function names. count is a built-in function of Julia. So if you want to use the count function in the number_of_zeros function, you will undoubtedly face a problem.
Second, consider returning the value instead of printing it (Although you didn't write the print function in the correct place).
Third, You can update the value by += not just a +!
Last but not least, Types in Julia are constantly introduced with the first capital letter! So we don't have an array standard type. It's an Array.
Here is the correction of your code.
function number_of_zeros(lst::Array{Int64})
counter = 0
for e in lst
if e == 0
counter += 1
end
end
return counter
end
lst = [0,1,2,3,0,4]
number_of_zeros(lst)
would result in 2.
Additional explanation
First, it might be better to avoid variable names similar to function names. count is a built-in function of Julia. So if you want to use the count function in the number_of_zeros function, you will undoubtedly face a problem.
Check this example:
function number_of_zeros(lst::Array{Int64})
count = 0
for e in lst
if e == 0
count += 1
end
end
return count, count(==(1), lst)
end
number_of_zeros(lst)
This code will lead to this error:
ERROR: MethodError: objects of type Int64 are not callable
Maybe you forgot to use an operator such as *, ^, %, / etc. ?
Stacktrace:
[1] number_of_zeros(lst::Vector{Int64})
# Main \t.jl:10
[2] top-level scope
# \t.jl:16
Because I overwrote the count variable on the count function! It's possible to avoid such problems by calling the function from its module:
function number_of_zeros(lst::Array{Int64})
count = 0
for e in lst
if e == 0
count += 1
end
end
return count, Base.count(==(1), lst)
The point is I used Base.count, then the compiler knows which count I mean by Base.count.

What is an alternative to Kotlin's `reduce` operation in Rust?

I encountered this competitive programming problem:
nums is a vector of integers (length n)
ops is a vector of strings containing + and - (length n-1)
It can be solved with the reduce operation in Kotlin like this:
val op_iter = ops.iterator();
nums.reduce {a, b ->
when (op_iter.next()) {
"+" -> a+b
"-" -> a-b
else -> throw Exception()
}
}
reduce is described as:
Accumulates value starting with the first element and applying operation from left to right to current accumulator value and each element.
It looks like Rust vectors do not have a reduce method. How would you achieve this task?
Edited: since Rust version 1.51.0, this function is called reduce
Be aware of similar function which is called fold. The difference is that reduce will produce None if iterator is empty while fold accepts accumulator and will produce accumulator's value if iterator is empty.
Outdated answer is left to capture the history of this function debating how to name it:
There is no reduce in Rust 1.48. In many cases you can simulate it with fold but be aware that the semantics of the these functions are different. If the iterator is empty, fold will return the initial value whereas reduce returns None. If you want to perform multiplication operation on all elements, for example, getting result 1 for empty set is not too logical.
Rust does have a fold_first function which is equivalent to Kotlin's reduce, but it is not stable yet. The main discussion is about naming it. It is a safe bet to use it if you are ok with nightly Rust because there is little chance the function will be removed. In the worst case, the name will be changed. If you need stable Rust, then use fold if you are Ok with an illogical result for empty sets. If not, then you'll have to implement it, or find a crate such as reduce.
Kotlin's reduce takes the first item of the iterator for the starting point while Rust's fold and try_fold let you specify a custom starting point.
Here is an equivalent of the Kotlin code:
let mut it = nums.iter().cloned();
let start = it.next().unwrap();
it.zip(ops.iter()).try_fold(start, |a, (b, op)| match op {
'+' => Ok(a + b),
'-' => Ok(a - b),
_ => Err(()),
})
Playground
Or since we're starting from a vector, which can be indexed:
nums[1..]
.iter()
.zip(ops.iter())
.try_fold(nums[0], |a, (b, op)| match op {
'+' => Ok(a + b),
'-' => Ok(a - b),
_ => Err(()),
});
Playground

How to find if item is contained in Dict in Julia

I'm fairly new to Julia and am trying to figure out how to check if the given expression is contained in a Dict I've created.
function parse( expr::Array{Any} )
if expr[1] == #check here if "expr[1]" is in "owl"
return BinopNode(owl[expr[1]], parse( expr[2] ), parse( expr[3] ) )
end
end
owl = Dict(:+ => +, :- => -, :* => *, :/ => /)
I've looked at Julia's documentation and other resources, but can't find any answer to this.
"owl" is the name of my dictionary that I'm trying to check. I want to run the return statement should expr[1] be either "+,-,* or /".
A standard approach to check if some dictionary contains some key would be:
:+ in keys(owl)
or
haskey(owl, :+)
Your solution depends on the fact that you are sure that 0 is not one of the values in the dictionary, which might not be true in general. However, if you want to use such an approach (it is useful when you do not want to perform a lookup in the dictionary twice: once to check if it contains some key, and second time to get the value assigned to the key if it exists) then normally you would use nothing as a sentinel and then perform the check get_return_value !== nothing (note two = here - they are important for the compiler to generate an efficient code). So your code would look like this:
function myparse(expr::Array{Any}, owl) # better pass `owl` as a parameter to the function
v = get(expr[1], owl, nothing)
if v !== nothing
return BinopNode(v, myparse(expr[2]), myparse(expr[3]))
end
# and what do we do if v === nothing?
end
Note that I use myparse name, as parse is a function defined in Base, so we do not want to have a name clash. Finally your myparse is recursive so you should define a second method to this function handling the case when expr is not an Array{Any}.
I feel like an idiot for finding this so fast, but I came up with the following solution: (Willing to hear more efficient answers however)
yes = 1
yes = get(owl,expr[1],0)
if yes != 0
#do return statement here
"yes" should get set equal to 0 if the expression is not found in the dictionary "owl". So a simple != if statement to see if it's zero fixes my problem.

How can I shorten a type in the body of a function

Best be explained by an example:
I define a type
type myType
α::Float64
β::Float64
end
z = myType( 1., 2. )
Then suppose that I want to pass this type as an argument to a function:
myFunc( x::Vector{Float64}, m::myType ) =
x[1].^m.α+x[2].^m.β
Is there a way to pass myType so that I can actually use it in the body of the function in a "cleaner" fashion as follows:
x[1].^α+x[2].^β
Thanks for any answer.
One way is to use dispatch to a more general function:
myFunc( x::Vector{Float64}, α::Float64, β::Float64) = x[1].^α+x[2].^β
myFunc( x::Vector{Float64}, m::myType = myFunc(x,m.α,m.β)
Or if your functions are longer, you may want to use Parameters.jl's #unpack:
function myFunc( x::Vector{Float64}, m::myType )
#unpack m: α,β #now those are defined
x[1].^α+x[2].^β
end
The overhead of unpacking is small because you're not copying, it's just doing a bunch of α=m.α which is just making an α which points to m.α. For longer equations, this can be a much nicer form if you have many fields and use them in long calculations (for reference, I use this a lot in DifferentialEquations.jl).
Edit
There's another way as noted in the comments. Let me show this. You can define your type (with optional kwargs) using the #with_kw macro from Parameters.jl. For example:
using Parameters
#with_kw type myType
α::Float64 = 1.0 # Give a default value
β::Float64 = 2.0
end
z = myType() # Generate with the default values
Then you can use the #unpack_myType macro which is automatically made by the #with_kw macro:
function myFunc( x::Vector{Float64}, m::myType )
#unpack_myType m
x[1].^α+x[2].^β
end
Again, this only has the overhead of making the references α and β without copying, so it's pretty lightweight.
You could add this to the body of your function:
(α::Float64, β::Float64) = (m.α, m.β)
UPDATE: My original answer was wrong for a subtle reason, but I thought it was a very interesting bit of information so rather than delete it altogether, I'm leaving it with an explanation on why it's wrong. Many thanks to Fengyang for pointing out the global scope of eval! (as well as the use of $ in an Expr context!)
The original answer suggested that:
[eval( parse( string( i,"=",getfield( m,i)))) for i in fieldnames( m)]
would return a list comprehension which had assignment side-effects, since it conceptually would result in something like [α=1., β=2., etc]. The assumption was that this assignment would be within local scope. However, as pointed out, eval is always assessed at global scope, therefore the above one-liner does not do what it's meant to. Example:
julia> type MyType
α::Float64
β::Float64
end
julia> function myFunc!(x::Vector{Float64}, m::MyType)
α=5.; β=6.;
[eval( parse( string( i,"=",getfield( m,i)))) for i in fieldnames( m)]
x[1] = α; x[2] = β; return x
end;
julia> myFunc!([0.,0.],MyType(1., 2.))
2-element Array{Float64,1}:
5.0
6.0
julia> whos()
MyType 124 bytes DataType
myFunc 0 bytes #myFunc
α 8 bytes Float64
β 8 bytes Float64
I.e. as you can see, the intention was for the local variables α and β to be overwritten, but they didn't; eval placed α and β variables at global scope instead. As a matlab programmer I naively assumed that eval() was conceptually equivalent to Matlab, without actually checking. Turns out it's more similar to the evalin('base',...) command.
Thanks again to Fengyand for giving another example of why the phrase "parse and eval" seems to have about the same effect on Julia programmers as the word "it" on the knights who until recently said "NI". :)

Julia: non-destructively update immutable type variable

Let's say there is a type
immutable Foo
x :: Int64
y :: Float64
end
and there is a variable foo = Foo(1,2.0). I want to construct a new variable bar using foo as a prototype with field y = 3.0 (or, alternatively non-destructively update foo producing a new Foo object). In ML languages (Haskell, OCaml, F#) and a few others (e.g. Clojure) there is an idiom that in pseudo-code would look like
bar = {foo with y = 3.0}
Is there something like this in Julia?
This is tricky. In Clojure this would work with a data structure, a dynamically typed immutable map, so we simply call the appropriate method to add/change a key. But when working with types we'll have to do some reflection to generate an appropriate new constructor for the type. Moreover, unlike Haskell or the various MLs, Julia isn't statically typed, so one does not simply look at an expression like {foo with y = 1} and work out what code should be generated to implement it.
Actually, we can build a Clojure-esque solution to this; since Julia provides enough reflection and dynamism that we can treat the type as a sort of immutable map. We can use fieldnames to get the list of "keys" in order (like [:x, :y]) and we can then use getfield(foo, :x) to get field values dynamically:
immutable Foo
x
y
z
end
x = Foo(1,2,3)
with_slow(x, p) =
typeof(x)(((f == p.first ? p.second : getfield(x, f)) for f in fieldnames(x))...)
with_slow(x, ps...) = reduce(with_slow, x, ps)
with_slow(x, :y => 4, :z => 6) == Foo(1,4,6)
However, there's a reason this is called with_slow. Because of the reflection it's going to be nowhere near as fast as a handwritten function like withy(foo::Foo, y) = Foo(foo.x, y, foo.z). If Foo is parametised (e.g. Foo{T} with y::T) then Julia will be able to infer that withy(foo, 1.) returns a Foo{Float64}, but won't be able to infer with_slow at all. As we know, this kills the crab performance.
The only way to make this as fast as ML and co is to generate code effectively equivalent to the handwritten version. As it happens, we can pull off that version as well!
# Fields
type Field{K} end
Base.convert{K}(::Type{Symbol}, ::Field{K}) = K
Base.convert(::Type{Field}, s::Symbol) = Field{s}()
macro f_str(s)
:(Field{$(Expr(:quote, symbol(s)))}())
end
typealias FieldPair{F<:Field, T} Pair{F, T}
# Immutable `with`
for nargs = 1:5
args = [symbol("p$i") for i = 1:nargs]
#eval with(x, $([:($p::FieldPair) for p = args]...), p::FieldPair) =
with(with(x, $(args...)), p)
end
#generated function with{F, T}(x, p::Pair{Field{F}, T})
:($(x.name.primary)($([name == F ? :(p.second) : :(x.$name)
for name in fieldnames(x)]...)))
end
The first section is a hack to produce a symbol-like object, f"foo", whose value is known within the type system. The generated function is like a macro that takes types as opposed to expressions; because it has access to Foo and the field names it can generate essentially the hand-optimised version of this code. You can also check that Julia is able to properly infer the output type, if you parametrise Foo:
#code_typed with(x, f"y" => 4., f"z" => "hello") # => ...::Foo{Int,Float64,String}
(The for nargs line is essentially a manually-unrolled reduce which enables this.)
Finally, lest I be accused of giving slightly crazy advice, I want to warn that this isn't all that idiomatic in Julia. While I can't give very specific advice without knowing your use case, it's generally best to have fields with a manageable (small) set of fields and a small set of functions which do the basic manipulation of those fields; you can build on those functions to create the final public API. If what you want is really an immutable dict, you're much better off just using a specialised data structure for that.
There is also setindex (without the ! at the end) implemented in the FixedSizeArrays.jl package, which does this in an efficient way.

Resources