I have a long string in Julia. I'd like to apply some operation to each line. How can I efficiently iterate over each line? I think I can use split but I am wondering if there is a method that won't allocate all the strings upfront?
You can use eachline for this:
julia> str = """
a
b
c
"""
"a\nb\nc\n"
julia> for line in eachline(IOBuffer(str))
println(line)
end
a
b
c
There's also a version that operates directly on a file, in case that's relevant to you:
help?> eachline
search: eachline eachslice
eachline(io::IO=stdin; keep::Bool=false)
eachline(filename::AbstractString; keep::Bool=false)
Create an iterable EachLine object that will yield each line from an I/O stream or a file. Iteration calls readline on
the stream argument repeatedly with keep passed through, determining whether trailing end-of-line characters are
retained. When called with a file name, the file is opened once at the beginning of iteration and closed at the end. If
iteration is interrupted, the file will be closed when the EachLine object is garbage collected.
Examples
≡≡≡≡≡≡≡≡≡≡
julia> open("my_file.txt", "w") do io
write(io, "JuliaLang is a GitHub organization.\n It has many members.\n");
end;
julia> for line in eachline("my_file.txt")
print(line)
end
JuliaLang is a GitHub organization. It has many members.
julia> rm("my_file.txt");
If you already have the complete string in memory then you can (and should) use split, as pointed out in the comments. split basically indexes into the string and doesn't allocate new Strings for each line, as opposed to eachline.
Following the approach of this answer I am trying to understand what happens exactly and how expressions and generated functions work in Julia within the concept of metaprogramming.
The goal is to optimize a recursive function using expressions and generated functions (for a concrete example you can have a look at the question answered in the link provided above).
Consider the following modified fibonacci function, in which I want to compute the fibonacci series up to n and multiply it by a number p.
The straightforward, recursive implementation would be
function fib(n::Integer, p::Real)
if n <= 1
return 1 * p
else
return n * fib(n-1, p)
end
end
As a first step, I could define a function which returns an expression instead of the computed value
function fib_expr(n::Integer, p::Symbol)
if n <= 1
return :(1 * $p)
else
return :($n * $(fib_expr(n-1, p)))
end
end
which, e.g. returns something like
julia> ex = fib_expr(3, :myp)
:(3 * (2 * (1myp)))
In this way I get an expression which is fully expanded and depends on the value assigned to the symbol myp. In this way I do not see the recursion anymore, basically I am metaprogramming: I created a function that creates another "function" (in this case we call it expression though).
I can now set myp = 0.5 and call eval(ex) to compute the result.
However, this is slower than the first approach.
What I can do though, is to generate a parametric function in the following way
#generated function fib_gen{n}(::Type{Val{n}}, p::Real)
return fib_expr(n, :p)
end
And magically, calling fib_gen(Val{3}, 0.5) gets things done, and is incredibly fast.
So, what is going on?
To my understanding, in the first call to fib_gen(Val{3}, 0.5), the parametric function fib_gen{Val{3}}(...) gets compiled and its content is the fully expanded expression obtained through fib_expr(3, :p), i.e. 3*2*1*p with p substituted with the input value.
The reason why it is so fast then, is because fib_gen is basically just a series of multiplications, whereas the original fib has to allocate on the stack every single recursive call making it slower, am I correct?
To give some numbers, here is my short benchmark using BenchmarkTools.
julia> #benchmark fib(10, 0.5)
...
mean time: 26.373 ns
...
julia> p = 0.5
0.5
julia> #benchmark eval(fib_expr(10, :p))
...
mean time: 177.906 μs
...
julia> #benchmark fib_gen(Val{10}, 0.5)
...
mean time: 2.046 ns
...
I have many questions:
Why the second case is so slow?
What exactly is and means ::Type{Val{n}}? (I copied that from the answer linked above)
Because of the JIT compiler, sometimes I am lost in what happens at compile-time and at run-time, as it is the case here...
Furthermore, I tried to combine fib_expr and fib_gen in a single function according to
#generated function fib_tot{n}(::Type{Val{n}}, p::Real)
if n <= 1
return :(1 * p)
else
return :(n * fib_tot(Val{n-1}, p))
end
end
which however is slow
julia> #benchmark fib_tot(Val{10}, 0.5)
...
mean time: 4.601 μs
...
What am I doing wrong here? Is it even possible to combine fib_expr and fib_gen in a single function?
I realize this is more a monograph rather than a question, however, even though I read the metaprogramming section few times, I am having a hard time to grasp everything, in particular with an applied example such as this one.
A monograph in response:
Metaprogramming basics
It will be easier to start with "normal" macros first. I'll relax the definition you used a bit:
function fib_expr(n::Integer, p)
if n <= 1
return :(1 * $p)
else
return :($n * $(fib_expr(n-1, p)))
end
end
That allows to pass in more than just symbols for p, like integer literals or whole expressions. Given this, we can define a macro for the same functionality:
macro fib_macro(n::Integer, p)
fib_expr(n, p)
end
Now, if #fib_macro 45 1 is used anywhere in the code, at compile time it will first be replaced by a long nested expression
:(45 * (44 * ... * (1 * 1)) ... )
and then compiled normally -- to a constant.
That's all there is to macros, really. Replacing syntax during compile time; and by recursion, this can be an arbitrarily long alteration between compiling, and evaluating functions on expressions. And for things that are essentially constant, but tedious to write otherwise, it is very useful: a bood example example is Base.Math.#evalpoly.
Evaluation at runtime?
But it has the problem that you cannot inspect values which are only known at runtime: you can't implement fib(n) = #fib_macro n 1, since at compile time, n is a symbol representing the parameter, and not a number you can dispatch on.
The next best solution to this would be to use
fib_eval(n::Integer) = eval(fib_expr(n, 1))
which works, but will repeat the compilation process every time it is called -- and that is much more overhead than the original function, since now at runtime, we perform the whole recursion on the expression tree and then call the compiler on the result. Not good.
Method dispatch & compilation
So we need a way to intermingle runtime and compile time. Enter #generated functions. These will at runtime dispatch on a type, and then work like a macro defining the function body.
First about type dispatch. If we have
f(x) = x + 1
and have a function call f(1), about the following will happen:
The type of the argument is determined (Int)
The method table of the function is consulted to find the best matching method
The method body is compiled for the specific Int argument type, if that hasn't been done before
The compiled method is evaluated on the concrete argument
If we then enter f(1.0), the same will happen again, with a new, different specialized method being compiled for Float64, based on the same function body.
Value types & singleton types
Now, Julia has the peculiar feature that you can use numbers as types. That means that the dispatch process outlined above will also work on the following function:
g(::Type{Val{N}}) where N = N + 1
That's a bit tricky. Remember that types are themselves values in Julia: Int isa Type.
Here, Val{N} is for every N a so-called singleton type having exactly one instance, namely Val{N}() -- just like Int is a type having many instances 0, -1, 1, -2, ....
Type{T} is also a singleton type, having as its single instance the type T. Int is a Type{Int}, and Val{3} is a Type{Val{3}} -- in fact, both are the only values of their type.
So, for each N, there is a type Val{N}, being the single instance of Type{Val{N}}. Thus, g will be dispatched and compiled for each single N. This is how we can dispatch on numbers as types. This already allows for optimization:
julia> #code_llvm g(Val{1})
define i64 #julia_g_61158(i8**) #0 !dbg !5 {
top:
ret i64 2
}
julia> #code_llvm f(1)
define i64 #julia_f_61076(i64) #0 !dbg !5 {
top:
%1 = shl i64 %0, 2
%2 = or i64 %1, 3
%3 = mul i64 %2, %0
%4 = add i64 %3, 2
ret i64 %4
}
But remember that it requires compilation for each new N at the first call.
(And fkt(::T) is just short for fkt(x::T) if you don't use x in the body.)
Integrating generating functions and value types
Finally to generated functions. They work as a slight modification of the above dispatch pattern:
The type of the argument is determined (Int)
The method table of the function is consulted to find the best matching method
The method body is treated as a macro and called with the Int argument type as a parameter, if that hasn't been done before. The resulting expression is compiled into a method.
The compiled method is evaluated on the concrete argument
This pattern allows to change the implementation for each type which the function is dispatched on.
For our concrete setting, we want to dispatch on the Val types representing the arguments of the Fibonacci sequence:
#generated function fib_gen{n}(::Type{Val{n}}, p::Real)
return fib_expr(n, :p)
end
You now see that your explanation was exactly right:
in the first call to fib_gen(Val{3}, 0.5), the parametric function
fib_gen{Val{3}}(...) gets compiled and its content is the fully
expanded expression obtained through fib_expr(3, :p), i.e. 3*2*1*p
with p substituted with the input value.
I hope that the whole story has also answered all three of your listed questions:
The implementation using eval replicates the recursion every time, plus the overhead of compilation
Val is a trick to lift numbers to types, and Type{T} the singleton type containing only T -- but I hope the examples were helpful enough
Compile time is not before execution, because of JIT -- it is every time a method gets compiled first time, because it get's called.
First of all, I am joining myself to the comments: your question is very well written & constructive.
I have reproduced your results using Julia 0.7-beta.
Difference between #generated fib_tot (one piece of code) and fib_gen (that calls fib_expr)
With my julia version results are identicals:
julia> #btime fib_tot(Val{10},0.5)
0.042 ns (0 allocations: 0 bytes)
1.8144e6
julia> #btime fib_gen(Val{10},0.5)
0.042 ns (0 allocations: 0 bytes)
1.8144e6
Sometimes breaking a function into multiple parts see official doc:performance tips can be useful, however in your peculiar case I do not see why this could be useful. At compile time Julia has everything it needs to optimize fib_tot. There is a branch if n<=1 however n is known at "compile time" thanks to the Type{Val{n}} trick and this branch should be removed without problem in the generated (specialized) code.
The Type{Val{n}} trick
To specialize functions, Julia inference is performed according to argument type and not according to argument value.
For instance a compiled version of foo(n::Int) = ... is not generated for each n value. You must define a type that depends on n value to reach this goal. This is precisely how Type{Val{n}} works: Val{n} is simply a parametrized empty structure:
struct Val{T} end
Hence, each Val{1}, Val{2}, ... Val{100}, ... is a different type. By consequence, if foo is defined as:
foo(::Type{Val{n}}) where {n} = ...
Each foo(Val{1}), foo(Val{2}), ... foo(Val{100}) will trigger a specialized foo version (because argument type is different).
The eval(fib_expr(n, 1)) case
This
julia> #btime eval(fib_expr(10, :p))
401.651 μs (99 allocations: 6.45 KiB)
1.8144e6
is slow because your expression is (re-)compiled every time. The problem can be avoided if you use a macro instead (see phg answer).
The fib version
.
julia> #btime fib(10,0.5)
30.778 ns (0 allocations: 0 bytes)
1.8144e6
There is only one compiled version of this fib function. By consequence, it must contain all the runtime branch tests etc... This explains how slow it is.
Just a remark about:
foo{n}(::Type{Val{n}}) deprecated syntax
The foo{n}(::Type{Val{n}}) syntax is deprecated, the new one is foo(::Type{Val{n}}) where {n}. You can read Julia doc, parametric methods for further details.
My Julia version:
julia> versioninfo()
Julia Version 0.7.0-beta.0
Commit f41b1ecaec (2018-06-24 01:32 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2603 v3 # 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
I have some code that runs with no problems without parallelization. However, the same code generates exceptions if I try to run it using PSeq instead of Seq. The messages I get look a bit random, they are hard to replicate exactly.
Here is the code. When the exception happens the three lines starting with let tmp2 are highlighted.
let frameToRMatrix (df: Frame<'R,string>) =
let foo k df : float list =
df
|> Frame.getCol k
|> Series.values
|> List.ofSeq
let folder acc k = (k, foo k df |> box) :: acc
let tmp =
List.fold folder [] (df.ColumnKeys |> List.ofSeq)
|> namedParams
let sd = df |> Frame.getCol "Vol0" |> Series.lastValue
let sd = sd * 1000.0 |> int
printfn "%s" "I was here"
let rand = System.Random(sd)
let rms = rand.Next(500)
System.Threading.Thread.Sleep rms
let tmp2 =
tmp
|> R.cbind // This line prints something on the console the first time it is executed
printfn "%s" "And here too"
tmp2
The code above includes random number generation and calls to System.Threading.Thread.Sleep. If I do not include this code, which is not needed under sequential execution, I get a message:
System.ArgumentException: 'An item with the same key has already been added.'
and the following on the console:
I was here
I was here
[1] 4095
So execution never gets to the And here too lines.
When I include the random number generator and the call to sleep I get different results, which seem to depend on the build options.
Below are four examples, with the build options, error message and what I see on the console. Notice that in all examples there are four instances of I was here but only three instances of And here too.
---------------------------------------
Any CPU with Prefer 32-bit checked
System.Runtime.InteropServices.SEHException: 'External component has thrown an exception.'
I was here
I was here
[1] 4095
And here too
I was here
And here too
I was here
And here too
Warning: stack imbalance in 'lazyLoadDBfetch', 66 then 65
Error in value[[3L]](cond) : unprotect_ptr: pointer not found
---------------------------------------
Any CPU with Prefer 32-bit unchecked
System.Runtime.InteropServices.SEHException: 'External component has thrown an exception.'
I was here
I was here
[1] 1.759219e+13
And here too
I was here
And here too
I was here
And here too
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)
---------------------------------------
x86
System.AccessViolationException" 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.'
I was here
I was here
[1] 4095
And here too
I was here
And here too
I was here
And here too
---------------------------------------
x64
Exception thrown: 'System.AccessViolationException' in Unknown Module. Attermpted to read or write protected memory.
$$$ - MachineLearning.signal: Calculating signal for ticker AAPL
$$$ - MachineLearning.signal: Calculating signal for ticker AAPL
I was here
I was here
[1] 1.759219e+13
And here too
I was here
And here too
I was here
And here too
Error in loadNamespace(name) :
no function to return from, jumping to top level
Based on my experience with debugging subtle issues with threading in the R type provider, I think the answer is no - sadly, the R native interop layer is not thread-safe and so you cannot call it from multiple threads in your F# application.
I think that the standard way of running R in parallel is to spawn multiple R.exe processes doing the work. I don't think you can easily initialise multiple independent R processes from F#, so your best bet is probably to create multiple .NET processes that each controls one R engine.
I need to read records from a file, each being 9 bytes long. I need to know how to start reading at different points in the file
It looks like you're looking for the seek function:
help?> seek
search: seek seekend seekstart ParseError setenv select select! selectperm
seek(s, pos)
Seek a stream to the given position.
In particular you might want to
open(filename) do f
seek(f, n) # seek past nth byte
read(f, m) # read m bytes
end
There is also the skip function that may come in useful
help?> skip
search: skip skipchars
skip(s, offset)
Seek a stream relative to the current position.
Is there a way to enforce a dictionary being constant?
I have a function which reads out a file for parameters (and ignores comments) and stores it in a dict:
function getparameters(filename::AbstractString)
f = open(filename,"r")
dict = Dict{AbstractString, AbstractString}()
for ln in eachline(f)
m = match(r"^\s*(?P<key>\w+)\s+(?P<value>[\w+-.]+)", ln)
if m != nothing
dict[m[:key]] = m[:value]
end
end
close(f)
return dict
end
This works just fine. Since i have a lot of parameters, which i will end up using on different places, my idea was to let this dict be global. And as we all know, global variables are not that great, so i wanted to ensure that the dict and its members are immutable.
Is this a good approach? How do i do it? Do i have to do it?
Bonus answerable stuff :)
Is my code even ok? (it is the first thing i did with julia, and coming from c/c++ and python i have the tendencies to do things differently.) Do i need to check whether the file is actually open? Is my reading of the file "julia"-like? I could also readall and then use eachmatch. I don't see the "right way to do it" (like in python).
Why not use an ImmutableDict? It's defined in base but not exported. You use one as follows:
julia> id = Base.ImmutableDict("key1"=>1)
Base.ImmutableDict{String,Int64} with 1 entry:
"key1" => 1
julia> id["key1"]
1
julia> id["key1"] = 2
ERROR: MethodError: no method matching setindex!(::Base.ImmutableDict{String,Int64}, ::Int64, ::String)
in eval(::Module, ::Any) at .\boot.jl:234
in macro expansion at .\REPL.jl:92 [inlined]
in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at .\event.jl:46
julia> id2 = Base.ImmutableDict(id,"key2"=>2)
Base.ImmutableDict{String,Int64} with 2 entries:
"key2" => 2
"key1" => 1
julia> id.value
1
You may want to define a constructor which takes in an array of pairs (or keys and values) and uses that algorithm to define the whole dict (that's the only way to do so, see the note at the bottom).
Just an added note, the actual internal representation is that each dictionary only contains one key-value pair, and a dictionary. The get method just walks through the dictionaries checking if it has the right value. The reason for this is because arrays are mutable: if you did a naive construction of an immutable type with a mutable field, the field is still mutable and thus while id["key1"]=2 wouldn't work, id.keys[1]=2 would. They go around this by not using a mutable type for holding the values (thus holding only single values) and then also holding an immutable dict. If you wanted to make this work directly on arrays, you could use something like ImmutableArrays.jl but I don't think that you'd get a performance advantage because you'd still have to loop through the array when checking for a key...
First off, I am new to Julia (I have been using/learning it since only two weeks). So do not put any confidence in what I am going to say unless it is validated by others.
The dictionary data structure Dict is defined here
julia/base/dict.jl
There is also a data structure called ImmutableDict in that file. However as const variables aren't actually const why would immutable dictionaries be immutable?
The comment states:
ImmutableDict is a Dictionary implemented as an immutable linked list,
which is optimal for small dictionaries that are constructed over many individual insertions
Note that it is not possible to remove a value, although it can be partially overridden and hidden
by inserting a new value with the same key
So let us call what you want to define as a dictionary UnmodifiableDict to avoid confusion. Such object would probably have
a similar data structure as Dict.
a constructor that takes a Dict as input to fill its data structure.
specialization (a new dispatch?) of the the method setindex! that is called by the operator [] =
in order to forbid modification of the data structure. This should be the case of all other functions that end with ! and hence modify the data.
As far as I understood, It is only possible to have subtypes of abstract types. Therefore you can't make UnmodifiableDict as a subtype of Dict and only redefine functions such as setindex!
Unfortunately this is a needed restriction for having run-time types and not compile-time types. You can't have such a good performance without a few restrictions.
Bottom line:
The only solution I see is to copy paste the code of the type Dict and its functions, replace Dict by UnmodifiableDict everywhere and modify the functions that end with ! to raise an exception if called.
you may also want to have a look at those threads.
https://groups.google.com/forum/#!topic/julia-users/n-lqjybIO_w
https://github.com/JuliaLang/julia/issues/1974
REVISION
Thanks to Chris Rackauckas for pointing out the error in my earlier response. I'll leave it below as an illustration of what doesn't work. But, Chris is right, the const declaration doesn't actually seem to improve performance when you feed the dictionary into the function. Thus, see Chris' answer for the best resolution to this issue:
D1 = [i => sind(i) for i = 0.0:5:3600];
const D2 = [i => sind(i) for i = 0.0:5:3600];
function test(D)
for jdx = 1:1000
# D[2] = 2
for idx = 0.0:5:3600
a = D[idx]
end
end
end
## Times given after an initial run to allow for compiling
#time test(D1); # 0.017789 seconds (4 allocations: 160 bytes)
#time test(D2); # 0.015075 seconds (4 allocations: 160 bytes)
Old Response
If you want your dictionary to be a constant, you can use:
const MyDict = getparameters( .. )
Update Keep in mind though that in base Julia, unlike some other languages, it's not that you cannot redefine constants, instead, it's just that you get a warning when doing so.
julia> const a = 2
2
julia> a = 3
WARNING: redefining constant a
3
julia> a
3
It is odd that you don't get the constant redefinition warning when adding a new key-val pair to the dictionary. But, you still see the performance boost from declaring it as a constant:
D1 = [i => sind(i) for i = 0.0:5:3600];
const D2 = [i => sind(i) for i = 0.0:5:3600];
function test1()
for jdx = 1:1000
for idx = 0.0:5:3600
a = D1[idx]
end
end
end
function test2()
for jdx = 1:1000
for idx = 0.0:5:3600
a = D2[idx]
end
end
end
## Times given after an initial run to allow for compiling
#time test1(); # 0.049204 seconds (1.44 M allocations: 22.003 MB, 5.64% gc time)
#time test2(); # 0.013657 seconds (4 allocations: 160 bytes)
To add to the existing answers, if you like immutability and would like to get performant (but still persistent) operations which change and extend the dictionary, check out FunctionalCollections.jl's PersistentHashMap type.
If you want to maximize performance and take maximal advantage of immutability, and you don't plan on doing any operations on the dictionary whatsoever, consider implementing a perfect hash function-based dictionary. In fact, if your dictionary is a compile-time constant, these can even be computed ahead of time (using metaprogramming) and precompiled.