How to iterate over the lines in a string? - julia

I have a long string in Julia. I'd like to apply some operation to each line. How can I efficiently iterate over each line? I think I can use split but I am wondering if there is a method that won't allocate all the strings upfront?

You can use eachline for this:
julia> str = """
a
b
c
"""
"a\nb\nc\n"
julia> for line in eachline(IOBuffer(str))
println(line)
end
a
b
c
There's also a version that operates directly on a file, in case that's relevant to you:
help?> eachline
search: eachline eachslice
eachline(io::IO=stdin; keep::Bool=false)
eachline(filename::AbstractString; keep::Bool=false)
Create an iterable EachLine object that will yield each line from an I/O stream or a file. Iteration calls readline on
the stream argument repeatedly with keep passed through, determining whether trailing end-of-line characters are
retained. When called with a file name, the file is opened once at the beginning of iteration and closed at the end. If
iteration is interrupted, the file will be closed when the EachLine object is garbage collected.
Examples
≡≡≡≡≡≡≡≡≡≡
julia> open("my_file.txt", "w") do io
write(io, "JuliaLang is a GitHub organization.\n It has many members.\n");
end;
julia> for line in eachline("my_file.txt")
print(line)
end
JuliaLang is a GitHub organization. It has many members.
julia> rm("my_file.txt");
If you already have the complete string in memory then you can (and should) use split, as pointed out in the comments. split basically indexes into the string and doesn't allocate new Strings for each line, as opposed to eachline.

Related

Julia iterator which parses each line in file

I'm new to Julia and have hit a rock with something I imagine should be a common scenario:
I would like an iterator which parses a text file one line at a time. So, like eachline(f), except a function parse is applied to each line. Call the result eachline(parse, f) if you wish (like the version of open with an extra function argument): mapping parse over eachline.
To be more specific: I have a function poset(s::String) which turns a string representation of a poset into a poset:
function poset(s::String)
nums = split(s)
...
return transitiveclosure(g)
end
Now I'd like to be able to say something like
open("posets7.txt") do f
for p in eachline(poset, f)
#something
end
end
(and I specifically need this to be an iterator: my files are rather large, so I really want to parse them one line at a time).
I think I would (personally) use for Iterators.map in such a case:
open("posets7.txt") do f
for p in Iterators.map(poset, eachline(f))
# do something
end
end
But, as the documentation notes (and as you discovered yourself), this is equivalent to using a generator expression:
open("posets7.txt") do f
for p in (poset(x) for x in eachline(f))
# do something
end
end

When defined in a script, Julia functions don't print output

I have this Julia script:
function f(x,y)
x+y
end
f(3, 4)
When I run this in the live terminal (via copy/paste), I get the desired result 7. But if I run the script, the output from the function is suppressed. Why is that?
Julia, unlike Matlab, doesn't automatically print values (the REPL does since that's what it's for: REPL = "read, eval, print loop"). You have to explicitly print the value using print or show, e.g. show(f(3, 4)). In this case, print and show do the same thing, but in general they have somewhat different meanings:
print([io::IO], xs...)
Write to io (or to the default output stream stdout if io is not given) a canonical (un-decorated) text representation. The representation used by print includes minimal formatting and tries to avoid Julia-specific details.
versus
show(x)
Write an informative text representation of a value to the current output stream. New types should overload show(io::IO, x) where the first argument is a stream. The representation used by show generally includes Julia-specific formatting and type information.
Note that there is also the #show macro, which prints the expression that is evaluated followed by its value, like so:
julia> #show f(3, 4);
f(3, 4) = 7

How to start reading a file x bytes from the beginning in Julia?

I need to read records from a file, each being 9 bytes long. I need to know how to start reading at different points in the file
It looks like you're looking for the seek function:
help?> seek
search: seek seekend seekstart ParseError setenv select select! selectperm
seek(s, pos)
Seek a stream to the given position.
In particular you might want to
open(filename) do f
seek(f, n) # seek past nth byte
read(f, m) # read m bytes
end
There is also the skip function that may come in useful
help?> skip
search: skip skipchars
skip(s, offset)
Seek a stream relative to the current position.

Immutable dictionary

Is there a way to enforce a dictionary being constant?
I have a function which reads out a file for parameters (and ignores comments) and stores it in a dict:
function getparameters(filename::AbstractString)
f = open(filename,"r")
dict = Dict{AbstractString, AbstractString}()
for ln in eachline(f)
m = match(r"^\s*(?P<key>\w+)\s+(?P<value>[\w+-.]+)", ln)
if m != nothing
dict[m[:key]] = m[:value]
end
end
close(f)
return dict
end
This works just fine. Since i have a lot of parameters, which i will end up using on different places, my idea was to let this dict be global. And as we all know, global variables are not that great, so i wanted to ensure that the dict and its members are immutable.
Is this a good approach? How do i do it? Do i have to do it?
Bonus answerable stuff :)
Is my code even ok? (it is the first thing i did with julia, and coming from c/c++ and python i have the tendencies to do things differently.) Do i need to check whether the file is actually open? Is my reading of the file "julia"-like? I could also readall and then use eachmatch. I don't see the "right way to do it" (like in python).
Why not use an ImmutableDict? It's defined in base but not exported. You use one as follows:
julia> id = Base.ImmutableDict("key1"=>1)
Base.ImmutableDict{String,Int64} with 1 entry:
"key1" => 1
julia> id["key1"]
1
julia> id["key1"] = 2
ERROR: MethodError: no method matching setindex!(::Base.ImmutableDict{String,Int64}, ::Int64, ::String)
in eval(::Module, ::Any) at .\boot.jl:234
in macro expansion at .\REPL.jl:92 [inlined]
in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at .\event.jl:46
julia> id2 = Base.ImmutableDict(id,"key2"=>2)
Base.ImmutableDict{String,Int64} with 2 entries:
"key2" => 2
"key1" => 1
julia> id.value
1
You may want to define a constructor which takes in an array of pairs (or keys and values) and uses that algorithm to define the whole dict (that's the only way to do so, see the note at the bottom).
Just an added note, the actual internal representation is that each dictionary only contains one key-value pair, and a dictionary. The get method just walks through the dictionaries checking if it has the right value. The reason for this is because arrays are mutable: if you did a naive construction of an immutable type with a mutable field, the field is still mutable and thus while id["key1"]=2 wouldn't work, id.keys[1]=2 would. They go around this by not using a mutable type for holding the values (thus holding only single values) and then also holding an immutable dict. If you wanted to make this work directly on arrays, you could use something like ImmutableArrays.jl but I don't think that you'd get a performance advantage because you'd still have to loop through the array when checking for a key...
First off, I am new to Julia (I have been using/learning it since only two weeks). So do not put any confidence in what I am going to say unless it is validated by others.
The dictionary data structure Dict is defined here
julia/base/dict.jl
There is also a data structure called ImmutableDict in that file. However as const variables aren't actually const why would immutable dictionaries be immutable?
The comment states:
ImmutableDict is a Dictionary implemented as an immutable linked list,
which is optimal for small dictionaries that are constructed over many individual insertions
Note that it is not possible to remove a value, although it can be partially overridden and hidden
by inserting a new value with the same key
So let us call what you want to define as a dictionary UnmodifiableDict to avoid confusion. Such object would probably have
a similar data structure as Dict.
a constructor that takes a Dict as input to fill its data structure.
specialization (a new dispatch?) of the the method setindex! that is called by the operator [] =
in order to forbid modification of the data structure. This should be the case of all other functions that end with ! and hence modify the data.
As far as I understood, It is only possible to have subtypes of abstract types. Therefore you can't make UnmodifiableDict as a subtype of Dict and only redefine functions such as setindex!
Unfortunately this is a needed restriction for having run-time types and not compile-time types. You can't have such a good performance without a few restrictions.
Bottom line:
The only solution I see is to copy paste the code of the type Dict and its functions, replace Dict by UnmodifiableDict everywhere and modify the functions that end with ! to raise an exception if called.
you may also want to have a look at those threads.
https://groups.google.com/forum/#!topic/julia-users/n-lqjybIO_w
https://github.com/JuliaLang/julia/issues/1974
REVISION
Thanks to Chris Rackauckas for pointing out the error in my earlier response. I'll leave it below as an illustration of what doesn't work. But, Chris is right, the const declaration doesn't actually seem to improve performance when you feed the dictionary into the function. Thus, see Chris' answer for the best resolution to this issue:
D1 = [i => sind(i) for i = 0.0:5:3600];
const D2 = [i => sind(i) for i = 0.0:5:3600];
function test(D)
for jdx = 1:1000
# D[2] = 2
for idx = 0.0:5:3600
a = D[idx]
end
end
end
## Times given after an initial run to allow for compiling
#time test(D1); # 0.017789 seconds (4 allocations: 160 bytes)
#time test(D2); # 0.015075 seconds (4 allocations: 160 bytes)
Old Response
If you want your dictionary to be a constant, you can use:
const MyDict = getparameters( .. )
Update Keep in mind though that in base Julia, unlike some other languages, it's not that you cannot redefine constants, instead, it's just that you get a warning when doing so.
julia> const a = 2
2
julia> a = 3
WARNING: redefining constant a
3
julia> a
3
It is odd that you don't get the constant redefinition warning when adding a new key-val pair to the dictionary. But, you still see the performance boost from declaring it as a constant:
D1 = [i => sind(i) for i = 0.0:5:3600];
const D2 = [i => sind(i) for i = 0.0:5:3600];
function test1()
for jdx = 1:1000
for idx = 0.0:5:3600
a = D1[idx]
end
end
end
function test2()
for jdx = 1:1000
for idx = 0.0:5:3600
a = D2[idx]
end
end
end
## Times given after an initial run to allow for compiling
#time test1(); # 0.049204 seconds (1.44 M allocations: 22.003 MB, 5.64% gc time)
#time test2(); # 0.013657 seconds (4 allocations: 160 bytes)
To add to the existing answers, if you like immutability and would like to get performant (but still persistent) operations which change and extend the dictionary, check out FunctionalCollections.jl's PersistentHashMap type.
If you want to maximize performance and take maximal advantage of immutability, and you don't plan on doing any operations on the dictionary whatsoever, consider implementing a perfect hash function-based dictionary. In fact, if your dictionary is a compile-time constant, these can even be computed ahead of time (using metaprogramming) and precompiled.

How to simply read binary data sheets which contains string column in Julia?

I am trying to (write, read) a number of tabular data sheets (in, from) a binary file, data are of Integer, Float64 and ASCIIString types, I write them without difficulty, I lpad ASCIIString to make ASCIIString columns of the same length. now I am facing reading operation, I want to read each table of data by a single call to read function e.g.:
read(myfile,Tuple{[UInt16;[Float64 for i=1:10];UInt8]...}, dim) # => works
EDIT-> I do not use the above line of code in my real solution because
I found that
sizeof(Tuple{Float64,Int32})!=sizeof(Float64)+sizeof(Int32)
but how to include ASCIIString fields in in my Tuple type?
check this simplified example:
file=open("./testfile.txt","w");
ts1="5char";
ts2="7 chars";
write(file,ts1,ts2);
close(file);
file=open("./testfile.txt","r");
data=read(file,typeof(ts1)); # => Errror
close(file);
Julia is right because typeof(ts1)==ASCIIString and ASCIIString is a variable length array, so Julia don't know how many bytes must be read.
What kind of type I must replace there? Is there a type that represents ConstantLangthString<length> or Bytes<length> , Chars<length>? any better solution exists?
EDIT
I should add more complete sample code that includes my latest progress, my latest solution is to read some part of data into a buffer (one row or more), allocate memory for one row of data then reinterpret bytes and copy result value from buffer into an out location:
#convert array of bits and copy them to out
function reinterpretarray!{ty}(out::Vector{ty}, buffer::Vector{UInt8}, pos::Int)
count=length(out)
out[1:count]=reinterpret(ty,buffer[pos:count*sizeof(ty)+pos-1])
return count*sizeof(ty)+pos
end
file=open("./testfile.binary","w");
#generate test data
infloat=ones(20);
instr=b"MyData";
inint=Int32[12];
#write tuple
write(file,([infloat...],instr,inint)...);
close(file);
file=open("./testfile.binary","r");
#read data into a buffer
buffer=readbytes(file,sizeof(infloat)+sizeof(instr)+sizeof(inint));
close(file);
#allocate memory
outfloat=zeros(20)
outstr=b"123456"
outint=Int32[1]
outdata=(outfloat,outstr,outint)
#copy and convert
pos=1
for elm in outdata
pos=reinterpretarray!(elm, buffer, pos)
end
assert(outdata==(infloat,instr,inint))
But my experiments in C language tell me that there must be a better, more convenient and faster solution exists, I would like to do it using C style pointers and references, I don't like to copy data from one location to another one.
Thanks
You can use Array{UInt8} as an alternative type for ASCIIString, which is the type for underlying data.
ts1="5chars"
print(ts1.data) #Array{UInt8}
someotherarray=ts1.data[:] #copies as new array
someotherstring=ASCIIString(somotherarray)
assert(someotherstring == ts1)
Do mind that I'm reading UInt8 in a x86_64 system, which might not be your case. You should use Array{eltype(ts1.data)} for safety reasons.

Resources