How to start reading a file x bytes from the beginning in Julia? - julia

I need to read records from a file, each being 9 bytes long. I need to know how to start reading at different points in the file

It looks like you're looking for the seek function:
help?> seek
search: seek seekend seekstart ParseError setenv select select! selectperm
seek(s, pos)
Seek a stream to the given position.
In particular you might want to
open(filename) do f
seek(f, n) # seek past nth byte
read(f, m) # read m bytes
end
There is also the skip function that may come in useful
help?> skip
search: skip skipchars
skip(s, offset)
Seek a stream relative to the current position.

Related

How to iterate over the lines in a string?

I have a long string in Julia. I'd like to apply some operation to each line. How can I efficiently iterate over each line? I think I can use split but I am wondering if there is a method that won't allocate all the strings upfront?
You can use eachline for this:
julia> str = """
a
b
c
"""
"a\nb\nc\n"
julia> for line in eachline(IOBuffer(str))
println(line)
end
a
b
c
There's also a version that operates directly on a file, in case that's relevant to you:
help?> eachline
search: eachline eachslice
eachline(io::IO=stdin; keep::Bool=false)
eachline(filename::AbstractString; keep::Bool=false)
Create an iterable EachLine object that will yield each line from an I/O stream or a file. Iteration calls readline on
the stream argument repeatedly with keep passed through, determining whether trailing end-of-line characters are
retained. When called with a file name, the file is opened once at the beginning of iteration and closed at the end. If
iteration is interrupted, the file will be closed when the EachLine object is garbage collected.
Examples
≡≡≡≡≡≡≡≡≡≡
julia> open("my_file.txt", "w") do io
write(io, "JuliaLang is a GitHub organization.\n It has many members.\n");
end;
julia> for line in eachline("my_file.txt")
print(line)
end
JuliaLang is a GitHub organization. It has many members.
julia> rm("my_file.txt");
If you already have the complete string in memory then you can (and should) use split, as pointed out in the comments. split basically indexes into the string and doesn't allocate new Strings for each line, as opposed to eachline.

Immutable dictionary

Is there a way to enforce a dictionary being constant?
I have a function which reads out a file for parameters (and ignores comments) and stores it in a dict:
function getparameters(filename::AbstractString)
f = open(filename,"r")
dict = Dict{AbstractString, AbstractString}()
for ln in eachline(f)
m = match(r"^\s*(?P<key>\w+)\s+(?P<value>[\w+-.]+)", ln)
if m != nothing
dict[m[:key]] = m[:value]
end
end
close(f)
return dict
end
This works just fine. Since i have a lot of parameters, which i will end up using on different places, my idea was to let this dict be global. And as we all know, global variables are not that great, so i wanted to ensure that the dict and its members are immutable.
Is this a good approach? How do i do it? Do i have to do it?
Bonus answerable stuff :)
Is my code even ok? (it is the first thing i did with julia, and coming from c/c++ and python i have the tendencies to do things differently.) Do i need to check whether the file is actually open? Is my reading of the file "julia"-like? I could also readall and then use eachmatch. I don't see the "right way to do it" (like in python).
Why not use an ImmutableDict? It's defined in base but not exported. You use one as follows:
julia> id = Base.ImmutableDict("key1"=>1)
Base.ImmutableDict{String,Int64} with 1 entry:
"key1" => 1
julia> id["key1"]
1
julia> id["key1"] = 2
ERROR: MethodError: no method matching setindex!(::Base.ImmutableDict{String,Int64}, ::Int64, ::String)
in eval(::Module, ::Any) at .\boot.jl:234
in macro expansion at .\REPL.jl:92 [inlined]
in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at .\event.jl:46
julia> id2 = Base.ImmutableDict(id,"key2"=>2)
Base.ImmutableDict{String,Int64} with 2 entries:
"key2" => 2
"key1" => 1
julia> id.value
1
You may want to define a constructor which takes in an array of pairs (or keys and values) and uses that algorithm to define the whole dict (that's the only way to do so, see the note at the bottom).
Just an added note, the actual internal representation is that each dictionary only contains one key-value pair, and a dictionary. The get method just walks through the dictionaries checking if it has the right value. The reason for this is because arrays are mutable: if you did a naive construction of an immutable type with a mutable field, the field is still mutable and thus while id["key1"]=2 wouldn't work, id.keys[1]=2 would. They go around this by not using a mutable type for holding the values (thus holding only single values) and then also holding an immutable dict. If you wanted to make this work directly on arrays, you could use something like ImmutableArrays.jl but I don't think that you'd get a performance advantage because you'd still have to loop through the array when checking for a key...
First off, I am new to Julia (I have been using/learning it since only two weeks). So do not put any confidence in what I am going to say unless it is validated by others.
The dictionary data structure Dict is defined here
julia/base/dict.jl
There is also a data structure called ImmutableDict in that file. However as const variables aren't actually const why would immutable dictionaries be immutable?
The comment states:
ImmutableDict is a Dictionary implemented as an immutable linked list,
which is optimal for small dictionaries that are constructed over many individual insertions
Note that it is not possible to remove a value, although it can be partially overridden and hidden
by inserting a new value with the same key
So let us call what you want to define as a dictionary UnmodifiableDict to avoid confusion. Such object would probably have
a similar data structure as Dict.
a constructor that takes a Dict as input to fill its data structure.
specialization (a new dispatch?) of the the method setindex! that is called by the operator [] =
in order to forbid modification of the data structure. This should be the case of all other functions that end with ! and hence modify the data.
As far as I understood, It is only possible to have subtypes of abstract types. Therefore you can't make UnmodifiableDict as a subtype of Dict and only redefine functions such as setindex!
Unfortunately this is a needed restriction for having run-time types and not compile-time types. You can't have such a good performance without a few restrictions.
Bottom line:
The only solution I see is to copy paste the code of the type Dict and its functions, replace Dict by UnmodifiableDict everywhere and modify the functions that end with ! to raise an exception if called.
you may also want to have a look at those threads.
https://groups.google.com/forum/#!topic/julia-users/n-lqjybIO_w
https://github.com/JuliaLang/julia/issues/1974
REVISION
Thanks to Chris Rackauckas for pointing out the error in my earlier response. I'll leave it below as an illustration of what doesn't work. But, Chris is right, the const declaration doesn't actually seem to improve performance when you feed the dictionary into the function. Thus, see Chris' answer for the best resolution to this issue:
D1 = [i => sind(i) for i = 0.0:5:3600];
const D2 = [i => sind(i) for i = 0.0:5:3600];
function test(D)
for jdx = 1:1000
# D[2] = 2
for idx = 0.0:5:3600
a = D[idx]
end
end
end
## Times given after an initial run to allow for compiling
#time test(D1); # 0.017789 seconds (4 allocations: 160 bytes)
#time test(D2); # 0.015075 seconds (4 allocations: 160 bytes)
Old Response
If you want your dictionary to be a constant, you can use:
const MyDict = getparameters( .. )
Update Keep in mind though that in base Julia, unlike some other languages, it's not that you cannot redefine constants, instead, it's just that you get a warning when doing so.
julia> const a = 2
2
julia> a = 3
WARNING: redefining constant a
3
julia> a
3
It is odd that you don't get the constant redefinition warning when adding a new key-val pair to the dictionary. But, you still see the performance boost from declaring it as a constant:
D1 = [i => sind(i) for i = 0.0:5:3600];
const D2 = [i => sind(i) for i = 0.0:5:3600];
function test1()
for jdx = 1:1000
for idx = 0.0:5:3600
a = D1[idx]
end
end
end
function test2()
for jdx = 1:1000
for idx = 0.0:5:3600
a = D2[idx]
end
end
end
## Times given after an initial run to allow for compiling
#time test1(); # 0.049204 seconds (1.44 M allocations: 22.003 MB, 5.64% gc time)
#time test2(); # 0.013657 seconds (4 allocations: 160 bytes)
To add to the existing answers, if you like immutability and would like to get performant (but still persistent) operations which change and extend the dictionary, check out FunctionalCollections.jl's PersistentHashMap type.
If you want to maximize performance and take maximal advantage of immutability, and you don't plan on doing any operations on the dictionary whatsoever, consider implementing a perfect hash function-based dictionary. In fact, if your dictionary is a compile-time constant, these can even be computed ahead of time (using metaprogramming) and precompiled.

How are bytes represented in a file by Ocaml?

I'm trying to write a function to write a list of bytes to a file (we're writing a parser of the .class file and after some insertions writing the file back.) When my partner wrote the code to read it in, the bytecode list variable is a list of ints. So now I need to convert it to bytes and then write it to a new .class file. Will the representation of the hex numbers be the same essentially so that the new .class file can be processed by the JVM? My functional programming is LISP from a single semester three years ago and a semester of Coq one year ago. Not enough to make me think in functional terms easily.
Your question is, in fact, pretty confusing. Here's some code that reads in the bytes of a file as an int list, then writes the ints back out as bytes to a new file. On a reasonable system (you don't mention your system), this will copy a file exactly so that no program including the JVM can tell the difference.
let get_bytes fn =
let inc = open_in_bin fn in
let rec go sofar =
match input_char inc with
| b -> go (Char.code b :: sofar)
| exception End_of_file -> List.rev sofar
in
let res = go [] in
close_in inc;
res
let put_bytes fn ints =
let outc = open_out_bin fn in
List.iter (fun b -> output_char outc (Char.chr b)) ints;
close_out outc
let copy_file infn outfn =
put_bytes outfn (get_bytes infn)
I tested this on my system (OS X 10.11.2). I don't have any class files around, but the JVM had no trouble running a jarfile copied with copy_file.
The essence of this problem has nothing to do with hexadecimal numbers. Those are a way of representing numbers as strings, which don't appear anywhere. It also has little to do with functional programming, other than the fact that you want to write your code in OCaml.
The essence of the problem is the meaning of a series of bytes stored in a file. At the lowest level, the bytes stored in the file are the meaning of the file. So you can faithfully copy the file just by copying the bytes. That's what copy_file does.
Since you want to change the bytes, you of course need to make sure your new bytes represent a valid class file. Once you've figured out the new bytes that you want, you can write them out with put_bytes (on a reasonable system).

StackOverflowError with tuple

I have written a recursive function for getting objects in larger arrays in julia. The following error occured:
ERROR: LoadError: StackOverflowError:
in cat_t at abstractarray.jl:831
in recGetObjChar at /home/user/Desktop/program.jl:1046
in recGetObjChar at /home/user/Desktop/program.jl:1075 (repeats 9179 times)
in getImChars at /home/user/Desktop/program.jl:968
in main at /home/user/Desktop/program.jl:69
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:308
in _start at ./client.jl:411
while loading /home/user/Desktop/program.jl, in expression starting on line 78
If you want to have a look at the code, I have already opened an issue (Assertion failed, process aborted). After debugging my code for julia v 0.4, it is more obvious, what causes the problem. The tupel locObj gets much bigger than 9000 entries, because one object can be e.g. 150 x 150 big.
That would result in a length of 22500 for locObj. How big can tupels get, and how can I avoid a stackoverflow? Is there another way to save my values?
As it's commented, I think better approaches exist to work with big arrays of data, and this answer is mainly belongs to this part of your question:
Is there another way to save my values?
I have prepared a test to show how using mmap is helpful when dealing with big array of data, following functions both do the same thing: they create a vector of 3*10E6 float64, then fill it, calculate sum and print result, in the first one (mmaptest()), a memory-map structure have been used to store Vector{Float64} while second one (ramtest()) do the work on machine ram:
function mmaptest()
s = open("./tmp/mmap.bin","w+") # tmp folder must exists in pwd() path
A = Mmap.mmap(s, Vector{Float64}, 3_000_000)
for j=1:3_000_000
A[j]=j
end
println("sum = $(sum(A))")
close(s)
end
function ramtest()
A = Vector{Float64}(3_000_000)
for j=1:3_000_000
A[j]=j
end
println("sum = $(sum(A))")
end
then both functions have been called and memory allocation size was calculated:
julia> gc(); # => remove old handles to closed stream
julia> #allocated mmaptest()
sum = 4.5000015e12
861684
julia> #allocated ramtest()
sum = 4.5000015e12
24072791
It's obvious from those tests that with a memory-map object, memory allocation is much smaller.
julia> gc()
julia> #time ramtest()
sum = 4.5000015e12
0.012584 seconds (29 allocations: 22.889 MB, 3.43% gc time)
julia> #time mmaptest()
sum = 4.5000015e12
0.019602 seconds (58 allocations: 2.277 KB)
as it's clear from #time test, using mmap makes the code slower while needs less memory.
I wish it helps you, regards.

How to simply read binary data sheets which contains string column in Julia?

I am trying to (write, read) a number of tabular data sheets (in, from) a binary file, data are of Integer, Float64 and ASCIIString types, I write them without difficulty, I lpad ASCIIString to make ASCIIString columns of the same length. now I am facing reading operation, I want to read each table of data by a single call to read function e.g.:
read(myfile,Tuple{[UInt16;[Float64 for i=1:10];UInt8]...}, dim) # => works
EDIT-> I do not use the above line of code in my real solution because
I found that
sizeof(Tuple{Float64,Int32})!=sizeof(Float64)+sizeof(Int32)
but how to include ASCIIString fields in in my Tuple type?
check this simplified example:
file=open("./testfile.txt","w");
ts1="5char";
ts2="7 chars";
write(file,ts1,ts2);
close(file);
file=open("./testfile.txt","r");
data=read(file,typeof(ts1)); # => Errror
close(file);
Julia is right because typeof(ts1)==ASCIIString and ASCIIString is a variable length array, so Julia don't know how many bytes must be read.
What kind of type I must replace there? Is there a type that represents ConstantLangthString<length> or Bytes<length> , Chars<length>? any better solution exists?
EDIT
I should add more complete sample code that includes my latest progress, my latest solution is to read some part of data into a buffer (one row or more), allocate memory for one row of data then reinterpret bytes and copy result value from buffer into an out location:
#convert array of bits and copy them to out
function reinterpretarray!{ty}(out::Vector{ty}, buffer::Vector{UInt8}, pos::Int)
count=length(out)
out[1:count]=reinterpret(ty,buffer[pos:count*sizeof(ty)+pos-1])
return count*sizeof(ty)+pos
end
file=open("./testfile.binary","w");
#generate test data
infloat=ones(20);
instr=b"MyData";
inint=Int32[12];
#write tuple
write(file,([infloat...],instr,inint)...);
close(file);
file=open("./testfile.binary","r");
#read data into a buffer
buffer=readbytes(file,sizeof(infloat)+sizeof(instr)+sizeof(inint));
close(file);
#allocate memory
outfloat=zeros(20)
outstr=b"123456"
outint=Int32[1]
outdata=(outfloat,outstr,outint)
#copy and convert
pos=1
for elm in outdata
pos=reinterpretarray!(elm, buffer, pos)
end
assert(outdata==(infloat,instr,inint))
But my experiments in C language tell me that there must be a better, more convenient and faster solution exists, I would like to do it using C style pointers and references, I don't like to copy data from one location to another one.
Thanks
You can use Array{UInt8} as an alternative type for ASCIIString, which is the type for underlying data.
ts1="5chars"
print(ts1.data) #Array{UInt8}
someotherarray=ts1.data[:] #copies as new array
someotherstring=ASCIIString(somotherarray)
assert(someotherstring == ts1)
Do mind that I'm reading UInt8 in a x86_64 system, which might not be your case. You should use Array{eltype(ts1.data)} for safety reasons.

Resources