I am trying to get Avro working in Julia and having some real issues. It is important for my application that I use a row-oriented data format to which I can append a hierarchical data structure row by row as they are generated.
Avro seems like a good fit. But I am having issues in Julia. I have things working in Python test, but I need to be in Julia as the main code is in julia.
Here are my simplified test examples which show my issue. The first one works, the rest don't. Any help would be appreciated. The second gives the wrong answer. The rest give errors.
import Avro
v1=Dict("RUTHERFORD" => 7, "DURHAM" => 11)
buf=Avro.write(v1)
Avro.read(buf,typeof(v1))
output:
Dict{String, Int64} with 2 entries:
"DURHAM" => 11
"RUTHERFORD" => 7
example 2:
#show v3=Dict((5,2) => 7, (5,4) => 11)
#show typeof(v3)
buf=Avro.write(v3)
Avro.read(buf,typeof(v3))
output:
v3 = Dict((5, 2) => 7, (5, 4) => 11) = Dict((5, 2) => 7, (5, 4) => 11)
typeof(v3) = Dict{Tuple{Int64, Int64}, Int64}
Dict{Tuple{Int64, Int64}, Int64} with 1 entry:
(40, 53) => 11
example 3:
#show v2=Dict(("jcm",2) => 7, ("sem",4) => 11)
#show typeof(v2)
buf=Avro.write(v2)
v2o=Avro.read(buf,typeof(v2))
output:
v2 = Dict(("jcm", 2) => 7, ("sem", 4) => 11) = Dict(("sem", 4) => 11, ("jcm", 2) => 7)
typeof(v2) = Dict{Tuple{String, Int64}, Int64}
MethodError: Cannot `convert` an object of type Char to an object of type String
Closest candidates are:
convert(::Type{String}, ::String) at essentials.jl:210
convert(::Type{T}, ::T) where T<:AbstractString at strings/basic.jl:231
convert(::Type{T}, ::AbstractString) where T<:AbstractString at strings/basic.jl:232
...
Stacktrace:
[1] _totuple
# ./tuple.jl:316 [inlined]
[2] Tuple{String, Int64}(itr::String)
# Base ./tuple.jl:303
[3] construct(T::Type, args::String; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
[4] construct(T::Type, args::String)
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
[5] construct(::Type{Tuple{String, Int64}}, ptr::Ptr{UInt8}, len::Int64; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
[6] construct(::Type{Tuple{String, Int64}}, ptr::Ptr{UInt8}, len::Int64)
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
[7] readvalue(B::Avro.Binary, #unused#::Avro.StringType, #unused#::Type{Tuple{String, Int64}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:247
[8] readvalue(B::Avro.Binary, MT::Avro.MapType, #unused#::Type{Dict{Tuple{String, Int64}, Int64}}, buf::Vector{UInt8}, pos::Int64, buflen::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/maps.jl:63
[9] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, Int64}, Int64}}; schema::Avro.MapType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
[10] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, Int64}, Int64}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
[11] top-level scope
# In[209]:5
[12] eval
# ./boot.jl:360 [inlined]
[13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
# Base ./loading.jl:1094
Last example:
v=Dict(("RUTHERFORD", "05A", "371619611022065") => 7, ("DURHAM", "28","jcm") => 11)
buf=Avro.write(v)
vo=Avro.read(buf,typeof(v))
output:
MethodError: Cannot `convert` an object of type Char to an object of type String
Closest candidates are:
convert(::Type{String}, ::String) at essentials.jl:210
convert(::Type{T}, ::T) where T<:AbstractString at strings/basic.jl:231
convert(::Type{T}, ::AbstractString) where T<:AbstractString at strings/basic.jl:232
...
Stacktrace:
[1] _totuple
# ./tuple.jl:316 [inlined]
[2] Tuple{String, String, String}(itr::String)
# Base ./tuple.jl:303
[3] construct(T::Type, args::String; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
[4] construct(T::Type, args::String)
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:310
[5] construct(::Type{Tuple{String, String, String}}, ptr::Ptr{UInt8}, len::Int64; kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
[6] construct(::Type{Tuple{String, String, String}}, ptr::Ptr{UInt8}, len::Int64)
# StructTypes ~/.julia/packages/StructTypes/NJXhA/src/StructTypes.jl:435
[7] readvalue(B::Avro.Binary, #unused#::Avro.StringType, #unused#::Type{Tuple{String, String, String}}, buf::Vector{UInt8}, pos::Int64, len::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:247
[8] readvalue(B::Avro.Binary, MT::Avro.MapType, #unused#::Type{Dict{Tuple{String, String, String}, Int64}}, buf::Vector{UInt8}, pos::Int64, buflen::Int64, opts::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/maps.jl:63
[9] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, String, String}, Int64}}; schema::Avro.MapType, jsonencoding::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
[10] read(buf::Vector{UInt8}, ::Type{Dict{Tuple{String, String, String}, Int64}})
# Avro ~/.julia/packages/Avro/JEoRa/src/types/binary.jl:58
[11] top-level scope
# In[210]:3
[12] eval
# ./boot.jl:360 [inlined]
[13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
# Base ./loading.jl:1094
What is going wrong?
Avro.jl is unable to properly read from the buffer into a Dict (or, as Avro calls it, a "Map") that uses a Tuple as a key because, according to the Avro specification:
Map keys are assumed to be strings.
This assumption is hard-coded into Avro.jl: no matter what the actual type of the Dict keys are, the code forces the key to be a String. Avro.jl does not bother to check that the key is actually a subtype of String because as long as the type can be converted to a String via the Base.string method, the code will write that string representation to the buffer. And that is exactly what is happening when you write a Dict with Tuple keys:
v = Dict((1,2) => 3)
buf = Avro.write(v)
Char.(buf)
This decodes the bytes in buf as ASCII/Unicode characters and prints them to the REPL. You should see the string representation of the Tuple (1,2) in there encoded as "(1, 2)":
11-element Vector{Char}:
'\x01': ASCII/Unicode U+0001 (category Cc: Other, control)
'\x10': ASCII/Unicode U+0010 (category Cc: Other, control)
'\f': ASCII/Unicode U+000C (category Cc: Other, control)
'(': ASCII/Unicode U+0028 (category Ps: Punctuation, open)
'1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)
',': ASCII/Unicode U+002C (category Po: Punctuation, other)
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
'2': ASCII/Unicode U+0032 (category Nd: Number, decimal digit)
')': ASCII/Unicode U+0029 (category Pe: Punctuation, close)
'\x06': ASCII/Unicode U+0006 (category Cc: Other, control)
'\0': ASCII/Unicode U+0000 (category Cc: Other, control)
The problem arises when you try to read that key back into a Tuple. When reading a key of a Map element, Avro.jl will try to read whatever is in the buffer as a String and stuff it into whatever type the key is. If the type is a Tuple of N types that can be constructed from UInt8 values (eltype(buf)), then the next N UInt8 values in the buffer will be used to create the key:
Avro.read(buf, typeof(v))
# Dict{Tuple{Int64, Int64}, Int64} with 1 entry:
# (40, 49) => 3
Why 40 and 49? Because those are the Int64 representations of the Chars '(' and '1', respectively:
Char(40)
# '(': ASCII/Unicode U+0028 (category Ps: Punctuation, open)
Char(49)
# '1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)
Note that this is why your second example is only reading one element in the Dict even though two are written. The two-element Tuple that is being parsed as the key is only reading the first to characters of the string representation, which are both '(' and '5' in your example. The Dict cannot have duplicate keys, so the second value simply overwrites the first.
How to fix it
Avoid using non-strings as keys
Because the Avro specifications specifically state that the key of a Map is assumed to be a string, you should probably follow the specification and avoid using non-strings as keys. In my opinion, Avro.jl should not let the user write a Dict with keys that are not subtypes of AbstractString. Maybe that's a design choice, or maybe that's a bug, but it might be worth filing an issue on the project page just in case.
Use a custom type as a key
If you really, really want to use something other than a String as a key, Avro.jl will always convert the key to a String when it serializes a Map to a buffer using the Base.string method. During deserialization, if the code recognizes the key as a struct, it will try to pass the serialized String to the struct's constructor. Therefore all you have to do is define a custom struct with a constructor that takes a String and make it do the right thing (and optionally overload the Base.string method). Here's an example:
struct XY
x::Int64
y::Int64
end
function XY(s::String)
# parse the default string representation of an XY value
# very inefficient: for demonstration purposes only
m = match(r"XY\((\d+), (\d+)\)", s)
XY(parse.(Int64, m.captures)...)
end
v2 = Dict(XY(1,2) => 3)
buf2 = Avro.write(v2)
Avro.read(buf2, typeof(v2)
# Dict{XY, Int64} with 1 entry:
# XY(1, 2) => 3
Write your own Tuple construct method
If you really, really, really want to use a Tuple as a key, you can take advantage of StructType.StringType and define your own StructType.construct method. Because Avro.jl uses the unsafe pointer version, you're stuck defining the same for your Tuple. Here is an awkward example:
function StructTypes.construct(::Type{Tuple{Int64,Int64}}, ptr::Ptr{UInt8}, len::Int; kw...)
arr = unsafe_wrap(Vector{UInt8}, ptr, len)
s = join(Char.(arr))
m = findall(r"\d+", s)
(parse(Int64, s[m[1]]), parse(Int64, s[m[2]]))
end
Avro.read(buf, typeof(v))
# Dict{Tuple{Int64, Int64}, Int64} with 1 entry:
# (1, 2) => 3
For the curious: why does Avro.jl get the value right, even if the key is parsed incorrectly?
In Avro's binary encoding scheme, strings are serialized with their lengths stored at the beginning of the string. This allows Avro.jl to pass the known length of the string key to the pointer-based StructTypes.construct method, which passes an Array{UInt8,1} to the Tuple constructor. A fun fact about Julia is that the iterable-based constructor for a Tuple will only read as many elements from the iterable as necessary to construct the Tuple, then stop. Example:
Tuple{Int64, Int64}([1,2,3,4])
# (1, 2)
So Avro.jl passes a 6-element Array{UInt8,1} (['(', '1', ',', ' ', '2', ')']) to the constructor of Tuple{Int64,Int64} which in turn reads only the first two elements, then returns the Tuple for Avro.jl to use as the key of the Map element. Avro.jl then skips ahead to where it knows the string ends (remember: it stores the length of the string in the buffer) and starts reading there for the value of the Map element. Avro.jl knows that value should be an Int64, and it knows how to parse an Int64, so it reads the appropriate value. Neat!
Related
Suppose I make my own custom vector type with it's own custom show method:
struct MyVector{T} <: AbstractVector{T}
v::Vector{T}
end
function Base.show(io::IO, v::MyVector{T}) where {T}
println(io, "My custom vector with eltype $T with elements")
for i in eachindex(v)
println(io, " ", v.v[i])
end
end
If I try making one of these objects at the REPL I get unexpected errors related to functions I never intended to call:
julia> MyVector([1, 2, 3])
Error showing value of type MyVector{Int64}:
ERROR: MethodError: no method matching size(::MyVector{Int64})
Closest candidates are:
size(::AbstractArray{T,N}, ::Any) where {T, N} at abstractarray.jl:38
size(::BitArray{1}) at bitarray.jl:77
size(::BitArray{1}, ::Integer) at bitarray.jl:81
...
Stacktrace:
[1] axes at ./abstractarray.jl:75 [inlined]
[2] summary(::IOContext{REPL.Terminals.TTYTerminal}, ::MyVector{Int64}) at ./show.jl:1877
[3] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::MyVector{Int64}) at ./arrayshow.jl:316
[4] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /Users/mason/julia/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:132
[5] display(::REPL.REPLDisplay, ::Any) at /Users/mason/julia/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:136
[6] display(::Any) at ./multimedia.jl:323
...
Okay, whatever so I'll implement Base.size so it'll leave me alone:
julia> Base.size(v::MyVector) = size(v.v)
julia> MyVector([1, 2, 3])
3-element MyVector{Int64}:
Error showing value of type MyVector{Int64}:
ERROR: getindex not defined for MyVector{Int64}
Stacktrace:
[1] error(::String, ::Type) at ./error.jl:42
[2] error_if_canonical_getindex(::IndexCartesian, ::MyVector{Int64}, ::Int64) at ./abstractarray.jl:991
[3] _getindex at ./abstractarray.jl:980 [inlined]
[4] getindex at ./abstractarray.jl:981 [inlined]
[5] isassigned(::MyVector{Int64}, ::Int64, ::Int64) at ./abstractarray.jl:405
[6] alignment(::IOContext{REPL.Terminals.TTYTerminal}, ::MyVector{Int64}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Int64, ::Int64, ::Int64) at ./arrayshow.jl:67
[7] print_matrix(::IOContext{REPL.Terminals.TTYTerminal}, ::MyVector{Int64}, ::String, ::String, ::String, ::String, ::String, ::String, ::Int64, ::Int64) at ./arrayshow.jl:186
[8] print_matrix at ./arrayshow.jl:159 [inlined]
[9] print_array at ./arrayshow.jl:308 [inlined]
[10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::MyVector{Int64}) at ./arrayshow.jl:345
[11] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /Users/mason/julia/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:132
[12] display(::REPL.REPLDisplay, ::Any) at /Users/mason/julia/usr/share/julia/stdlib/v1.3/REPL/src/REPL.jl:136
[13] display(::Any) at ./multimedia.jl:323
...
Hmm, now it wants getindex
julia> Base.getindex(v::MyVector, args...) = getindex(v.v, args...)
julia> MyVector([1, 2, 3])
3-element MyVector{Int64}:
1
2
3
What? That wasn't the print formatting I told it to do! what's going on here?
The problem is that in julia, Base defines a method Base.show(io::IO ::MIME"text/plain", X::AbstractArray) which is actually more specific than the Base.show(io::IO, v::MyVector) for the purposes of display. This section of the julia manual describes the sort of custom printing that AbstractArray uses. So if we want to use our custom show method, we instead need to do
julia> function Base.show(io::IO, ::MIME"text/plain", v::MyVector{T}) where {T}
println(io, "My custom vector with eltype $T and elements")
for i in eachindex(v)
println(io, " ", v.v[i])
end
end
julia> MyVector([1, 2, 3])
My custom vector with eltype Int64 and elements
1
2
3
See also: https://discourse.julialang.org/t/extending-base-show-for-array-of-types/31289
I wish to use the JLD package to write an OrderedDict to file in such a way that I can subsequently read it back unchanged.
Here was my first effort:
using JLD, HDF5, DataStructures
function testjld()
res = OrderedDict("A" => 1, "B" => 2)
filename = "c:/temp/test.jld"
save(File(format"JLD", filename), "res", res)
res2 = load(filename)["res"]
#Check if round-tripping works
res == res2
end
But the "round-tripping" doesn't work - the function returns false. It also raises a warning:
julia> testjld()
┌ Warning: type JLD.AssociativeWrapper{Core.String,Core.Int64,OrderedCollections.OrderedDict{Core.String,Core.Int64}} not present in workspace; reconstructing
└ # JLD C:\Users\Philip\.julia\packages\JLD\1BoSz\src\jld_types.jl:703
false
After reading the docs, I thought that JLD does not support OrderedDict "out of the box", but does support Dict and I can use that fact to write my own custom serialisation for OrderedDict. Something like this:
struct OrderedDictSerializer
d::Dict
end
JLD.writeas(data::OrderedDict) = OrderedDictSerializer(Dict("contents" => convert(Dict, data),
"keyorder" => [k for (k, v) in data]))
function JLD.readas(serdata::OrderedDictSerializer)
unordered = serdata.d["contents"]
keyorder = serdata.d["keyorder"]
OrderedDict((k, unordered[k]) for k in keyorder)
end
Hardly an exhaustive test, but this does seem to work:
julia> testjld()
true
Am I correct in thinking I need to write my own serializer for OrderedDict, and can my serializer be improved?
EDIT
The answer to to my question "Can my serializer be improved?" seems to be "It will have to be, though I don't yet understand how."
Consider the two following test functions:
function testjld2()
res = OrderedDict("A" => [1.0,2.0],"B" => [3.0,4.0])
#check if round-tripping of readas and writeas methods works:
JLD.readas(JLD.writeas(res)) == res
end
function testjld3()
res = OrderedDict("A" => [1.0,2.0],"B" => [3.0,4.0])
filename = "c:/temp/test.jld"
save(File(format"JLD", filename), "res", res)
res2 = load(filename)["res"]
#Check if round-tripping to jld file and back works
res == res2
end
testjld2 shows that my writeas and readas methods correctly round-trip for an OrderedDict{String,Array{Float64,1}} with 2 entries
julia> testjld2()
true
and yet testjld3 doesn't work at all, but yields an error:
julia> testjld3()
HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
#000: E:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.10.5/src/H5Tfields.c line 60 in H5Tget_nmembers(): not a datatype
major: Invalid arguments to routine
minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
#000: E:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.10.5/src/H5Tfields.c line 60 in H5Tget_nmembers(): not a datatype
major: Invalid arguments to routine
minor: Inappropriate type
ERROR: Error getting the number of members
Stacktrace:
[1] error(::String) at .\error.jl:33
[2] h5t_get_nmembers at C:\Users\Philip\.julia\packages\HDF5\rF1Fe\src\HDF5.jl:2279 [inlined]
[3] _gen_h5convert!(::Any) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\jld_types.jl:638
[4] #s27#9(::Any, ::Any, ::Any, ::Any, ::Any, ::Any) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\jld_types.jl:664
[5] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at .\boot.jl:524
[6] #write_compound#24(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(JLD.write_compound), ::JLD.JldGroup, ::String, ::JLD.AssociativeWrapper{String,Any,Dict{String,Any}}, ::JLD.JldWriteSession) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:700
[7] write_compound at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:694 [inlined]
[8] #_write#23 at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:690 [inlined]
[9] _write at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:690 [inlined]
[10] write_ref(::JLD.JldFile, ::Dict{String,Any}, ::JLD.JldWriteSession) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:658
[11] macro expansion at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\jld_types.jl:648 [inlined]
[12] h5convert!(::Ptr{UInt8}, ::JLD.JldFile, ::OrderedDictSerializer, ::JLD.JldWriteSession) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\jld_types.jl:664
[13] #write_compound#24(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(JLD.write_compound), ::JLD.JldFile, ::String, ::OrderedDictSerializer, ::JLD.JldWriteSession) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:700
[14] write_compound at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:694 [inlined]
[15] #_write#23 at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:690 [inlined]
[16] _write at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:690 [inlined]
[17] #write#17(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(write), ::JLD.JldFile, ::String, ::OrderedDict{String,Array{Float64,1}}, ::JLD.JldWriteSession) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:514
[18] write at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:514 [inlined]
[19] #35 at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:1223 [inlined]
[20] #jldopen#14(::Base.Iterators.Pairs{Symbol,Bool,Tuple{Symbol,Symbol},NamedTuple{(:compatible, :compress),Tuple{Bool,Bool}}}, ::typeof(jldopen), ::getfield(JLD, Symbol("##35#36")){String,OrderedDict{String,Array{Float64,1}},Tuple{}},
::String, ::Vararg{String,N} where N) at C:\Users\Philip\.julia\packages\JLD\1BoSz\src\JLD.jl:246
[21] testjld3() at .\none:0
[22] top-level scope at REPL[48]:1
Use JLD2 instead:
using JLD2, DataStructures, FileIO
function testjld2()
res = OrderedDict("A" => 1, "B" => 2)
myfilename = "c:/temp/test.jld2"
save(myfilename, "res", res)
res2 = load(myfilename)["res"]
#Check if round-tripping works
res == res2
end
Testing:
julia> testjld2()
true
Personally, whenever I can I use BJSON:
using DataStructures, BSON, OrderedCollections
function testbson()
res = OrderedDict("A" => 1, "B" => 2)
myfilename = "c:/temp/test.bjson"
BSON.bson(myfilename, Dict("res" => res))
res2 = BSON.load(myfilename)["res"]
#Check if round-tripping works
res == res2
end
julia> testbson()
true
I have the following code. It basically iterates over rows in a dataframe and tries to assign a value to column C.
I have tried to locate how to achieve this without success. I know that this sentence r.C = i*100 is not correct, which would be the right one to assign a value to column C for each iterated row?
Note that the question is a simplified example, in my real code I need to actually iterate over each row because the calculations are far more complex.
File main2.jl:
struct MyStruct
a::Int32
b::Int32
c::String
end
df = DataFrame( A=Int[], B=Int[] )
push!(df, [1, 10])
push!(df, [2, 20])
push!(df, [3, 30])
insertcols!(df, 3, :C => Int)
println(df)
i = 1
for r in eachrow(df)
global i
r.C = i*100
i = i + 1
end
And I get:
julia> include("main2.jl")
| A | B | C |
| Int64 | Int64 | DataType |
|-------|-------|----------|
| 1 | 10 | Int64 |
| 2 | 20 | Int64 |
| 3 | 30 | Int64 |
ERROR: LoadError: MethodError: Cannot `convert` an object of type Int64 to an object of type DataType
Closest candidates are:
convert(::Type{S}, ::T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)) where {S, T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)} at /home/.../.julia/packages/CategoricalArrays/qcwgl/src/value.jl:91
convert(::Type{T}, ::T) where T at essentials.jl:167
Stacktrace:
[1] setindex!(::Array{DataType,1}, ::Int64, ::Int64) at ./array.jl:766
[2] insert_single_entry!(::DataFrame, ::Int64, ::Int64, ::Int64) at /home/.../.julia/packages/DataFrames/yH0f6/src/dataframe/dataframe.jl:458
[3] setindex! at /home/.../.julia/packages/DataFrames/yH0f6/src/dataframe/dataframe.jl:497 [inlined]
[4] setindex! at /home/.../.julia/packages/DataFrames/yH0f6/src/dataframerow/dataframerow.jl:106 [inlined]
[5] setproperty!(::DataFrameRow{DataFrame,DataFrames.Index}, ::Symbol, ::Int64) at /home/.../.julia/packages/DataFrames/yH0f6/src/dataframerow/dataframerow.jl:129
[6] top-level scope at /usr/home/.../main2.jl:23
[7] include at ./boot.jl:328 [inlined]
[8] include_relative(::Module, ::String) at ./loading.jl:1094
[9] include(::Module, ::String) at ./Base.jl:31
[10] include(::String) at ./client.jl:431
[11] top-level scope at REPL[1]:1
in expression starting at /usr/home/.../main2.jl:21
The standard way to add a column with a sentinel value to a DataFrame is just:
df[!, :C] .= 0
insertcols! is OK to use but typically it is employed when you want to insert a column in the middle of the DataFrame (not as the last column, what my example does).
Now the loop you have written at the end of your question can be stated as:
for (i, r) in enumerate(eachrow(df))
r.C = i*100
end
which I would say is a more typical way to do it.
Finally you could have simply written:
df.C = 100 .* axes(df, 1)
to get the same effect. Note that the last statement could have been much more complex like:
df.C = #. 100 * $axes(df, 1) + df.A + sin(df.B)
or equivalently in this case
df.C = 100 * axes(df, 1) + df.A + sin.(df.B)
(in general - you can freely use broadcasting when working with data frames instead of loops)
The problem is in
insertcols!(df, 3, :C => Int)
where you initialize the :C column with a type (Int) instead of an Int value, like 0. Changing this to
insertcols!(df, 3, :C => 0)
works.
I am having trouble with pmap() throwing a BoundsError when setting the values of array elements - my code works for 1 worker but not >1. I have written a minimum working example which roughly follows the real code flow:
Get source data
Define set of points over which to iterate
Initialise array points to be calculated
Calculate each array point
The main file:
#pmapdemo.jl
using Distributed
#addprocs(length(Sys.cpu_info())) # uncomment this line for error
#everywhere include(joinpath(#__DIR__, "pmapdemo2.jl"))
function main()
# Get source data
source = Dict{String, Any}("t"=>zeros(5),
"x"=>zeros(5,6),
"y"=>zeros(5,3),
"z"=>zeros(5,3))
# Define set of points over which to iterate
iterset = Dict{String, Any}("t"=>source["t"],
"x"=>source["x"],
"y"=>fill(2, size(source["t"])[1], 1),
"z"=>fill(2, size(source["t"])[1], 1))
data = Dict{String, Any}()
# Initialise array points to be calculated
MyMod.initialisearray!(data, iterset)
# Calculate each array point
MyMod.calcarray!(data, iterset, source)
#show data
end
main()
The functionality file:
#pmapdemo2.jl
module MyMod
using Distributed
#everywhere using SharedArrays
# Initialise data array
function initialisearray!(data, fieldset)
zerofield::SharedArray{Float64, 4} = zeros(size(fieldset["t"])[1],
size(fieldset["x"])[2],
size(fieldset["y"])[2],
size(fieldset["z"])[2])
data["field"] = deepcopy(zerofield)
end
# Calculate values of array elements according to values in source
function calcpoint!((data, source, a, b, c, d))
data["field"][a,b,c,d] = rand()
end
# Set values in array
function calcarray!(data, iterset, source)
for a in eachindex(iterset["t"])
# [additional functionality f(a) here]
b = eachindex(iterset["x"][a,:])
c = eachindex(iterset["y"][a,:])
d = eachindex(iterset["z"][a,:])
pmap(calcpoint!, Iterators.product(Iterators.repeated(data,1), Iterators.repeated(source,1), Iterators.repeated(a,1), b, c, d))
end
end
end
The error output:
ERROR: LoadError: On worker 2:
BoundsError: attempt to access 0×0×0×0 Array{Float64,4} at index [1]
setindex! at ./array.jl:767 [inlined]
setindex! at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/SharedArrays/src/SharedArrays.jl:500 [inlined]
_setindex! at ./abstractarray.jl:1043
setindex! at ./abstractarray.jl:1020
calcpoint! at /home/dave/pmapdemo2.jl:25
#112 at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:269
run_work_thunk at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:56
macro expansion at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:269 [inlined]
#111 at ./task.jl:259
Stacktrace:
[1] (::getfield(Base, Symbol("##696#698")))(::Task) at ./asyncmap.jl:178
[2] foreach(::getfield(Base, Symbol("##696#698")), ::Array{Any,1}) at ./abstractarray.jl:1866
[3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Base.Iterators.ProductIterator{Tuple{Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Int64}},Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}}) at ./asyncmap.jl:178
[4] #async_usemap#681 at ./asyncmap.jl:154 [inlined]
[5] #async_usemap at ./none:0 [inlined]
[6] #asyncmap#680 at ./asyncmap.jl:81 [inlined]
[7] #asyncmap at ./none:0 [inlined]
[8] #pmap#213(::Bool, ::Int64, ::Nothing, ::Array{Any,1}, ::Nothing, ::Function, ::Function, ::WorkerPool, ::Base.Iterators.ProductIterator{Tuple{Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Int64}},Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}}) at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:126
[9] pmap(::Function, ::WorkerPool, ::Base.Iterators.ProductIterator{Tuple{Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Int64}},Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}}) at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:101
[10] #pmap#223(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Base.Iterators.ProductIterator{Tuple{Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Int64}},Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}}) at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:156
[11] pmap(::Function, ::Base.Iterators.ProductIterator{Tuple{Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Dict{String,Any}}},Base.Iterators.Take{Base.Iterators.Repeated{Int64}},Base.OneTo{Int64},Base.OneTo{Int64},Base.OneTo{Int64}}}) at /build/julia/src/julia-1.1.1/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:156
[12] calcarray!(::Dict{String,Any}, ::Dict{String,Any}, ::Dict{String,Any}) at /home/dave/pmapdemo2.jl:20
[13] main() at /home/dave/pmapdemo.jl:19
[14] top-level scope at none:0
in expression starting at /home/dave/pmapdemo.jl:23
In pmapdemo2.jl, replacing data["field"][a,b,c,d] = rand() with #show a, b, c, d demonstrates that all workers are running and have full access to the variables being passed, however instead replacing it with #show data["field"] throws the same error. Surely the entire purpose of SharedArrays is to avoid this? Or am I misunderstanding how to use it with pmap?
This is a crosspost from the Julia discourse here.
pmap will do the work of passing the data to the processes, so you don't need to use SharedArrays. Typically, the function provided to pmap (and indeed map) will be a pure function (and therefore doesn't mutate any variable) which returns one element of an output array. That function is mapped across each element of the input array, and the pmap function will construct the output array for you. For example, in your case, the code may look a bit like this
calcpoint(source, (a,b,c,d)) = rand() # Or some function of source and the indices a,b,c,d
field["data"] = pmap(calcpoint, Iterators.repeated(source), Iterators.product(a,b,c,d))
When I tried to calculate
julia> -2.3^-7.6
-0.0017818389423254909
But the result given by my calculator is
0.0005506 + 0.001694 i
Just to be safe I tried it again and this time it complains. Why does it not complain when I tried it the first time?
julia> a = -2.3; b = -7.6; a^b
ERROR: DomainError with -2.6:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Stacktrace:
[1] throw_exp_domainerror(::Float64) at ./math.jl:35
[2] ^(::Float64, ::Float64) at ./math.jl:769
[3] top-level scope at none:0
[4] eval at ./boot.jl:319 [inlined]
[5] #85 at /Users/ssiew/.julia/packages/Atom/jodeb/src/repl.jl:129 [inlined]
[6] with_logstate(::getfield(Main, Symbol("##85#87")),::Base.CoreLogging.LogState) at ./logging.jl:397
[7] with_logger(::Function, ::Atom.Progress.JunoProgressLogger) at ./logging.jl:493
[8] top-level scope at /Users/ssiew/.julia/packages/Atom/jodeb/src/repl.jl:128
This is an order of operations issue. You can see how Julia's parsing that expression:
julia> parse("-2.3^-7.6")
:(-(2.3 ^ -7.6))
and so the reason you don't have any problems is because you're actually taking 2.3 ^ (-7.6), which is 0.0017818389423254909, and then flipping the sign.
Your second approach is equivalent to making sure that the "x" in "x^y" is really negative, or:
julia> parse("(-2.3)^-7.6")
:(-2.3 ^ -7.6)
julia> eval(parse("(-2.3)^-7.6"))
ERROR: DomainError:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Stacktrace:
[1] nan_dom_err at ./math.jl:300 [inlined]
[2] ^(::Float64, ::Float64) at ./math.jl:699
[3] eval(::Module, ::Any) at ./boot.jl:235
[4] eval(::Any) at ./boot.jl:234
And if we follow that instruction, we get what you expect:
julia> Complex(-2.3)^-7.6
0.0005506185144176565 + 0.0016946295370871215im