DataFrame column with nothing values in Julia? - julia

I am trying to understand how DataFrames work in Julia and I am having a rough time.
I usually worked with DataFrames --in Python-- adding new columns on every simulation step and populating each row with values.
For example, I have this DataFrame which contains input Data:
using DataFrames
df = DataFrame( A=Int[], B=Int[] )
push!(df, [1, 10])
push!(df, [2, 20])
push!(df, [3, 30])
Now, let's say that I do calculations based on those A and B columns that generate a third column C with DateTime objects. But DateTime objects are not generated for all rows, they could be null.
How is that use case handled in Julia?
How shall I create the new C column and assign values inside the for r in eachrow(df)?
# Pseudocode of what I intend to do
df[! :C] .= nothing
for r in eachrow(df)
if condition
r.C = mySuperComplexFunctionThatReturnsDateTimeForEachRow()
else
r.C = nothing
end
end
To give a runable and concrete code, let's fake condition and function:
df[! :C] .= nothing
for r in eachrow(df)
if r.A == 2
r.C = Dates.now()
else
r.C = nothing
end
end

The efficient pattern to do this is:
df.C = f.(df.A, df.B)
where f is a function that takes scalars and calculates an output based on them (i.e. your simulation code) and you pass to it the columns you need to extract from df to perform the calculations. In this way the Julia compiler will be able to generate fast (type-stable) native code.
In your example the function f would be ifelse so you could write:
df.C = ifelse.(df.A .== 2, Dates.now(), nothing)
Also consider if you return nothing or missing (they have a different interpretation in Julia: nothing means absence of a value and missing means that the value is present but is not known; I am not sure which would be better in your case).

If you initialize the column with df[!, :C] .= nothing it has the element type Nothing. When writing DateTimes to this column, Julia is attempting to convert them to Nothing and fails.
I am not sure if this is the most efficient or recommended solution, but if you initialize the column as a union of DateTime and Nothing
df[!, :C] = Vector{Union{DateTime, Nothing}}(nothing, size(df, 1))
your example should work.

Related

Creating an array of `nothing` of any size in Julia

In Julia it is possible to create arrays of any size using the functions zeros(.) or ones(.). Is there a similar function to create an array that is filled with nothing at initialization but also accepts floats? I mean a function like in this example:
a = array_of_nothing(3)
# a = [nothing,nothing,nothing]
a[1] = 3.14
# a = [3.14,nothing,nothing]
I tried to find information on internet, but without success... Sorry, I am a beginner in Julia.
The fill function can be used to create arrays of arbitrary values, but it's not so easy to use here, since you want a Vector{Union{Float64, Nothing}}. Two options come to mind:
A comprehension:
a = Union{Float64, Nothing}[nothing for _ in 1:3];
a[2] = 3.14;
>> a
3-element Array{Union{Nothing, Float64},1}:
nothing
3.14
nothing
Or ordinary array initialization:
a = Vector{Union{Float64, Nothing}}(undef, 3)
fill!(a, nothing)
a[2] = 3.14
It seems that when you do Vector{Union{Float64, Nothing}}(undef, 3) the vector automatically contains nothing, but I wouldn't rely on that, so fill! may be necessary.
I think you are looking for the Base.fill — Function.
fill(x, dims)
This creates an array filled with value x.
println(fill("nothing", (1,3)))
You can also pass a function Foo() like fill(Foo(), dims) which will return an array filled with the result of evaluating Foo() once.

select string array values based on another array

I can select data that is equal to a value with
data = rand(1:3, 10)
value = 2
data .== value
or equal to a list of values with
values = [1, 2]
in.(data, (values,))
The last one is generic and also works for a scalar: in.(data, (value, )) .
However, this works for Int, but the generic does not work for String values:
data = rand(["A", "B", "C"], 10)
value = "B"
data .== value
values = ["A","B"]
in.(data, (values, ))
in.(data, (value, ))
ERROR: use occursin(x, y) for string containment
Is there a generic way for Strings?
For a generic val input I'm now writing the following, but I feel there must be a better solution.
isa(val, AbstractArray) ? in.(data, (val,)) : data .== val
Background: I'm creating a function to select rows from a dataframe (and do something with them) but I want to allow for both a list of values as well as a single value.
Here is a trick that is worth knowing:
[x;]
Now - if x is an array it will remain an array. If x is a scalar it will become a 1-element array. And this is exactly what you need.
So you can write
in.(data, ([val;],))
The drawback is that it allocates a new array, but I guess that val is small and it is not used in performance critical code? If the code is performance critical I think it is better to treat scalars and arrays by separate branches.

In-place modification/reassignment of vector in Julia without getting copies

Here's some toy code:
type MyType
x::Int
end
vec = [MyType(1), MyType(2), MyType(3), MyType(4)]
ids = [2, 1, 3, 1]
vec = vec[ids]
julia> vec
4-element Array{MyType,1}:
MyType(2)
MyType(1)
MyType(3)
MyType(1)
That looks fine, except for this behavior:
julia> vec[2].x = 60
60
julia> vec
4-element Array{MyType,1}:
MyType(2)
MyType(60)
MyType(3)
MyType(60)
I want to be able to rearrange the contents of a vector, with the possibility that I eliminate some values and duplicate others. But when I duplicate values, I don't want this copy behavior. Is there an "elegant" way to do this? Something like this works, but yeesh:
vec = [deepcopy(vec[ids[i]]) for i in 1:4]
The issue is that you're creating mutable types, and your vector therefore contains references to the instantiated data - so when you create a vector based on ids, you're creating what amounts to a vector of pointers to the structures. This further means that the elements in the vector with the same id are actually pointers to the same object.
There's no good way to do this without ensuring that your references are different. That either means 1) immutable types, which means you can't reassign x, or 2) copy/deepcopy.

Slicing and broadcasting multidimensional arrays in Julia : meshgrid example

I recently started learning Julia by coding a simple implementation of Self Organizing Maps. I want the size and dimensions of the map to be specified by the user, which means I can't really use for loops to work on the map arrays because I don't know in advance how many layers of loops I will need. So I absolutely need broadcasting and slicing functions that work on arrays of arbitrary dimensions.
Right now, I need to construct an array of indices of the map. Say my map is defined by an array of size mapsize = (5, 10, 15), I need to construct an array indices of size (3, 5, 10, 15) where indices[:, a, b, c] should return [a, b, c].
I come from a Python/NumPy background, in which the solution is already given by a specific "function", mgrid :
indices = numpy.mgrid[:5, :10, :15]
print indices.shape # gives (3, 5, 10, 15)
print indices[:, 1, 2, 3] gives [1, 2, 3]
I didn't expect Julia to have such a function on the get-go, so I turned to broadcasting. In NumPy, broadcasting is based on a set of rules that I find quite clear and logical. You can use arrays of different dimensions in the same expression as long as the sizes in each dimension match or one of it is 1 :
(5, 10, 15) broadcasts to (5, 10, 15)
(10, 1)
(5, 1, 15) also broadcasts to (5, 10, 15)
(1, 10, 1)
To help with this, you can also use numpy.newaxis or None to easily add new dimensions to your array :
array = numpy.zeros((5, 15))
array[:,None,:] has shape (5, 1, 15)
This helps broadcast arrays easily :
A = numpy.arange(5)
B = numpy.arange(10)
C = numpy.arange(15)
bA, bB, bC = numpy.broadcast_arrays(A[:,None,None], B[None,:,None], C[None,None,:])
bA.shape == bB.shape == bC.shape = (5, 10, 15)
Using this, creating the indices array is rather straightforward :
indices = numpy.array(numpy.broadcast_arrays(A[:,None,None], B[None,:,None], C[None,None,:]))
(indices == numpy.mgrid[:5,:10,:15]).all() returns True
The general case is of course a bit more complicated, but can be worked around using list comprehension and slices :
arrays = [ numpy.arange(i)[tuple([None if m!=n else slice(None) for m in range(len(mapsize))])] for n, i in enumerate(mapsize) ]
indices = numpy.array(numpy.broadcast_arrays(*arrays))
So back to Julia. I tried to apply the same kind of rationale and ended up achieving the equivalent of the arrays list of the code above. This ended up being rather simpler than the NumPy counterpart thanks to the compound expression syntax :
arrays = [ (idx = ones(Int, length(mapsize)); idx[n] = i;reshape([1:i], tuple(idx...))) for (n,i)=enumerate(mapsize) ]
Now I'm stuck here, as I don't really know how to apply the broadcasting to my list of generating arrays here... The broadcast[!] functions ask for a function f to apply, and I don't have any. I tried using a for loop to try forcing the broadcasting:
indices = Array(Int, tuple(unshift!([i for i=mapsize], length(mapsize))...))
for i=1:length(mapsize)
A[i] = arrays[i]
end
But this gives me an error : ERROR: convert has no method matching convert(::Type{Int64}, ::Array{Int64,3})
Am I doing this the right way? Did I overlook something important? Any help is appreciated.
If you're running julia 0.4, you can do this:
julia> function mgrid(mapsize)
T = typeof(CartesianIndex(mapsize))
indices = Array(T, mapsize)
for I in eachindex(indices)
indices[I] = I
end
indices
end
It would be even nicer if one could just say
indices = [I for I in CartesianRange(CartesianIndex(mapsize))]
I'll look into that :-).
Broadcasting in Julia has been modelled pretty much on broadcasting in NumPy, so you should hopefully find that it obeys more or less the same simple rules (not sure if the way to pad dimensions when not all inputs have the same number of dimensions is the same though, since Julia arrays are column-major).
A number of useful things like newaxis indexing and broadcast_arrays have not been implemented (yet) however. (I hope they will.) Also note that indexing works a bit differently in Julia compared to NumPy: when you leave off indices for trailing dimensions in NumPy, the remaining indices default to colons. In Julia they could be said to default to ones instead.
I'm not sure if you actually need a meshgrid function, most things that you would want to use it for could be done by using the original entries of your arrays array with broadcasting operations. The major reason that meshgrid is useful in matlab is because it is terrible at broadcasting.
But it is quite straightforward to accomplish what you want to do using the broadcast! function:
# assume mapsize is a vector with the desired shape, e.g. mapsize = [2,3,4]
N = length(mapsize)
# Your line to create arrays below, with an extra initial dimension on each array
arrays = [ (idx = ones(Int, N+1); idx[n+1] = i;reshape([1:i], tuple(idx...))) for (n,i) in enumerate(mapsize) ]
# Create indices and fill it one coordinate at a time
indices = zeros(Int, tuple(N, mapsize...))
for (i,arr) in enumerate(arrays)
dest = sub(indices, i, [Colon() for j=1:N]...)
broadcast!(identity, dest, arr)
end
I had to add an initial singleton dimension on the entries of arrays to line up with the axes of indices (newaxis had been useful here...).
Then I go through each coordinate, create a subarray (a view) on the relevant part of indices, and fill it. (Indexing will default to returning subarrays in Julia 0.4, but for now we have to use sub explicitly).
The call to broadcast! just evaluates the identity function identity(x)=x on the input arr=arrays[i], broadcasts to the shape of the output. There's no efficiency lost in using the identity function for this; broadcast! generates a specialized function based on the given function, number of arguments, and number of dimensions of the result.
I guess this is the same as the MATLAB meshgrid functionality. I've never really thought about the generalization to more than two dimensions, so its a bit harder to get my head around.
First, here is my completely general version, which is kinda crazy but I can't think of a better way to do it without generating code for common dimensions (e.g. 2, 3)
function numpy_mgridN(dims...)
X = Any[zeros(Int,dims...) for d in 1:length(dims)]
for d in 1:length(dims)
base_idx = Any[1:nd for nd in dims]
for i in 1:dims[d]
cur_idx = copy(base_idx)
cur_idx[d] = i
X[d][cur_idx...] = i
end
end
#show X
end
X = numpy_mgridN(3,4,5)
#show X[1][1,2,3] # 1
#show X[2][1,2,3] # 2
#show X[3][1,2,3] # 3
Now, what I mean by code generation is that, for the 2D case, you can simply do
function numpy_mgrid(dim1,dim2)
X = [i for i in 1:dim1, j in 1:dim2]
Y = [j for i in 1:dim1, j in 1:dim2]
return X,Y
end
and for the 3D case:
function numpy_mgrid(dim1,dim2,dim3)
X = [i for i in 1:dim1, j in 1:dim2, k in 1:dim3]
Y = [j for i in 1:dim1, j in 1:dim2, k in 1:dim3]
Z = [k for i in 1:dim1, j in 1:dim2, k in 1:dim3]
return X,Y,Z
end
Test with, e.g.
X,Y,Z=numpy_mgrid(3,4,5)
#show X
#show Y
#show Z
I guess mgrid shoves them all into one tensor, so you could do that like this
all = cat(4,X,Y,Z)
which is still slightly different:
julia> all[1,2,3,:]
1x1x1x3 Array{Int64,4}:
[:, :, 1, 1] =
1
[:, :, 1, 2] =
2
[:, :, 1, 3] =
3
julia> vec(all[1,2,3,:])
3-element Array{Int64,1}:
1
2
3

Selecting from a Julia Dataframe using the `contains` function

I have a DataFrame df with a column named "cond". One of the values in this column is "aer". To select all the rows with cond == "aer", this code works:
select(:(cond .== "aer"), df)
But this doesn't
select(:(contains(["aer"],cond)), df)
It fails with the error:
ERROR: all SubDataFrame indices must be > 0
in SubDataFrame at /Users/seanmackesey/.julia/DataFrames/src/dataframe.jl:1007
in sub at /Users/seanmackesey/.julia/DataFrames/src/dataframe.jl:1020
in select at /Users/seanmackesey/.julia/DataFrames/src/dataframe.jl:1031
I looked at the source but fail to understand what's going on here. What are the general limitations on what I can put in expression predicates like this?
I think the problem is that contain isn't a vectorized operation:
julia> contains(["aer"], ["aer", "aer", "abr"])
false
This probably means that it's not generating valid indices.
In general, the family of expressions that should work in select are those that generate a vector of indices. There are a few broken cases, but I believe the problem in this case is just that the predicate isn't producing useful indices.

Resources