I need to populate a column of a data frame with unique factors. I have been using sequential integers, but I don't want to consumer of my function to be confused and think that they can do arithmetic on these values. These values are categorical with no definition for order, distance, and scale. In R, I would have solved this problem with as.factor. I see that there is a CategoricalArrays.jl project, which I have never used, that might offer similar functionality.
Mathematica has a useful Unique function that can create a (as the name implies) unique symbol.
In[1]:= Unique[]
Out[1]= $10
Julia has a similar Symbol that generates a lightweight value that I think makes sense to treat as a factor, but I haven't found a built-in technique to automatically generate unique symbols. You cannot invoke Symbol() without a parameter. I suppose I could call Symbol(UUIDs.uuid1()), but these are very long.
julia> using UUIDs
julia> Symbol(UUIDs.uuid1())
Symbol("8a9452d0-2451-11ec-08b4-3bb7f56a346a")
Is there an idiomatic way to generate short and unique symbols in Julia?
The way to generate unique Symbol is to use the gensym function.
However, I assume you most likely want to use CategoricalArrays.jl as you have commented. This package allows you to create arrays of both ordered or unordered factors - just like in R. The difference from R is that the user will be able to clearly see that what is stored in an array is a factor even after extracting it from an array, e.g.:
julia> using CategoricalArrays
julia> x = categorical(1:3)
3-element CategoricalArray{Int64,1,UInt32}:
1
2
3
julia> x[1]
CategoricalValue{Int64, UInt32} 1
and as you can see the notion of being categorical is not lost which I guess is exactly what you want.
Related
I have the following function: problema_firma_emprestimo(r,w,r_emprestimo,posicao,posicao_banco), where all input are scalars.
This function return three different matrix, using
return demanda_k_emprestimo,demanda_l_emprestimo,lucro_emprestimo
I need to run this function for a series of values of posicao_banco that are stored in a vector.
I'm doing this using a for loop, because I need three separate matrix with each of them storing one of the three outputs of the function, and the first dimension of each matrix corresponds to the index of posicao_banco. My code for this part is:
demanda_k_emprestimo = zeros(num_bancos,na,ny);
demanda_l_emprestimo = similar(demanda_k_emprestimo);
lucro_emprestimo = similar(demanda_k_emprestimo);
for i in eachindex(posicao_bancos)
demanda_k_emprestimo[i,:,:] , demanda_l_emprestimo[i,:,:] , lucro_emprestimo[i,:,:] = problema_firma_emprestimo(r,w,r_emprestimo[i],posicao,posicao_bancos[i]);
end
Is there a fast and clean way of doing this using vectorized functions? Something like problema_firma_emprestimo.(r,w,r_emprestimo[i],posicao,posicao_bancos) ? When I do this, I got a tuple with the result, but I can't find a good way of unpacking the answer.
Thanks!
Unfortunately, it's not easy to use broadcasting here, since then you will end up with output that is an array of tuples, instead of a tuple of arrays. I think a loop is a very good approach, and has no performance penalty compared to broadcasting.
I would suggest, however, that you organize your output array dimensions differently, so that i indexes into the last dimension instead of the first:
for i in eachindex(posicao_bancos)
demanda_k_emprestimo[:, :, i] , ...
end
This is because Julia arrays are column major, and this way the output values are filled into the output arrays in the most efficient way. You could also consider making the output arrays into vectors of matrices, instead of 3D arrays.
On a side note: since you are (or should be) creating an MWE for the sake of the people answering, it would be better if you used shorter and less confusing variable names. In particular for people who don't understand Portuguese (I'm guessing), your variable names are super long, confusing and make the code visually dense. Telling the difference between demanda_k_emprestimo and demanda_l_emprestimo at a glance is hard. The meaning of the variables are not important either, so it's better to just call them A and B or X and Y, and the functions foo or something.
I am very new to Julia and mostly code in Python these days. I am using Julia to work with and manipulate HDF5 files.
So when I get to writing out (h5write), I get an error because the data argument is of mixed type and I need to find out why.
The error message says Array{Dict{String,Any},4} is what I am trying to pass in, but when I look at the values (and it is a huge structure), I see a lot of 0xff and values like this. How do I quickly find why the Any and not a single type?
Just to make this an answer:
If my_dicts is an Array{Dict{String, Any}, 4}, then one way of working out what types are hiding in the Any part of the dict is:
unique(typeof.(values(my_dicts[1])))
To explain:
my_dicts[1] picks out the first element of your Array, i.e. one of your Dict{String, Any}
values then extracts the values, which is the Any part of the dictionary,
typeof. (notice the dot) broadcasts the typeof function over all elements returned by values, returning the types of all of these elements; and
unique takes the list of all these types and reduces it to its unique elements, so you'll end up with a list of each separate type contained in the Any partof your dictionary.
I store important metadata in R objects as attributes. I want to migrate my workflow to Julia and I am looking for a way to represent at least temporarily the attributes as something accessible by Julia. Then I can start thinking about extending the RData package to fill this data structure with actual objects' attributes.
I understand, that annotating with things like label or unit in DataFrame - I think the most important use for object' attributes - is probably going to be implemented in the DataFrames package some time (https://github.com/JuliaData/DataFrames.jl/issues/35). But I am asking about about more general solution, that doesn't depend on this specific use case.
For anyone interested, here is a related discussion in the RData package
In Julia it is ideomatic to define your own types - you'd simply make fields in the type to store the attributes. In R, the nice thing about storing things as attributes is that they don't affect how the type dispatches - e.g. adding metadata to a Vector doesn't make it stop behaving like a Vector. In julia, that approach is a little more complicated - you'd have to define the AbstractVector interface for your type https://docs.julialang.org/en/latest/manual/interfaces/#man-interface-array-1 to have it behave like a Vector.
In essence, this means that the workflow solutions are a little different - e.g. often the attribute metadata in R is used to associate metadata to an object when it's returned from a function. An easy way to do something similar in Julia is to have the function return a tuple and assign the result to a tuple:
function ex()
res = rand(5)
met = "uniformly distributed random numbers"
res, met
end
result, metadata = ex()
I don't think there are plans to implement attributes like in R.
I am trying to find an efficient way to create a new array by repeating each element of an old array a different, specified number of times. I have come up with something that works, using array comprehensions, but it is not very efficient, either in memory or in computation:
LENGTH = 1e6
A = collect(1:LENGTH) ## arbitrary values that will be repeated specified numbers of times
NumRepeats = [rand(20:100) for idx = 1:LENGTH] ## arbitrary numbers of times to repeat each value in A
B = vcat([ [A[idx] for n = 1:NumRepeats[idx]] for idx = 1:length(A) ]...)
Ideally, what I would like would be a structure akin to the sparse matrix apparatus that Julia has but that would instead store data efficiently based on the indices where repeated values occur. Barring that, I would at least like an efficient way to create a vector such as B in the example above. I looked into the repeat() function, but as far as I can tell from the documentation and my experimentation with the function, it is just for repeating slices of an array the same number of times for each slice. What is the best way to approach this?
Sounds like you're looking for run-length encoding. There's an RLEVectors.jl package here: https://github.com/phaverty/RLEVectors.jl. Not sure how usable it is. You could also make your own data type fairly easily.
Thanks for trying RLEVectors.jl. Some features and optimizations had been languishing on master without a version bump. It can definitely be mixed with other vectors for element-wise arithmetic. I'll put the linear algebra operations on the feature request list. Any additional feature suggestions would be most welcome.
RLEVectors.jl has a rep function that works like R's and RLEVectors.inverse_ree is like StatsBase.inverse_rle, but it works on run ends rather than lengths.
Suppose I have an 2 dimensional array and I want to apply several functions to each of its columns. Ideally I would like to get the results back in the form of a matrix (with one row per function, and one column per input column).
The following code generates the values I want, but as an Array of Arrays.
A = rand(10,10)
[mapslices(f, A, 1) for f in [mean median iqr]]
Another similar example is here [Julia: use of pmap with matrices
Is there a better syntax for getting the results back in the form of a 2 dimensional array, instead of an array of arrays?
What I'd really like is something with a functionality similar to sapply from R. [https://stat.ethz.ch/R-manual/R-devel/library/base/html/lapply.html]
You can use an anonymous function as in
mapslices(t -> [mean(t), median(t), iqr(t)], A, 1)
but using comprehensions and splatting, as in your last example, is also fine. For very large arrays, you might want to avoid the temporary allocations introduced by transpose and splatting, but in most cases you don't have to pay attention to that.
After playing around a bit I found one option, but I am still interested in hearing if there are any better ways of doing it.
[[mapslices(f, A, 1)' for f in [mean median iqr]]...]