how to create dictionary from julia dataframe? - julia

I have a df like below, I want to get dictionary from the df.
df = DataFrame(id=[1, 2, 3, 4], value=["Rajesh", "John", "Jacob", "sundar"], other=[0.43, 0.42,0.54, 0.63])
│ Row │ id │ value │ other │
│ │ Int64 │ String │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1 │ 1 │ Rajesh │ 0.43 │
│ 2 │ 2 │ John │ 0.42 │
│ 3 │ 3 │ Jacob │ 0.54 │
│ 4 │ 4 │ sundar │ 0.63 │
Expected Output:
{1: 'Rajesh', 2: 'John', 3: 'Jacob', 4: 'sundar'}
I know how to do this in pandas,
df.set_index("id")["value"].to_dict()
What would be the pandas's equivalent code in julia?

To make a dictionary from a data frame you could write:
julia> Dict(pairs(eachcol(df)))
Dict{Symbol,AbstractArray{T,1} where T} with 3 entries:
:value => ["Rajesh", "John", "Jacob", "sundar"]
:id => [1, 2, 3, 4]
:other => [0.43, 0.42, 0.54, 0.63]
However, what you ask for is making a dictionary from a vector (that just happens to be stored in a data frame), which you can do in the following way (the pattern is very similar, but just applied to a vector):
julia> Dict(pairs(df.value))
Dict{Int64,String} with 4 entries:
4 => "sundar"
2 => "John"
3 => "Jacob"
1 => "Rajesh"
and if you want a mapping from :id to :value write (assumming :id is unique; again - it is just two vectors, the fact that they are stored in a data frame is not important here):
julia> Dict(Pair.(df.id, df.value))
Dict{Int64,String} with 4 entries:
4 => "sundar"
2 => "John"
3 => "Jacob"
1 => "Rajesh"

Related

Using combine Julia function on GroupedDataFrame while using regex to reference columns

As you will be able to tell from the question I am a VERY new user to Julia and just trying to do things that I have already done in python and stumbling a bit in the dark. What I am trying to do right now is to create some simple stats over multiple columns based on a certain grouping of the data. So after doing something like:
df = DataFrame(CSV.File(file));
gdf = groupby(df, :Class);
where df looks like:
df[1:3, [:Class, :V1, :V2, :V10]]
Class V1 V2 V10
Int64 Float64 Float64 Float64
1 0 -1.35981 -0.0727812 0.0907942
2 1 1.19186 0.266151 -0.166974
3 0 -1.35835 -1.34016 0.207643
...
I know I can do something like:
combine(gdf, :V1 => maximum => :v1_max, :V1 => minimum => :v1_min, nrow)
But then I saw that I could use regex to reference multiple columns and so my thought was to do something simple like:
combine(gdf, r"V[0-9]{1,2}" => maximum)
and have Julia in a single line generate the max value for all of the columns matching the regex for the grouped DataFrame.
I finally was able to do this in what I am guessing is not a nice efficient way and so looking for anyone's help to help me improve my usage of Julia.
foo = DataFrame(Class=[0, 1])
for v in ["V$i" for i in 1:28]
foo = join(foo,
combine(gdf, v => maximum => string(v, "_max")),
combine(gdf, v => minimum => string(v, "_min")),
on=:Class)
end
Just write:
combine(gdf, names(gdf, r"V[0-9]{1,2}") .=> maximum)
(note the . in front of =>)
In this case the target column names will be automatically generated.
What I have written above is a shorthand for:
combine(gdf, [n => maximum for n in names(gdf, r"V[0-9]{1,2}")])
Another way to write it is:
combine(AsTable(r"V[0-9]{1,2}") => x -> map(maximum, x), gdf)
when the old column names get retained.
The combine syntax is very flexible. I recommend you to have a look at its docstring for all available options.
Consider the following examples:
julia> using DataFrames
julia> passthrough(x...) = (#show x; x)
passthrough (generic function with 1 method)
julia> df = DataFrame(Class=[1,1,2], V1=1:3, V2=11:13)
3×3 DataFrame
│ Row │ Class │ V1 │ V2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 11 │
│ 2 │ 1 │ 2 │ 12 │
│ 3 │ 2 │ 3 │ 13 │
julia> gdf = groupby(df, :Class)
GroupedDataFrame with 2 groups based on key: Class
First Group (2 rows): Class = 1
│ Row │ Class │ V1 │ V2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 11 │
│ 2 │ 1 │ 2 │ 12 │
⋮
Last Group (1 row): Class = 2
│ Row │ Class │ V1 │ V2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 3 │ 13 │
julia> combine(gdf, r"V[0-9]{1,2}" .=> passthrough)
x = ([1, 2], [11, 12])
x = ([3], [13])
2×2 DataFrame
│ Row │ Class │ V1_V2_passthrough │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────────────┤
│ 1 │ 1 │ ([1, 2], [11, 12]) │
│ 2 │ 2 │ ([3], [13]) │
julia> combine(gdf, r"V[0-9]{1,2}" => passthrough)
x = ([1, 2], [11, 12])
x = ([3], [13])
2×2 DataFrame
│ Row │ Class │ V1_V2_passthrough │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────────────┤
│ 1 │ 1 │ ([1, 2], [11, 12]) │
│ 2 │ 2 │ ([3], [13]) │
julia> combine(gdf, names(gdf, r"V[0-9]{1,2}") .=> passthrough)
x = ([1, 2],)
x = ([3],)
x = ([11, 12],)
x = ([13],)
2×3 DataFrame
│ Row │ Class │ V1_passthrough │ V2_passthrough │
│ │ Int64 │ Tuple… │ Tuple… │
├─────┼───────┼────────────────┼────────────────┤
│ 1 │ 1 │ ([1, 2],) │ ([11, 12],) │
│ 2 │ 2 │ ([3],) │ ([13],) │
julia> combine(gdf, names(gdf, r"V[0-9]{1,2}") => passthrough)
x = ([1, 2], [11, 12])
x = ([3], [13])
2×2 DataFrame
│ Row │ Class │ V1_V2_passthrough │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────────────┤
│ 1 │ 1 │ ([1, 2], [11, 12]) │
│ 2 │ 2 │ ([3], [13]) │
In particular it is crucial to understand what gets passed to combine:
julia> r"V[0-9]{1,2}" .=> passthrough
r"V[0-9]{1,2}" => passthrough
julia> r"V[0-9]{1,2}" => passthrough
r"V[0-9]{1,2}" => passthrough
julia> names(gdf, r"V[0-9]{1,2}") .=> passthrough
2-element Array{Pair{String,typeof(passthrough)},1}:
"V1" => passthrough
"V2" => passthrough
julia> names(gdf, r"V[0-9]{1,2}") => passthrough
["V1", "V2"] => passthrough
So as you can see, all depends what gets passed to combine. In particular r"V[0-9]{1,2}" .=> passthrough and r"V[0-9]{1,2}" => passthrough are parsed as exactly the same expression, in which case passthrough is called only ONCE per group getting multiple positional arguments.
On the other hand names(gdf, r"V[0-9]{1,2}") .=> passthrough makes passthrough get called for each column separately for each group.

How to get dtypes of columns in julia dataframe

How to get dtypes of all columns and particular column(s) in julia. To be specific what is the pandas equivalent of df.dtypes in julia?
For example,
I have a df like below,
│ Row │ Id │ name │ item location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
│ 2 │ 2 │ B │ abc │
│ 3 │ 3 │ C │ def │
│ 4 │ 4 │ D │ ghi │
│ 5 │ 5 │ E │ xyz │
│ 6 │ 6 │ F │ abc │
│ 7 │ 7 │ G │ def │
│ 8 │ 8 │ H │ ghi │
│ 9 │ 9 │ I │ xyz │
│ 10 │ 10 │ J │ abc │
Expected output:
{'id': Int64, 'name': String, 'item location': String}
How to get dtypes, i.e., Int64 │ String │ String in Julia?
You have specified two different expected outputs so I show here how to get both:
julia> df = DataFrame("Id" => 1, "name" => "A", "item_location" => "xyz")
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
Int64
String
String
julia> Dict(names(df) .=> eltype.(eachcol(df)))
Dict{String,DataType} with 3 entries:
"Id" => Int64
"name" => String
"item_location" => String
additionally, if you wanted to store the result in a DataFrame instead of a Dict you can write (see mapcols documentation here):
julia> mapcols(eltype, df)
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ DataType │ DataType │ DataType │
├─────┼──────────┼──────────┼───────────────┤
│ 1 │ Int64 │ String │ String │
And if you would want to have a NamedTuple storing this information write (the documentation of Tables.columntable is here):
julia> map(eltype, Tables.columntable(df))
(Id = Int64, name = String, item_location = String)
(in this case note that for very wide tables this might have some extra compilation cost as each time you call it you potentially get a new type of NamedTuple)
You can also use describe(df) which is a catchall for learning about the columns in your data frame.

How can I convert a list of dictionaries into a DataFrame?

I have a list of dictionaries with a format similar to the following. The list
is generated by other functions which I don't want to change. Therefore, the
existance of the list and its dicts can be taken as a given.
dictlist=[]
for i in 1:20
push!(dictlist, Dict(:a=>i, :b=>2*i))
end
Is there a syntactically clean way of converting this list into a DataFrame?
You can push! the rows (represented by the dictionaries) in
Per the docs on row by row construction.
While as the docs say this is substantially slower than column by column construction, it is not any slower than constructing the columns from the dicts yourself.
df = DataFrame()
for row in dictlist
push!(df, row)
end
There is a current proposal
to make Vector{Dict} a Tables.jl row-table type.
If that was done (which seems likely to happen within a month or so)
Then you could just do
df = DataFrame(dictlist)
There's no nice direct way (that I'm aware of), but with a DataFrame like this, you can first convert it to a list of NamedTuples:
julia> using DataFrames
julia> dictlist=[]
0-element Array{Any,1}
julia> for i in 1:20
push!(dictlist, Dict(:a=>i, :b=>2*i))
end
julia> DataFrame([NamedTuple{Tuple(keys(d))}(values(d)) for d in dictlist])
20×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 2 │ 4 │
│ 3 │ 3 │ 6 │
│ 4 │ 4 │ 8 │
│ 5 │ 5 │ 10 │
│ 6 │ 6 │ 12 │
│ 7 │ 7 │ 14 │
│ 8 │ 8 │ 16 │
│ 9 │ 9 │ 18 │
│ 10 │ 10 │ 20 │
│ 11 │ 11 │ 22 │
│ 12 │ 12 │ 24 │
│ 13 │ 13 │ 26 │
│ 14 │ 14 │ 28 │
│ 15 │ 15 │ 30 │
│ 16 │ 16 │ 32 │
│ 17 │ 17 │ 34 │
│ 18 │ 18 │ 36 │
│ 19 │ 19 │ 38 │
│ 20 │ 20 │ 40 │
Note that just today, I opened this up as an issue in Tables.jl, so there may be better support for this soon.
This function provides one possible solution:
using DataFrames
function DictionariesToDataFrame(dictlist)
ret = Dict() #Holds dataframe's columns while we build it
#Get all unique keys from dictlist and make them entries in ret
for x in unique([y for x in [collect(keys(x)) for x in dictlist] for y in x])
ret[x] = []
end
for row in dictlist #Loop through each row
for (key,value) in ret #Use ret to check all possible keys in row
if haskey(row,key) #Is key present in row?
push!(value, row[key]) #Yes
else #Nope
push!(value, nothing) #So add nothing. Keeps columns same length.
end
end
end
#Fix the data types of the columns
for (k,v) in ret #Consider each column
row_type = unique([typeof(x) for x in v]) #Get datatypes of each row
if length(row_type)==1 #All rows had same datatype
row_type = row_type[1] #Fetch datatype
ret[k] = convert(Array{row_type,1}, v) #Convert column to that type
end
end
#DataFrame is ready to go!
return DataFrames.DataFrame(ret)
end
#Generate some data
dictlist=[]
for i in 1:20
push!(dictlist, Dict("a"=>i, "b"=>2*i))
if i>10
dictlist[end-1]["c"]=3*i
end
end
DictionariesToDataFrame(dictlist)
Here is one that does not lose data, but adds missing, for a potentially sparse frame:
using DataFrames
dictlist = [Dict("a" => 2), Dict("a" => 5, "b" => 8)]
keycol = unique(mapreduce(x -> collect(keys(x)), vcat, dictlist))
df = DataFrame()
df[!, Symbol("Keys")] = keycol
for (i, d) in enumerate(dictlist)
df[!, Symbol(string(i))] = [get(d, k, missing) for k in keycol]
end
println(df)
Just for reference Its looks there is no method available to covert a list of dict in to dataframe. Instead we have convert the list of dict into dict of list. I mean from [(:a => 1, :b =>2), (:a => 3, :b =>4)] into (:a => [1, 3], :b => [2, 4]) So we need to create such function:
function to_dict_of_array(data::Array, fields::Array)
# Pre allocate the array needed for speed up in case of large dataset
doa = Dict(Symbol(field) => Array{Any}(undef, length(data)) for field in fields)
for (i, datum) in enumerate(data)
for fn in fields
sym_fn = Symbol(fn)
doa[sym_fn][i] = datum[fn]
end
end
return doa
end
Then we can use that method to create dataframe.
array_of_dict = [Dict("a" => 1, "b" =>2), Dict("a" => 3, "b" =>4)]
required_field = ["a", "b"]
df = DataFrame(to_dict_of_array(array_of_dict, required_field));
Its just a conceptual example. Should be modified based on the usecase.

Mutate DataFrames in Julia

Looking for a function that works like by but doesn't collapse my DataFrame. In R I would use dplyr's groupby(b) %>% mutate(x1 = sum(a)). I don't want to lose information from the table such as that in variable :c.
mydf = DataFrame(a = 1:4, b = repeat(1:2,2), c=4:-1:1)
bypreserve(mydf, :b, x -> sum(x.a))
│ Row │ a │ b │ c │ x1
│ │ Int64 │ Int64 │ Int64 │Int64
├─────┼───────┼───────┼───────┤───────
│ 1 │ 1 │ 1 │ 4 │ 4
│ 2 │ 2 │ 2 │ 3 │ 6
│ 3 │ 3 │ 1 │ 2 │ 4
│ 4 │ 4 │ 2 │ 1 │ 6
Adding this functionality is discussed, but I would say that it will take several months to be shipped (the general idea is to allow select to have groupby keyword argument + also add transform function that will work like select but preserve columns of the source data frame).
For now the solution is to use join after by:
join(mydf, by(mydf, :b, x1 = :a => sum), on=:b)

How to sort a data frame by multiple column(s) in Julia

I want to sort a data frame by multiple columns. Here is a simple data frame I made. How can I sort each column by a different sort type?
using DataFrames
DataFrame(b = ("Hi", "Med", "Hi", "Low"),
levels = ("Med", "Hi", "Low"),
x = ("A", "E", "I", "O"), y = (6, 3, 7, 2),
z = (2, 1, 1, 2))
Ported this over from here.
Unlike R, Julia's DataFrame constructor expects the values in each column to be passed as a vector rather than as a tuple: so DataFrame(b = ["Hi", "Med", "Hi", "Low"], &tc.
Also, DataFrames does not expect explicit levels to be given in the way R does it. Instead, the optional keyword argument categorical is available and should be set to "a vector of Bool indicating which columns should be converted to CategoricalVector".
(after adding the DataFrames and the CategoricalArrays packages)
julia> using DataFrames, CategoricalArrays
julia> xyorz = categorical(rand(("x","y","z"), 5))
5-element CategoricalArray{String,1,UInt32}:
"z"
"y"
"x"
"x"
"z"
julia> smallints = rand(1:4, 5)
5-element Array{Int64,1}:
2
3
2
1
1
julia> df = DataFrame(A = 1:5, B = xyorz, C = smallints)
5×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1 │ 1 │ z │ 2 │
│ 2 │ 2 │ y │ 3 │
│ 3 │ 3 │ x │ 2 │
│ 4 │ 4 │ x │ 1 │
│ 5 │ 5 │ z │ 1 │
now, what do you want to sort? A on (B then C)? [4, 3, 2, 5, 1]
julia> sort(df, (:B, :C))
5×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1 │ 4 │ x │ 1 │
│ 2 │ 3 │ x │ 2 │
│ 3 │ 2 │ y │ 3 │
│ 4 │ 5 │ z │ 1 │
│ 5 │ 1 │ z │ 2 │
julia> sort(df, (:B, :C)).A
5-element Array{Int64,1}:
4
3
2
5
1
This is a good place to start http://juliadata.github.io/DataFrames.jl/stable/
Your code was creating a single row DataFrame containing a tuple so I corrected it.
Note that for nominal variables you would normally used Symbols rather than Strings.
using DataFrames
df = DataFrame(b = [:Hi, :Med, :Hi, :Low, :Hi],
x = ["A", "E", "I", "O","A"],
y = [6, 3, 7, 2, 1],
z = [2, 1, 1, 2, 2])
sort(df, [:z,:y])

Resources