I'm relatively new to Julia - I wondered how to select some columns in DataFrames.jl, based on condition, e.q., all columns with an average greater than 0.
One way to select columns based on a column-wise condition is to map that condition on the columns using eachcol, then use the resulting Bool array as a column selector on the DataFrame:
julia> using DataFrames, Statistics
julia> df = DataFrame(a=randn(10), b=randn(10) .- 1, c=randn(10) .+ 1, d=randn(10))
10×4 DataFrame
Row │ a b c d
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ -1.05612 -2.01901 1.99614 -2.08048
2 │ -0.37359 0.00750529 2.11529 1.93699
3 │ -1.15199 -0.812506 -0.721653 -0.286076
4 │ 0.992366 -2.05898 0.474682 -0.210283
5 │ 0.206846 -0.922274 1.87723 -0.403679
6 │ -1.01923 -1.4401 -0.0769749 0.0557395
7 │ 1.99409 -0.463743 1.83163 -0.585677
8 │ 2.21445 0.658119 2.33056 -1.01474
9 │ 0.918917 -0.371214 1.76301 -0.234561
10 │ -0.839345 -1.09017 1.38716 -2.82545
julia> f(x) = mean(x) > 0
f (generic function with 1 method)
julia> df[:, map(f, eachcol(df))]
10×2 DataFrame
Row │ a c
│ Float64 Float64
─────┼───────────────────────
1 │ -1.05612 1.99614
2 │ -0.37359 2.11529
3 │ -1.15199 -0.721653
4 │ 0.992366 0.474682
5 │ 0.206846 1.87723
6 │ -1.01923 -0.0769749
7 │ 1.99409 1.83163
8 │ 2.21445 2.33056
9 │ 0.918917 1.76301
10 │ -0.839345 1.38716
Related
Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257
I have a list of dictionaries with a format similar to the following. The list
is generated by other functions which I don't want to change. Therefore, the
existance of the list and its dicts can be taken as a given.
dictlist=[]
for i in 1:20
push!(dictlist, Dict(:a=>i, :b=>2*i))
end
Is there a syntactically clean way of converting this list into a DataFrame?
You can push! the rows (represented by the dictionaries) in
Per the docs on row by row construction.
While as the docs say this is substantially slower than column by column construction, it is not any slower than constructing the columns from the dicts yourself.
df = DataFrame()
for row in dictlist
push!(df, row)
end
There is a current proposal
to make Vector{Dict} a Tables.jl row-table type.
If that was done (which seems likely to happen within a month or so)
Then you could just do
df = DataFrame(dictlist)
There's no nice direct way (that I'm aware of), but with a DataFrame like this, you can first convert it to a list of NamedTuples:
julia> using DataFrames
julia> dictlist=[]
0-element Array{Any,1}
julia> for i in 1:20
push!(dictlist, Dict(:a=>i, :b=>2*i))
end
julia> DataFrame([NamedTuple{Tuple(keys(d))}(values(d)) for d in dictlist])
20×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 2 │ 4 │
│ 3 │ 3 │ 6 │
│ 4 │ 4 │ 8 │
│ 5 │ 5 │ 10 │
│ 6 │ 6 │ 12 │
│ 7 │ 7 │ 14 │
│ 8 │ 8 │ 16 │
│ 9 │ 9 │ 18 │
│ 10 │ 10 │ 20 │
│ 11 │ 11 │ 22 │
│ 12 │ 12 │ 24 │
│ 13 │ 13 │ 26 │
│ 14 │ 14 │ 28 │
│ 15 │ 15 │ 30 │
│ 16 │ 16 │ 32 │
│ 17 │ 17 │ 34 │
│ 18 │ 18 │ 36 │
│ 19 │ 19 │ 38 │
│ 20 │ 20 │ 40 │
Note that just today, I opened this up as an issue in Tables.jl, so there may be better support for this soon.
This function provides one possible solution:
using DataFrames
function DictionariesToDataFrame(dictlist)
ret = Dict() #Holds dataframe's columns while we build it
#Get all unique keys from dictlist and make them entries in ret
for x in unique([y for x in [collect(keys(x)) for x in dictlist] for y in x])
ret[x] = []
end
for row in dictlist #Loop through each row
for (key,value) in ret #Use ret to check all possible keys in row
if haskey(row,key) #Is key present in row?
push!(value, row[key]) #Yes
else #Nope
push!(value, nothing) #So add nothing. Keeps columns same length.
end
end
end
#Fix the data types of the columns
for (k,v) in ret #Consider each column
row_type = unique([typeof(x) for x in v]) #Get datatypes of each row
if length(row_type)==1 #All rows had same datatype
row_type = row_type[1] #Fetch datatype
ret[k] = convert(Array{row_type,1}, v) #Convert column to that type
end
end
#DataFrame is ready to go!
return DataFrames.DataFrame(ret)
end
#Generate some data
dictlist=[]
for i in 1:20
push!(dictlist, Dict("a"=>i, "b"=>2*i))
if i>10
dictlist[end-1]["c"]=3*i
end
end
DictionariesToDataFrame(dictlist)
Here is one that does not lose data, but adds missing, for a potentially sparse frame:
using DataFrames
dictlist = [Dict("a" => 2), Dict("a" => 5, "b" => 8)]
keycol = unique(mapreduce(x -> collect(keys(x)), vcat, dictlist))
df = DataFrame()
df[!, Symbol("Keys")] = keycol
for (i, d) in enumerate(dictlist)
df[!, Symbol(string(i))] = [get(d, k, missing) for k in keycol]
end
println(df)
Just for reference Its looks there is no method available to covert a list of dict in to dataframe. Instead we have convert the list of dict into dict of list. I mean from [(:a => 1, :b =>2), (:a => 3, :b =>4)] into (:a => [1, 3], :b => [2, 4]) So we need to create such function:
function to_dict_of_array(data::Array, fields::Array)
# Pre allocate the array needed for speed up in case of large dataset
doa = Dict(Symbol(field) => Array{Any}(undef, length(data)) for field in fields)
for (i, datum) in enumerate(data)
for fn in fields
sym_fn = Symbol(fn)
doa[sym_fn][i] = datum[fn]
end
end
return doa
end
Then we can use that method to create dataframe.
array_of_dict = [Dict("a" => 1, "b" =>2), Dict("a" => 3, "b" =>4)]
required_field = ["a", "b"]
df = DataFrame(to_dict_of_array(array_of_dict, required_field));
Its just a conceptual example. Should be modified based on the usecase.
Looking for a function that works like by but doesn't collapse my DataFrame. In R I would use dplyr's groupby(b) %>% mutate(x1 = sum(a)). I don't want to lose information from the table such as that in variable :c.
mydf = DataFrame(a = 1:4, b = repeat(1:2,2), c=4:-1:1)
bypreserve(mydf, :b, x -> sum(x.a))
│ Row │ a │ b │ c │ x1
│ │ Int64 │ Int64 │ Int64 │Int64
├─────┼───────┼───────┼───────┤───────
│ 1 │ 1 │ 1 │ 4 │ 4
│ 2 │ 2 │ 2 │ 3 │ 6
│ 3 │ 3 │ 1 │ 2 │ 4
│ 4 │ 4 │ 2 │ 1 │ 6
Adding this functionality is discussed, but I would say that it will take several months to be shipped (the general idea is to allow select to have groupby keyword argument + also add transform function that will work like select but preserve columns of the source data frame).
For now the solution is to use join after by:
join(mydf, by(mydf, :b, x1 = :a => sum), on=:b)
Working with Julia 1.0
I am trying to aggregate (in this case mean-center) several columns by group and looking for a way to loop over the columns as opposed to writing all column names explicitly.
The below works but I am looking for more succinct syntax for cases where I have many columns.
using DataFrames, Statistics
dd=DataFrame(A=["aa";"aa";"bb";"bb"], B=[1.0;2.0;3.0;4.0], C=[5.0;5.0;10.0;10.0])
by(dd, :A, df -> DataFrame(bm = df[:B].-mean(df[:B]), cm = df[:C].-mean(df[:C])))
Is there a way to loop over [:B, :C] and not write the statement separately for each?
You can use aggregate:
julia> centered(col) = col .- mean(col)
centered (generic function with 1 method)
julia> aggregate(dd, :A, centered)
4×3 DataFrame
│ Row │ A │ B_centered │ C_centered │
│ │ String │ Float64 │ Float64 │
├─────┼────────┼────────────┼────────────┤
│ 1 │ aa │ -0.5 │ 0.0 │
│ 2 │ aa │ 0.5 │ 0.0 │
│ 3 │ bb │ -0.5 │ 0.0 │
│ 4 │ bb │ 0.5 │ 0.0 │
Note that function name is used as a suffix. If you need more customized suffixes use by and pass it a more fancy third argument that iterates over passed columns giving them appropriate names.
In R we can convert NA to 0 with:
df[is.na(df)] <- 0
This works for single columns:
df[ismissing.(df[:col]), :col] = 0
There a way for the full df?
I don't think there's such a function in DataFrames.jl yet.
But you can hack your way around it by combining colwise and recode. I'm also providing a reproducible example here, in case someone wants to iterate on this answer:
julia> using DataFrames
julia> df = DataFrame(a = [missing, 5, 5],
b = [1, missing, missing])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1 │
│ 2 │ 5 │ missing │
│ 3 │ 5 │ missing │
julia> DataFrame(colwise(col -> recode(col, missing=>0), df), names(df))
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───┼───┤
│ 1 │ 0 │ 1 │
│ 2 │ 5 │ 0 │
│ 3 │ 5 │ 0 │
This is a bit ugly as you have to reassign the dataframe column names.
Maybe a simpler way to convert all missing values in a DataFrame is to just use list comprehension:
[df[ismissing.(df[i]), i] = 0 for i in names(df)]