Splitting datasets into train and test in julia - julia

I am trying to split the dataset into train and test subsets in Julia. So far, I have tried using MLDataUtils.jl package for this operation, however, the results are not up to the expectations.
Below are my findings and issues:
Code
# the inputs are
a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
)
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
using MLDataUtils
(x1, y1), (x2, y2) = stratifiedobs((a,b), p=0.7)
#Output of this operation is: (which is not the expectation)
println("x1 is: $x1")
x1 is:
10×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │ 3 │
│ 4 │ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │ 6 │
│ 7 │ 7 │ 7 │ 7 │
│ 8 │ 8 │ 8 │ 8 │
│ 9 │ 9 │ 9 │ 9 │
│ 10 │ 10 │ 10 │ 10 │
println("y1 is: $y1")
y1 is:
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
# but x2 is printed as
(0×3 SubDataFrame, Float64[])
# while y2 as
0-element view(::Array{Float64,1}, Int64[]) with eltype Float64)
However, I would like this dataset to be split in 2 parts with 70% data in train and 30% in test.
Please suggest a better approach to perform this operation in julia.
Thanks in advance.

Probably MLJ.jl developers can show you how to do it using the general ecosystem. Here is a solution using DataFrames.jl only:
julia> using DataFrames, Random
julia> a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
)
10×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 3 3 3
4 │ 4 4 4
5 │ 5 5 5
6 │ 6 6 6
7 │ 7 7 7
8 │ 8 8 8
9 │ 9 9 9
10 │ 10 10 10
julia> function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df, sel, :), view(df, .!sel, :)
end
splitdf (generic function with 1 method)
julia> splitdf(a, 0.7)
(7×3 SubDataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 3 3
2 │ 4 4 4
3 │ 6 6 6
4 │ 7 7 7
5 │ 8 8 8
6 │ 9 9 9
7 │ 10 10 10, 3×3 SubDataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 5 5 5)
I am using views to save memory, but alternatively you could just materialize train and test data frames if you prefer this.

This is how I did implement it for generic arrays in the Beta Machine Learning Toolkit:
"""
partition(data,parts;shuffle=true)
Partition (by rows) one or more matrices according to the shares in `parts`.
# Parameters
* `data`: A matrix/vector or a vector of matrices/vectors
* `parts`: A vector of the required shares (must sum to 1)
* `shufle`: Wheter to randomly shuffle the matrices (preserving the relative order between matrices)
"""
function partition(data::AbstractArray{T,1},parts::AbstractArray{Float64,1};shuffle=true) where T <: AbstractArray
n = size(data[1],1)
if !all(size.(data,1) .== n)
#error "All matrices passed to `partition` must have the same number of rows"
end
ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
return partition.(data,Ref(parts);shuffle=shuffle, fixedRIdx = ridx)
end
function partition(data::AbstractArray{T,N} where N, parts::AbstractArray{Float64,1};shuffle=true,fixedRIdx=Int64[]) where T
n = size(data,1)
nParts = size(parts)
toReturn = []
if !(sum(parts) ≈ 1)
#error "The sum of `parts` in `partition` should total to 1."
end
ridx = fixedRIdx
if (isempty(ridx))
ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
end
current = 1
cumPart = 0.0
for (i,p) in enumerate(parts)
cumPart += parts[i]
final = i == nParts ? n : Int64(round(cumPart*n))
push!(toReturn,data[ridx[current:final],:])
current = (final +=1)
end
return toReturn
end
Use it with:
julia> x = [1:10 11:20]
julia> y = collect(31:40)
julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])
Ore that you can partition also in three or more parts, and the number of arrays to partition also is variable.
By default they are also shuffled, but you can avoid it with the parameter shuffle...

using Pkg Pkg.add("Lathe") using Lathe.preprocess: TrainTestSplit train, test = TrainTestSplit(df)
There is also a positional argument, at in the second position that takes a percentage to split at.

Related

Julia panel data find data

Suppose I have the following data.
dt = DataFrame(
id = [1,1,1,1,1,2,2,2,2,2,],
t = [1,2,3,4,5, 1,2,3,4,5],
val = randn(10)
)
Row │ id t val
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673
2 │ 1 2 -0.817519
3 │ 1 3 0.201231
4 │ 1 4 0.856569
5 │ 1 5 1.8941
6 │ 2 1 0.240532
7 │ 2 2 -0.431824
8 │ 2 3 0.165137
9 │ 2 4 1.22958
10 │ 2 5 -0.424504
I want to make a dummy variable from t to t+2 whether the val>0.5.
For instance, I want to make val_gr_0.5 a new variable.
Could someone help me with how to do this?
Row │ id t val val_gr_0.5
│ Int64 Int64 Float64 Float64
─────┼─────────────────────────
1 │ 1 1 0.546673 0 (search t:1 to 3)
2 │ 1 2 -0.817519 1 (search t:2 to 4)
3 │ 1 3 0.201231 1 (search t:3 to 5)
4 │ 1 4 0.856569 missing
5 │ 1 5 1.8941 missing
6 │ 2 1 0.240532 0 (search t:1 to 3)
7 │ 2 2 -0.431824 1 (search t:2 to 4)
8 │ 2 3 0.165137 1 (search t:3 to 5)
9 │ 2 4 1.22958 missing
10 │ 2 5 -0.424504 missing
julia> using DataFramesMeta
julia> function checkvals(subsetdf)
vals = subsetdf[!, :val]
length(vals) < 3 && return missing
any(vals .> 0.5)
end
checkvals (generic function with 1 method)
julia> for sdf in groupby(dt, :id)
transform!(sdf, :t => ByRow(t -> checkvals(#subset(sdf, #byrow t <= :t <= t+2))) => :val_gr)
end
julia> dt
10×4 DataFrame
Row │ id t val val_gr
│ Int64 Int64 Float64 Bool?
─────┼──────────────────────────────────
1 │ 1 1 0.0619327 false
2 │ 1 2 0.278406 false
3 │ 1 3 -0.595824 true
4 │ 1 4 0.0466594 missing
5 │ 1 5 1.08579 missing
6 │ 2 1 -1.57656 true
7 │ 2 2 0.17594 true
8 │ 2 3 0.865381 true
9 │ 2 4 0.972024 missing
10 │ 2 5 1.54641 missing
first define a function
function run_max(x, window)
window -= 1
res = missings(eltype(x), length(x))
for i in 1:length(x)-window
res[i] = maximum(view(x, i:i+window))
end
res
end
then use it in DataFrames.jl
dt.new = dt.val .> 0.5
transform!(groupby(dt,1), :new => x->run_max(x, 3))

Return the maximum sum in `DataFrames.jl`?

Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257

Condition-based column selection in Julia Programming Language

I'm relatively new to Julia - I wondered how to select some columns in DataFrames.jl, based on condition, e.q., all columns with an average greater than 0.
One way to select columns based on a column-wise condition is to map that condition on the columns using eachcol, then use the resulting Bool array as a column selector on the DataFrame:
julia> using DataFrames, Statistics
julia> df = DataFrame(a=randn(10), b=randn(10) .- 1, c=randn(10) .+ 1, d=randn(10))
10×4 DataFrame
Row │ a b c d
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ -1.05612 -2.01901 1.99614 -2.08048
2 │ -0.37359 0.00750529 2.11529 1.93699
3 │ -1.15199 -0.812506 -0.721653 -0.286076
4 │ 0.992366 -2.05898 0.474682 -0.210283
5 │ 0.206846 -0.922274 1.87723 -0.403679
6 │ -1.01923 -1.4401 -0.0769749 0.0557395
7 │ 1.99409 -0.463743 1.83163 -0.585677
8 │ 2.21445 0.658119 2.33056 -1.01474
9 │ 0.918917 -0.371214 1.76301 -0.234561
10 │ -0.839345 -1.09017 1.38716 -2.82545
julia> f(x) = mean(x) > 0
f (generic function with 1 method)
julia> df[:, map(f, eachcol(df))]
10×2 DataFrame
Row │ a c
│ Float64 Float64
─────┼───────────────────────
1 │ -1.05612 1.99614
2 │ -0.37359 2.11529
3 │ -1.15199 -0.721653
4 │ 0.992366 0.474682
5 │ 0.206846 1.87723
6 │ -1.01923 -0.0769749
7 │ 1.99409 1.83163
8 │ 2.21445 2.33056
9 │ 0.918917 1.76301
10 │ -0.839345 1.38716

How can I convert a list of dictionaries into a DataFrame?

I have a list of dictionaries with a format similar to the following. The list
is generated by other functions which I don't want to change. Therefore, the
existance of the list and its dicts can be taken as a given.
dictlist=[]
for i in 1:20
push!(dictlist, Dict(:a=>i, :b=>2*i))
end
Is there a syntactically clean way of converting this list into a DataFrame?
You can push! the rows (represented by the dictionaries) in
Per the docs on row by row construction.
While as the docs say this is substantially slower than column by column construction, it is not any slower than constructing the columns from the dicts yourself.
df = DataFrame()
for row in dictlist
push!(df, row)
end
There is a current proposal
to make Vector{Dict} a Tables.jl row-table type.
If that was done (which seems likely to happen within a month or so)
Then you could just do
df = DataFrame(dictlist)
There's no nice direct way (that I'm aware of), but with a DataFrame like this, you can first convert it to a list of NamedTuples:
julia> using DataFrames
julia> dictlist=[]
0-element Array{Any,1}
julia> for i in 1:20
push!(dictlist, Dict(:a=>i, :b=>2*i))
end
julia> DataFrame([NamedTuple{Tuple(keys(d))}(values(d)) for d in dictlist])
20×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 2 │ 4 │
│ 3 │ 3 │ 6 │
│ 4 │ 4 │ 8 │
│ 5 │ 5 │ 10 │
│ 6 │ 6 │ 12 │
│ 7 │ 7 │ 14 │
│ 8 │ 8 │ 16 │
│ 9 │ 9 │ 18 │
│ 10 │ 10 │ 20 │
│ 11 │ 11 │ 22 │
│ 12 │ 12 │ 24 │
│ 13 │ 13 │ 26 │
│ 14 │ 14 │ 28 │
│ 15 │ 15 │ 30 │
│ 16 │ 16 │ 32 │
│ 17 │ 17 │ 34 │
│ 18 │ 18 │ 36 │
│ 19 │ 19 │ 38 │
│ 20 │ 20 │ 40 │
Note that just today, I opened this up as an issue in Tables.jl, so there may be better support for this soon.
This function provides one possible solution:
using DataFrames
function DictionariesToDataFrame(dictlist)
ret = Dict() #Holds dataframe's columns while we build it
#Get all unique keys from dictlist and make them entries in ret
for x in unique([y for x in [collect(keys(x)) for x in dictlist] for y in x])
ret[x] = []
end
for row in dictlist #Loop through each row
for (key,value) in ret #Use ret to check all possible keys in row
if haskey(row,key) #Is key present in row?
push!(value, row[key]) #Yes
else #Nope
push!(value, nothing) #So add nothing. Keeps columns same length.
end
end
end
#Fix the data types of the columns
for (k,v) in ret #Consider each column
row_type = unique([typeof(x) for x in v]) #Get datatypes of each row
if length(row_type)==1 #All rows had same datatype
row_type = row_type[1] #Fetch datatype
ret[k] = convert(Array{row_type,1}, v) #Convert column to that type
end
end
#DataFrame is ready to go!
return DataFrames.DataFrame(ret)
end
#Generate some data
dictlist=[]
for i in 1:20
push!(dictlist, Dict("a"=>i, "b"=>2*i))
if i>10
dictlist[end-1]["c"]=3*i
end
end
DictionariesToDataFrame(dictlist)
Here is one that does not lose data, but adds missing, for a potentially sparse frame:
using DataFrames
dictlist = [Dict("a" => 2), Dict("a" => 5, "b" => 8)]
keycol = unique(mapreduce(x -> collect(keys(x)), vcat, dictlist))
df = DataFrame()
df[!, Symbol("Keys")] = keycol
for (i, d) in enumerate(dictlist)
df[!, Symbol(string(i))] = [get(d, k, missing) for k in keycol]
end
println(df)
Just for reference Its looks there is no method available to covert a list of dict in to dataframe. Instead we have convert the list of dict into dict of list. I mean from [(:a => 1, :b =>2), (:a => 3, :b =>4)] into (:a => [1, 3], :b => [2, 4]) So we need to create such function:
function to_dict_of_array(data::Array, fields::Array)
# Pre allocate the array needed for speed up in case of large dataset
doa = Dict(Symbol(field) => Array{Any}(undef, length(data)) for field in fields)
for (i, datum) in enumerate(data)
for fn in fields
sym_fn = Symbol(fn)
doa[sym_fn][i] = datum[fn]
end
end
return doa
end
Then we can use that method to create dataframe.
array_of_dict = [Dict("a" => 1, "b" =>2), Dict("a" => 3, "b" =>4)]
required_field = ["a", "b"]
df = DataFrame(to_dict_of_array(array_of_dict, required_field));
Its just a conceptual example. Should be modified based on the usecase.

How to sort a data frame by multiple column(s) in Julia

I want to sort a data frame by multiple columns. Here is a simple data frame I made. How can I sort each column by a different sort type?
using DataFrames
DataFrame(b = ("Hi", "Med", "Hi", "Low"),
levels = ("Med", "Hi", "Low"),
x = ("A", "E", "I", "O"), y = (6, 3, 7, 2),
z = (2, 1, 1, 2))
Ported this over from here.
Unlike R, Julia's DataFrame constructor expects the values in each column to be passed as a vector rather than as a tuple: so DataFrame(b = ["Hi", "Med", "Hi", "Low"], &tc.
Also, DataFrames does not expect explicit levels to be given in the way R does it. Instead, the optional keyword argument categorical is available and should be set to "a vector of Bool indicating which columns should be converted to CategoricalVector".
(after adding the DataFrames and the CategoricalArrays packages)
julia> using DataFrames, CategoricalArrays
julia> xyorz = categorical(rand(("x","y","z"), 5))
5-element CategoricalArray{String,1,UInt32}:
"z"
"y"
"x"
"x"
"z"
julia> smallints = rand(1:4, 5)
5-element Array{Int64,1}:
2
3
2
1
1
julia> df = DataFrame(A = 1:5, B = xyorz, C = smallints)
5×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1 │ 1 │ z │ 2 │
│ 2 │ 2 │ y │ 3 │
│ 3 │ 3 │ x │ 2 │
│ 4 │ 4 │ x │ 1 │
│ 5 │ 5 │ z │ 1 │
now, what do you want to sort? A on (B then C)? [4, 3, 2, 5, 1]
julia> sort(df, (:B, :C))
5×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1 │ 4 │ x │ 1 │
│ 2 │ 3 │ x │ 2 │
│ 3 │ 2 │ y │ 3 │
│ 4 │ 5 │ z │ 1 │
│ 5 │ 1 │ z │ 2 │
julia> sort(df, (:B, :C)).A
5-element Array{Int64,1}:
4
3
2
5
1
This is a good place to start http://juliadata.github.io/DataFrames.jl/stable/
Your code was creating a single row DataFrame containing a tuple so I corrected it.
Note that for nominal variables you would normally used Symbols rather than Strings.
using DataFrames
df = DataFrame(b = [:Hi, :Med, :Hi, :Low, :Hi],
x = ["A", "E", "I", "O","A"],
y = [6, 3, 7, 2, 1],
z = [2, 1, 1, 2, 2])
sort(df, [:z,:y])

Resources