Imagine I have a data frame like below:
What I want to do is to fill those missing values with previous values, so after fillling the data frame would be like:
Is there any simple way that I can do this?
This is the way to do it using Impute.jl:
julia> using Impute, DataFrames
julia> df = DataFrame(dt1=[0.2, missing, missing, 1, missing, 5, 6],
dt2=[0.3, missing, missing, 3, missing, 5, 6])
7×2 DataFrame
Row │ dt1 dt2
│ Float64? Float64?
─────┼──────────────────────
1 │ 0.2 0.3
2 │ missing missing
3 │ missing missing
4 │ 1.0 3.0
5 │ missing missing
6 │ 5.0 5.0
7 │ 6.0 6.0
julia> transform(df, names(df) .=> Impute.locf, renamecols=false)
7×2 DataFrame
Row │ dt1 dt2
│ Float64? Float64?
─────┼────────────────────
1 │ 0.2 0.3
2 │ 0.2 0.3
3 │ 0.2 0.3
4 │ 1.0 3.0
5 │ 1.0 3.0
6 │ 5.0 5.0
7 │ 6.0 6.0
Using a quick for loop would do the trick. Maybe it is useful to wrap into a function
using DataFrames
df = DataFrame(a = [1, missing, 2], b = [3, missing, 4])
previous = 0
for c in 1:ncol(df), r in 1:nrow(df)
if !ismissing(df[r, c])
previous = df[r,c]
else
df[r,c] = previous
end
end
julia> df
3×2 DataFrame
│ Row │ a │ b │
│ │ Int64? │ Int64? │
├─────┼────────┼────────┤
│ 1 │ 1 │ 3 │
│ 2 │ 1 │ 3 │
│ 3 │ 2 │ 4 │
df = DataFrame(dt1=[missing, 0.2, missing, missing, 1, missing, 5, 6],
dt2=[9, 0.3, missing, missing, 3, missing, 5, 6])
filldown(v)=accumulate((x,y)->coalesce(y,x), v,init=v[1])
transform(df,[:dt1,:dt2].=>filldown,renamecols=false)
julia> transform(df,[:dt1,:dt2].=>filldown,renamecols=false)
8×2 DataFrame
Row │ dt1 dt2
│ Float64? Float64
─────┼────────────────────
1 │ missing 9.0
2 │ 0.2 0.3
3 │ 0.2 0.3
4 │ 0.2 0.3
5 │ 1.0 3.0
6 │ 1.0 3.0
7 │ 5.0 5.0
8 │ 6.0 6.0
fillup(v)=reverse(filldown(reverse(v)))
transform(df,[:dt2,:dt1].=>[filldown,fillup],renamecols=false)
Related
I'm relatively new to Julia - I wondered how to select some columns in DataFrames.jl, based on condition, e.q., all columns with an average greater than 0.
One way to select columns based on a column-wise condition is to map that condition on the columns using eachcol, then use the resulting Bool array as a column selector on the DataFrame:
julia> using DataFrames, Statistics
julia> df = DataFrame(a=randn(10), b=randn(10) .- 1, c=randn(10) .+ 1, d=randn(10))
10×4 DataFrame
Row │ a b c d
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ -1.05612 -2.01901 1.99614 -2.08048
2 │ -0.37359 0.00750529 2.11529 1.93699
3 │ -1.15199 -0.812506 -0.721653 -0.286076
4 │ 0.992366 -2.05898 0.474682 -0.210283
5 │ 0.206846 -0.922274 1.87723 -0.403679
6 │ -1.01923 -1.4401 -0.0769749 0.0557395
7 │ 1.99409 -0.463743 1.83163 -0.585677
8 │ 2.21445 0.658119 2.33056 -1.01474
9 │ 0.918917 -0.371214 1.76301 -0.234561
10 │ -0.839345 -1.09017 1.38716 -2.82545
julia> f(x) = mean(x) > 0
f (generic function with 1 method)
julia> df[:, map(f, eachcol(df))]
10×2 DataFrame
Row │ a c
│ Float64 Float64
─────┼───────────────────────
1 │ -1.05612 1.99614
2 │ -0.37359 2.11529
3 │ -1.15199 -0.721653
4 │ 0.992366 0.474682
5 │ 0.206846 1.87723
6 │ -1.01923 -0.0769749
7 │ 1.99409 1.83163
8 │ 2.21445 2.33056
9 │ 0.918917 1.76301
10 │ -0.839345 1.38716
I want to extract the 3rd and 7th row of a data frame in Julia. The MWE is:
using DataFrames
my_data = DataFrame(A = 1:10, B = 16:25);
my_data
10×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 16 │
│ 2 │ 2 │ 17 │
│ 3 │ 3 │ 18 │
│ 4 │ 4 │ 19 │
│ 5 │ 5 │ 20 │
│ 6 │ 6 │ 21 │
│ 7 │ 7 │ 22 │
│ 8 │ 8 │ 23 │
│ 9 │ 9 │ 24 │
│ 10 │ 10 │ 25 │
This should give you the expected output:
using DataFrames
my_data = DataFrame(A = 1:10, B = 16:25);
my_data;
my_data[[3, 7], :]
2×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 3 │ 18 │
│ 2 │ 7 │ 22 │
The great thing about Julia is that you do not need to materialize the result (and hence save memory and time on copying the data). Hence, if you need a subrange of any array-like structure it is better to use #view rather than materialize directly
julia> #view my_data[[3, 7], :]
2×2 SubDataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 3 │ 18 │
│ 2 │ 7 │ 22 │
Now the performance testing.
function submean1(df)
d = df[[3, 7], :]
mean(d.A)
end
function submean2(df)
d = #view df[[3, 7], :]
mean(d.A)
end
And tests:
julia> using BenchmarkTools
julia> #btime submean1($my_data)
689.262 ns (19 allocations: 1.38 KiB)
5.0
julia> #btime submean2($my_data)
582.315 ns (9 allocations: 288 bytes)
5.0
Even in this simplistic example #view is 15% faster and uses four times less memory. Of course sometimes you want to copy the data but the rule of thumb is not to materialize.
I want to sort a data frame by multiple columns. Here is a simple data frame I made. How can I sort each column by a different sort type?
using DataFrames
DataFrame(b = ("Hi", "Med", "Hi", "Low"),
levels = ("Med", "Hi", "Low"),
x = ("A", "E", "I", "O"), y = (6, 3, 7, 2),
z = (2, 1, 1, 2))
Ported this over from here.
Unlike R, Julia's DataFrame constructor expects the values in each column to be passed as a vector rather than as a tuple: so DataFrame(b = ["Hi", "Med", "Hi", "Low"], &tc.
Also, DataFrames does not expect explicit levels to be given in the way R does it. Instead, the optional keyword argument categorical is available and should be set to "a vector of Bool indicating which columns should be converted to CategoricalVector".
(after adding the DataFrames and the CategoricalArrays packages)
julia> using DataFrames, CategoricalArrays
julia> xyorz = categorical(rand(("x","y","z"), 5))
5-element CategoricalArray{String,1,UInt32}:
"z"
"y"
"x"
"x"
"z"
julia> smallints = rand(1:4, 5)
5-element Array{Int64,1}:
2
3
2
1
1
julia> df = DataFrame(A = 1:5, B = xyorz, C = smallints)
5×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1 │ 1 │ z │ 2 │
│ 2 │ 2 │ y │ 3 │
│ 3 │ 3 │ x │ 2 │
│ 4 │ 4 │ x │ 1 │
│ 5 │ 5 │ z │ 1 │
now, what do you want to sort? A on (B then C)? [4, 3, 2, 5, 1]
julia> sort(df, (:B, :C))
5×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Categorical… │ Int64 │
├─────┼───────┼──────────────┼───────┤
│ 1 │ 4 │ x │ 1 │
│ 2 │ 3 │ x │ 2 │
│ 3 │ 2 │ y │ 3 │
│ 4 │ 5 │ z │ 1 │
│ 5 │ 1 │ z │ 2 │
julia> sort(df, (:B, :C)).A
5-element Array{Int64,1}:
4
3
2
5
1
This is a good place to start http://juliadata.github.io/DataFrames.jl/stable/
Your code was creating a single row DataFrame containing a tuple so I corrected it.
Note that for nominal variables you would normally used Symbols rather than Strings.
using DataFrames
df = DataFrame(b = [:Hi, :Med, :Hi, :Low, :Hi],
x = ["A", "E", "I", "O","A"],
y = [6, 3, 7, 2, 1],
z = [2, 1, 1, 2, 2])
sort(df, [:z,:y])
Working with Julia 1.0
I am trying to aggregate (in this case mean-center) several columns by group and looking for a way to loop over the columns as opposed to writing all column names explicitly.
The below works but I am looking for more succinct syntax for cases where I have many columns.
using DataFrames, Statistics
dd=DataFrame(A=["aa";"aa";"bb";"bb"], B=[1.0;2.0;3.0;4.0], C=[5.0;5.0;10.0;10.0])
by(dd, :A, df -> DataFrame(bm = df[:B].-mean(df[:B]), cm = df[:C].-mean(df[:C])))
Is there a way to loop over [:B, :C] and not write the statement separately for each?
You can use aggregate:
julia> centered(col) = col .- mean(col)
centered (generic function with 1 method)
julia> aggregate(dd, :A, centered)
4×3 DataFrame
│ Row │ A │ B_centered │ C_centered │
│ │ String │ Float64 │ Float64 │
├─────┼────────┼────────────┼────────────┤
│ 1 │ aa │ -0.5 │ 0.0 │
│ 2 │ aa │ 0.5 │ 0.0 │
│ 3 │ bb │ -0.5 │ 0.0 │
│ 4 │ bb │ 0.5 │ 0.0 │
Note that function name is used as a suffix. If you need more customized suffixes use by and pass it a more fancy third argument that iterates over passed columns giving them appropriate names.
In R we can convert NA to 0 with:
df[is.na(df)] <- 0
This works for single columns:
df[ismissing.(df[:col]), :col] = 0
There a way for the full df?
I don't think there's such a function in DataFrames.jl yet.
But you can hack your way around it by combining colwise and recode. I'm also providing a reproducible example here, in case someone wants to iterate on this answer:
julia> using DataFrames
julia> df = DataFrame(a = [missing, 5, 5],
b = [1, missing, missing])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1 │
│ 2 │ 5 │ missing │
│ 3 │ 5 │ missing │
julia> DataFrame(colwise(col -> recode(col, missing=>0), df), names(df))
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───┼───┤
│ 1 │ 0 │ 1 │
│ 2 │ 5 │ 0 │
│ 3 │ 5 │ 0 │
This is a bit ugly as you have to reassign the dataframe column names.
Maybe a simpler way to convert all missing values in a DataFrame is to just use list comprehension:
[df[ismissing.(df[i]), i] = 0 for i in names(df)]