Replacing missing values in Juila - julia

I have data-frame contain some missing values, I want to replace all the missing values with the mean of LoanAmount column
df[ismissing.(df.LoanAmount),:LoanAmount]= floor(mean(skipmissing(df.LoanAmount)))
but when I am running above code i am getting
MethodError: no method matching setindex!(::DataFrame, ::Float64, ::BitArray{1}, ::Symbol)

Use skipmissing e.g.:
mean(skipmissing(df.LoanAmount))
Answer to the second, edited question: you should broadcast the assignment using the dot operator (.) as in the example below:
julia> df = DataFrame(col=rand([missing;1:3],10))
10×1 DataFrame
│ Row │ col │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ missing │
│ 2 │ 3 │
│ 3 │ 2 │
│ 4 │ 2 │
│ 5 │ missing │
│ 6 │ missing │
│ 7 │ missing │
│ 8 │ 3 │
│ 9 │ 1 │
│ 10 │ 3 │
julia> df[ismissing.(df.col),:col] .= floor(mean(skipmissing(df.col)));
julia> df
10×1 DataFrame
│ Row │ col │
│ │ Int64? │
├─────┼────────┤
│ 1 │ 2 │
│ 2 │ 3 │
│ 3 │ 2 │
│ 4 │ 2 │
│ 5 │ 2 │
│ 6 │ 2 │
│ 7 │ 2 │
│ 8 │ 3 │
│ 9 │ 1 │
│ 10 │ 3 │
Impute.jl
yet another option is to use Impute.jl as suggested by Bogumil:
Impute.fill(df;value=(x)->floor(mean(x)))

I found this one also,
when we need to replace with mean
replace!(df.col,missing => floor(mean(skipmissing(df[!,:col]))))
when we need to replace with mode
replace!(df.col,missing => mode(skipmissing(df[!,:col])))

Related

How to get a new column that depends of a subset of dataframe columns

My dataframe has 3 columns A, B and C and for each row only one of these columns contains a value.
I want a MERGE column that contains the values from A or B or C
using DataFrames
df = DataFrame(NAME = ["a", "b", "c"], A = [1, missing, missing], B = [missing, 2, missing], C = [missing, missing, 3])
3×4 DataFrame
│ Row │ NAME │ A │ B │ C │
│ │ String │ Int64? │ Int64? │ Int64? │
├─────┼────────┼─────────┼─────────┼─────────┤
│ 1 │ a │ 1 │ missing │ missing │
│ 2 │ b │ missing │ 2 │ missing │
│ 3 │ c │ missing │ missing │ 3 │
How the best julia way to get the MERGE column?
3×5 DataFrame
│ Row │ NAME │ A │ B │ C │ MERGE │
│ │ String │ Int64? │ Int64? │ Int64? │ Int64 │
├─────┼────────┼─────────┼─────────┼─────────┼───────┤
│ 1 │ a │ 1 │ missing │ missing │ 1 │
│ 2 │ b │ missing │ 2 │ missing │ 2 │
│ 3 │ c │ missing │ missing │ 3 │ 3 │
What I was able to work out so far is:
select(df, :, [:A, :B, :C] => ByRow((a,b,c) -> sum(skipmissing([a, b, c]))) => :MERGE)
What about a scenario for which there is a variable range of columns?
select(df, range => ??? => :MERGE)
You can write it like this:
julia> transform!(df, [:A, :B, :C] => ByRow(coalesce) => :MERGE)
3×5 DataFrame
│ Row │ NAME │ A │ B │ C │ MERGE │
│ │ String │ Int64? │ Int64? │ Int64? │ Int64 │
├─────┼────────┼─────────┼─────────┼─────────┼───────┤
│ 1 │ a │ 1 │ missing │ missing │ 1 │
│ 2 │ b │ missing │ 2 │ missing │ 2 │
│ 3 │ c │ missing │ missing │ 3 │ 3 │
Instead of [:A, :B, :C] you can put any selector, like All(), Between(:A, :C), 1:3 etc.

How to add suffix or prefix for duplicate columns in julia?

I have a two df and both dfs have some common columns which are not included in on list. If I add makeunique parameter it creates new column with suffix of _n where. Is there anyway I can pass prefix values such as ['_left', '_right'] to the result df?
In pandas I can pass some argument lsuffix and rsuffix.
Sample Input:
Df1:
│ Row │ ID │ Name │
│ │ Int64 │ String │
├─────┼───────┼─────────┤
│ 1 │ 1 │ Mohamed │
│ 2 │ 2 │ Thasin │
Df2:
│ Row │ ID │ Job │ Name │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼────────┤
│ 1 │ 1 │ Tech │ Md │
│ 2 │ 2 │ Tech │ Tn │
│ 3 │ 3 │ Assist │ Rj │
│ 4 │ 4 │ Test │ Mi │
inner join result:
innerjoin(people, jobs, on = :ID, makeunique=true)
│ Row │ ID │ Name │ Job │ Name_1 │
│ │ Int64 │ String │ String │ String │
├─────┼───────┼─────────┼────────┼─────────┤
│ 1 │ 1 │ Mohamed │ Tech │ Md │
│ 2 │ 2 │ Thasin │ Tech │ Tn │
Expected Output:
| Row │ ID │ Name_left│ Job │ Name_right │
│ │ Int64 │ String │ String │ String │
├─────┼───────┼─────────┼────────┼─────────┤
│ 1 │ 1 │ Mohamed │ Tech │ Md │
│ 2 │ 2 │ Thasin │ Tech │ Tn │
This is not implemented yet. You can expect that it will be added this year. See https://github.com/JuliaData/DataFrames.jl/issues/1333.
What you can do for the time being is:
innerjoin(rename!(s -> s == "ID" ? "ID" : s*"_left", DataFrame!(people)),
rename!(s -> s == "ID" ? "ID" : s*"_right", DataFrame!(jobs)),
on = :ID)
If you do not care about efficiency and want a bit shorter code use:
innerjoin(rename(s -> s == "ID" ? "ID" : s*"_left", people),
rename(s -> s == "ID" ? "ID" : s*"_right", jobs),
on = :ID)

How to remove/drop rows of nothing and NaN in Julia dataframe?

I have a df which contains, nothing, NaN and missing. to remove rows which contains missing I can use dropmissing. Is there any methods to deal with NaN and nothing?
Sample df:
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ missing │ 'b' │
│ 3 │ 3.0 │ 'c' │
│ 4 │ │ 'd' │
│ 5 │ 5.0 │ 'e' │
│ 6 │ NaN │ 'f' │
Expected output:
│ Row │ x │ y │
│ │ Any │ Char │
├─────┼─────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ 3.0 │ 'c' │
│ 3 │ 5.0 │ 'e' │
What I have tried so far,
Based on my knowledge in Julia I tried this,
df.x = replace(df.x, NaN=>"something", missing=>"something", nothing=>"something")
print(df[df."x".!="something", :])
My code is working as I expected. I feel it's ineffective way of solving this issue.
Is there any separate method to deal with nothing and NaN?
You can do e.g. this:
julia> df = DataFrame(x=[1,missing,3,nothing,5,NaN], y='a':'f')
6×2 DataFrame
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ missing │ 'b' │
│ 3 │ 3.0 │ 'c' │
│ 4 │ │ 'd' │
│ 5 │ 5.0 │ 'e' │
│ 6 │ NaN │ 'f' │
julia> filter(:x => x -> !any(f -> f(x), (ismissing, isnothing, isnan)), df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ 3.0 │ 'c' │
│ 3 │ 5.0 │ 'e' │
Note that here the order of checks is important, as isnan should be last, because otherwise this check will fail for missing or nothing value.
You could also have written it more directly as:
julia> filter(:x => x -> !(ismissing(x) || isnothing(x) || isnan(x)), df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ 3.0 │ 'c' │
│ 3 │ 5.0 │ 'e' │
but I felt that the example with any is more extensible (you can then store the list of predicates to check in a variable).
The reason why only a function for removing missing is provided in DataFrames.jl is that this is what is normally considered to be a valid but desirable to remove value in data science pipelines.
Normally in Julia when you see nothing or NaN you probably want to handle them in a different way than missing as they most likely signal there is some error in the data or in processing of data (as opposed to missing which signals that the data was just not collected).

Cumulative Returns

In R we can do:
cum.ret <- cumprod(1 + df$rets) - 1
I want to do the same thing with Julia here is some dummy data:
# Dummy Data
df = DataFrame(a = 1:10, b = 10*rand(10), Close = 10 * rand(10))
# Calculate Returns
Close = df[:Close]
Close = convert(Array, Close)
df[:Close_Rets] = [NaN; (Close[2:end] ./ Close[1:(end-1)] - 1)]
# Calculate Cumulative Returns
df[:Cum_Ret] = cumprod(((1 .+ df[:Close_Rets])-1),2)
With the output:
julia> head(df)
6×5 DataFrames.DataFrame
│ Row │ a │ b │ Close │ Close_Rets │ Cum_Ret │
├─────┼───┼─────────┼──────────┼────────────┼───────────┤
│ 1 │ 1 │ 6.15507 │ 3.6363 │ NaN │ NaN │
│ 2 │ 2 │ 7.73259 │ 0.98378 │ -0.729456 │ -0.729456 │
│ 3 │ 3 │ 3.64926 │ 7.94633 │ 7.07735 │ 7.07735 │
│ 4 │ 4 │ 5.15762 │ 0.744905 │ -0.906258 │ -0.906258 │
│ 5 │ 5 │ 9.49532 │ 8.51811 │ 10.4352 │ 10.4352 │
│ 6 │ 6 │ 6.14604 │ 5.02165 │ -0.410473 │ -0.410473 │
Anyway to make this work?

Multiply two data frame columns

So I tried this:
df[:new_col] = (df[:col_one ] .* [df[:col_two]])
It produces a wild result.
I then though to iterate row wise by access the data frame index:
v = Float64[]
for i in 1:nrow(df)
z = df[[i],[:col_one]] * df[[i],[:col_two]]
append!(v,z)
end
This however does not work. Any ideas?
What are my options from here? Pull the data from a data frame and make a vector?
** Update **
df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10))
df[:new_d] = df[:b] .* df[:c]
For output:
julia> head(df)
6×4 DataFrames.DataFrame
│ Row │ a │ b │ c │ new_d │
├─────┼───┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 6.67916 │ 8.38096 │ 55.9778 │
│ 2 │ 2 │ 7.50056 │ 5.26593 │ 39.4974 │
│ 3 │ 3 │ 7.76419 │ 3.54361 │ 27.5133 │
│ 4 │ 4 │ 2.86521 │ 8.41335 │ 24.1061 │
│ 5 │ 5 │ 3.7417 │ 8.10884 │ 30.3409 │
│ 6 │ 6 │ 7.52014 │ 2.61603 │ 19.6729 │

Resources