How to extract particular rows from a data frame in Julia? - julia

I want to extract the 3rd and 7th row of a data frame in Julia. The MWE is:
using DataFrames
my_data = DataFrame(A = 1:10, B = 16:25);
my_data
10×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 16 │
│ 2 │ 2 │ 17 │
│ 3 │ 3 │ 18 │
│ 4 │ 4 │ 19 │
│ 5 │ 5 │ 20 │
│ 6 │ 6 │ 21 │
│ 7 │ 7 │ 22 │
│ 8 │ 8 │ 23 │
│ 9 │ 9 │ 24 │
│ 10 │ 10 │ 25 │

This should give you the expected output:
using DataFrames
my_data = DataFrame(A = 1:10, B = 16:25);
my_data;
my_data[[3, 7], :]
2×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 3 │ 18 │
│ 2 │ 7 │ 22 │

The great thing about Julia is that you do not need to materialize the result (and hence save memory and time on copying the data). Hence, if you need a subrange of any array-like structure it is better to use #view rather than materialize directly
julia> #view my_data[[3, 7], :]
2×2 SubDataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 3 │ 18 │
│ 2 │ 7 │ 22 │
Now the performance testing.
function submean1(df)
d = df[[3, 7], :]
mean(d.A)
end
function submean2(df)
d = #view df[[3, 7], :]
mean(d.A)
end
And tests:
julia> using BenchmarkTools
julia> #btime submean1($my_data)
689.262 ns (19 allocations: 1.38 KiB)
5.0
julia> #btime submean2($my_data)
582.315 ns (9 allocations: 288 bytes)
5.0
Even in this simplistic example #view is 15% faster and uses four times less memory. Of course sometimes you want to copy the data but the rule of thumb is not to materialize.

Related

Replacing missing values in Juila

I have data-frame contain some missing values, I want to replace all the missing values with the mean of LoanAmount column
df[ismissing.(df.LoanAmount),:LoanAmount]= floor(mean(skipmissing(df.LoanAmount)))
but when I am running above code i am getting
MethodError: no method matching setindex!(::DataFrame, ::Float64, ::BitArray{1}, ::Symbol)
Use skipmissing e.g.:
mean(skipmissing(df.LoanAmount))
Answer to the second, edited question: you should broadcast the assignment using the dot operator (.) as in the example below:
julia> df = DataFrame(col=rand([missing;1:3],10))
10×1 DataFrame
│ Row │ col │
│ │ Int64? │
├─────┼─────────┤
│ 1 │ missing │
│ 2 │ 3 │
│ 3 │ 2 │
│ 4 │ 2 │
│ 5 │ missing │
│ 6 │ missing │
│ 7 │ missing │
│ 8 │ 3 │
│ 9 │ 1 │
│ 10 │ 3 │
julia> df[ismissing.(df.col),:col] .= floor(mean(skipmissing(df.col)));
julia> df
10×1 DataFrame
│ Row │ col │
│ │ Int64? │
├─────┼────────┤
│ 1 │ 2 │
│ 2 │ 3 │
│ 3 │ 2 │
│ 4 │ 2 │
│ 5 │ 2 │
│ 6 │ 2 │
│ 7 │ 2 │
│ 8 │ 3 │
│ 9 │ 1 │
│ 10 │ 3 │
Impute.jl
yet another option is to use Impute.jl as suggested by Bogumil:
Impute.fill(df;value=(x)->floor(mean(x)))
I found this one also,
when we need to replace with mean
replace!(df.col,missing => floor(mean(skipmissing(df[!,:col]))))
when we need to replace with mode
replace!(df.col,missing => mode(skipmissing(df[!,:col])))

How to insert a new row in julia at specific index [duplicate]

Is there a way to add a row to an existing dataframe at a specific index?
E.g. you have a dataframe with 3 rows and 1 columns
df = DataFrame(x = [2,3,4])
X
2
3
4
any way to do the following:
insert!(df, 1, [1])
in order to get
X
1
2
3
4
I know that i could probably concat two dataframes df = [df1; df2] but i was hoping to avoid garbaging a large DF whenever i want to insert a row.
In DataFrames 0.21.4 just write (I give two options: one, with broadcasting, is short but creates a temporary object; the other, with foreach is longer to write but allocates a bit less):
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(eachcol(df), 2, [4, "d"]) # creates an temporary object but is terse
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
["a", "d", "b", "c"]
julia> df
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
julia> foreach((c, v) -> insert!(c, 2, v), eachcol(df), [4, "d"]) # does not create a temporary object
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 4 │ d │
│ 4 │ 2 │ b │
│ 5 │ 3 │ c │
note that the above operation is not atomic (it may corrupt your data frame if the type of the element you want to add does not match the element type allowed in the column).
If you want a safe operation that will provide automatic promotion use this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> [view(df, 1:1, :); DataFrame(names(df) .=> ['a', 'b']); view(df, 3:3, :)]
3×2 DataFrame
│ Row │ x │ y │
│ │ Any │ Any │
├─────┼─────┼─────┤
│ 1 │ 1 │ a │
│ 2 │ 'a' │ 'b' │
│ 3 │ 3 │ c │
(it is a bit slower though and creates a new data frame)
Deprecated
The original answer is here. It was valid for Julia before 1.0 release (and DataFrames.jl version that was compatible with it).
I guess you want to do it in place. Then you can use insert! function like this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
Of course you have to make sure that you have the right number of columns in the added collection.
If you accept using unexported internal structure of a DataFrame you can do it even simpler:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(df.columns, 2, [4, "d"])
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
String["a", "d", "b", "c"]
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │

How to append zero values for eache DateTime x-axis on a dataframe in julia

I want to plot a data, but it's x-axis is time, for the missing value in each half-hour, I wish to fill zero.
using CSV, DataFrames, Dates
s="ts, v
2020-01-01T01:00:00, 3
2020-01-01T04:00:00, 6
2020-01-01T05:00:00, 1"
d=CSV.read(IOBuffer(s))
I expect to expand the d like d2
s2="ts, v
2020-01-01T01:00:00, 3
2020-01-01T01:30:00, 0
2020-01-01T02:00:00, 0
2020-01-01T02:30:00, 0
2020-01-01T03:00:00, 0
2020-01-01T03:30:00, 0
2020-01-01T04:00:00, 6
2020-01-01T04:30:00, 0
2020-01-01T05:00:00, 1"
d2=CSV.read(IOBuffer(s2))
I would probably do the following:
# Create half-hourly data frame with zeros from first to last observation
julia> df = DataFrame(ts = minimum(d.ts):Minute(30):maximum(d.ts), v_filled = 0);
# Join the existing observations onto this dataframe
julia> df = join(df, d, on = :ts, kind = :left);
# Replace zeros with observations where available
julia> df[.!ismissing.(df.v), :v_filled] = df[.!ismissing.(df.v), :v];
julia> df
9×3 DataFrame
│ Row │ ts │ v_filled │ v │
│ │ DateTime │ Int64 │ Int64⍰ │
├─────┼─────────────────────┼──────────┼─────────┤
│ 1 │ 2020-01-01T01:00:00 │ 3 │ 3 │
│ 2 │ 2020-01-01T01:30:00 │ 0 │ missing │
│ 3 │ 2020-01-01T02:00:00 │ 0 │ missing │
│ 4 │ 2020-01-01T02:30:00 │ 0 │ missing │
│ 5 │ 2020-01-01T03:00:00 │ 0 │ missing │
│ 6 │ 2020-01-01T03:30:00 │ 0 │ missing │
│ 7 │ 2020-01-01T04:00:00 │ 6 │ 6 │
│ 8 │ 2020-01-01T04:30:00 │ 0 │ missing │
│ 9 │ 2020-01-01T05:00:00 │ 1 │ 1 │

Cumulative Returns

In R we can do:
cum.ret <- cumprod(1 + df$rets) - 1
I want to do the same thing with Julia here is some dummy data:
# Dummy Data
df = DataFrame(a = 1:10, b = 10*rand(10), Close = 10 * rand(10))
# Calculate Returns
Close = df[:Close]
Close = convert(Array, Close)
df[:Close_Rets] = [NaN; (Close[2:end] ./ Close[1:(end-1)] - 1)]
# Calculate Cumulative Returns
df[:Cum_Ret] = cumprod(((1 .+ df[:Close_Rets])-1),2)
With the output:
julia> head(df)
6×5 DataFrames.DataFrame
│ Row │ a │ b │ Close │ Close_Rets │ Cum_Ret │
├─────┼───┼─────────┼──────────┼────────────┼───────────┤
│ 1 │ 1 │ 6.15507 │ 3.6363 │ NaN │ NaN │
│ 2 │ 2 │ 7.73259 │ 0.98378 │ -0.729456 │ -0.729456 │
│ 3 │ 3 │ 3.64926 │ 7.94633 │ 7.07735 │ 7.07735 │
│ 4 │ 4 │ 5.15762 │ 0.744905 │ -0.906258 │ -0.906258 │
│ 5 │ 5 │ 9.49532 │ 8.51811 │ 10.4352 │ 10.4352 │
│ 6 │ 6 │ 6.14604 │ 5.02165 │ -0.410473 │ -0.410473 │
Anyway to make this work?

Multiply two data frame columns

So I tried this:
df[:new_col] = (df[:col_one ] .* [df[:col_two]])
It produces a wild result.
I then though to iterate row wise by access the data frame index:
v = Float64[]
for i in 1:nrow(df)
z = df[[i],[:col_one]] * df[[i],[:col_two]]
append!(v,z)
end
This however does not work. Any ideas?
What are my options from here? Pull the data from a data frame and make a vector?
** Update **
df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10))
df[:new_d] = df[:b] .* df[:c]
For output:
julia> head(df)
6×4 DataFrames.DataFrame
│ Row │ a │ b │ c │ new_d │
├─────┼───┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 6.67916 │ 8.38096 │ 55.9778 │
│ 2 │ 2 │ 7.50056 │ 5.26593 │ 39.4974 │
│ 3 │ 3 │ 7.76419 │ 3.54361 │ 27.5133 │
│ 4 │ 4 │ 2.86521 │ 8.41335 │ 24.1061 │
│ 5 │ 5 │ 3.7417 │ 8.10884 │ 30.3409 │
│ 6 │ 6 │ 7.52014 │ 2.61603 │ 19.6729 │

Resources