Julia: Within data frame convert missing to 0 - julia

In R we can convert NA to 0 with:
df[is.na(df)] <- 0
This works for single columns:
df[ismissing.(df[:col]), :col] = 0
There a way for the full df?

I don't think there's such a function in DataFrames.jl yet.
But you can hack your way around it by combining colwise and recode. I'm also providing a reproducible example here, in case someone wants to iterate on this answer:
julia> using DataFrames
julia> df = DataFrame(a = [missing, 5, 5],
b = [1, missing, missing])
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼─────────┼─────────┤
│ 1 │ missing │ 1 │
│ 2 │ 5 │ missing │
│ 3 │ 5 │ missing │
julia> DataFrame(colwise(col -> recode(col, missing=>0), df), names(df))
3×2 DataFrames.DataFrame
│ Row │ a │ b │
├─────┼───┼───┤
│ 1 │ 0 │ 1 │
│ 2 │ 5 │ 0 │
│ 3 │ 5 │ 0 │
This is a bit ugly as you have to reassign the dataframe column names.

Maybe a simpler way to convert all missing values in a DataFrame is to just use list comprehension:
[df[ismissing.(df[i]), i] = 0 for i in names(df)]

Related

What is the meaning of the exclamation mark in indexing a Julia DataFrame?

I thought that the exclamation mark ! is the symbol for the logical NOT operator. Now, learning indexing in the DataFrames package, I came across this: data[!,:Treatment]. Which seems to be the same as using the known colon symbol :
data[:,:Treatment]==data[!,:Treatment] is true.
Why this redundancy then?
! in indexing is specific to DataFrames, and signals that you want a reference to the underlying vector storing the data, rather than a copy of it. You can read all about indexing DataFrames here. In your example the are both == because all values are identical, but they are not === since df[:, :Treatment] gives you a copy of the underlying data.
Example:
julia> using DataFrames
julia> df = DataFrame(y = [1, 2, 3]);
julia> df[:, :y] == df[!, :y] # true because all values are equal
true
julia> df[:, :y] === df[!, :y] # false because they are not the same vector
false
Quoting the documentation of DataFrames.jl:
Columns can be directly (i.e. without copying) accessed via df.col or df[!, :col]. [...] Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.
An example might make this clearer:
julia> using DataFrames
julia> df = DataFrame(x = rand(5), y=rand(5))
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `a` is a copy of `df.x`: modifying it will not affect `df`
julia> a = df[:, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> a[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `b` is a view of `df.x`: any change made to it will be reflected in df
julia> b = df[!, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> b[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 1.0 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
Note that, since the indexing with ! does not involve any data copy, it will generally be more efficient.

How to append zero values for eache DateTime x-axis on a dataframe in julia

I want to plot a data, but it's x-axis is time, for the missing value in each half-hour, I wish to fill zero.
using CSV, DataFrames, Dates
s="ts, v
2020-01-01T01:00:00, 3
2020-01-01T04:00:00, 6
2020-01-01T05:00:00, 1"
d=CSV.read(IOBuffer(s))
I expect to expand the d like d2
s2="ts, v
2020-01-01T01:00:00, 3
2020-01-01T01:30:00, 0
2020-01-01T02:00:00, 0
2020-01-01T02:30:00, 0
2020-01-01T03:00:00, 0
2020-01-01T03:30:00, 0
2020-01-01T04:00:00, 6
2020-01-01T04:30:00, 0
2020-01-01T05:00:00, 1"
d2=CSV.read(IOBuffer(s2))
I would probably do the following:
# Create half-hourly data frame with zeros from first to last observation
julia> df = DataFrame(ts = minimum(d.ts):Minute(30):maximum(d.ts), v_filled = 0);
# Join the existing observations onto this dataframe
julia> df = join(df, d, on = :ts, kind = :left);
# Replace zeros with observations where available
julia> df[.!ismissing.(df.v), :v_filled] = df[.!ismissing.(df.v), :v];
julia> df
9×3 DataFrame
│ Row │ ts │ v_filled │ v │
│ │ DateTime │ Int64 │ Int64⍰ │
├─────┼─────────────────────┼──────────┼─────────┤
│ 1 │ 2020-01-01T01:00:00 │ 3 │ 3 │
│ 2 │ 2020-01-01T01:30:00 │ 0 │ missing │
│ 3 │ 2020-01-01T02:00:00 │ 0 │ missing │
│ 4 │ 2020-01-01T02:30:00 │ 0 │ missing │
│ 5 │ 2020-01-01T03:00:00 │ 0 │ missing │
│ 6 │ 2020-01-01T03:30:00 │ 0 │ missing │
│ 7 │ 2020-01-01T04:00:00 │ 6 │ 6 │
│ 8 │ 2020-01-01T04:30:00 │ 0 │ missing │
│ 9 │ 2020-01-01T05:00:00 │ 1 │ 1 │

Mutate DataFrames in Julia

Looking for a function that works like by but doesn't collapse my DataFrame. In R I would use dplyr's groupby(b) %>% mutate(x1 = sum(a)). I don't want to lose information from the table such as that in variable :c.
mydf = DataFrame(a = 1:4, b = repeat(1:2,2), c=4:-1:1)
bypreserve(mydf, :b, x -> sum(x.a))
│ Row │ a │ b │ c │ x1
│ │ Int64 │ Int64 │ Int64 │Int64
├─────┼───────┼───────┼───────┤───────
│ 1 │ 1 │ 1 │ 4 │ 4
│ 2 │ 2 │ 2 │ 3 │ 6
│ 3 │ 3 │ 1 │ 2 │ 4
│ 4 │ 4 │ 2 │ 1 │ 6
Adding this functionality is discussed, but I would say that it will take several months to be shipped (the general idea is to allow select to have groupby keyword argument + also add transform function that will work like select but preserve columns of the source data frame).
For now the solution is to use join after by:
join(mydf, by(mydf, :b, x1 = :a => sum), on=:b)

Build dataframe from matrix - specify column types

Ok lets say I have a series of arrays:
data_one = ["dog","cat"]
data_two = [1,2]
data_three = ["1/1/2018","1/2/2018"]
I build them into a matrix
m = hcat(data_one,data_two,data_three)
and convert to a df
df = DataFrame(m)
showcols(df)
for output:
julia> showcols(df)
3×5 DataFrames.DataFrame
│ Row │ variable │ eltype │ nmissing │ first │ last │
├─────┼──────────┼────────┼──────────┼──────────┼──────────┤
│ 1 │ x1 │ Any │ 0 │ dog │ cat │
│ 2 │ x2 │ Any │ 0 │ 1 │ 2 │
│ 3 │ x3 │ Any │ 0 │ 1/1/2018 │ 1/2/2018 │
When I build this data frame - how may I specify the types of each column??
col1 should be String
col2 = Int
col3 = String
You can do it only indirectly through the following DataFrame constructor (of course you could pass [String, Int, String] as a variable here):
DataFrame([([String, Int, String][i]).(m[:,i]) for i in 1:size(m, 2)])
and if you want to use automatic detection of column type you can use:
DataFrame([[v for v in m[:,i]] for i in 1:size(m, 2)])

Adding rows to a dataframe with pre-allocated memory?

Let's say I have a pre-sized dataframe and I want to assign values to every row. (Therfore push! and append! are out of game)
length = 10
df = DataFrame(id = Array(Int64,length),value = Array(String,length))
for n in 1:10
df[n,:id] = n
df[n,:value] = "random text"
end
The above code shows how to do that cell by cell for each iterated row.
Is there a solution to add an entire row at once for each iteration?
Because
for n in 1:10
df[n] = [n "random text"]
end
throws a wrong type exception.
To access a row the syntax is [row,:] rather than just row.
Also you'll need to convert the row to a DataFrame first.
for n in 1:10
df[n,:] = DataFrame([n "random text2"])
end
You can roll your own function to set a row quite easily:
julia> function setrow!(df, rowi, val)
for j in eachindex(val)
df[rowi, j] = val[j]
end
df
end
setrow! (generic function with 1 method)
julia> setrow!(df, 1, [1, "a"])
10×2 DataFrames.DataFrame
│ Row │ id │ value │
├─────┼─────────────────┼──────────┤
│ 1 │ 1 │ "a" │
│ 2 │ 140525709817424 │ "#undef" │
│ 3 │ 140525709817488 │ "#undef" │
│ 4 │ 140525709817072 │ "#undef" │
│ 5 │ 140525709817104 │ "#undef" │
│ 6 │ 140525709817136 │ "#undef" │
│ 7 │ 140525709817168 │ "#undef" │
│ 8 │ 140525709817200 │ "#undef" │
│ 9 │ 140525709817232 │ "#undef" │
│ 10 │ 0 │ "#undef" │
Ideally, you might be able to use the broadcasting assignment syntax:
df[2, :] .= [2, "b"]
But that appears to be not implemented (perhaps for good reason, I'm not sure).

Resources