Is there a names() equivalent in Julia? - r

I am transferring a script written in R to Julia, and one of the R functions is the names() function. Is there a synonymous function in Julia?

DataFrames
In Julia there is names function for a DataFrame which returns column names, e.g.:
julia> using DataFrames
julia> x = DataFrame(rand(3,4))
3×4 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1 │ 0.721198 │ 0.605882 │ 0.191941 │ 0.597295 │
│ 2 │ 0.0537836 │ 0.619698 │ 0.764937 │ 0.273197 │
│ 3 │ 0.679952 │ 0.899523 │ 0.206124 │ 0.928319 │
julia> names(x)
4-element Array{Symbol,1}:
:x1
:x2
:x3
:x4
Then in order to set column names of a DataFrame you can use names! function (example continued):
julia> names!(x, [:a,:b,:c,:d])
3×4 DataFrames.DataFrame
│ Row │ a │ b │ c │ d │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1 │ 0.721198 │ 0.605882 │ 0.191941 │ 0.597295 │
│ 2 │ 0.0537836 │ 0.619698 │ 0.764937 │ 0.273197 │
│ 3 │ 0.679952 │ 0.899523 │ 0.206124 │ 0.928319 │
Arrays
Standard arrays do not support naming their dimensions. But you can use NamedArrays.jl package which adds this functionality. You can get and set names of dimensions as well as names of indices along each dimension. You can find the details here https://github.com/davidavdav/NamedArrays.jl#general-functions.

I'm not an R expert but I think you want fieldnames
type Foo
bar::Int
end
#show fieldnames(Foo)
baz = Foo(2)
#show fieldnames(baz)

Related

How to remove/drop rows of nothing and NaN in Julia dataframe?

I have a df which contains, nothing, NaN and missing. to remove rows which contains missing I can use dropmissing. Is there any methods to deal with NaN and nothing?
Sample df:
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ missing │ 'b' │
│ 3 │ 3.0 │ 'c' │
│ 4 │ │ 'd' │
│ 5 │ 5.0 │ 'e' │
│ 6 │ NaN │ 'f' │
Expected output:
│ Row │ x │ y │
│ │ Any │ Char │
├─────┼─────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ 3.0 │ 'c' │
│ 3 │ 5.0 │ 'e' │
What I have tried so far,
Based on my knowledge in Julia I tried this,
df.x = replace(df.x, NaN=>"something", missing=>"something", nothing=>"something")
print(df[df."x".!="something", :])
My code is working as I expected. I feel it's ineffective way of solving this issue.
Is there any separate method to deal with nothing and NaN?
You can do e.g. this:
julia> df = DataFrame(x=[1,missing,3,nothing,5,NaN], y='a':'f')
6×2 DataFrame
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ missing │ 'b' │
│ 3 │ 3.0 │ 'c' │
│ 4 │ │ 'd' │
│ 5 │ 5.0 │ 'e' │
│ 6 │ NaN │ 'f' │
julia> filter(:x => x -> !any(f -> f(x), (ismissing, isnothing, isnan)), df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ 3.0 │ 'c' │
│ 3 │ 5.0 │ 'e' │
Note that here the order of checks is important, as isnan should be last, because otherwise this check will fail for missing or nothing value.
You could also have written it more directly as:
julia> filter(:x => x -> !(ismissing(x) || isnothing(x) || isnan(x)), df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Union…? │ Char │
├─────┼─────────┼──────┤
│ 1 │ 1.0 │ 'a' │
│ 2 │ 3.0 │ 'c' │
│ 3 │ 5.0 │ 'e' │
but I felt that the example with any is more extensible (you can then store the list of predicates to check in a variable).
The reason why only a function for removing missing is provided in DataFrames.jl is that this is what is normally considered to be a valid but desirable to remove value in data science pipelines.
Normally in Julia when you see nothing or NaN you probably want to handle them in a different way than missing as they most likely signal there is some error in the data or in processing of data (as opposed to missing which signals that the data was just not collected).

How to get dtypes of columns in julia dataframe

How to get dtypes of all columns and particular column(s) in julia. To be specific what is the pandas equivalent of df.dtypes in julia?
For example,
I have a df like below,
│ Row │ Id │ name │ item location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
│ 2 │ 2 │ B │ abc │
│ 3 │ 3 │ C │ def │
│ 4 │ 4 │ D │ ghi │
│ 5 │ 5 │ E │ xyz │
│ 6 │ 6 │ F │ abc │
│ 7 │ 7 │ G │ def │
│ 8 │ 8 │ H │ ghi │
│ 9 │ 9 │ I │ xyz │
│ 10 │ 10 │ J │ abc │
Expected output:
{'id': Int64, 'name': String, 'item location': String}
How to get dtypes, i.e., Int64 │ String │ String in Julia?
You have specified two different expected outputs so I show here how to get both:
julia> df = DataFrame("Id" => 1, "name" => "A", "item_location" => "xyz")
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
Int64
String
String
julia> Dict(names(df) .=> eltype.(eachcol(df)))
Dict{String,DataType} with 3 entries:
"Id" => Int64
"name" => String
"item_location" => String
additionally, if you wanted to store the result in a DataFrame instead of a Dict you can write (see mapcols documentation here):
julia> mapcols(eltype, df)
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ DataType │ DataType │ DataType │
├─────┼──────────┼──────────┼───────────────┤
│ 1 │ Int64 │ String │ String │
And if you would want to have a NamedTuple storing this information write (the documentation of Tables.columntable is here):
julia> map(eltype, Tables.columntable(df))
(Id = Int64, name = String, item_location = String)
(in this case note that for very wide tables this might have some extra compilation cost as each time you call it you potentially get a new type of NamedTuple)
You can also use describe(df) which is a catchall for learning about the columns in your data frame.

What is the meaning of the exclamation mark in indexing a Julia DataFrame?

I thought that the exclamation mark ! is the symbol for the logical NOT operator. Now, learning indexing in the DataFrames package, I came across this: data[!,:Treatment]. Which seems to be the same as using the known colon symbol :
data[:,:Treatment]==data[!,:Treatment] is true.
Why this redundancy then?
! in indexing is specific to DataFrames, and signals that you want a reference to the underlying vector storing the data, rather than a copy of it. You can read all about indexing DataFrames here. In your example the are both == because all values are identical, but they are not === since df[:, :Treatment] gives you a copy of the underlying data.
Example:
julia> using DataFrames
julia> df = DataFrame(y = [1, 2, 3]);
julia> df[:, :y] == df[!, :y] # true because all values are equal
true
julia> df[:, :y] === df[!, :y] # false because they are not the same vector
false
Quoting the documentation of DataFrames.jl:
Columns can be directly (i.e. without copying) accessed via df.col or df[!, :col]. [...] Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.
An example might make this clearer:
julia> using DataFrames
julia> df = DataFrame(x = rand(5), y=rand(5))
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `a` is a copy of `df.x`: modifying it will not affect `df`
julia> a = df[:, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> a[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `b` is a view of `df.x`: any change made to it will be reflected in df
julia> b = df[!, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> b[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 1.0 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
Note that, since the indexing with ! does not involve any data copy, it will generally be more efficient.

Repeat random data with Faker

I want to use Faker data for many rows. My current code only repeats whatever was generated by the Faker library at that moment:
Current output:
│ Row │ Identifier │
│ │ String │
├─────┼────────────┤
│ 1 │ 40D593 │
│ 2 │ 40D593 │
│ 3 │ 40D593 │
Desired outputs:
│ Row │ Digits │
│ │ String │
├─────┼────────┤
│ 1 │ 23K125 │
│ 2 │ 13K125 │
│ 3 │ 45K125 │
df2 = DataFrame(Identifier = repeat([Faker.bothify("##?###")], outer=[3]))
I thought I could do something like Faker.bothify("##?###") * 3. I suppose there may also be a way to apply it to a dataframe column that was already made, but I can't find a way just looking through the docs quickly.
The simple way is to use a comprehension:
df2 = DataFrame(Identifier=[Faker.bothify("##?###") for _ in 1:3])
an alternative is to use broadcasting (but for me a comprehension in this case is more natural to use):
df2 = DataFrame(Identifier=Faker.bothify.(Iterators.repeated("##?###", 3)))
(I assume this is what you want)
and this is the way to apply it to an existing column with String eltype. This operation is in-place:
julia> df = DataFrame(Identifier=Vector{String}(undef, 3))
3×1 DataFrame
│ Row │ Identifier │
│ │ String │
├─────┼────────────┤
│ 1 │ #undef │
│ 2 │ #undef │
│ 3 │ #undef │
julia> df.Identifier .= Faker.bothify.("##?###")
3-element Array{String,1}:
"12H314"
"56G992"
"23X588"
julia> df
3×1 DataFrame
│ Row │ Identifier │
│ │ String │
├─────┼────────────┤
│ 1 │ 12H314 │
│ 2 │ 56G992 │
│ 3 │ 23X588 │

Adding rows to a dataframe with pre-allocated memory?

Let's say I have a pre-sized dataframe and I want to assign values to every row. (Therfore push! and append! are out of game)
length = 10
df = DataFrame(id = Array(Int64,length),value = Array(String,length))
for n in 1:10
df[n,:id] = n
df[n,:value] = "random text"
end
The above code shows how to do that cell by cell for each iterated row.
Is there a solution to add an entire row at once for each iteration?
Because
for n in 1:10
df[n] = [n "random text"]
end
throws a wrong type exception.
To access a row the syntax is [row,:] rather than just row.
Also you'll need to convert the row to a DataFrame first.
for n in 1:10
df[n,:] = DataFrame([n "random text2"])
end
You can roll your own function to set a row quite easily:
julia> function setrow!(df, rowi, val)
for j in eachindex(val)
df[rowi, j] = val[j]
end
df
end
setrow! (generic function with 1 method)
julia> setrow!(df, 1, [1, "a"])
10×2 DataFrames.DataFrame
│ Row │ id │ value │
├─────┼─────────────────┼──────────┤
│ 1 │ 1 │ "a" │
│ 2 │ 140525709817424 │ "#undef" │
│ 3 │ 140525709817488 │ "#undef" │
│ 4 │ 140525709817072 │ "#undef" │
│ 5 │ 140525709817104 │ "#undef" │
│ 6 │ 140525709817136 │ "#undef" │
│ 7 │ 140525709817168 │ "#undef" │
│ 8 │ 140525709817200 │ "#undef" │
│ 9 │ 140525709817232 │ "#undef" │
│ 10 │ 0 │ "#undef" │
Ideally, you might be able to use the broadcasting assignment syntax:
df[2, :] .= [2, "b"]
But that appears to be not implemented (perhaps for good reason, I'm not sure).

Resources