Using combine Julia function on GroupedDataFrame while using regex to reference columns - julia

As you will be able to tell from the question I am a VERY new user to Julia and just trying to do things that I have already done in python and stumbling a bit in the dark. What I am trying to do right now is to create some simple stats over multiple columns based on a certain grouping of the data. So after doing something like:
df = DataFrame(CSV.File(file));
gdf = groupby(df, :Class);
where df looks like:
df[1:3, [:Class, :V1, :V2, :V10]]
Class V1 V2 V10
Int64 Float64 Float64 Float64
1 0 -1.35981 -0.0727812 0.0907942
2 1 1.19186 0.266151 -0.166974
3 0 -1.35835 -1.34016 0.207643
...
I know I can do something like:
combine(gdf, :V1 => maximum => :v1_max, :V1 => minimum => :v1_min, nrow)
But then I saw that I could use regex to reference multiple columns and so my thought was to do something simple like:
combine(gdf, r"V[0-9]{1,2}" => maximum)
and have Julia in a single line generate the max value for all of the columns matching the regex for the grouped DataFrame.
I finally was able to do this in what I am guessing is not a nice efficient way and so looking for anyone's help to help me improve my usage of Julia.
foo = DataFrame(Class=[0, 1])
for v in ["V$i" for i in 1:28]
foo = join(foo,
combine(gdf, v => maximum => string(v, "_max")),
combine(gdf, v => minimum => string(v, "_min")),
on=:Class)
end

Just write:
combine(gdf, names(gdf, r"V[0-9]{1,2}") .=> maximum)
(note the . in front of =>)
In this case the target column names will be automatically generated.
What I have written above is a shorthand for:
combine(gdf, [n => maximum for n in names(gdf, r"V[0-9]{1,2}")])
Another way to write it is:
combine(AsTable(r"V[0-9]{1,2}") => x -> map(maximum, x), gdf)
when the old column names get retained.
The combine syntax is very flexible. I recommend you to have a look at its docstring for all available options.
Consider the following examples:
julia> using DataFrames
julia> passthrough(x...) = (#show x; x)
passthrough (generic function with 1 method)
julia> df = DataFrame(Class=[1,1,2], V1=1:3, V2=11:13)
3×3 DataFrame
│ Row │ Class │ V1 │ V2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 11 │
│ 2 │ 1 │ 2 │ 12 │
│ 3 │ 2 │ 3 │ 13 │
julia> gdf = groupby(df, :Class)
GroupedDataFrame with 2 groups based on key: Class
First Group (2 rows): Class = 1
│ Row │ Class │ V1 │ V2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 1 │ 11 │
│ 2 │ 1 │ 2 │ 12 │
⋮
Last Group (1 row): Class = 2
│ Row │ Class │ V1 │ V2 │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 2 │ 3 │ 13 │
julia> combine(gdf, r"V[0-9]{1,2}" .=> passthrough)
x = ([1, 2], [11, 12])
x = ([3], [13])
2×2 DataFrame
│ Row │ Class │ V1_V2_passthrough │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────────────┤
│ 1 │ 1 │ ([1, 2], [11, 12]) │
│ 2 │ 2 │ ([3], [13]) │
julia> combine(gdf, r"V[0-9]{1,2}" => passthrough)
x = ([1, 2], [11, 12])
x = ([3], [13])
2×2 DataFrame
│ Row │ Class │ V1_V2_passthrough │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────────────┤
│ 1 │ 1 │ ([1, 2], [11, 12]) │
│ 2 │ 2 │ ([3], [13]) │
julia> combine(gdf, names(gdf, r"V[0-9]{1,2}") .=> passthrough)
x = ([1, 2],)
x = ([3],)
x = ([11, 12],)
x = ([13],)
2×3 DataFrame
│ Row │ Class │ V1_passthrough │ V2_passthrough │
│ │ Int64 │ Tuple… │ Tuple… │
├─────┼───────┼────────────────┼────────────────┤
│ 1 │ 1 │ ([1, 2],) │ ([11, 12],) │
│ 2 │ 2 │ ([3],) │ ([13],) │
julia> combine(gdf, names(gdf, r"V[0-9]{1,2}") => passthrough)
x = ([1, 2], [11, 12])
x = ([3], [13])
2×2 DataFrame
│ Row │ Class │ V1_V2_passthrough │
│ │ Int64 │ Tuple… │
├─────┼───────┼────────────────────┤
│ 1 │ 1 │ ([1, 2], [11, 12]) │
│ 2 │ 2 │ ([3], [13]) │
In particular it is crucial to understand what gets passed to combine:
julia> r"V[0-9]{1,2}" .=> passthrough
r"V[0-9]{1,2}" => passthrough
julia> r"V[0-9]{1,2}" => passthrough
r"V[0-9]{1,2}" => passthrough
julia> names(gdf, r"V[0-9]{1,2}") .=> passthrough
2-element Array{Pair{String,typeof(passthrough)},1}:
"V1" => passthrough
"V2" => passthrough
julia> names(gdf, r"V[0-9]{1,2}") => passthrough
["V1", "V2"] => passthrough
So as you can see, all depends what gets passed to combine. In particular r"V[0-9]{1,2}" .=> passthrough and r"V[0-9]{1,2}" => passthrough are parsed as exactly the same expression, in which case passthrough is called only ONCE per group getting multiple positional arguments.
On the other hand names(gdf, r"V[0-9]{1,2}") .=> passthrough makes passthrough get called for each column separately for each group.

Related

How to get a new column that depends of a subset of dataframe columns

My dataframe has 3 columns A, B and C and for each row only one of these columns contains a value.
I want a MERGE column that contains the values from A or B or C
using DataFrames
df = DataFrame(NAME = ["a", "b", "c"], A = [1, missing, missing], B = [missing, 2, missing], C = [missing, missing, 3])
3×4 DataFrame
│ Row │ NAME │ A │ B │ C │
│ │ String │ Int64? │ Int64? │ Int64? │
├─────┼────────┼─────────┼─────────┼─────────┤
│ 1 │ a │ 1 │ missing │ missing │
│ 2 │ b │ missing │ 2 │ missing │
│ 3 │ c │ missing │ missing │ 3 │
How the best julia way to get the MERGE column?
3×5 DataFrame
│ Row │ NAME │ A │ B │ C │ MERGE │
│ │ String │ Int64? │ Int64? │ Int64? │ Int64 │
├─────┼────────┼─────────┼─────────┼─────────┼───────┤
│ 1 │ a │ 1 │ missing │ missing │ 1 │
│ 2 │ b │ missing │ 2 │ missing │ 2 │
│ 3 │ c │ missing │ missing │ 3 │ 3 │
What I was able to work out so far is:
select(df, :, [:A, :B, :C] => ByRow((a,b,c) -> sum(skipmissing([a, b, c]))) => :MERGE)
What about a scenario for which there is a variable range of columns?
select(df, range => ??? => :MERGE)
You can write it like this:
julia> transform!(df, [:A, :B, :C] => ByRow(coalesce) => :MERGE)
3×5 DataFrame
│ Row │ NAME │ A │ B │ C │ MERGE │
│ │ String │ Int64? │ Int64? │ Int64? │ Int64 │
├─────┼────────┼─────────┼─────────┼─────────┼───────┤
│ 1 │ a │ 1 │ missing │ missing │ 1 │
│ 2 │ b │ missing │ 2 │ missing │ 2 │
│ 3 │ c │ missing │ missing │ 3 │ 3 │
Instead of [:A, :B, :C] you can put any selector, like All(), Between(:A, :C), 1:3 etc.

How to insert a new row in julia at specific index [duplicate]

Is there a way to add a row to an existing dataframe at a specific index?
E.g. you have a dataframe with 3 rows and 1 columns
df = DataFrame(x = [2,3,4])
X
2
3
4
any way to do the following:
insert!(df, 1, [1])
in order to get
X
1
2
3
4
I know that i could probably concat two dataframes df = [df1; df2] but i was hoping to avoid garbaging a large DF whenever i want to insert a row.
In DataFrames 0.21.4 just write (I give two options: one, with broadcasting, is short but creates a temporary object; the other, with foreach is longer to write but allocates a bit less):
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(eachcol(df), 2, [4, "d"]) # creates an temporary object but is terse
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
["a", "d", "b", "c"]
julia> df
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
julia> foreach((c, v) -> insert!(c, 2, v), eachcol(df), [4, "d"]) # does not create a temporary object
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 4 │ d │
│ 4 │ 2 │ b │
│ 5 │ 3 │ c │
note that the above operation is not atomic (it may corrupt your data frame if the type of the element you want to add does not match the element type allowed in the column).
If you want a safe operation that will provide automatic promotion use this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> [view(df, 1:1, :); DataFrame(names(df) .=> ['a', 'b']); view(df, 3:3, :)]
3×2 DataFrame
│ Row │ x │ y │
│ │ Any │ Any │
├─────┼─────┼─────┤
│ 1 │ 1 │ a │
│ 2 │ 'a' │ 'b' │
│ 3 │ 3 │ c │
(it is a bit slower though and creates a new data frame)
Deprecated
The original answer is here. It was valid for Julia before 1.0 release (and DataFrames.jl version that was compatible with it).
I guess you want to do it in place. Then you can use insert! function like this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
Of course you have to make sure that you have the right number of columns in the added collection.
If you accept using unexported internal structure of a DataFrame you can do it even simpler:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(df.columns, 2, [4, "d"])
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
String["a", "d", "b", "c"]
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │

How to get dtypes of columns in julia dataframe

How to get dtypes of all columns and particular column(s) in julia. To be specific what is the pandas equivalent of df.dtypes in julia?
For example,
I have a df like below,
│ Row │ Id │ name │ item location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
│ 2 │ 2 │ B │ abc │
│ 3 │ 3 │ C │ def │
│ 4 │ 4 │ D │ ghi │
│ 5 │ 5 │ E │ xyz │
│ 6 │ 6 │ F │ abc │
│ 7 │ 7 │ G │ def │
│ 8 │ 8 │ H │ ghi │
│ 9 │ 9 │ I │ xyz │
│ 10 │ 10 │ J │ abc │
Expected output:
{'id': Int64, 'name': String, 'item location': String}
How to get dtypes, i.e., Int64 │ String │ String in Julia?
You have specified two different expected outputs so I show here how to get both:
julia> df = DataFrame("Id" => 1, "name" => "A", "item_location" => "xyz")
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ Int64 │ String │ String │
├─────┼───────┼────────┼───────────────┤
│ 1 │ 1 │ A │ xyz │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
Int64
String
String
julia> Dict(names(df) .=> eltype.(eachcol(df)))
Dict{String,DataType} with 3 entries:
"Id" => Int64
"name" => String
"item_location" => String
additionally, if you wanted to store the result in a DataFrame instead of a Dict you can write (see mapcols documentation here):
julia> mapcols(eltype, df)
1×3 DataFrame
│ Row │ Id │ name │ item_location │
│ │ DataType │ DataType │ DataType │
├─────┼──────────┼──────────┼───────────────┤
│ 1 │ Int64 │ String │ String │
And if you would want to have a NamedTuple storing this information write (the documentation of Tables.columntable is here):
julia> map(eltype, Tables.columntable(df))
(Id = Int64, name = String, item_location = String)
(in this case note that for very wide tables this might have some extra compilation cost as each time you call it you potentially get a new type of NamedTuple)
You can also use describe(df) which is a catchall for learning about the columns in your data frame.

What is the meaning of the exclamation mark in indexing a Julia DataFrame?

I thought that the exclamation mark ! is the symbol for the logical NOT operator. Now, learning indexing in the DataFrames package, I came across this: data[!,:Treatment]. Which seems to be the same as using the known colon symbol :
data[:,:Treatment]==data[!,:Treatment] is true.
Why this redundancy then?
! in indexing is specific to DataFrames, and signals that you want a reference to the underlying vector storing the data, rather than a copy of it. You can read all about indexing DataFrames here. In your example the are both == because all values are identical, but they are not === since df[:, :Treatment] gives you a copy of the underlying data.
Example:
julia> using DataFrames
julia> df = DataFrame(y = [1, 2, 3]);
julia> df[:, :y] == df[!, :y] # true because all values are equal
true
julia> df[:, :y] === df[!, :y] # false because they are not the same vector
false
Quoting the documentation of DataFrames.jl:
Columns can be directly (i.e. without copying) accessed via df.col or df[!, :col]. [...] Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.
An example might make this clearer:
julia> using DataFrames
julia> df = DataFrame(x = rand(5), y=rand(5))
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `a` is a copy of `df.x`: modifying it will not affect `df`
julia> a = df[:, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> a[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `b` is a view of `df.x`: any change made to it will be reflected in df
julia> b = df[!, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> b[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 1.0 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
Note that, since the indexing with ! does not involve any data copy, it will generally be more efficient.

Adding rows to a dataframe with pre-allocated memory?

Let's say I have a pre-sized dataframe and I want to assign values to every row. (Therfore push! and append! are out of game)
length = 10
df = DataFrame(id = Array(Int64,length),value = Array(String,length))
for n in 1:10
df[n,:id] = n
df[n,:value] = "random text"
end
The above code shows how to do that cell by cell for each iterated row.
Is there a solution to add an entire row at once for each iteration?
Because
for n in 1:10
df[n] = [n "random text"]
end
throws a wrong type exception.
To access a row the syntax is [row,:] rather than just row.
Also you'll need to convert the row to a DataFrame first.
for n in 1:10
df[n,:] = DataFrame([n "random text2"])
end
You can roll your own function to set a row quite easily:
julia> function setrow!(df, rowi, val)
for j in eachindex(val)
df[rowi, j] = val[j]
end
df
end
setrow! (generic function with 1 method)
julia> setrow!(df, 1, [1, "a"])
10×2 DataFrames.DataFrame
│ Row │ id │ value │
├─────┼─────────────────┼──────────┤
│ 1 │ 1 │ "a" │
│ 2 │ 140525709817424 │ "#undef" │
│ 3 │ 140525709817488 │ "#undef" │
│ 4 │ 140525709817072 │ "#undef" │
│ 5 │ 140525709817104 │ "#undef" │
│ 6 │ 140525709817136 │ "#undef" │
│ 7 │ 140525709817168 │ "#undef" │
│ 8 │ 140525709817200 │ "#undef" │
│ 9 │ 140525709817232 │ "#undef" │
│ 10 │ 0 │ "#undef" │
Ideally, you might be able to use the broadcasting assignment syntax:
df[2, :] .= [2, "b"]
But that appears to be not implemented (perhaps for good reason, I'm not sure).

Resources