Julia - How to aggregate many columns by group

Julia - How to aggregate many columns by group - julia

Working with Julia 1.0
I am trying to aggregate (in this case mean-center) several columns by group and looking for a way to loop over the columns as opposed to writing all column names explicitly.
The below works but I am looking for more succinct syntax for cases where I have many columns.
using DataFrames, Statistics
dd=DataFrame(A=["aa";"aa";"bb";"bb"], B=[1.0;2.0;3.0;4.0], C=[5.0;5.0;10.0;10.0])
by(dd, :A, df -> DataFrame(bm = df[:B].-mean(df[:B]), cm = df[:C].-mean(df[:C])))
Is there a way to loop over [:B, :C] and not write the statement separately for each?

You can use aggregate:
julia> centered(col) = col .- mean(col)
centered (generic function with 1 method)
julia> aggregate(dd, :A, centered)
4×3 DataFrame
│ Row │ A │ B_centered │ C_centered │
│ │ String │ Float64 │ Float64 │
├─────┼────────┼────────────┼────────────┤
│ 1 │ aa │ -0.5 │ 0.0 │
│ 2 │ aa │ 0.5 │ 0.0 │
│ 3 │ bb │ -0.5 │ 0.0 │
│ 4 │ bb │ 0.5 │ 0.0 │
Note that function name is used as a suffix. If you need more customized suffixes use by and pass it a more fancy third argument that iterates over passed columns giving them appropriate names.

Related

What is the meaning of the exclamation mark in indexing a Julia DataFrame?

I thought that the exclamation mark ! is the symbol for the logical NOT operator. Now, learning indexing in the DataFrames package, I came across this: data[!,:Treatment]. Which seems to be the same as using the known colon symbol :
data[:,:Treatment]==data[!,:Treatment] is true.
Why this redundancy then?

! in indexing is specific to DataFrames, and signals that you want a reference to the underlying vector storing the data, rather than a copy of it. You can read all about indexing DataFrames here. In your example the are both == because all values are identical, but they are not === since df[:, :Treatment] gives you a copy of the underlying data.
Example:
julia> using DataFrames
julia> df = DataFrame(y = [1, 2, 3]);
julia> df[:, :y] == df[!, :y] # true because all values are equal
true
julia> df[:, :y] === df[!, :y] # false because they are not the same vector
false

Quoting the documentation of DataFrames.jl:
Columns can be directly (i.e. without copying) accessed via df.col or df[!, :col]. [...] Since df[!, :col] does not make a copy, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. To get a copy of the column use df[:, :col]: changing the vector returned by this syntax does not change df.
An example might make this clearer:
julia> using DataFrames
julia> df = DataFrame(x = rand(5), y=rand(5))
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `a` is a copy of `df.x`: modifying it will not affect `df`
julia> a = df[:, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> a[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 0.54413 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
# `b` is a view of `df.x`: any change made to it will be reflected in df
julia> b = df[!, :x]
5-element Array{Float64,1}:
0.9378915597741728
0.544130347207969
0.9613717853719412
0.958788066884128
0.9425183324742632
julia> b[2] = 1;
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Float64 │ Float64 │
├─────┼──────────┼───────────┤
│ 1 │ 0.937892 │ 0.42232 │
│ 2 │ 1.0 │ 0.932265 │
│ 3 │ 0.961372 │ 0.680818 │
│ 4 │ 0.958788 │ 0.923667 │
│ 5 │ 0.942518 │ 0.0428454 │
Note that, since the indexing with ! does not involve any data copy, it will generally be more efficient.

Mutate DataFrames in Julia

Looking for a function that works like by but doesn't collapse my DataFrame. In R I would use dplyr's groupby(b) %>% mutate(x1 = sum(a)). I don't want to lose information from the table such as that in variable :c.
mydf = DataFrame(a = 1:4, b = repeat(1:2,2), c=4:-1:1)
bypreserve(mydf, :b, x -> sum(x.a))
│ Row │ a │ b │ c │ x1
│ │ Int64 │ Int64 │ Int64 │Int64
├─────┼───────┼───────┼───────┤───────
│ 1 │ 1 │ 1 │ 4 │ 4
│ 2 │ 2 │ 2 │ 3 │ 6
│ 3 │ 3 │ 1 │ 2 │ 4
│ 4 │ 4 │ 2 │ 1 │ 6

Adding this functionality is discussed, but I would say that it will take several months to be shipped (the general idea is to allow select to have groupby keyword argument + also add transform function that will work like select but preserve columns of the source data frame).
For now the solution is to use join after by:
join(mydf, by(mydf, :b, x1 = :a => sum), on=:b)

Is there a names() equivalent in Julia?

I am transferring a script written in R to Julia, and one of the R functions is the names() function. Is there a synonymous function in Julia?

DataFrames
In Julia there is names function for a DataFrame which returns column names, e.g.:
julia> using DataFrames
julia> x = DataFrame(rand(3,4))
3×4 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1 │ 0.721198 │ 0.605882 │ 0.191941 │ 0.597295 │
│ 2 │ 0.0537836 │ 0.619698 │ 0.764937 │ 0.273197 │
│ 3 │ 0.679952 │ 0.899523 │ 0.206124 │ 0.928319 │
julia> names(x)
4-element Array{Symbol,1}:
:x1
:x2
:x3
:x4
Then in order to set column names of a DataFrame you can use names! function (example continued):
julia> names!(x, [:a,:b,:c,:d])
3×4 DataFrames.DataFrame
│ Row │ a │ b │ c │ d │
├─────┼───────────┼──────────┼──────────┼──────────┤
│ 1 │ 0.721198 │ 0.605882 │ 0.191941 │ 0.597295 │
│ 2 │ 0.0537836 │ 0.619698 │ 0.764937 │ 0.273197 │
│ 3 │ 0.679952 │ 0.899523 │ 0.206124 │ 0.928319 │
Arrays
Standard arrays do not support naming their dimensions. But you can use NamedArrays.jl package which adds this functionality. You can get and set names of dimensions as well as names of indices along each dimension. You can find the details here https://github.com/davidavdav/NamedArrays.jl#general-functions.

I'm not an R expert but I think you want fieldnames
type Foo
bar::Int
end
#show fieldnames(Foo)
baz = Foo(2)
#show fieldnames(baz)

Adding rows to a dataframe with pre-allocated memory?

Let's say I have a pre-sized dataframe and I want to assign values to every row. (Therfore push! and append! are out of game)
length = 10
df = DataFrame(id = Array(Int64,length),value = Array(String,length))
for n in 1:10
df[n,:id] = n
df[n,:value] = "random text"
end
The above code shows how to do that cell by cell for each iterated row.
Is there a solution to add an entire row at once for each iteration?
Because
for n in 1:10
df[n] = [n "random text"]
end
throws a wrong type exception.

To access a row the syntax is [row,:] rather than just row.
Also you'll need to convert the row to a DataFrame first.
for n in 1:10
df[n,:] = DataFrame([n "random text2"])
end

You can roll your own function to set a row quite easily:
julia> function setrow!(df, rowi, val)
for j in eachindex(val)
df[rowi, j] = val[j]
end
df
end
setrow! (generic function with 1 method)
julia> setrow!(df, 1, [1, "a"])
10×2 DataFrames.DataFrame
│ Row │ id │ value │
├─────┼─────────────────┼──────────┤
│ 1 │ 1 │ "a" │
│ 2 │ 140525709817424 │ "#undef" │
│ 3 │ 140525709817488 │ "#undef" │
│ 4 │ 140525709817072 │ "#undef" │
│ 5 │ 140525709817104 │ "#undef" │
│ 6 │ 140525709817136 │ "#undef" │
│ 7 │ 140525709817168 │ "#undef" │
│ 8 │ 140525709817200 │ "#undef" │
│ 9 │ 140525709817232 │ "#undef" │
│ 10 │ 0 │ "#undef" │
Ideally, you might be able to use the broadcasting assignment syntax:
df[2, :] .= [2, "b"]
But that appears to be not implemented (perhaps for good reason, I'm not sure).

Combine Julia Dataframes by reference, instead of making copy

In Julia you can combine dataframes:
d1 = DataFrame(A=1:10)
d2 = DataFrame(A=11:20)
d3 = [d1; d2]
However this appears to copy d1, d2 into d3. I don't want to copy them. If you make a modification to d1, it is not reflected in d3.
Anyone know how to combine them by reference instead of by value, so that if d1 is modified, the change reflects in d3?
Thanks!

In the Array type terminology, what you want is d1 and d2 to be views to the data in d3. This is also possible with DataFrames:
julia> using DataFrames
julia> d3 = DataFrame(A=1:20);
julia> d1 = view(d3,1:10);
julia> d2 = view(d3,11:20);
julia> d1[1:3,:]
3×1 DataFrames.DataFrame
│ Row │ A │
├─────┼───┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> d3[1:3,:]
3×1 DataFrames.DataFrame
│ Row │ A │
├─────┼───┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
julia> d1[1,:A] = 999
999
julia> d3[1:3,:]
3×1 DataFrames.DataFrame
│ Row │ A │
├─────┼─────┤
│ 1 │ 999 │
│ 2 │ 2 │
│ 3 │ 3 │
Of course, you may want to create d1 and d2 first, and then combine them to d3, but this would require a copy operation (to make the columns contiguous in memory). After that, you can generate the views (and assign them to d1 and d2). Using different variables for the views might be recommended as changing the type of d1 and d2 might cause type-instability (bad in Julia).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Julia - How to aggregate many columns by group - julia

Related

What is the meaning of the exclamation mark in indexing a Julia DataFrame?

Mutate DataFrames in Julia

Is there a names() equivalent in Julia?

Adding rows to a dataframe with pre-allocated memory?

Combine Julia Dataframes by reference, instead of making copy

Categories

Resources