How to do fast group by operations in Julia? - julia

Specially, I would like something similar to R::data.table d[, function(...), by = key]. Using the answer for another Stackoverflow question (
Julia Dataframe group by and pivot tables functions) I have this solution:
using DataFrames
df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
Class = ["H","L","H","L","L","H", "H","L","L","M"],
Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
Score = ["4","5","3","2","1","5","4","3","2","1"])
julia> by(df, :Location, d -> DataFrame(count=nrow(d)))
4x2 DataFrames.DataFrame
| Row | Location | count |
|-----|----------|-------|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
That works fine, but it turns out to be extremely slow for large datasets. Is there any faster solution?

For counting, the following solution is faster but not as readable:
cmap = countmap(df[:Location]);
res = DataFrame(Location=collect(keys(cmap)),count=collect(values(cmap)))
Or, more generally (again for counting):
countdf(df::DataFrame, fld) =
( h = countmap(df[fld]) ; DataFrame(collect.([keys(h),values(h)]),[fld,:count]) )
Giving:
julia> countdf(df,:Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ count │
├─────┼──────────┼───────┤
│ 1 │ "DC" │ 1 │
│ 2 │ "SF" │ 3 │
│ 3 │ "NY" │ 3 │
│ 4 │ "TX" │ 3 │
For other aggregation functions (which can be computed sequentially) we can define functions:
foldmap(op, v0, df, col) =
foldl((x,y)->setindex!(x,op(get(x,y[col],v0),y),y[col]),
Dict{eltype(df[col]),typeof(v0)}(), eachrow(df))
folddf(op, v0, df, col) =
(h = foldmap(op, v0, df, col) ;
DataFrame(collect.([keys(h),values(h)]),[col,:res]) )
inc1(x,y) = x+1
sumScore(x,y) = x+y[:Score]
maxScore(x,y) = max(x,y[:Score])
With these definitions:
julia> eltype(df[:Score])<:Real || ( df[:Score] = parse.(Float64, df[:Score]) );
julia> foldmap(inc1, 0, df, :Location)
Dict{String,Int64} with 4 entries:
"DC" => 1
"SF" => 3
"NY" => 3
"TX" => 3
julia> folddf(sumScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼──────┤
│ 1 │ "DC" │ 1.0 │
│ 2 │ "SF" │ 11.0 │
│ 3 │ "NY" │ 9.0 │
│ 4 │ "TX" │ 9.0 │
julia> folddf(maxScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼─────┤
│ 1 │ "DC" │ 1.0 │
│ 2 │ "SF" │ 5.0 │
│ 3 │ "NY" │ 4.0 │
│ 4 │ "TX" │ 4.0 │

Related

How to insert a new row in julia at specific index [duplicate]

Is there a way to add a row to an existing dataframe at a specific index?
E.g. you have a dataframe with 3 rows and 1 columns
df = DataFrame(x = [2,3,4])
X
2
3
4
any way to do the following:
insert!(df, 1, [1])
in order to get
X
1
2
3
4
I know that i could probably concat two dataframes df = [df1; df2] but i was hoping to avoid garbaging a large DF whenever i want to insert a row.
In DataFrames 0.21.4 just write (I give two options: one, with broadcasting, is short but creates a temporary object; the other, with foreach is longer to write but allocates a bit less):
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(eachcol(df), 2, [4, "d"]) # creates an temporary object but is terse
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
["a", "d", "b", "c"]
julia> df
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
julia> foreach((c, v) -> insert!(c, 2, v), eachcol(df), [4, "d"]) # does not create a temporary object
julia> df
5×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 4 │ d │
│ 4 │ 2 │ b │
│ 5 │ 3 │ c │
note that the above operation is not atomic (it may corrupt your data frame if the type of the element you want to add does not match the element type allowed in the column).
If you want a safe operation that will provide automatic promotion use this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> [view(df, 1:1, :); DataFrame(names(df) .=> ['a', 'b']); view(df, 3:3, :)]
3×2 DataFrame
│ Row │ x │ y │
│ │ Any │ Any │
├─────┼─────┼─────┤
│ 1 │ 1 │ a │
│ 2 │ 'a' │ 'b' │
│ 3 │ 3 │ c │
(it is a bit slower though and creates a new data frame)
Deprecated
The original answer is here. It was valid for Julia before 1.0 release (and DataFrames.jl version that was compatible with it).
I guess you want to do it in place. Then you can use insert! function like this:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> foreach((v,n) -> insert!(df[n], 2, v), [4, "d"], names(df))
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │
Of course you have to make sure that you have the right number of columns in the added collection.
If you accept using unexported internal structure of a DataFrame you can do it even simpler:
julia> df = DataFrame(x = [1,2,3], y = ["a", "b", "c"])
3×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
│ 3 │ 3 │ c │
julia> insert!.(df.columns, 2, [4, "d"])
2-element Array{Array{T,1} where T,1}:
[1, 4, 2, 3]
String["a", "d", "b", "c"]
julia> df
4×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 1 │ a │
│ 2 │ 4 │ d │
│ 3 │ 2 │ b │
│ 4 │ 3 │ c │

Julia: Subset data frame

I had two tab delimited files; one with the data, and the second one with the names of the columns that I am interested in. I want to subset the data frame so that it only has my columns of interest. Here is my code:
dat1 = DataFrame(CSV.File("data.txt"))
hdr = Symbol(readdlm("header.txt",'\t'))
which gives
julia> dat1
4×5 DataFrame
│ Row │ chr │ pos │ alt │ ref │ cadd │
│ │ String │ Int64 │ String │ String │ Float64 │
├─────┼────────┼───────┼────────┼────────┼─────────┤
│ 1 │ chr1 │ 1234 │ A │ T │ 23.4 │
│ 2 │ chr2 │ 1234 │ C │ G │ 5.4 │
│ 3 │ chr2 │ 1234 │ G │ C │ 11.0 │
│ 4 │ chr5 │ 3216 │ A │ T │ 3.0 │
julia> hdr
Symbol("Any[\"pos\" \"alt\"]")
However, I get an error if I try to subset with:
julia> dat2 = dat1[ :, :hdr]
What would be the correct way to subset? Thanks!
Just do:
hdr = vec(readdlm("header.txt",'\t'))
dat2 = dat1[:, hdr]
or for the second step
dat2 = select(df1, hdr)
What is important here is hat hdr should be a vector of strings.
You could also have written:
dat2 = select(df1, readdlm("header.txt",'\t')...)
splatting the contents of the matrix (strings holding column names) as positional arguments.

Build dataframe from matrix - specify column types

Ok lets say I have a series of arrays:
data_one = ["dog","cat"]
data_two = [1,2]
data_three = ["1/1/2018","1/2/2018"]
I build them into a matrix
m = hcat(data_one,data_two,data_three)
and convert to a df
df = DataFrame(m)
showcols(df)
for output:
julia> showcols(df)
3×5 DataFrames.DataFrame
│ Row │ variable │ eltype │ nmissing │ first │ last │
├─────┼──────────┼────────┼──────────┼──────────┼──────────┤
│ 1 │ x1 │ Any │ 0 │ dog │ cat │
│ 2 │ x2 │ Any │ 0 │ 1 │ 2 │
│ 3 │ x3 │ Any │ 0 │ 1/1/2018 │ 1/2/2018 │
When I build this data frame - how may I specify the types of each column??
col1 should be String
col2 = Int
col3 = String
You can do it only indirectly through the following DataFrame constructor (of course you could pass [String, Int, String] as a variable here):
DataFrame([([String, Int, String][i]).(m[:,i]) for i in 1:size(m, 2)])
and if you want to use automatic detection of column type you can use:
DataFrame([[v for v in m[:,i]] for i in 1:size(m, 2)])

Cumulative Returns

In R we can do:
cum.ret <- cumprod(1 + df$rets) - 1
I want to do the same thing with Julia here is some dummy data:
# Dummy Data
df = DataFrame(a = 1:10, b = 10*rand(10), Close = 10 * rand(10))
# Calculate Returns
Close = df[:Close]
Close = convert(Array, Close)
df[:Close_Rets] = [NaN; (Close[2:end] ./ Close[1:(end-1)] - 1)]
# Calculate Cumulative Returns
df[:Cum_Ret] = cumprod(((1 .+ df[:Close_Rets])-1),2)
With the output:
julia> head(df)
6×5 DataFrames.DataFrame
│ Row │ a │ b │ Close │ Close_Rets │ Cum_Ret │
├─────┼───┼─────────┼──────────┼────────────┼───────────┤
│ 1 │ 1 │ 6.15507 │ 3.6363 │ NaN │ NaN │
│ 2 │ 2 │ 7.73259 │ 0.98378 │ -0.729456 │ -0.729456 │
│ 3 │ 3 │ 3.64926 │ 7.94633 │ 7.07735 │ 7.07735 │
│ 4 │ 4 │ 5.15762 │ 0.744905 │ -0.906258 │ -0.906258 │
│ 5 │ 5 │ 9.49532 │ 8.51811 │ 10.4352 │ 10.4352 │
│ 6 │ 6 │ 6.14604 │ 5.02165 │ -0.410473 │ -0.410473 │
Anyway to make this work?

Multiply two data frame columns

So I tried this:
df[:new_col] = (df[:col_one ] .* [df[:col_two]])
It produces a wild result.
I then though to iterate row wise by access the data frame index:
v = Float64[]
for i in 1:nrow(df)
z = df[[i],[:col_one]] * df[[i],[:col_two]]
append!(v,z)
end
This however does not work. Any ideas?
What are my options from here? Pull the data from a data frame and make a vector?
** Update **
df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10))
df[:new_d] = df[:b] .* df[:c]
For output:
julia> head(df)
6×4 DataFrames.DataFrame
│ Row │ a │ b │ c │ new_d │
├─────┼───┼─────────┼─────────┼─────────┤
│ 1 │ 1 │ 6.67916 │ 8.38096 │ 55.9778 │
│ 2 │ 2 │ 7.50056 │ 5.26593 │ 39.4974 │
│ 3 │ 3 │ 7.76419 │ 3.54361 │ 27.5133 │
│ 4 │ 4 │ 2.86521 │ 8.41335 │ 24.1061 │
│ 5 │ 5 │ 3.7417 │ 8.10884 │ 30.3409 │
│ 6 │ 6 │ 7.52014 │ 2.61603 │ 19.6729 │

Resources