Group DataFrame by sequential occurrence of values in a column in Julia - julia

Suppose I have the follow DataFrame:
julia> Random.seed!(1)
TaskLocalRNG()
julia> df = DataFrame(data = rand(1:10, 10), gr = rand([0, 1], 10))
10×2 DataFrame
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 4 0
3 │ 7 0
4 │ 7 0
5 │ 10 1
6 │ 2 1
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
What I want is something not only the value of the :gr, but also the occurrences of these values. In this case, the number of groups should be 4:
Group 1 (1 row)
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
Group 2 (3 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
2 │ 4 0
3 │ 7 0
4 │ 7 0
Group 3 (2 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
5 │ 10 1
6 │ 2 1
Group 4 (4 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
If I group by the column :gr, however, I could only get two groups:
julia> groupby(df, :gr)
GroupedDataFrame with 2 groups based on key: gr
First Group (7 rows): gr = 0
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 4 0
2 │ 7 0
3 │ 7 0
4 │ 8 0
5 │ 8 0
6 │ 7 0
7 │ 2 0
⋮
Last Group (3 rows): gr = 1
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 10 1
3 │ 2 1
How could I implement this in Julia DataFrames.jl? Thanks
versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, westmere)
Environment:
JULIA_NUM_THREADS = 1
JULIA_EDITOR = code

You can group by a new variable that increments each time :gr changes value.
For example:
nrow = size(df,1)
gr1 = zeros(Int64, nrow)
for i=2:nrow
gr1[i] = gr1[i-1] + (df.gr[i] != df.gr[i-1])
end
df.gr1 = gr1

Same basic idea as #GEK's answer, with a more vectorized implementation:
julia> edgedetect(col) = [0; abs.(diff(col))] |> cumsum
edgedetect (generic function with 1 method)
julia> edgedetect([0, 1, 1, 1, 0, 0, 1]) |> print
[0, 1, 1, 1, 2, 2, 3]
abs.(diff(col)) places a 1 wherever the value of column col changes, and a 0 elsewhere. (diff returns n-1 differences given n elements, so we prefix the result with a 0 to maintain column length.) Doing a cumulative sum on this, we get a new column that increases every time the value in the original column changes.
We can then use this function to groupby on a transformed dataframe, like this:
julia> groupby(transform(df, :gr => edgedetect => :gr_edges, copycols = false), :gr_edges) |> print
GroupedDataFrame with 4 groups based on key: gr_edges
Group 1 (1 row): gr_edges = 0
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 1 1 0
Group 2 (3 rows): gr_edges = 1
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 4 0 1
2 │ 7 0 1
3 │ 7 0 1
Group 3 (2 rows): gr_edges = 2
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 10 1 2
2 │ 2 1 2
Group 4 (4 rows): gr_edges = 3
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 8 0 3
2 │ 8 0 3
3 │ 7 0 3
4 │ 2 0 3

Related

Julia tranpose grouped data in DataFrames?

ds = Dataset(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
In inmemorydatasets package, we can transpose grouped data like this.
#transpose by group
transpose(groupby(ds, :group), 2:4)
How can do I do this in DataFrames packages?
How can do I do this in R?
result:
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Answer (attempt) regarding Julia DataFrames part of the question:
First creating the DataFrame:
df = DataFrame(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
Next, since the transpose operation depends on row ordering, we fix a row ordering in the groups:
julia> ordereddf = transform(DataFrames.groupby(df, :group),"group" => (x->1:length(x)) => "rn")[:,Not(:e)]
6×5 DataFrame
Row │ group b c d rn
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 1 1 1 1
2 │ 1 1 1 2 2
3 │ 2 1 1 3 1
4 │ 2 2 1 4 2
5 │ 3 2 1 5 1
6 │ 3 2 1 6 2
Finally, the stack - unstack combo, does the transposing bit:
julia> sort!(unstack(stack(ordereddf,[:b,:c,:d]),:rn, :value),:group)
9×4 DataFrame
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Feels like there might be easier ways to do this, but in general, transpose is rarely appropriate for database-like tables, and if it is appropriate, then maybe a matrix should have been used to store information in the first place.
The R part is left for someone else to answer.

Add a column with a constant value to a DataFrame

How can I add a column with a constant value to a DataFrame?
E.g. I have the following DataFrame:
using DataFrames
df = DataFrame(x = 1:10, y = 'a':'j')
And I would like to add a new variable z with constant value 1 and obtain:
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
To create such column:
df = DataFrame(x = 1:10, y = 'a':'j', d = 1)
To append such column to the existing DataFrame, you need broadcasting:
df.e .= 1
or
df[:, "f"] .= 1
A more general alternative is:
julia> insertcols!(df, :z => 1)
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
which by default does the same, but it additionally:
allows you to specify the location of the new column;
by default makes sure that you do not accidentally overwrite an existing column

Stack multiples columns into two columns in DataFrame.jl

Suppose my source Dataframe and target Dataframe are df1 and df2 as being illustrated in the following.
I'm wondering how can I convert df1 to df2 using stack? Thanks.
julia> df1 = DataFrame(x1 = 1:4, x2 = 5:8, y1 = Char.(65:68), y2 = Char.(69:72))
4×4 DataFrame
Row │ x1 x2 y1 y2
│ Int64 Int64 Char Char
─────┼──────────────────────────
1 │ 1 5 A E
2 │ 2 6 B F
3 │ 3 7 C G
4 │ 4 8 D H
julia> df2 = DataFrame(x = [d1.x1; d1.x2], y = [d1.y1; d1.y2])
8×2 DataFrame
Row │ x y
│ Int64 Char
─────┼─────────────
1 │ 1 A
2 │ 2 B
3 │ 3 C
4 │ 4 D
5 │ 5 E
6 │ 6 F
7 │ 7 G
8 │ 8 H
What you want is not a regular stack.
Obviously, you can stack for each x and y singly, then hcat together to get your df21. But here is a different try.
julia> long = stack(df1,All())
16×2 DataFrame
Row │ variable value
│ String Any
─────┼─────────────────
1 │ x1 1
2 │ x1 2
3 │ x1 3
4 │ x1 4
5 │ x2 5
6 │ x2 6
7 │ x2 7
8 │ x2 8
9 │ y1 A
10 │ y1 B
11 │ y1 C
12 │ y1 D
13 │ y2 E
14 │ y2 F
15 │ y2 G
16 │ y2 H
Turn all x1,x2,y1,y2 to x,x,y,y and add new unique id column
julia> long.variable = SubString.(long.variable,1,1);
julia> long.id = repeat(1:8,2);
julia> long
16×3 DataFrame
Row │ variable value id
│ SubStrin… Any Int64
─────┼─────────────────────────
1 │ x 1 1
2 │ x 2 2
3 │ x 3 3
4 │ x 4 4
5 │ x 5 5
6 │ x 6 6
7 │ x 7 7
8 │ x 8 8
9 │ y A 1
10 │ y B 2
11 │ y C 3
12 │ y D 4
13 │ y E 5
14 │ y F 6
15 │ y G 7
16 │ y H 8
Finally, unstack back
julia> unstack(long, :variable, :value)
8×3 DataFrame
Row │ id x y
│ Int64 Any Any
─────┼─────────────────
1 │ 1 1 A
2 │ 2 2 B
3 │ 3 3 C
4 │ 4 4 D
5 │ 5 5 E
6 │ 6 6 F
7 │ 7 7 G
8 │ 8 8 H

Julia - separate() tidyr equivalent

Are there any functions in julia that is equivalent to separate() in R?
I have a column with a long string with ":" as a delimiter and I want to split the string by those delimiter into 8 columns.
Assuming you use DataFrames.jl you can use the following.
First generate some test data:
julia> using DataFrames
julia> df = DataFrame(in = join.(eachrow(rand(1:9, 10, 8)), ":"))
10×1 DataFrame
Row │ in
│ String
─────┼─────────────────
1 │ 8:2:9:4:1:9:3:1
2 │ 9:6:9:9:8:1:9:5
3 │ 2:4:9:8:5:4:8:7
4 │ 8:2:2:9:5:3:7:7
5 │ 1:4:6:1:3:9:2:1
6 │ 8:6:1:5:1:4:8:8
7 │ 4:6:4:4:4:4:8:8
8 │ 4:3:3:5:1:4:3:4
9 │ 9:5:5:7:5:3:4:3
10 │ 4:5:8:5:2:5:7:4
The splitting could be done in many ways (we are fully flexible in how you can split your column). Here I just use the split function.
If you want auto-generated column names use:
julia> transform(df, :in => ByRow(x -> split(x, ":")) => AsTable)
10×9 DataFrame
Row │ in x1 x2 x3 x4 x5 x6 x7 x8
│ String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Alternatively pass your own column names:
julia> transform(df, :in => ByRow(x -> split(x, ":")) => "out" .* string.(1:8))
10×9 DataFrame
Row │ in out1 out2 out3 out4 out5 out6 out7 out8
│ String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Note that the benefit of allowing custom parser is that you can in one shot convert the parts of your original string to their numeric values like this:
julia> transform(df, :in => ByRow(x -> parse.(Int, split(x, ":"))) => AsTable)
10×9 DataFrame
Row │ in x1 x2 x3 x4 x5 x6 x7 x8
│ String Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Here is another way to do it. Not as composable as the above but showing you the flexibility of the ecosystem:
julia> using CSV
julia> CSV.read(IOBuffer(join(df.in, "\n")), DataFrame, header=false, delim=":")
10×8 DataFrame
Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────────────
1 │ 8 2 9 4 1 9 3 1
2 │ 9 6 9 9 8 1 9 5
3 │ 2 4 9 8 5 4 8 7
4 │ 8 2 2 9 5 3 7 7
5 │ 1 4 6 1 3 9 2 1
6 │ 8 6 1 5 1 4 8 8
7 │ 4 6 4 4 4 4 8 8
8 │ 4 3 3 5 1 4 3 4
9 │ 9 5 5 7 5 3 4 3
10 │ 4 5 8 5 2 5 7 4
julia> CSV.read(IOBuffer(join(df.in, "\n")), DataFrame, header="out" .* string.(1:8), delim=":")
10×8 DataFrame
Row │ out1 out2 out3 out4 out5 out6 out7 out8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────
1 │ 8 2 9 4 1 9 3 1
2 │ 9 6 9 9 8 1 9 5
3 │ 2 4 9 8 5 4 8 7
4 │ 8 2 2 9 5 3 7 7
5 │ 1 4 6 1 3 9 2 1
6 │ 8 6 1 5 1 4 8 8
7 │ 4 6 4 4 4 4 8 8
8 │ 4 3 3 5 1 4 3 4
9 │ 9 5 5 7 5 3 4 3
10 │ 4 5 8 5 2 5 7 4

DataFrames.jl - leftjoin() - retain left dataframe index positioning

Using leftjoin() to join two dataframes df1 and df2.
df1 = DataFrame(x1 = collect(1:1:10), x2 = fill(1.0,10))
Row │ x1 x2
│ Int64 Float64
─────┼────────────────
1 │ 1 1.0
2 │ 2 1.0
3 │ 3 1.0
4 │ 4 1.0
5 │ 5 1.0
6 │ 6 1.0
7 │ 7 1.0
8 │ 8 1.0
9 │ 9 1.0
10 │ 10 1.0
df2 = DataFrame(x1 = collect(1:2:10), x2 = fill(1.0,5))
Row │ x1 x2
│ Int64 Float64
─────┼────────────────
1 │ 1 1.0
2 │ 3 1.0
3 │ 5 1.0
4 │ 7 1.0
5 │ 9 1.0
out_df = leftjoin(df1,df2, on = :x1, makeunique=true)
for output:
Row │ x1 x2 x2_1
│ Int64 Float64? Float64?
─────┼────────────────────────────
1 │ 1 1.0 1.0
2 │ 3 1.0 1.0
3 │ 5 1.0 1.0
4 │ 7 1.0 1.0
5 │ 9 1.0 1.0
6 │ 2 1.0 missing
7 │ 4 1.0 missing
8 │ 6 1.0 missing
9 │ 8 1.0 missing
10 │ 10 1.0 missing
My question is with df1 being 10 rows and df2 being 5 rows. I am electing df1 to be the 'master' df if you will and wish to retain its original index positioning and when join df1 to df2 - df2 slots into the df1 matches and puts in missing values on non-matches but retaining df1 index positioning for output:
Row │ x1 x2 x2_1
│ Int64 Float64? Float64?
─────┼────────────────────────────
1 │ 1 1.0 1.0
2 │ 2 1.0 missing
3 │ 3 1.0 1.0
4 │ 4 1.0 missing
5 │ 5 1.0 1.0
6 │ 6 1.0 missing
7 │ 7 1.0 1.0
8 │ 8 1.0 missing
9 │ 9 1.0 1.0
10 │ 10 1.0 missing
There anyway I can achieve this?
This is a feature we plan to add in the future, see https://github.com/JuliaData/DataFrames.jl/issues/2753.
For now, before we add the requested functionality, add a column to your left data frame with row id (in your example there is already such a column :x1) and sort the result on this column.

Resources