DataFrames.jl - leftjoin() - retain left dataframe index positioning - julia

Using leftjoin() to join two dataframes df1 and df2.
df1 = DataFrame(x1 = collect(1:1:10), x2 = fill(1.0,10))
Row │ x1 x2
│ Int64 Float64
─────┼────────────────
1 │ 1 1.0
2 │ 2 1.0
3 │ 3 1.0
4 │ 4 1.0
5 │ 5 1.0
6 │ 6 1.0
7 │ 7 1.0
8 │ 8 1.0
9 │ 9 1.0
10 │ 10 1.0
df2 = DataFrame(x1 = collect(1:2:10), x2 = fill(1.0,5))
Row │ x1 x2
│ Int64 Float64
─────┼────────────────
1 │ 1 1.0
2 │ 3 1.0
3 │ 5 1.0
4 │ 7 1.0
5 │ 9 1.0
out_df = leftjoin(df1,df2, on = :x1, makeunique=true)
for output:
Row │ x1 x2 x2_1
│ Int64 Float64? Float64?
─────┼────────────────────────────
1 │ 1 1.0 1.0
2 │ 3 1.0 1.0
3 │ 5 1.0 1.0
4 │ 7 1.0 1.0
5 │ 9 1.0 1.0
6 │ 2 1.0 missing
7 │ 4 1.0 missing
8 │ 6 1.0 missing
9 │ 8 1.0 missing
10 │ 10 1.0 missing
My question is with df1 being 10 rows and df2 being 5 rows. I am electing df1 to be the 'master' df if you will and wish to retain its original index positioning and when join df1 to df2 - df2 slots into the df1 matches and puts in missing values on non-matches but retaining df1 index positioning for output:
Row │ x1 x2 x2_1
│ Int64 Float64? Float64?
─────┼────────────────────────────
1 │ 1 1.0 1.0
2 │ 2 1.0 missing
3 │ 3 1.0 1.0
4 │ 4 1.0 missing
5 │ 5 1.0 1.0
6 │ 6 1.0 missing
7 │ 7 1.0 1.0
8 │ 8 1.0 missing
9 │ 9 1.0 1.0
10 │ 10 1.0 missing
There anyway I can achieve this?

This is a feature we plan to add in the future, see https://github.com/JuliaData/DataFrames.jl/issues/2753.
For now, before we add the requested functionality, add a column to your left data frame with row id (in your example there is already such a column :x1) and sort the result on this column.

Related

Julia tranpose grouped data in DataFrames?

ds = Dataset(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
In inmemorydatasets package, we can transpose grouped data like this.
#transpose by group
transpose(groupby(ds, :group), 2:4)
How can do I do this in DataFrames packages?
How can do I do this in R?
result:
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Answer (attempt) regarding Julia DataFrames part of the question:
First creating the DataFrame:
df = DataFrame(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
Next, since the transpose operation depends on row ordering, we fix a row ordering in the groups:
julia> ordereddf = transform(DataFrames.groupby(df, :group),"group" => (x->1:length(x)) => "rn")[:,Not(:e)]
6×5 DataFrame
Row │ group b c d rn
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 1 1 1 1
2 │ 1 1 1 2 2
3 │ 2 1 1 3 1
4 │ 2 2 1 4 2
5 │ 3 2 1 5 1
6 │ 3 2 1 6 2
Finally, the stack - unstack combo, does the transposing bit:
julia> sort!(unstack(stack(ordereddf,[:b,:c,:d]),:rn, :value),:group)
9×4 DataFrame
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Feels like there might be easier ways to do this, but in general, transpose is rarely appropriate for database-like tables, and if it is appropriate, then maybe a matrix should have been used to store information in the first place.
The R part is left for someone else to answer.

Add a column with a constant value to a DataFrame

How can I add a column with a constant value to a DataFrame?
E.g. I have the following DataFrame:
using DataFrames
df = DataFrame(x = 1:10, y = 'a':'j')
And I would like to add a new variable z with constant value 1 and obtain:
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
To create such column:
df = DataFrame(x = 1:10, y = 'a':'j', d = 1)
To append such column to the existing DataFrame, you need broadcasting:
df.e .= 1
or
df[:, "f"] .= 1
A more general alternative is:
julia> insertcols!(df, :z => 1)
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
which by default does the same, but it additionally:
allows you to specify the location of the new column;
by default makes sure that you do not accidentally overwrite an existing column

Group DataFrame by sequential occurrence of values in a column in Julia

Suppose I have the follow DataFrame:
julia> Random.seed!(1)
TaskLocalRNG()
julia> df = DataFrame(data = rand(1:10, 10), gr = rand([0, 1], 10))
10×2 DataFrame
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 4 0
3 │ 7 0
4 │ 7 0
5 │ 10 1
6 │ 2 1
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
What I want is something not only the value of the :gr, but also the occurrences of these values. In this case, the number of groups should be 4:
Group 1 (1 row)
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
Group 2 (3 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
2 │ 4 0
3 │ 7 0
4 │ 7 0
Group 3 (2 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
5 │ 10 1
6 │ 2 1
Group 4 (4 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
If I group by the column :gr, however, I could only get two groups:
julia> groupby(df, :gr)
GroupedDataFrame with 2 groups based on key: gr
First Group (7 rows): gr = 0
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 4 0
2 │ 7 0
3 │ 7 0
4 │ 8 0
5 │ 8 0
6 │ 7 0
7 │ 2 0
⋮
Last Group (3 rows): gr = 1
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 10 1
3 │ 2 1
How could I implement this in Julia DataFrames.jl? Thanks
versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, westmere)
Environment:
JULIA_NUM_THREADS = 1
JULIA_EDITOR = code
You can group by a new variable that increments each time :gr changes value.
For example:
nrow = size(df,1)
gr1 = zeros(Int64, nrow)
for i=2:nrow
gr1[i] = gr1[i-1] + (df.gr[i] != df.gr[i-1])
end
df.gr1 = gr1
Same basic idea as #GEK's answer, with a more vectorized implementation:
julia> edgedetect(col) = [0; abs.(diff(col))] |> cumsum
edgedetect (generic function with 1 method)
julia> edgedetect([0, 1, 1, 1, 0, 0, 1]) |> print
[0, 1, 1, 1, 2, 2, 3]
abs.(diff(col)) places a 1 wherever the value of column col changes, and a 0 elsewhere. (diff returns n-1 differences given n elements, so we prefix the result with a 0 to maintain column length.) Doing a cumulative sum on this, we get a new column that increases every time the value in the original column changes.
We can then use this function to groupby on a transformed dataframe, like this:
julia> groupby(transform(df, :gr => edgedetect => :gr_edges, copycols = false), :gr_edges) |> print
GroupedDataFrame with 4 groups based on key: gr_edges
Group 1 (1 row): gr_edges = 0
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 1 1 0
Group 2 (3 rows): gr_edges = 1
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 4 0 1
2 │ 7 0 1
3 │ 7 0 1
Group 3 (2 rows): gr_edges = 2
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 10 1 2
2 │ 2 1 2
Group 4 (4 rows): gr_edges = 3
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 8 0 3
2 │ 8 0 3
3 │ 7 0 3
4 │ 2 0 3

Stack multiples columns into two columns in DataFrame.jl

Suppose my source Dataframe and target Dataframe are df1 and df2 as being illustrated in the following.
I'm wondering how can I convert df1 to df2 using stack? Thanks.
julia> df1 = DataFrame(x1 = 1:4, x2 = 5:8, y1 = Char.(65:68), y2 = Char.(69:72))
4×4 DataFrame
Row │ x1 x2 y1 y2
│ Int64 Int64 Char Char
─────┼──────────────────────────
1 │ 1 5 A E
2 │ 2 6 B F
3 │ 3 7 C G
4 │ 4 8 D H
julia> df2 = DataFrame(x = [d1.x1; d1.x2], y = [d1.y1; d1.y2])
8×2 DataFrame
Row │ x y
│ Int64 Char
─────┼─────────────
1 │ 1 A
2 │ 2 B
3 │ 3 C
4 │ 4 D
5 │ 5 E
6 │ 6 F
7 │ 7 G
8 │ 8 H
What you want is not a regular stack.
Obviously, you can stack for each x and y singly, then hcat together to get your df21. But here is a different try.
julia> long = stack(df1,All())
16×2 DataFrame
Row │ variable value
│ String Any
─────┼─────────────────
1 │ x1 1
2 │ x1 2
3 │ x1 3
4 │ x1 4
5 │ x2 5
6 │ x2 6
7 │ x2 7
8 │ x2 8
9 │ y1 A
10 │ y1 B
11 │ y1 C
12 │ y1 D
13 │ y2 E
14 │ y2 F
15 │ y2 G
16 │ y2 H
Turn all x1,x2,y1,y2 to x,x,y,y and add new unique id column
julia> long.variable = SubString.(long.variable,1,1);
julia> long.id = repeat(1:8,2);
julia> long
16×3 DataFrame
Row │ variable value id
│ SubStrin… Any Int64
─────┼─────────────────────────
1 │ x 1 1
2 │ x 2 2
3 │ x 3 3
4 │ x 4 4
5 │ x 5 5
6 │ x 6 6
7 │ x 7 7
8 │ x 8 8
9 │ y A 1
10 │ y B 2
11 │ y C 3
12 │ y D 4
13 │ y E 5
14 │ y F 6
15 │ y G 7
16 │ y H 8
Finally, unstack back
julia> unstack(long, :variable, :value)
8×3 DataFrame
Row │ id x y
│ Int64 Any Any
─────┼─────────────────
1 │ 1 1 A
2 │ 2 2 B
3 │ 3 3 C
4 │ 4 4 D
5 │ 5 5 E
6 │ 6 6 F
7 │ 7 7 G
8 │ 8 8 H

Is there a way to subtract multiple dataframe columns at once?

I'm pretty new to Julia, so I apologize if this is a super basic question. From R I'm used to doing basic operations on multiple columns of a dataframe at once. I tried to do this in Julia the following way:
I have two dataframes, let's call them data_1 and data_2:
using DataFrames
data_1 = DataFrame(rand(4,6))
data_2 = DataFrame(zeros(4,6))
And now I want to fill data_2 as the difference of certain rows from data_1, e.g.:
data_2[1,:] = data_1[1,:] - data_1[2,:]
but this produces an error. So how can I modify this approach to successfully subtract multiple columns of dataframe rows?
Thank you very much!
This particular task is unfortunately a bit tricky, as we try to retain consistency with Julia Base.
Here are the ways to do it:
Option 1
Use iteration:
julia> data_1 = DataFrame(reshape(1:24, 4, 6))
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 5 │ 9 │ 13 │ 17 │ 21 │
│ 2 │ 2 │ 6 │ 10 │ 14 │ 18 │ 22 │
│ 3 │ 3 │ 7 │ 11 │ 15 │ 19 │ 23 │
│ 4 │ 4 │ 8 │ 12 │ 16 │ 20 │ 24 │
julia> data_2 = DataFrame(zeros(4,6))
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
julia> foreach(i -> data_2[1,i] = data_1[1, i] - data_1[2, i], axes(data_1, 2))
julia> data_2
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
Option 2
Use broadcasting of data frames:
julia> data_1 = DataFrame(reshape(1:24, 4, 6))
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 5 │ 9 │ 13 │ 17 │ 21 │
│ 2 │ 2 │ 6 │ 10 │ 14 │ 18 │ 22 │
│ 3 │ 3 │ 7 │ 11 │ 15 │ 19 │ 23 │
│ 4 │ 4 │ 8 │ 12 │ 16 │ 20 │ 24 │
julia> data_2
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
julia> data_2[1:1,:] .= data_1[1:1,:] .- data_1[2:2,:]
1×6 SubDataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │
julia> data_2
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
As you can see the trick is to use broadcasting (the .) and slices (1:1 etc.) not single indices.
The problem with single indices is that DataFrameRow does not support broadcasting now:
julia> data_2[1,:] .= data_1[1,:] .- data_1[2,:]
ERROR: ArgumentError: broadcasting over `DataFrameRow`s is reserved
because it is undecided how broadcasting will work for NamedTuple objects in Base, as you can see here:
julia> (a=1,b=2) .- (a=1,b=2)
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved
(once Base supports broadcasting over NamedTuples we will add this support to DataFrameRows)
Option 3
It is a workaround of the no-broadcasting issue of DataFrameRow object:
julia> data_1 = DataFrame(reshape(1:24, 4, 6))
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 5 │ 9 │ 13 │ 17 │ 21 │
│ 2 │ 2 │ 6 │ 10 │ 14 │ 18 │ 22 │
│ 3 │ 3 │ 7 │ 11 │ 15 │ 19 │ 23 │
│ 4 │ 4 │ 8 │ 12 │ 16 │ 20 │ 24 │
julia> data_2 = DataFrame(zeros(4,6))
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
julia> data_2[1,:] = Vector(data_1[1,:]) - Vector(data_1[2,:])
DataFrameRow
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │
julia> data_2
4×6 DataFrame
│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │ -1.0 │
│ 2 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 3 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
│ 4 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │
as you can see the trick is to transform RHS into Vectors which support -.
Finally (as an additional reference that might be useful in some cases) you can write Vector(data_1[1,:]) - Vector(data_1[2,:]) shorter just as:
julia> -(Vector.((data_1[1,:],data_1[2,:]))...)
6-element Array{Int64,1}:
-1
-1
-1
-1
-1
-1

Resources