ds = Dataset(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
In inmemorydatasets package, we can transpose grouped data like this.
#transpose by group
transpose(groupby(ds, :group), 2:4)
How can do I do this in DataFrames packages?
How can do I do this in R?
result:
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Answer (attempt) regarding Julia DataFrames part of the question:
First creating the DataFrame:
df = DataFrame(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
Next, since the transpose operation depends on row ordering, we fix a row ordering in the groups:
julia> ordereddf = transform(DataFrames.groupby(df, :group),"group" => (x->1:length(x)) => "rn")[:,Not(:e)]
6×5 DataFrame
Row │ group b c d rn
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 1 1 1 1
2 │ 1 1 1 2 2
3 │ 2 1 1 3 1
4 │ 2 2 1 4 2
5 │ 3 2 1 5 1
6 │ 3 2 1 6 2
Finally, the stack - unstack combo, does the transposing bit:
julia> sort!(unstack(stack(ordereddf,[:b,:c,:d]),:rn, :value),:group)
9×4 DataFrame
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Feels like there might be easier ways to do this, but in general, transpose is rarely appropriate for database-like tables, and if it is appropriate, then maybe a matrix should have been used to store information in the first place.
The R part is left for someone else to answer.
Related
How can I add a column with a constant value to a DataFrame?
E.g. I have the following DataFrame:
using DataFrames
df = DataFrame(x = 1:10, y = 'a':'j')
And I would like to add a new variable z with constant value 1 and obtain:
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
To create such column:
df = DataFrame(x = 1:10, y = 'a':'j', d = 1)
To append such column to the existing DataFrame, you need broadcasting:
df.e .= 1
or
df[:, "f"] .= 1
A more general alternative is:
julia> insertcols!(df, :z => 1)
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
which by default does the same, but it additionally:
allows you to specify the location of the new column;
by default makes sure that you do not accidentally overwrite an existing column
Suppose I have the follow DataFrame:
julia> Random.seed!(1)
TaskLocalRNG()
julia> df = DataFrame(data = rand(1:10, 10), gr = rand([0, 1], 10))
10×2 DataFrame
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 4 0
3 │ 7 0
4 │ 7 0
5 │ 10 1
6 │ 2 1
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
What I want is something not only the value of the :gr, but also the occurrences of these values. In this case, the number of groups should be 4:
Group 1 (1 row)
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
Group 2 (3 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
2 │ 4 0
3 │ 7 0
4 │ 7 0
Group 3 (2 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
5 │ 10 1
6 │ 2 1
Group 4 (4 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
If I group by the column :gr, however, I could only get two groups:
julia> groupby(df, :gr)
GroupedDataFrame with 2 groups based on key: gr
First Group (7 rows): gr = 0
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 4 0
2 │ 7 0
3 │ 7 0
4 │ 8 0
5 │ 8 0
6 │ 7 0
7 │ 2 0
⋮
Last Group (3 rows): gr = 1
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 10 1
3 │ 2 1
How could I implement this in Julia DataFrames.jl? Thanks
versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, westmere)
Environment:
JULIA_NUM_THREADS = 1
JULIA_EDITOR = code
You can group by a new variable that increments each time :gr changes value.
For example:
nrow = size(df,1)
gr1 = zeros(Int64, nrow)
for i=2:nrow
gr1[i] = gr1[i-1] + (df.gr[i] != df.gr[i-1])
end
df.gr1 = gr1
Same basic idea as #GEK's answer, with a more vectorized implementation:
julia> edgedetect(col) = [0; abs.(diff(col))] |> cumsum
edgedetect (generic function with 1 method)
julia> edgedetect([0, 1, 1, 1, 0, 0, 1]) |> print
[0, 1, 1, 1, 2, 2, 3]
abs.(diff(col)) places a 1 wherever the value of column col changes, and a 0 elsewhere. (diff returns n-1 differences given n elements, so we prefix the result with a 0 to maintain column length.) Doing a cumulative sum on this, we get a new column that increases every time the value in the original column changes.
We can then use this function to groupby on a transformed dataframe, like this:
julia> groupby(transform(df, :gr => edgedetect => :gr_edges, copycols = false), :gr_edges) |> print
GroupedDataFrame with 4 groups based on key: gr_edges
Group 1 (1 row): gr_edges = 0
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 1 1 0
Group 2 (3 rows): gr_edges = 1
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 4 0 1
2 │ 7 0 1
3 │ 7 0 1
Group 3 (2 rows): gr_edges = 2
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 10 1 2
2 │ 2 1 2
Group 4 (4 rows): gr_edges = 3
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 8 0 3
2 │ 8 0 3
3 │ 7 0 3
4 │ 2 0 3
Suppose my source Dataframe and target Dataframe are df1 and df2 as being illustrated in the following.
I'm wondering how can I convert df1 to df2 using stack? Thanks.
julia> df1 = DataFrame(x1 = 1:4, x2 = 5:8, y1 = Char.(65:68), y2 = Char.(69:72))
4×4 DataFrame
Row │ x1 x2 y1 y2
│ Int64 Int64 Char Char
─────┼──────────────────────────
1 │ 1 5 A E
2 │ 2 6 B F
3 │ 3 7 C G
4 │ 4 8 D H
julia> df2 = DataFrame(x = [d1.x1; d1.x2], y = [d1.y1; d1.y2])
8×2 DataFrame
Row │ x y
│ Int64 Char
─────┼─────────────
1 │ 1 A
2 │ 2 B
3 │ 3 C
4 │ 4 D
5 │ 5 E
6 │ 6 F
7 │ 7 G
8 │ 8 H
What you want is not a regular stack.
Obviously, you can stack for each x and y singly, then hcat together to get your df21. But here is a different try.
julia> long = stack(df1,All())
16×2 DataFrame
Row │ variable value
│ String Any
─────┼─────────────────
1 │ x1 1
2 │ x1 2
3 │ x1 3
4 │ x1 4
5 │ x2 5
6 │ x2 6
7 │ x2 7
8 │ x2 8
9 │ y1 A
10 │ y1 B
11 │ y1 C
12 │ y1 D
13 │ y2 E
14 │ y2 F
15 │ y2 G
16 │ y2 H
Turn all x1,x2,y1,y2 to x,x,y,y and add new unique id column
julia> long.variable = SubString.(long.variable,1,1);
julia> long.id = repeat(1:8,2);
julia> long
16×3 DataFrame
Row │ variable value id
│ SubStrin… Any Int64
─────┼─────────────────────────
1 │ x 1 1
2 │ x 2 2
3 │ x 3 3
4 │ x 4 4
5 │ x 5 5
6 │ x 6 6
7 │ x 7 7
8 │ x 8 8
9 │ y A 1
10 │ y B 2
11 │ y C 3
12 │ y D 4
13 │ y E 5
14 │ y F 6
15 │ y G 7
16 │ y H 8
Finally, unstack back
julia> unstack(long, :variable, :value)
8×3 DataFrame
Row │ id x y
│ Int64 Any Any
─────┼─────────────────
1 │ 1 1 A
2 │ 2 2 B
3 │ 3 3 C
4 │ 4 4 D
5 │ 5 5 E
6 │ 6 6 F
7 │ 7 7 G
8 │ 8 8 H
Are there any functions in julia that is equivalent to separate() in R?
I have a column with a long string with ":" as a delimiter and I want to split the string by those delimiter into 8 columns.
Assuming you use DataFrames.jl you can use the following.
First generate some test data:
julia> using DataFrames
julia> df = DataFrame(in = join.(eachrow(rand(1:9, 10, 8)), ":"))
10×1 DataFrame
Row │ in
│ String
─────┼─────────────────
1 │ 8:2:9:4:1:9:3:1
2 │ 9:6:9:9:8:1:9:5
3 │ 2:4:9:8:5:4:8:7
4 │ 8:2:2:9:5:3:7:7
5 │ 1:4:6:1:3:9:2:1
6 │ 8:6:1:5:1:4:8:8
7 │ 4:6:4:4:4:4:8:8
8 │ 4:3:3:5:1:4:3:4
9 │ 9:5:5:7:5:3:4:3
10 │ 4:5:8:5:2:5:7:4
The splitting could be done in many ways (we are fully flexible in how you can split your column). Here I just use the split function.
If you want auto-generated column names use:
julia> transform(df, :in => ByRow(x -> split(x, ":")) => AsTable)
10×9 DataFrame
Row │ in x1 x2 x3 x4 x5 x6 x7 x8
│ String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Alternatively pass your own column names:
julia> transform(df, :in => ByRow(x -> split(x, ":")) => "out" .* string.(1:8))
10×9 DataFrame
Row │ in out1 out2 out3 out4 out5 out6 out7 out8
│ String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Note that the benefit of allowing custom parser is that you can in one shot convert the parts of your original string to their numeric values like this:
julia> transform(df, :in => ByRow(x -> parse.(Int, split(x, ":"))) => AsTable)
10×9 DataFrame
Row │ in x1 x2 x3 x4 x5 x6 x7 x8
│ String Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Here is another way to do it. Not as composable as the above but showing you the flexibility of the ecosystem:
julia> using CSV
julia> CSV.read(IOBuffer(join(df.in, "\n")), DataFrame, header=false, delim=":")
10×8 DataFrame
Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────────────
1 │ 8 2 9 4 1 9 3 1
2 │ 9 6 9 9 8 1 9 5
3 │ 2 4 9 8 5 4 8 7
4 │ 8 2 2 9 5 3 7 7
5 │ 1 4 6 1 3 9 2 1
6 │ 8 6 1 5 1 4 8 8
7 │ 4 6 4 4 4 4 8 8
8 │ 4 3 3 5 1 4 3 4
9 │ 9 5 5 7 5 3 4 3
10 │ 4 5 8 5 2 5 7 4
julia> CSV.read(IOBuffer(join(df.in, "\n")), DataFrame, header="out" .* string.(1:8), delim=":")
10×8 DataFrame
Row │ out1 out2 out3 out4 out5 out6 out7 out8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────
1 │ 8 2 9 4 1 9 3 1
2 │ 9 6 9 9 8 1 9 5
3 │ 2 4 9 8 5 4 8 7
4 │ 8 2 2 9 5 3 7 7
5 │ 1 4 6 1 3 9 2 1
6 │ 8 6 1 5 1 4 8 8
7 │ 4 6 4 4 4 4 8 8
8 │ 4 3 3 5 1 4 3 4
9 │ 9 5 5 7 5 3 4 3
10 │ 4 5 8 5 2 5 7 4
Using leftjoin() to join two dataframes df1 and df2.
df1 = DataFrame(x1 = collect(1:1:10), x2 = fill(1.0,10))
Row │ x1 x2
│ Int64 Float64
─────┼────────────────
1 │ 1 1.0
2 │ 2 1.0
3 │ 3 1.0
4 │ 4 1.0
5 │ 5 1.0
6 │ 6 1.0
7 │ 7 1.0
8 │ 8 1.0
9 │ 9 1.0
10 │ 10 1.0
df2 = DataFrame(x1 = collect(1:2:10), x2 = fill(1.0,5))
Row │ x1 x2
│ Int64 Float64
─────┼────────────────
1 │ 1 1.0
2 │ 3 1.0
3 │ 5 1.0
4 │ 7 1.0
5 │ 9 1.0
out_df = leftjoin(df1,df2, on = :x1, makeunique=true)
for output:
Row │ x1 x2 x2_1
│ Int64 Float64? Float64?
─────┼────────────────────────────
1 │ 1 1.0 1.0
2 │ 3 1.0 1.0
3 │ 5 1.0 1.0
4 │ 7 1.0 1.0
5 │ 9 1.0 1.0
6 │ 2 1.0 missing
7 │ 4 1.0 missing
8 │ 6 1.0 missing
9 │ 8 1.0 missing
10 │ 10 1.0 missing
My question is with df1 being 10 rows and df2 being 5 rows. I am electing df1 to be the 'master' df if you will and wish to retain its original index positioning and when join df1 to df2 - df2 slots into the df1 matches and puts in missing values on non-matches but retaining df1 index positioning for output:
Row │ x1 x2 x2_1
│ Int64 Float64? Float64?
─────┼────────────────────────────
1 │ 1 1.0 1.0
2 │ 2 1.0 missing
3 │ 3 1.0 1.0
4 │ 4 1.0 missing
5 │ 5 1.0 1.0
6 │ 6 1.0 missing
7 │ 7 1.0 1.0
8 │ 8 1.0 missing
9 │ 9 1.0 1.0
10 │ 10 1.0 missing
There anyway I can achieve this?
This is a feature we plan to add in the future, see https://github.com/JuliaData/DataFrames.jl/issues/2753.
For now, before we add the requested functionality, add a column to your left data frame with row id (in your example there is already such a column :x1) and sort the result on this column.