Julia - separate() tidyr equivalent - julia

Are there any functions in julia that is equivalent to separate() in R?
I have a column with a long string with ":" as a delimiter and I want to split the string by those delimiter into 8 columns.

Assuming you use DataFrames.jl you can use the following.
First generate some test data:
julia> using DataFrames
julia> df = DataFrame(in = join.(eachrow(rand(1:9, 10, 8)), ":"))
10×1 DataFrame
Row │ in
│ String
─────┼─────────────────
1 │ 8:2:9:4:1:9:3:1
2 │ 9:6:9:9:8:1:9:5
3 │ 2:4:9:8:5:4:8:7
4 │ 8:2:2:9:5:3:7:7
5 │ 1:4:6:1:3:9:2:1
6 │ 8:6:1:5:1:4:8:8
7 │ 4:6:4:4:4:4:8:8
8 │ 4:3:3:5:1:4:3:4
9 │ 9:5:5:7:5:3:4:3
10 │ 4:5:8:5:2:5:7:4
The splitting could be done in many ways (we are fully flexible in how you can split your column). Here I just use the split function.
If you want auto-generated column names use:
julia> transform(df, :in => ByRow(x -> split(x, ":")) => AsTable)
10×9 DataFrame
Row │ in x1 x2 x3 x4 x5 x6 x7 x8
│ String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Alternatively pass your own column names:
julia> transform(df, :in => ByRow(x -> split(x, ":")) => "out" .* string.(1:8))
10×9 DataFrame
Row │ in out1 out2 out3 out4 out5 out6 out7 out8
│ String SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin… SubStrin…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Note that the benefit of allowing custom parser is that you can in one shot convert the parts of your original string to their numeric values like this:
julia> transform(df, :in => ByRow(x -> parse.(Int, split(x, ":"))) => AsTable)
10×9 DataFrame
Row │ in x1 x2 x3 x4 x5 x6 x7 x8
│ String Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 8:2:9:4:1:9:3:1 8 2 9 4 1 9 3 1
2 │ 9:6:9:9:8:1:9:5 9 6 9 9 8 1 9 5
3 │ 2:4:9:8:5:4:8:7 2 4 9 8 5 4 8 7
4 │ 8:2:2:9:5:3:7:7 8 2 2 9 5 3 7 7
5 │ 1:4:6:1:3:9:2:1 1 4 6 1 3 9 2 1
6 │ 8:6:1:5:1:4:8:8 8 6 1 5 1 4 8 8
7 │ 4:6:4:4:4:4:8:8 4 6 4 4 4 4 8 8
8 │ 4:3:3:5:1:4:3:4 4 3 3 5 1 4 3 4
9 │ 9:5:5:7:5:3:4:3 9 5 5 7 5 3 4 3
10 │ 4:5:8:5:2:5:7:4 4 5 8 5 2 5 7 4
Here is another way to do it. Not as composable as the above but showing you the flexibility of the ecosystem:
julia> using CSV
julia> CSV.read(IOBuffer(join(df.in, "\n")), DataFrame, header=false, delim=":")
10×8 DataFrame
Row │ Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────────────────────
1 │ 8 2 9 4 1 9 3 1
2 │ 9 6 9 9 8 1 9 5
3 │ 2 4 9 8 5 4 8 7
4 │ 8 2 2 9 5 3 7 7
5 │ 1 4 6 1 3 9 2 1
6 │ 8 6 1 5 1 4 8 8
7 │ 4 6 4 4 4 4 8 8
8 │ 4 3 3 5 1 4 3 4
9 │ 9 5 5 7 5 3 4 3
10 │ 4 5 8 5 2 5 7 4
julia> CSV.read(IOBuffer(join(df.in, "\n")), DataFrame, header="out" .* string.(1:8), delim=":")
10×8 DataFrame
Row │ out1 out2 out3 out4 out5 out6 out7 out8
│ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼────────────────────────────────────────────────────────
1 │ 8 2 9 4 1 9 3 1
2 │ 9 6 9 9 8 1 9 5
3 │ 2 4 9 8 5 4 8 7
4 │ 8 2 2 9 5 3 7 7
5 │ 1 4 6 1 3 9 2 1
6 │ 8 6 1 5 1 4 8 8
7 │ 4 6 4 4 4 4 8 8
8 │ 4 3 3 5 1 4 3 4
9 │ 9 5 5 7 5 3 4 3
10 │ 4 5 8 5 2 5 7 4

Related

Julia tranpose grouped data in DataFrames?

ds = Dataset(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
In inmemorydatasets package, we can transpose grouped data like this.
#transpose by group
transpose(groupby(ds, :group), 2:4)
How can do I do this in DataFrames packages?
How can do I do this in R?
result:
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Answer (attempt) regarding Julia DataFrames part of the question:
First creating the DataFrame:
df = DataFrame(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
Next, since the transpose operation depends on row ordering, we fix a row ordering in the groups:
julia> ordereddf = transform(DataFrames.groupby(df, :group),"group" => (x->1:length(x)) => "rn")[:,Not(:e)]
6×5 DataFrame
Row │ group b c d rn
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 1 1 1 1
2 │ 1 1 1 2 2
3 │ 2 1 1 3 1
4 │ 2 2 1 4 2
5 │ 3 2 1 5 1
6 │ 3 2 1 6 2
Finally, the stack - unstack combo, does the transposing bit:
julia> sort!(unstack(stack(ordereddf,[:b,:c,:d]),:rn, :value),:group)
9×4 DataFrame
Row │ group variable 1 2
│ Int64 String Int64? Int64?
─────┼─────────────────────────────────
1 │ 1 b 1 1
2 │ 1 c 1 1
3 │ 1 d 1 2
4 │ 2 b 1 2
5 │ 2 c 1 1
6 │ 2 d 3 4
7 │ 3 b 2 2
8 │ 3 c 1 1
9 │ 3 d 5 6
Feels like there might be easier ways to do this, but in general, transpose is rarely appropriate for database-like tables, and if it is appropriate, then maybe a matrix should have been used to store information in the first place.
The R part is left for someone else to answer.

Add a column with a constant value to a DataFrame

How can I add a column with a constant value to a DataFrame?
E.g. I have the following DataFrame:
using DataFrames
df = DataFrame(x = 1:10, y = 'a':'j')
And I would like to add a new variable z with constant value 1 and obtain:
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
To create such column:
df = DataFrame(x = 1:10, y = 'a':'j', d = 1)
To append such column to the existing DataFrame, you need broadcasting:
df.e .= 1
or
df[:, "f"] .= 1
A more general alternative is:
julia> insertcols!(df, :z => 1)
10×3 DataFrame
Row │ x y z
│ Int64 Char Int64
─────┼────────────────────
1 │ 1 a 1
2 │ 2 b 1
3 │ 3 c 1
4 │ 4 d 1
5 │ 5 e 1
6 │ 6 f 1
7 │ 7 g 1
8 │ 8 h 1
9 │ 9 i 1
10 │ 10 j 1
which by default does the same, but it additionally:
allows you to specify the location of the new column;
by default makes sure that you do not accidentally overwrite an existing column

Stack multiples columns into two columns in DataFrame.jl

Suppose my source Dataframe and target Dataframe are df1 and df2 as being illustrated in the following.
I'm wondering how can I convert df1 to df2 using stack? Thanks.
julia> df1 = DataFrame(x1 = 1:4, x2 = 5:8, y1 = Char.(65:68), y2 = Char.(69:72))
4×4 DataFrame
Row │ x1 x2 y1 y2
│ Int64 Int64 Char Char
─────┼──────────────────────────
1 │ 1 5 A E
2 │ 2 6 B F
3 │ 3 7 C G
4 │ 4 8 D H
julia> df2 = DataFrame(x = [d1.x1; d1.x2], y = [d1.y1; d1.y2])
8×2 DataFrame
Row │ x y
│ Int64 Char
─────┼─────────────
1 │ 1 A
2 │ 2 B
3 │ 3 C
4 │ 4 D
5 │ 5 E
6 │ 6 F
7 │ 7 G
8 │ 8 H
What you want is not a regular stack.
Obviously, you can stack for each x and y singly, then hcat together to get your df21. But here is a different try.
julia> long = stack(df1,All())
16×2 DataFrame
Row │ variable value
│ String Any
─────┼─────────────────
1 │ x1 1
2 │ x1 2
3 │ x1 3
4 │ x1 4
5 │ x2 5
6 │ x2 6
7 │ x2 7
8 │ x2 8
9 │ y1 A
10 │ y1 B
11 │ y1 C
12 │ y1 D
13 │ y2 E
14 │ y2 F
15 │ y2 G
16 │ y2 H
Turn all x1,x2,y1,y2 to x,x,y,y and add new unique id column
julia> long.variable = SubString.(long.variable,1,1);
julia> long.id = repeat(1:8,2);
julia> long
16×3 DataFrame
Row │ variable value id
│ SubStrin… Any Int64
─────┼─────────────────────────
1 │ x 1 1
2 │ x 2 2
3 │ x 3 3
4 │ x 4 4
5 │ x 5 5
6 │ x 6 6
7 │ x 7 7
8 │ x 8 8
9 │ y A 1
10 │ y B 2
11 │ y C 3
12 │ y D 4
13 │ y E 5
14 │ y F 6
15 │ y G 7
16 │ y H 8
Finally, unstack back
julia> unstack(long, :variable, :value)
8×3 DataFrame
Row │ id x y
│ Int64 Any Any
─────┼─────────────────
1 │ 1 1 A
2 │ 2 2 B
3 │ 3 3 C
4 │ 4 4 D
5 │ 5 5 E
6 │ 6 6 F
7 │ 7 7 G
8 │ 8 8 H

Skip random rows from a table

I have a table that has the column header in row 2 with the actual data starting in row 5. My question is how to read the table skipping rows 1, 3 and 4 and assign row 2 as column header?
I'm using something like below. However, would like to understand if there are better ways.
headers <- read.table("file_1", skip=1, header=F, sep =',', nrows=1, as.is=T)
df <- read.table("file_1", skip=3, header=F, sep =',')
colnames(df) <- headers
Not very different, but you could scan the header row and read.table for the remainder.
You are probably facing something like this.
tb <- ' 1 1 1 1 1 1 1 1 1 1 1
2 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9 9
10 10 10 10 10 10 10 10 10 10 10'
scan your file, state what=character(), how many lines are skipped and nlines to read in the column names r1. For the data r2, read.table and skip= all unneeded stuff. Skip first element each, since it's the indices. Finally use r1 to setNames of r2, and type.convert.
r1 <- scan(text=tb, what=character(), skip=1, nlines=1)[-1]
r2 <- read.table(text=tb, skip=4)[-1]
res <- r2 |>
setNames(r1) |>
type.convert(as.is=TRUE)
res
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 5 5 5 5 5 5 5 5 5 5
# 2 6 6 6 6 6 6 6 6 6 6
# 3 7 7 7 7 7 7 7 7 7 7
# 4 8 8 8 8 8 8 8 8 8 8
# 5 9 9 9 9 9 9 9 9 9 9
# 6 10 10 10 10 10 10 10 10 10 10
Note: It depends a little on how the data is stored in the file and you probably have to customize the skip='s.

How to delete a specific row in Julia

How can I delete a specific row in Julia? Let's say I have an array:
[A , 2
B , 4
C , 6]
I want to delete the lines for which 'B' is in the first column. I can identify which row this is, but am not able to delete this row. Can anybody help me?
Thanks,
Nico
julia> a = rand(1:10, 5,3)
5×3 Array{Int64,2}:
4 5 7
8 4 3
8 6 3
10 4 1
9 3 10
To delete row 4:
julia> row = 4
julia> a = a[setdiff(1:end, row), :]
4×3 Array{Int64,2}:
4 5 7
8 4 3
8 6 3
9 3 10
Say you have a dataframe called "data".
julia> data=DataFrame(rand(1:10, 5,3))
5×3 DataFrames.DataFrame
Row x1 x2 x3
1 9 1 1
2 8 5 8
3 9 2 2
4 9 6 5
5 3 8 7
You want to delete entire row where column x1 has value 8.
julia> data[data[:x1].!=8,:]
4×3 DataFrames.DataFrame
Row x1 x2 x3
1 9 1 1
2 9 2 2
3 9 6 5
4 3 8 7

Resources