How this below R code equivalent to Julia code? - r

I am trying some julia code as shown here:
However, I get an error:

In Julia 1 and 1.0 are different. 1 is an Integer while 1.0 is a floating point number. R only has floating point numbers. you want x and y to be Integers.

You are most likely using incorrectly the filtering.
Suppose you have a data frame:
data = DataFrame(a=1:6, b='a':'f');
One way to filter would be to use a BitVector such as:
julia> rows = data.a .< 3
6-element BitVector:
1
1
0
0
0
0
julia> data[rows, :]
2×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
You could of course just write data[data.a .< 3, :]
If you want to use filter instead the code could look like this:
julia> filter(row -> row.a < 3, data)
2×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b

Related

How to create TypedTable from names (strings) and vectors?

Consider
import TypedTables as TT
TT.Table(this=[1,2,3])
Fine. Now instead I have
a = "this"
b = [1,2,3]
How do I create the same table from a and b? Going via a NamedTyple is a bit round about but seems to work:
TT.Table((; Symbol(a) =>b))
Is a less round about approach available?
You can skip NamedTuple construction and just pass this as kwargs:
julia> Table(;Symbol(a) =>b)
Table with 1 column and 3 rows:
this
┌─────
1 │ 1
2 │ 2
3 │ 3
Regarding the multi-column comments:
julia> as = ["this", "that"];
julia> bs = [[1,2,3],[4,5,6]];
julia> Table(; (Symbol.(as) .=> bs)...)
Table with 2 columns and 3 rows:
this that
┌───────────
1 │ 1 4
2 │ 2 5
3 │ 3 6

Is there as.factor analogue in Julia?

I have an integer column in dataframe. How can I convert its values into string in Julia?
In R a can simply write:
mutate(column2 = as.factor(column1))
In Julia:
julia> using DataFramesMeta, CategoricalArrays
julia> df = DataFrame(a=1:3, b='a':'c')
3×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> #transform!(df, :b = categorical(:b))
3×2 DataFrame
Row │ a b
│ Int64 Cat…
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
or #transform if you want a new data frame. Also target column name can be different e.g. :b_categorical = categorical(:b).

Counting occurrences and making calculations with a dataframe

I have column in a dataframe like this:
df = DataFrame(:num=>rand(0:10,20))
From df I want to make 2 others dataframe:
df1 = counter(df[!,:num)
To have the frequencies of each integer from 0 to 10. But I need the values sorted from 0 to 10:
0=>2
1=>3
2=>7
so on..
Then I want a new dataframe df2 where:
column_p = sum of occurrences of 9 and 10
column_n = sum of occurrences of 7 and 8
column_d = sum of occurrences of 0 to 6
I managed to get the first part, even though the result is not sorted but this last dataframe has been a challenge to my julia skills (still learning)
UPDATE 1
I managed to do this fucntion:
function f(dff)
#eachrow dff begin
if :num >=9
:class = "Positive"
elseif :num >=7
:class = "Neutral"
elseif :num <7
:class = "Negative"
end
end
end
This function do half of what I want and fails if there's no :class column in the dataframe.
Now I want to count how many positive, neutral and negatives to do this operation:
(posivite - negative) / (negatives+neutral+positives)
The first part is:
julia> using DataFrames, Random
julia> Random.seed!(1234);
julia> df = DataFrame(:num=>rand(0:10,20));
julia> df1 = combine(groupby(df, :num, sort=true), nrow)
10×2 DataFrame
Row │ num nrow
│ Int64 Int64
─────┼──────────────
1 │ 0 1
2 │ 2 2
3 │ 3 2
4 │ 4 2
5 │ 5 1
6 │ 6 2
7 │ 7 2
8 │ 8 4
9 │ 9 1
10 │ 10 3
I was not sure what you wanted in the second step, but here are two ways to achieve the third step using either df1 or df:
julia> (sum(df1.nrow[df1.num .>= 9]) - sum(df1.nrow[df1.num .<= 6])) / sum(df1.nrow)
-0.3
julia> (count(>=(9), df.num) - count(<=(6), df.num)) / nrow(df)
-0.3

Return the maximum sum in `DataFrames.jl`?

Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257

How do you apply a shift to a Julia Dataframe?

In python pandas, the shift function is useful to shift the rows in the dataframe forward and possible relative to the original which allows for calculating changes in time series data. What is the equivalent method in Julia?
Normally one would use ShiftedArrays.jl and apply it to columns that require shifting.
Here is a small working example:
using DataFrames, ShiftedArrays
df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
transform(df, :a => lag => :lag_a)
3×3 DataFrame
Row │ a b lag_a
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 4 missing
2 │ 2 5 1
3 │ 3 6 2
or you could do:
df.c = lag(df.a)
or, to have the lead of two rows:
df.c = lead(df.a, 2)
etc.

Resources