Counting occurrences and making calculations with a dataframe - julia

I have column in a dataframe like this:
df = DataFrame(:num=>rand(0:10,20))
From df I want to make 2 others dataframe:
df1 = counter(df[!,:num)
To have the frequencies of each integer from 0 to 10. But I need the values sorted from 0 to 10:
0=>2
1=>3
2=>7
so on..
Then I want a new dataframe df2 where:
column_p = sum of occurrences of 9 and 10
column_n = sum of occurrences of 7 and 8
column_d = sum of occurrences of 0 to 6
I managed to get the first part, even though the result is not sorted but this last dataframe has been a challenge to my julia skills (still learning)
UPDATE 1
I managed to do this fucntion:
function f(dff)
#eachrow dff begin
if :num >=9
:class = "Positive"
elseif :num >=7
:class = "Neutral"
elseif :num <7
:class = "Negative"
end
end
end
This function do half of what I want and fails if there's no :class column in the dataframe.
Now I want to count how many positive, neutral and negatives to do this operation:
(posivite - negative) / (negatives+neutral+positives)

The first part is:
julia> using DataFrames, Random
julia> Random.seed!(1234);
julia> df = DataFrame(:num=>rand(0:10,20));
julia> df1 = combine(groupby(df, :num, sort=true), nrow)
10×2 DataFrame
Row │ num nrow
│ Int64 Int64
─────┼──────────────
1 │ 0 1
2 │ 2 2
3 │ 3 2
4 │ 4 2
5 │ 5 1
6 │ 6 2
7 │ 7 2
8 │ 8 4
9 │ 9 1
10 │ 10 3
I was not sure what you wanted in the second step, but here are two ways to achieve the third step using either df1 or df:
julia> (sum(df1.nrow[df1.num .>= 9]) - sum(df1.nrow[df1.num .<= 6])) / sum(df1.nrow)
-0.3
julia> (count(>=(9), df.num) - count(<=(6), df.num)) / nrow(df)
-0.3

Related

How to filter a dataframe keeping the highest value of a certain column

Given the dataframe below, I want to filter records that shares the same q2, id_q, check_id and keep only the ones with the highest value.
input dataframe:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
hdfhhd
dfsdfsdf
10
10
80
There are 2 q2 with same id_q, check_id but with different values: 90,80.
I want to return for the same q2, id_q, check_id the line with the highest value. For example above the output is:
So I want to drop duplicates regarding to: check_id and id_q and keep the one with the highest value of valuecolumn
Desired Output:
q1
q2
id_q
check_id
value
sdfsdf
dfsdfsdf
10
10
90
For this case this code seems to be shorter that the ones referenced in other answers:
Suppose you have
julia> df = DataFrame(a=["a","a","b","b","b","b"], b=[1,1,2,2,3,3],c=11:16,notimportant=rand(6))
6×4 DataFrame
Row │ a b c notimportant
│ String Int64 Int64 Float64
─────┼────────────────────────────────────
1 │ a 1 11 0.93785
2 │ a 1 12 0.877777
3 │ b 2 13 0.845306
4 │ b 2 14 0.477606
5 │ b 3 15 0.722569
6 │ b 3 16 0.122807
Than you can just do:
julia> combine(groupby(df, [:a, :b]), :c => maximum => :c)
3×3 DataFrame
Row │ a b c
│ String Int64 Int64
─────┼──────────────────────
1 │ a 1 12
2 │ b 2 14
3 │ b 3 16

How this below R code equivalent to Julia code?

I am trying some julia code as shown here:
However, I get an error:
In Julia 1 and 1.0 are different. 1 is an Integer while 1.0 is a floating point number. R only has floating point numbers. you want x and y to be Integers.
You are most likely using incorrectly the filtering.
Suppose you have a data frame:
data = DataFrame(a=1:6, b='a':'f');
One way to filter would be to use a BitVector such as:
julia> rows = data.a .< 3
6-element BitVector:
1
1
0
0
0
0
julia> data[rows, :]
2×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
You could of course just write data[data.a .< 3, :]
If you want to use filter instead the code could look like this:
julia> filter(row -> row.a < 3, data)
2×2 DataFrame
Row │ a b
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b

How to create TypedTable from names (strings) and vectors?

Consider
import TypedTables as TT
TT.Table(this=[1,2,3])
Fine. Now instead I have
a = "this"
b = [1,2,3]
How do I create the same table from a and b? Going via a NamedTyple is a bit round about but seems to work:
TT.Table((; Symbol(a) =>b))
Is a less round about approach available?
You can skip NamedTuple construction and just pass this as kwargs:
julia> Table(;Symbol(a) =>b)
Table with 1 column and 3 rows:
this
┌─────
1 │ 1
2 │ 2
3 │ 3
Regarding the multi-column comments:
julia> as = ["this", "that"];
julia> bs = [[1,2,3],[4,5,6]];
julia> Table(; (Symbol.(as) .=> bs)...)
Table with 2 columns and 3 rows:
this that
┌───────────
1 │ 1 4
2 │ 2 5
3 │ 3 6

Return the maximum sum in `DataFrames.jl`?

Suppose my DataFrame has two columns v and g. First, I grouped the DataFrame by column g and calculated the sum of the column v. Second, I used the function maximum to retrieve the maximum sum. I am wondering whether it is possible to retrieve the value in one step? Thanks.
julia> using Random
julia> Random.seed!(1)
TaskLocalRNG()
julia> dt = DataFrame(v = rand(15), g = rand(1:3, 15))
15×2 DataFrame
Row │ v g
│ Float64 Int64
─────┼──────────────────
1 │ 0.0491718 3
2 │ 0.119079 2
3 │ 0.393271 2
4 │ 0.0240943 3
5 │ 0.691857 2
6 │ 0.767518 2
7 │ 0.087253 1
8 │ 0.855718 1
9 │ 0.802561 3
10 │ 0.661425 1
11 │ 0.347513 2
12 │ 0.778149 3
13 │ 0.196832 1
14 │ 0.438058 2
15 │ 0.0113425 1
julia> gdt = combine(groupby(dt, :g), :v => sum => :v)
3×2 DataFrame
Row │ g v
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
julia> maximum(gdt.v)
2.7572966050340257
I am not sure if that is what you mean but you can retrieve the values of g and v in one step using the following command:
julia> v, g = findmax(x-> (x.v, x.g), eachrow(gdt))[1]
(4.343050512360169, 3)
DataFramesMeta.jl has an #by macro:
julia> #by(dt, :g, :sv = sum(:v))
3×2 DataFrame
Row │ g sv
│ Int64 Float64
─────┼────────────────
1 │ 1 1.81257
2 │ 2 2.7573
3 │ 3 1.65398
which gives you somewhat neater syntax for the first part of this.
With that, you can do either:
julia> #by(dt, :g, :sv = sum(:v)).sv |> maximum
2.7572966050340257
or (IMO more readably):
julia> #chain dt begin
#by(:g, :sv = sum(:v))
maximum(_.sv)
end
2.7572966050340257

How do you apply a shift to a Julia Dataframe?

In python pandas, the shift function is useful to shift the rows in the dataframe forward and possible relative to the original which allows for calculating changes in time series data. What is the equivalent method in Julia?
Normally one would use ShiftedArrays.jl and apply it to columns that require shifting.
Here is a small working example:
using DataFrames, ShiftedArrays
df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
transform(df, :a => lag => :lag_a)
3×3 DataFrame
Row │ a b lag_a
│ Int64 Int64 Int64?
─────┼───────────────────────
1 │ 1 4 missing
2 │ 2 5 1
3 │ 3 6 2
or you could do:
df.c = lag(df.a)
or, to have the lead of two rows:
df.c = lead(df.a, 2)
etc.

Resources