Julia subsetting dataframe with multiple conditions - julia

In DataFramesMeta, why should I wrap every condition within a pair of parentheses? Below is an example dataframe where I want a subset that contains values greater than 1 or is missing.
d = DataFrame(a = [1, 2, missing], b = ["x", "y", missing]);
Using DataFramesMeta to subset:
#chain d begin
#subset #byrow begin
(:a > 1) | (:a===missing)
end
end
If I don't use parentheses, errors pop up.
#chain d begin
#subset #byrow begin
:a > 1 | :a===missing
end
end
# ERROR: LoadError: TypeError: non-boolean (Missing) used in boolean context

The reason is operator precedence (and is unrelated to DataFramesMeta.jl).
See:
julia> dump(:(2 > 1 | 3 > 4))
Expr
head: Symbol comparison
args: Array{Any}((5,))
1: Int64 2
2: Symbol >
3: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol |
2: Int64 1
3: Int64 3
4: Symbol >
5: Int64 4
as you can see 2 > 1 | 3 > 4 gets parsed as: 2 > (1 | 3) > 4 which is not what you want.
However, I would recommend you the following syntax for your case:
julia> #chain d begin
#subset #byrow begin
coalesce(:a > 1, true)
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
or
julia> #chain d begin
#subset #byrow begin
ismissing(:a) || :a > 1
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
I personally prefer coalesce but it is a matter of taste.
Note that || as opposed to | does not require parentheses, but you need to reverse the order of the conditions to take advantage of short circuiting behavior of || as if you reversed the conditions you would get an error:
julia> #chain d begin
#subset #byrow begin
:a > 1 || ismissing(:a)
end
end
ERROR: TypeError: non-boolean (Missing) used in boolean context
Finally with #rsubset this can be just:
julia> #chain d begin
#rsubset coalesce(:a > 1, true)
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
(I assume you want #chain as this is one of the steps you want to do in the analysis so I keep it)

Related

Iterate over dataframe names in Julia

I am trying to generate n dataframes using a loop, where each dataframe has one column with i rows populated with random numbers where i=1:n. So far, none of the following iterating methods work for iterating over dataframe names in order to generate them:
n = 5;
for i = 1:n
"df_$i" = DataFrame(rand($i), :auto)
end
or
n = 5;
for i = 1:n
"df_$(i)" = DataFrame(rand($i), :auto)
end
Thanks!
Is this what you want?
julia> [DataFrame("col$i" => rand(i)) for i in 1:3]
3-element Vector{DataFrame}:
1×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.368821
2×1 DataFrame
Row │ col2
│ Float64
─────┼──────────
1 │ 0.757023
2 │ 0.201711
3×1 DataFrame
Row │ col3
│ Float64
─────┼──────────
1 │ 0.702651
2 │ 0.256179
3 │ 0.560374
(I additionally showed you how to dynamically generate the name of the column in each data frame)

Is it possible to access row index inside DataFramesMeta macros?

Is there a way to access current_row_index in the following snippet ?
#with df begin
fn.(:col, current_row_index)
end
In this context, since you are broacasting just pass first axes of df:
julia> using DataFramesMeta
julia> fn(x, y) = (x, y)
fn (generic function with 1 method)
julia> df = DataFrame(col=["a", "b", "c"])
3×1 DataFrame
Row │ col
│ String
─────┼────────
1 │ a
2 │ b
3 │ c
julia> #with df begin
fn.(:col, axes(df, 1))
end
3-element Vector{Tuple{String, Int64}}:
("a", 1)
("b", 2)
("c", 3)

Bind function arguments in Julia

Does Julia provide something similar to std::bind in C++? I wish to do something along the lines of:
function add(x, y)
return x + y
end
add45 = bind(add, 4, 5)
add2 = bind(add, _1, 2)
add3 = bind(add, 3, _2)
And if this is possible does it incur any performance overhead?
As answered here you can obtain this behavior using higher order functions in Julia.
Regarding the performance. There should be no overhead. Actually the compiler should inline everything in such a situation and even perform constant propagation (so that the code could actually be faster). The use of const in the other answer here is needed only because we are working in global scope. If all this would be used within a function then const is not required (as the function that takes this argument will be properly compiled), so in the example below I do not use const.
Let me give an example with Base.Fix1 and your add function:
julia> using BenchmarkTools
julia> function add(x, y)
return x + y
end
add (generic function with 1 method)
julia> add2 = Base.Fix1(add, 10)
(::Base.Fix1{typeof(add), Int64}) (generic function with 1 method)
julia> y = 1:10^6;
julia> #btime add.(10, $y);
1.187 ms (2 allocations: 7.63 MiB)
julia> #btime $add2.($y);
1.189 ms (2 allocations: 7.63 MiB)
Note that I did not define add2 as const and since we are in global scope I need to prefix it with $ to interpolate its value into the benchmarking suite.
If I did not do it you would get:
julia> #btime add2.($y);
1.187 ms (6 allocations: 7.63 MiB)
Which is essentially the same timing and memory use, but does 6 not 2 allocations since in this case add2 is a type-unstable global variable.
I work on DataFrames.jl, and there using the patterns which we discuss here is very useful. Let me give just one example:
julia> using DataFrames
julia> df = DataFrame(x = 1:5)
5×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
julia> filter(:x => <(2.5), df)
2×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
What the operation does is picking rows where values from column :x that are less than 2.5. The key thing to understand here is what <(2.5) does. It is:
julia> <(2.5)
(::Base.Fix2{typeof(<), Float64}) (generic function with 1 method)
so as you can see it is similar to what we would have obtained if we defined the x -> x < 2.5 function (essentially fixing the second argument of < function, as in Julia < is just a two argument function). Such shortcuts like <(2.5) above are defined in Julia by default for several common comparison operators.

Frequencies in a vector or list using counts()

How can i use counts() to show the frequencies and the items? for example:
a=[1,2,2,3]
count(a) gives 1,2,1
How can i do to get:
1:1, 2:2, 3:1?
Thanks
It looks like you are already using StatsBase, because that is where the counts function you mention is defined. The function you are looking for is called countmap:
using StatsBase
a = [1,2,2,3];
countmap(a)
# Dict{Int64, Int64} with 3 entries:
# 2 => 2
# 3 => 1
# 1 => 1
If you prefer tabular output you can also do:
julia> using FreqTables
julia> a = [1,2,2,3];
julia> freqtable(a)
3-element Named Vector{Int64}
Dim1 │
──────┼──
1 │ 1
2 │ 2
3 │ 1

Running into an issue with using a variable as an exponent in Julia

I am new to Julia and was working on some example problems from here as a way of getting a handle on the language. To describe the specific issue I'm facing. I am trying to write some code for question 11 in the programming problems which requires me to compute a summation. I am reproducing my code below. I set a variable k to 1 and the formula needs to find the value of -1 to the power of k + 1. When k = 1, it should compute the result as -1 squared which should be 1 but it returns -1. Not sure what is going wrong here. Help me understand my error?
function computeequation()
result = 0
for k = 1:1000000
result = result + ((-1^(k+1))/((2 * k) - 1))
end
return 4 * result
end
This is common to several programming languages, not just Julia: exponentiation has a higher precedence than subtraction or negation. For Julia, you can see the list table of operator precedence here: https://docs.julialang.org/en/v1/manual/mathematical-operations/#Operator-Precedence-and-Associativity-1.
For this reason, -1^2 doesn't produce what you may naively expect:
julia> -1^2
-1
In order to override the default precedence, just use parentheses as appropriate:
julia> (-1)^2
1
As suggested by Lyndon White in a comment, a nice way to visualise the precedence of operations in an expression is to quote it
julia> :(-1 ^ 2)
:(-(1 ^ 2))
julia> :((-1) ^ 2)
:((-1) ^ 2)
and dump it to see the full AST:
julia> dump(:(-1 ^ 2))
Expr
head: Symbol call
args: Array{Any}((2,))
1: Symbol -
2: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ^
2: Int64 1
3: Int64 2
julia> dump(:((-1) ^ 2))
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ^
2: Int64 -1
3: Int64 2
Here you can note that in the first case the exponentiation is done before the negation, in the second case where parentheses are used, negation comes before exponentiation.
Another neat way to see how an expression is lowered in Julia is to use the Meta.lower function:
julia> Meta.lower(Main, :(-1 ^ 2) )
:($(Expr(:thunk, CodeInfo(
# none within `top-level scope'
1 ─ %1 = Core.apply_type(Base.Val, 2)
│ %2 = (%1)()
│ %3 = Base.literal_pow(^, 1, %2)
│ %4 = -%3
└── return %4
))))
julia> Meta.lower(Main, :((-1) ^ 2) )
:($(Expr(:thunk, CodeInfo(
# none within `top-level scope'
1 ─ %1 = Core.apply_type(Base.Val, 2)
│ %2 = (%1)()
│ %3 = Base.literal_pow(^, -1, %2)
└── return %3
))))
For your particular problem you can do
function computeequation()
result = 0
for k = 1:1000_000
result = result + ((-1) ^ (k + 1))/((2 * k) - 1)
end
return 4 * result
end
Answering my own question. Looks like adding braces around the -1 solves the problem.
function computeequation()
result = 0
for k = 1:1000000
result = result + (((-1)^(k+1))/((2 * k) - 1))
end
return 4 * result
end

Resources