I am trying to generate n dataframes using a loop, where each dataframe has one column with i rows populated with random numbers where i=1:n. So far, none of the following iterating methods work for iterating over dataframe names in order to generate them:
n = 5;
for i = 1:n
"df_$i" = DataFrame(rand($i), :auto)
end
or
n = 5;
for i = 1:n
"df_$(i)" = DataFrame(rand($i), :auto)
end
Thanks!
Is this what you want?
julia> [DataFrame("col$i" => rand(i)) for i in 1:3]
3-element Vector{DataFrame}:
1×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.368821
2×1 DataFrame
Row │ col2
│ Float64
─────┼──────────
1 │ 0.757023
2 │ 0.201711
3×1 DataFrame
Row │ col3
│ Float64
─────┼──────────
1 │ 0.702651
2 │ 0.256179
3 │ 0.560374
(I additionally showed you how to dynamically generate the name of the column in each data frame)
In DataFramesMeta, why should I wrap every condition within a pair of parentheses? Below is an example dataframe where I want a subset that contains values greater than 1 or is missing.
d = DataFrame(a = [1, 2, missing], b = ["x", "y", missing]);
Using DataFramesMeta to subset:
#chain d begin
#subset #byrow begin
(:a > 1) | (:a===missing)
end
end
If I don't use parentheses, errors pop up.
#chain d begin
#subset #byrow begin
:a > 1 | :a===missing
end
end
# ERROR: LoadError: TypeError: non-boolean (Missing) used in boolean context
The reason is operator precedence (and is unrelated to DataFramesMeta.jl).
See:
julia> dump(:(2 > 1 | 3 > 4))
Expr
head: Symbol comparison
args: Array{Any}((5,))
1: Int64 2
2: Symbol >
3: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol |
2: Int64 1
3: Int64 3
4: Symbol >
5: Int64 4
as you can see 2 > 1 | 3 > 4 gets parsed as: 2 > (1 | 3) > 4 which is not what you want.
However, I would recommend you the following syntax for your case:
julia> #chain d begin
#subset #byrow begin
coalesce(:a > 1, true)
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
or
julia> #chain d begin
#subset #byrow begin
ismissing(:a) || :a > 1
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
I personally prefer coalesce but it is a matter of taste.
Note that || as opposed to | does not require parentheses, but you need to reverse the order of the conditions to take advantage of short circuiting behavior of || as if you reversed the conditions you would get an error:
julia> #chain d begin
#subset #byrow begin
:a > 1 || ismissing(:a)
end
end
ERROR: TypeError: non-boolean (Missing) used in boolean context
Finally with #rsubset this can be just:
julia> #chain d begin
#rsubset coalesce(:a > 1, true)
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
(I assume you want #chain as this is one of the steps you want to do in the analysis so I keep it)
The following code works as intended:
x = 1
exp = Expr(:(=), :x, 4) # :(x = 4)
eval(exp) # x is now equal to 4 as expected
The following code fails:
x = 1
exp = Expr(:(==), :x, 4) # Got :($(Expr(:(==), :x, 4))) instead of the expected :(x == 4)
eval(exp) # ERROR: syntax: invalid syntax (== (outerref x) 4)
== is a function, so you have:
julia> dump(:(x==4))
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ==
2: Symbol x
3: Int64 4
but
julia> dump(:(x=4))
Expr
head: Symbol =
args: Array{Any}((2,))
1: Symbol x
2: Int64 4
so in particular the following works:
julia> x = 1
1
julia> exp = Expr(:call, :(==), :x, 4)
:(x == 4)
julia> dump(exp)
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ==
2: Symbol x
3: Int64 4
julia> eval(exp)
false
How can i use counts() to show the frequencies and the items? for example:
a=[1,2,2,3]
count(a) gives 1,2,1
How can i do to get:
1:1, 2:2, 3:1?
Thanks
It looks like you are already using StatsBase, because that is where the counts function you mention is defined. The function you are looking for is called countmap:
using StatsBase
a = [1,2,2,3];
countmap(a)
# Dict{Int64, Int64} with 3 entries:
# 2 => 2
# 3 => 1
# 1 => 1
If you prefer tabular output you can also do:
julia> using FreqTables
julia> a = [1,2,2,3];
julia> freqtable(a)
3-element Named Vector{Int64}
Dim1 │
──────┼──
1 │ 1
2 │ 2
3 │ 1
I would like to do some residual analysis for a GLM.
My model is in the form
using GLM
model = glm(#formula(y ~ x), data, Binomial(), LogitLink())
My text book suggests that residual analysis in the GLM be performed using deviance residuals. I was glad to see that Julia's GLM has a devresid() function, and that it suggests how to use it for plotting (sign(y - μ) * sqrt(devresid(D, y, μ))). However, I'm at a total loss as to what the input arguments are supposed to be. Looking at the doc-string:
?devresid
devresid(D, y, μ::Real)
Return the squared deviance residual of μ from y for distribution D
The deviance of a GLM can be evaluated as the sum of the squared deviance residuals. This is the principal use for these values. The actual deviance residual, say for plotting, is the signed square root of this value
sign(y - μ) * sqrt(devresid(D, y, μ))
Examples
julia> devresid(Normal(), 0, 0.25) ≈ abs2(0.25)
true
julia> devresid(Bernoulli(), 1, 0.75) ≈ -2*log(0.75)
true
julia> devresid(Bernoulli(), 0, 0.25) ≈ -2*log1p(-0.25)
true
D: I'm guessing that in my case it is Binomial()
y: I'm guessing this is the indicator variable for a single case, i.e. 1 or 0
μ: What is this?
How can I use this function to produce things like plots of the deviance residual on a normal probability scale and versus fitted values?
Here's the data I'm using in CSV form
x,y
400,0
220,1
490,0
210,1
500,0
270,0
200,1
470,0
480,0
310,1
240,1
490,0
420,0
330,1
280,1
210,1
300,1
470,1
230,0
430,0
460,0
220,1
250,1
200,1
390,0
I understand this is what you want:
julia> data = DataFrame(X=[1,1,1,2,2], Y=[1,1,0,0,1])
5×2 DataFrame
│ Row │ X │ Y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 1 │ 1 │
│ 3 │ 1 │ 0 │
│ 4 │ 2 │ 0 │
│ 5 │ 2 │ 1 │
julia> model = glm(#formula(Y ~ X), data, Binomial(), LogitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Y ~ 1 + X
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) 1.38629 2.82752 0.490286 0.6239 -4.15554 6.92813
X -0.693146 1.87049 -0.37057 0.7110 -4.35923 2.97294
─────────────────────────────────────────────────────────────────────────────
julia> p = predict(model)
5-element Array{Float64,1}:
0.6666664218508201
0.6666664218508201
0.6666664218508201
0.5
0.5
julia> y = data.Y
5-element Array{Int64,1}:
1
1
0
0
1
julia> #. sign(y - p) * sqrt(devresid(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747
(this is what you would get from calling residuals(model, type="deviance") in R)
Note that in the last line I use #. to vectorize the whole line. Alternatively you could have written it as:
julia> sign.(y .- p) .* sqrt.(devresid.(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747