Iterate over dataframe names in Julia - julia

I am trying to generate n dataframes using a loop, where each dataframe has one column with i rows populated with random numbers where i=1:n. So far, none of the following iterating methods work for iterating over dataframe names in order to generate them:
n = 5;
for i = 1:n
"df_$i" = DataFrame(rand($i), :auto)
end
or
n = 5;
for i = 1:n
"df_$(i)" = DataFrame(rand($i), :auto)
end
Thanks!

Is this what you want?
julia> [DataFrame("col$i" => rand(i)) for i in 1:3]
3-element Vector{DataFrame}:
1×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.368821
2×1 DataFrame
Row │ col2
│ Float64
─────┼──────────
1 │ 0.757023
2 │ 0.201711
3×1 DataFrame
Row │ col3
│ Float64
─────┼──────────
1 │ 0.702651
2 │ 0.256179
3 │ 0.560374
(I additionally showed you how to dynamically generate the name of the column in each data frame)

Related

Is it possible to access row index inside DataFramesMeta macros?

Is there a way to access current_row_index in the following snippet ?
#with df begin
fn.(:col, current_row_index)
end
In this context, since you are broacasting just pass first axes of df:
julia> using DataFramesMeta
julia> fn(x, y) = (x, y)
fn (generic function with 1 method)
julia> df = DataFrame(col=["a", "b", "c"])
3×1 DataFrame
Row │ col
│ String
─────┼────────
1 │ a
2 │ b
3 │ c
julia> #with df begin
fn.(:col, axes(df, 1))
end
3-element Vector{Tuple{String, Int64}}:
("a", 1)
("b", 2)
("c", 3)

Julia subsetting dataframe with multiple conditions

In DataFramesMeta, why should I wrap every condition within a pair of parentheses? Below is an example dataframe where I want a subset that contains values greater than 1 or is missing.
d = DataFrame(a = [1, 2, missing], b = ["x", "y", missing]);
Using DataFramesMeta to subset:
#chain d begin
#subset #byrow begin
(:a > 1) | (:a===missing)
end
end
If I don't use parentheses, errors pop up.
#chain d begin
#subset #byrow begin
:a > 1 | :a===missing
end
end
# ERROR: LoadError: TypeError: non-boolean (Missing) used in boolean context
The reason is operator precedence (and is unrelated to DataFramesMeta.jl).
See:
julia> dump(:(2 > 1 | 3 > 4))
Expr
head: Symbol comparison
args: Array{Any}((5,))
1: Int64 2
2: Symbol >
3: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol |
2: Int64 1
3: Int64 3
4: Symbol >
5: Int64 4
as you can see 2 > 1 | 3 > 4 gets parsed as: 2 > (1 | 3) > 4 which is not what you want.
However, I would recommend you the following syntax for your case:
julia> #chain d begin
#subset #byrow begin
coalesce(:a > 1, true)
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
or
julia> #chain d begin
#subset #byrow begin
ismissing(:a) || :a > 1
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
I personally prefer coalesce but it is a matter of taste.
Note that || as opposed to | does not require parentheses, but you need to reverse the order of the conditions to take advantage of short circuiting behavior of || as if you reversed the conditions you would get an error:
julia> #chain d begin
#subset #byrow begin
:a > 1 || ismissing(:a)
end
end
ERROR: TypeError: non-boolean (Missing) used in boolean context
Finally with #rsubset this can be just:
julia> #chain d begin
#rsubset coalesce(:a > 1, true)
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
(I assume you want #chain as this is one of the steps you want to do in the analysis so I keep it)

How to estimate many GLM models in Julia?

I have a dataset of 5000 variables. One target and 4999 covariates. I want to estimate one glm per each target-variable combination (4999 models).
How can I do that without manually typing 4999 formulas for GLM ?
In R I would simply define a list of 4999 strings ("target ~ x1) , convert each string to a formula and use map to estimate multiple glm. Is there something similar that can be done in Julia ? Or is there an elegant alternative ?
Thanks in advance.
You can programatically create the formula via Term objects. The docs for that can be found here, but consider the following simple example which should meet your needs:
Start with dummy data
julia> using DataFrames, GLM
julia> df = hcat(DataFrame(y = rand(10)), DataFrame(rand(10, 5)))
10×6 DataFrame
│ Row │ y │ x1 │ x2 │ x3 │ x4 │ x5 │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼───────────┼───────────┼──────────┼───────────┼────────────┼──────────┤
│ 1 │ 0.0200963 │ 0.924856 │ 0.947904 │ 0.429068 │ 0.00833488 │ 0.547378 │
│ 2 │ 0.169498 │ 0.0915296 │ 0.375369 │ 0.0341015 │ 0.390461 │ 0.835634 │
│ 3 │ 0.900145 │ 0.502495 │ 0.38106 │ 0.47253 │ 0.637731 │ 0.814095 │
│ 4 │ 0.255163 │ 0.865253 │ 0.791909 │ 0.0833828 │ 0.741899 │ 0.961041 │
│ 5 │ 0.651996 │ 0.29538 │ 0.161443 │ 0.23427 │ 0.23132 │ 0.947486 │
│ 6 │ 0.305908 │ 0.170662 │ 0.569827 │ 0.178898 │ 0.314841 │ 0.237354 │
│ 7 │ 0.308431 │ 0.835606 │ 0.114943 │ 0.19743 │ 0.344216 │ 0.97108 │
│ 8 │ 0.344968 │ 0.452961 │ 0.595219 │ 0.313425 │ 0.102282 │ 0.456764 │
│ 9 │ 0.126244 │ 0.593456 │ 0.818383 │ 0.485622 │ 0.151394 │ 0.043125 │
│ 10 │ 0.60174 │ 0.8977 │ 0.643095 │ 0.0865611 │ 0.482014 │ 0.858999 │
Now when you run a linear model with GLM, you'd do something like lm(#formula(y ~ x1), df), which indeed can't easily be used in a loop to construct different formulas. We'll therefore follow the docs and create the output of the #formula macro directly - remember macros in Julia just transform syntax to other syntax, so they don't do anything we can't write ourselves!
julia> lm(Term(:y) ~ Term(:x1), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x1
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.428436 0.193671 2.21 0.0579 -0.0181696 0.875041
x1 -0.106603 0.304597 -0.35 0.7354 -0.809005 0.595799
──────────────────────────────────────────────────────────────────────────
You can verify for yourself that the above is equivalent to lm(#formula(y ~ x1), df).
Now it's hopefully an easy step to building the loop that you're looking for (restricted to two covariates below to limit the output):
julia> for x ∈ names(df[:, Not(:y)])[1:2]
#show lm(term(:y) ~ term(x), df)
end
lm(term(:y) ~ term(x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x1
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.428436 0.193671 2.21 0.0579 -0.0181696 0.875041
x1 -0.106603 0.304597 -0.35 0.7354 -0.809005 0.595799
──────────────────────────────────────────────────────────────────────────
lm(Term(:y) ~ Term(x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
y ~ 1 + x2
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 0.639633 0.176542 3.62 0.0068 0.232527 1.04674
x2 -0.502327 0.293693 -1.71 0.1256 -1.17958 0.17493
─────────────────────────────────────────────────────────────────────────
As Dave points out below, it's helpful to use the term() function here to create our terms rather than the Term() constructor directly - this is because names(df) returns a vector of Strings, while the Term() constructor expects Symbols. term() has a method for Strings that handles the conversion automatically.
You can also use the low-level API and pass the dependent variable as a vector and the independent variable as a matrix directly without even building formulas. You will lose coefficient names, but since you have only one independent variable in each model it's probably OK.
This is documented in ?fit. The call for each model will look like glm([ones(length(x1)) x1], target, dist). The column full of ones is for the intercept.

How to provide reproducible Sample Data in Julia

Here on stackoverflow.com - when I provide sample data to make a reproducible example, how can I do it the Julian way?
In R for example dput(df) will output a string with which you can create df again. Hence, you just post the result here on stackoverflow and bam! - reproducible example. So, how should one do it in Julia?
I think the easiest thing to do generally is to simply construct an MWE DataFrame with random numbers etc in your example, so there's no need to read/write out.
In situations where that's inconvenient, you might consider writing out to an IO buffer and taking the string representation of that, which people can then read back in the same way in reverse:
julia> using CSV, DataFrames
julia> df = DataFrame(a = rand(5), b = rand(1:10, 5));
julia> io = IOBuffer()
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1)
julia> string_representation = String(take!(CSV.write(io, df)))
"a,b\n0.5613453808585873,9\n0.3308122459718885,6\n0.631520224612919,9\n0.3533712075535982,3\n0.35289980394398723,9\n"
julia> CSV.read(IOBuffer(string_representation))
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64 │ Int64 │
├─────┼──────────┼───────┤
│ 1 │ 0.561345 │ 9 │
│ 2 │ 0.330812 │ 6 │
│ 3 │ 0.63152 │ 9 │
│ 4 │ 0.353371 │ 3 │
│ 5 │ 0.3529 │ 9 │
Here is one way to mimic the behaviour of R's dput in Julia:
julia> using DataFrames
julia> using Random; Random.seed!(0);
julia> df = DataFrame(a = rand(3), b = rand(1:10, 3))
3×2 DataFrame
Row │ a b
│ Float64 Int64
─────┼──────────────────
1 │ 0.405699 1
2 │ 0.0685458 7
3 │ 0.862141 2
julia> julian_dput(x) = invoke(show, Tuple{typeof(stdout), Any}, stdout, df);
julia> julian_dput(df)
DataFrame(AbstractVector[[0.4056994708920292, 0.06854582438651502, 0.8621408571954849], [1, 7, 2]], DataFrames.Index(Dict(:a => 1, :b => 2), [:a, :b]))
That is, julian_dput() takes a DataFrame as input and returns a string that can generate the input.
Source: https://discourse.julialang.org/t/given-an-object-return-julia-code-that-defines-the-object/80579/12

Julia GLM - using devresid for plotting

I would like to do some residual analysis for a GLM.
My model is in the form
using GLM
model = glm(#formula(y ~ x), data, Binomial(), LogitLink())
My text book suggests that residual analysis in the GLM be performed using deviance residuals. I was glad to see that Julia's GLM has a devresid() function, and that it suggests how to use it for plotting (sign(y - μ) * sqrt(devresid(D, y, μ))). However, I'm at a total loss as to what the input arguments are supposed to be. Looking at the doc-string:
?devresid
devresid(D, y, μ::Real)
Return the squared deviance residual of μ from y for distribution D
The deviance of a GLM can be evaluated as the sum of the squared deviance residuals. This is the principal use for these values. The actual deviance residual, say for plotting, is the signed square root of this value
sign(y - μ) * sqrt(devresid(D, y, μ))
Examples
julia> devresid(Normal(), 0, 0.25) ≈ abs2(0.25)
true
julia> devresid(Bernoulli(), 1, 0.75) ≈ -2*log(0.75)
true
julia> devresid(Bernoulli(), 0, 0.25) ≈ -2*log1p(-0.25)
true
D: I'm guessing that in my case it is Binomial()
y: I'm guessing this is the indicator variable for a single case, i.e. 1 or 0
μ: What is this?
How can I use this function to produce things like plots of the deviance residual on a normal probability scale and versus fitted values?
Here's the data I'm using in CSV form
x,y
400,0
220,1
490,0
210,1
500,0
270,0
200,1
470,0
480,0
310,1
240,1
490,0
420,0
330,1
280,1
210,1
300,1
470,1
230,0
430,0
460,0
220,1
250,1
200,1
390,0
I understand this is what you want:
julia> data = DataFrame(X=[1,1,1,2,2], Y=[1,1,0,0,1])
5×2 DataFrame
│ Row │ X │ Y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 1 │ 1 │
│ 3 │ 1 │ 0 │
│ 4 │ 2 │ 0 │
│ 5 │ 2 │ 1 │
julia> model = glm(#formula(Y ~ X), data, Binomial(), LogitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Y ~ 1 + X
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) 1.38629 2.82752 0.490286 0.6239 -4.15554 6.92813
X -0.693146 1.87049 -0.37057 0.7110 -4.35923 2.97294
─────────────────────────────────────────────────────────────────────────────
julia> p = predict(model)
5-element Array{Float64,1}:
0.6666664218508201
0.6666664218508201
0.6666664218508201
0.5
0.5
julia> y = data.Y
5-element Array{Int64,1}:
1
1
0
0
1
julia> #. sign(y - p) * sqrt(devresid(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747
(this is what you would get from calling residuals(model, type="deviance") in R)
Note that in the last line I use #. to vectorize the whole line. Alternatively you could have written it as:
julia> sign.(y .- p) .* sqrt.(devresid.(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747

Resources