First row of a DF as column names

First row of a DF as column names - julia

How I do a make this first row as my column names instead of x1, x2....xn?
Thanks
I am able to load the data with XLSX.readtable, but unable to convert it to DataFrame:
dir = expanduser("~/Downloads/")
using DataFrames, XLSX
f = dir * "data.xlsx"
d = XLSX.readtable(f, "dados", "A:C", first_row = 5, header = false)
df = DataFrame(d)
ArgumentError: 'Tuple{Vector{Any}, Vector{Symbol}}' iterates 'Vector{Any}' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
This fails at the last step. The spreadsheet looks like pretty standard data. If I dump the data, I get:
Tuple{Vector{Any}, Vector{Symbol}}
1: Array{Any}((3,))
1: Array{Any}((225,))
1: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 731672
2: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 731702
3: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 731733
4: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 731763
5: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 731794
...
221: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 738368
222: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 738399
223: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 738429
224: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 738460
225: Dates.Date
instant: Dates.UTInstant{Dates.Day}
periods: Dates.Day
value: Int64 738490
2: Array{Any}((225,))
1: Float64 329.16091
2: Float64 303.23791
3: Float64 284.96296
4: Float64 283.10436
5: Float64 286.08795
...
221: Missing missing
222: Missing missing
223: Missing missing
224: Missing missing
225: Missing missing
3: Array{Any}((225,))
1: Float64 189.49076
2: Float64 191.64219
3: Float64 194.163
4: Float64 194.49731
5: Float64 198.85504
...
221: Missing missing
222: Missing missing
223: Missing missing
224: Missing missing
225: Missing missing
2: Array{Symbol}((3,))
1: Symbol A
2: Symbol B
3: Symbol C
The spreadsheet may be downloaded at the link I have indicated in the comment section.

You should have read this file using CSV.read which by default processes column names. However having a DataFrame such as this it is still possible to rename columns.
However let's assume you have such data:
julia> df = DataFrame([["a" "b"];[3 7]], :auto)
2×2 DataFrame
Row │ x1 x2
│ Any Any
─────┼──────────
1 │ a b
2 │ 3 7
You can use rename! to assign the column names:
julia> rename!(df, Symbol.(Vector(df[1,:])))[2:end,:]
1×2 DataFrame
Row │ a b
│ Any Any
─────┼──────────
1 │ 3 7

There is XLSX.readtable, which has signature
readtable(filepath, sheet, [columns]; [first_row], <args omitted for brevity>)
You can pass first_row to tell it where the data begins.

Related

Iterate over dataframe names in Julia

I am trying to generate n dataframes using a loop, where each dataframe has one column with i rows populated with random numbers where i=1:n. So far, none of the following iterating methods work for iterating over dataframe names in order to generate them:
n = 5;
for i = 1:n
"df_$i" = DataFrame(rand($i), :auto)
end
or
n = 5;
for i = 1:n
"df_$(i)" = DataFrame(rand($i), :auto)
end
Thanks!

Is this what you want?
julia> [DataFrame("col$i" => rand(i)) for i in 1:3]
3-element Vector{DataFrame}:
1×1 DataFrame
Row │ col1
│ Float64
─────┼──────────
1 │ 0.368821
2×1 DataFrame
Row │ col2
│ Float64
─────┼──────────
1 │ 0.757023
2 │ 0.201711
3×1 DataFrame
Row │ col3
│ Float64
─────┼──────────
1 │ 0.702651
2 │ 0.256179
3 │ 0.560374
(I additionally showed you how to dynamically generate the name of the column in each data frame)

Julia subsetting dataframe with multiple conditions

In DataFramesMeta, why should I wrap every condition within a pair of parentheses? Below is an example dataframe where I want a subset that contains values greater than 1 or is missing.
d = DataFrame(a = [1, 2, missing], b = ["x", "y", missing]);
Using DataFramesMeta to subset:
#chain d begin
#subset #byrow begin
(:a > 1) | (:a===missing)
end
end
If I don't use parentheses, errors pop up.
#chain d begin
#subset #byrow begin
:a > 1 | :a===missing
end
end
# ERROR: LoadError: TypeError: non-boolean (Missing) used in boolean context

The reason is operator precedence (and is unrelated to DataFramesMeta.jl).
See:
julia> dump(:(2 > 1 | 3 > 4))
Expr
head: Symbol comparison
args: Array{Any}((5,))
1: Int64 2
2: Symbol >
3: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol |
2: Int64 1
3: Int64 3
4: Symbol >
5: Int64 4
as you can see 2 > 1 | 3 > 4 gets parsed as: 2 > (1 | 3) > 4 which is not what you want.
However, I would recommend you the following syntax for your case:
julia> #chain d begin
#subset #byrow begin
coalesce(:a > 1, true)
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
or
julia> #chain d begin
#subset #byrow begin
ismissing(:a) || :a > 1
end
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
I personally prefer coalesce but it is a matter of taste.
Note that || as opposed to | does not require parentheses, but you need to reverse the order of the conditions to take advantage of short circuiting behavior of || as if you reversed the conditions you would get an error:
julia> #chain d begin
#subset #byrow begin
:a > 1 || ismissing(:a)
end
end
ERROR: TypeError: non-boolean (Missing) used in boolean context
Finally with #rsubset this can be just:
julia> #chain d begin
#rsubset coalesce(:a > 1, true)
end
2×2 DataFrame
Row │ a b
│ Int64? String?
─────┼──────────────────
1 │ 2 y
2 │ missing missing
(I assume you want #chain as this is one of the steps you want to do in the analysis so I keep it)

Julia expression with equality operator does not work

The following code works as intended:
x = 1
exp = Expr(:(=), :x, 4) # :(x = 4)
eval(exp) # x is now equal to 4 as expected
The following code fails:
x = 1
exp = Expr(:(==), :x, 4) # Got :($(Expr(:(==), :x, 4))) instead of the expected :(x == 4)
eval(exp) # ERROR: syntax: invalid syntax (== (outerref x) 4)

== is a function, so you have:
julia> dump(:(x==4))
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ==
2: Symbol x
3: Int64 4
but
julia> dump(:(x=4))
Expr
head: Symbol =
args: Array{Any}((2,))
1: Symbol x
2: Int64 4
so in particular the following works:
julia> x = 1
1
julia> exp = Expr(:call, :(==), :x, 4)
:(x == 4)
julia> dump(exp)
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol ==
2: Symbol x
3: Int64 4
julia> eval(exp)
false

Frequencies in a vector or list using counts()

How can i use counts() to show the frequencies and the items? for example:
a=[1,2,2,3]
count(a) gives 1,2,1
How can i do to get:
1:1, 2:2, 3:1?
Thanks

It looks like you are already using StatsBase, because that is where the counts function you mention is defined. The function you are looking for is called countmap:
using StatsBase
a = [1,2,2,3];
countmap(a)
# Dict{Int64, Int64} with 3 entries:
# 2 => 2
# 3 => 1
# 1 => 1

If you prefer tabular output you can also do:
julia> using FreqTables
julia> a = [1,2,2,3];
julia> freqtable(a)
3-element Named Vector{Int64}
Dim1 │
──────┼──
1 │ 1
2 │ 2
3 │ 1

Julia GLM - using devresid for plotting

I would like to do some residual analysis for a GLM.
My model is in the form
using GLM
model = glm(#formula(y ~ x), data, Binomial(), LogitLink())
My text book suggests that residual analysis in the GLM be performed using deviance residuals. I was glad to see that Julia's GLM has a devresid() function, and that it suggests how to use it for plotting (sign(y - μ) * sqrt(devresid(D, y, μ))). However, I'm at a total loss as to what the input arguments are supposed to be. Looking at the doc-string:
?devresid
devresid(D, y, μ::Real)
Return the squared deviance residual of μ from y for distribution D
The deviance of a GLM can be evaluated as the sum of the squared deviance residuals. This is the principal use for these values. The actual deviance residual, say for plotting, is the signed square root of this value
sign(y - μ) * sqrt(devresid(D, y, μ))
Examples
julia> devresid(Normal(), 0, 0.25) ≈ abs2(0.25)
true
julia> devresid(Bernoulli(), 1, 0.75) ≈ -2*log(0.75)
true
julia> devresid(Bernoulli(), 0, 0.25) ≈ -2*log1p(-0.25)
true
D: I'm guessing that in my case it is Binomial()
y: I'm guessing this is the indicator variable for a single case, i.e. 1 or 0
μ: What is this?
How can I use this function to produce things like plots of the deviance residual on a normal probability scale and versus fitted values?
Here's the data I'm using in CSV form
x,y
400,0
220,1
490,0
210,1
500,0
270,0
200,1
470,0
480,0
310,1
240,1
490,0
420,0
330,1
280,1
210,1
300,1
470,1
230,0
430,0
460,0
220,1
250,1
200,1
390,0

I understand this is what you want:
julia> data = DataFrame(X=[1,1,1,2,2], Y=[1,1,0,0,1])
5×2 DataFrame
│ Row │ X │ Y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 1 │ 1 │
│ 3 │ 1 │ 0 │
│ 4 │ 2 │ 0 │
│ 5 │ 2 │ 1 │
julia> model = glm(#formula(Y ~ X), data, Binomial(), LogitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Y ~ 1 + X
Coefficients:
─────────────────────────────────────────────────────────────────────────────
Estimate Std. Error z value Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept) 1.38629 2.82752 0.490286 0.6239 -4.15554 6.92813
X -0.693146 1.87049 -0.37057 0.7110 -4.35923 2.97294
─────────────────────────────────────────────────────────────────────────────
julia> p = predict(model)
5-element Array{Float64,1}:
0.6666664218508201
0.6666664218508201
0.6666664218508201
0.5
0.5
julia> y = data.Y
5-element Array{Int64,1}:
1
1
0
0
1
julia> #. sign(y - p) * sqrt(devresid(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747
(this is what you would get from calling residuals(model, type="deviance") in R)
Note that in the last line I use #. to vectorize the whole line. Alternatively you could have written it as:
julia> sign.(y .- p) .* sqrt.(devresid.(Bernoulli(), y, p))
5-element Array{Float64,1}:
0.9005170462928523
0.9005170462928523
-1.4823033118905455
-1.1774100225154747
1.1774100225154747

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

First row of a DF as column names - julia

There is XLSX.readtable, which has signature readtable(filepath, sheet, [columns]; [first_row], <args omitted for brevity>) You can pass first_row to tell it where the data begins.

Related

Iterate over dataframe names in Julia

Julia subsetting dataframe with multiple conditions

Julia expression with equality operator does not work

Frequencies in a vector or list using counts()

Julia GLM - using devresid for plotting

Categories

Resources