I am quite new to Julia language. I would like to know if this language implements some statistical tests, in particular t-test for comparison of means.And if so, how can I invoke it?
You can use HypothesisTests.jl for this -- relevant docs.
(See also: The website juliastats.github.io neatly lists packages in Julia for stats and ML)
Here's how you could probably go about it:
julia> Pkg.add("HypothesisTests")
julia> using HypothesisTests
julia> xs = [1, 2, 3, 4]
julia> ys = [5, 6, 7, 8]
julia> OneSampleTTest(vec(xs), vec(ys))
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 0
point estimate: -4.0
95% confidence interval: (-4.0,-4.0)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0 (extremely significant)
Details:
number of observations: 4
t-statistic: -Inf
degrees of freedom: 3
empirical standard error: 0.0
(... above example just illustrates the usage of the function, data does not mean anything).
Related
This is my first post so I apologize for any formatting issues.
I'm trying to calculate the expected value of a collection of numbers in Julia, given a probability distribution that is the mixture of two Beta distributions. Using the following code gives the error seen below
using Distribution, Expectations, Statistics
d = MixtureModel([Beta(1,1),Beta(3,1.6)],[0.5,0.5])
E = expectation(d)
E*rand(32,1)
MethodError: no method matching *(::MixtureExpectation{Vector{IterableExpectation{Vector{Float64}, Vector{Float64}}}, Vector{Float64}}, ::Matrix{Float64})
If I use just a single Beta distribution, the above syntax works fine:
d = Beta(1,1)
E = expectation(d)
E*rand(32,1)
Out = 0.503
And if I use function notation in the expectation, I can calculate expectations of functions using the Mixture model as well.
d = MixtureModel([Beta(1,1),Beta(3,1.6)],[0.5,0.5])
E = expectation(d)
E(x -> x^2)
It just seems to not work when using the dot-notation shown above.
Single distribution yields IterableExpectation that allows multiplication over an array, while mixture distribution yields MixtureExpectation that allows multiplications only over a scalar. You can run typeof(E) to check the type in your code.
julia> methodswith(IterableExpectation)
[1] *(r::Real, e::IterableExpectation) in Expectations at C:\JuliaPkg\Julia-1.8.0\packages\Expectations\hZ5Gh\src\iterable.jl:53
[2] *(e::IterableExpectation, h::AbstractArray) in Expectations at C:\JuliaPkg\Julia-1.8.0\packages\Expectations\hZ5Gh\src\iterable.jl:44
...
julia> methodswith(MixtureExpectation)
[1] *(r::Real, e::MixtureExpectation) in Expectations at C:\JuliaPkg\Julia-1.8.0\packages\Expectations\hZ5Gh\src\mixturemodels.jl:15
...
This is my FTSE column
To calculate log return of the FTSE; R code is given below.
log_return = diff(log(FTSE))*100
How to change it to JULIA further I need to do adf.test
adf.test(log(FTSE))
adf.test(log_return)
I had tried this
ADFTest(FTSE::AbstractVector{T}, deterministic::Symbol, lag::Int) where T<:Real
But I got an error.
in the same way I tried for this line
Box.test(log_return,lag=10,type = "Ljung-Box")
But error occurred.
Can anyone help me write the julia version of these, TIA
I assume you want to use ADFTest from HypothesisTests.jl, here is how you should use it (I picked some example parameters for the test; this assumes you want to run a test for unit root in log returns of FTSE):
julia> using HypothesisTests
julia> FTSE = rand(20);
julia> log_return = diff(log.(FTSE));
julia> ADFTest(log_return, :none, 1)
Augmented Dickey-Fuller unit root test
--------------------------------------
Population details:
parameter of interest: coefficient on lagged non-differenced variable
value under h_0: 0
point estimate: -2.23676
Test summary:
outcome with 95% confidence: reject h_0
p-value: <1e-06
Details:
sample size in regression: 17
number of lags: 1
ADF statistic: -5.27364
Critical values at 1%, 5%, and 10%: [-2.69346 -1.95991 -1.60666]
My professor assigned us some homework questions regarding normal distributions. We are using R studio to calculate our values instead of the z-tables.
One question asks about something about meteors where the mean (μ) = 4.35, standard deviation (σ) = 0.59 and we are looking for the probability of x>5.
I already figured out the answer with 1-pnorm((5-4.35)/0.59) ~ 0.135.
However, I am currently having some difficulty trying to understand what pnorm calculates.
Originally, I just assumed that z scores were the only arguments needed. So I proceeded to use pnorm(z-score) for most of the normal curvature problems.
The help page for pnorm accessed through ?pnorm() indicates that the usage is:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE).
My professor also says that I am ignoring the mean and sd by just using pnorm(z-score). I feel like it is just easier to type in one value instead of the whole set of arguments. So I experimented and found that
1-pnorm((5-4.35)/0.59) = 1-pnorm(5,4.35,0.59)
So it looks like pnorm(z-score) = pnorm (x,μ,σ).
Is there a reason that using the z-score allows to skip the mean and
standard deviation in the pnorm function?
I have also noticed that trying to add μ,σ arguments with the z-score gives the wrong answer (ex: pnorm(z-score,μ,σ).
> 1-pnorm((5-4.35)/0.59)
[1] 0.1352972
> pnorm(5,4.35,0.59)
[1] 0.8647028
> 1-pnorm(5,4.35,0.59)
[1] 0.1352972
> 1-pnorm((5-4.35)/0.59,4.35,0.59)
[1] 1
That is because a z-score is standard normally distributed, meaning it has μ = 0 and σ = 1, which, as you found out, are the default parameters for pnorm().
The z-score is just the transformation of any normally distributed value to a standard normally distributed one.
So when you output the probability of the z-score for x = 5 you indeed get the same value than asking for the probability of x > 5 in a normal distribution with μ = 4.35 and σ = 0.59.
But when you add μ = 4.35 and σ = 0.59 to your z-score inside pnorm() you get it all wrong, because you're looking for a standard normally distributed value in a different distribution.
pnorm() (to answer your first question) calculates the cumulative density function, which shows you P(X < x) (the probability that a random variable takes a value equal or less than x). That's why you do 1 - pnorm(..) to find out P(X > x).
Does Julia have a function to calculate the density points where p% of the distribution is included?
Something like the scipy.stats norm.ppf function mentioned in this answer
Example: 2-sided 95% confidence interval:
> norm.ppf(1-(1-0.95)/2)
1.96
> norm.ppf(1-(1+0.95)/2)
-1.96
The quantile function from Distributions package is probably (95% CI) what you are looking for. For the Normal distributions you have:
julia> using Distributions
julia> quantile(Normal(0.0, 1.0),1-(1+0.95)/2)
-1.9599639845400576
julia> quantile(Normal(0.0, 1.0),1-(1-0.95)/2)
1.9599639845400576
The same function quantile can be used for other distributions.
Just to add another related enhancement to the answer, especially for users of Bayesian posteriors, we can define medianinterval as follows:
medianinterval(d,p = 0.95) = quantile(d,1-(1+p)/2),quantile(d,(1+p)/2)
and have:
julia> medianinterval(Normal())
(-1.9599639845400576, 1.9599639845400576)
But sometimes a more efficient (i.e. shorter) interval will be around the mode of the distribution. To address this we can define:
function modeinterval(d,p=0.95)
mcdf = cdf(d,mode(d))
endpoints = mcdf < p/2 ? (0,p) : mcdf > 1-p/2 ? (1-p,1) : (mcdf-p/2,mcdf+p/2)
return map(x->quantile(d,x), endpoints)
end
For the Normal distribution it doesn't matter since the mode is also the median, but for other distributions such as the Beta, we can have:
julia> modeinterval(Beta(2,8),0.2)
(0.09639068616673087, 0.15355172436770012)
julia> medianinterval(Beta(2,8),0.2)
(0.1498495815725847, 0.21227857915644155)
julia> 0.15355172436770012 - 0.09639068616673087
0.05716103820096925
julia> 0.21227857915644155 - 0.1498495815725847
0.06242899758385684
The mode interval covers the same fraction of the distribution with a shorter length. See Credible interval for related discussion.
I am using eigs() function in Julia for computing eigenvalues and eigenvectors. Results are non deterministic and often full of 0.0. Temporary solution is to link LAPACK 2.0.
Any idea how to do it on Linux Ubuntu? So far I am not able to link it and I do not how complex Linux administration skills so It will be good if someone could post some guide for how to link it correctly.
Thanks a lot.
Edit:
I wanted to add results but I noticed one flaw in code. I was using matrix = sparse(map(collect,zip([triple(e,"weight") for e in edges(g)]...))..., num_vertices(g), num_vertices(g)). It answer from you to one of my questions. It works ok when vertices are indexed from 1. But my vertices have random indexes due to reading them from file. So I changed num_vertices to be equal to largest index. But I do not noticed that it was doing for example computations considering 1000 vertices when vertex with max index was 1000 although whole graph could consists of 3 verts 1, 10, 1000 for example. Any idea how to fix it ?
Edit 2:
#Content of matrix = matrix+matrix'
[2, 1] = 10.0
[3, 1] = 14.0
[1, 2] = 10.0
[3, 2] = 10.0
[5, 2] = 2.0
[1, 3] = 14.0
[2, 3] = 10.0
[4, 3] = 20.0
[5, 3] = 20.0
[3, 4] = 20.0
[2, 5] = 2.0
[3, 5] = 20.0
[6, 5] = 10.0
[5, 6] = 10.0
matrix = matrix+matrix'
(d, v) = eigs(matrix, nev=1, which=:LR, maxiter=1)
5 executions of code above:
[-0.3483956604402672
-0.3084333257587648
-0.6697046040724708
-0.37450798643794125
-0.4249810113292739
-0.11882760090004019]
[0.3483956604402674
0.308433325758765
0.6697046040724703
0.3745079864379416
0.424981011329274
0.11882760090004027]
[-0.3483956604402673
-0.308433325758765
-0.669704604072471
-0.37450798643794114
-0.4249810113292739
-0.1188276009000403]
[0.34839566044026726
0.30843332575876503
0.6697046040724703
0.37450798643794114
0.4249810113292739
0.11882760090004038]
[0.34839566044026715
0.30843332575876503
0.6697046040724708
0.3745079864379412
0.4249810113292738
0.11882760090004038]
The algorithm is indeed non-deterministic (as is obvious in the example in the question). But, there are two kinds of non-determinism in the answers:
the complete sign reversals of the eigenvector.
small accuracy errors.
If a vector is an eigenvector, so is every scalar multiple of it (mathematically, the eigenvector is part of a subspace of eigenvectors belonging to an eigenvalue). Thus, if v is an eigenvector, so is λv. When λ = -1 this is the sign reversal. But 2v is also an eigenvector. The eigs function normalizes the vectors to norm 1, so the only freedom left is this sign reversal. To solve this non-determinism, you can choose a sign for the first non-zero coordinate of the vector (say, positive) and multiple the eigenvector to make it so. In code:
v = v*sign(v[findfirst(v)])
Regarding the second non-determinism source (inaccuracies), it is important to note that the true eigenvalues and eigenvectors are often real numbers which cannot be accurately represented by Float64, thus the return values are always off. If the level of accuracy needed is low enough, rounding the values deterministically should make the resulting approximation the same. If this is not clear, consider an algorithm for calculating sqrt(2). It may be non-deterministic and return 1.4142135623730951 and sometimes 1.4142135623730949, but rounding to 5 decimal places would always yield 1.41421.
The above should provide a guide to making the results more deterministic. But consider:
If there are multiple eigenvalues with the same value, the subspace of eigenvectors is more than 1 dimensional and there is more freedom to choose an eigenvector. This could make finding a deterministic vector (or vectors) to span this space more intricate.
Does the application really require this determinism?
(Thanks for the code bits - they do help. Even better when they can be quickly cut-and-pasted).