How to create simple covariance in Julia on a matrix - julia

Using Julia 0.5. Given:
Supertech = [-.2 .1 .3 .5];
Slowpoke = [.05 .2 -.12 .09];
How in the world can I get a covariance. In Excel I just say
=covariance.p(Supertech,Slowpoke)
and it gives me the correct answer of -0.004875
For the life of me I can't figure out how to get this to work using StatsBase.cov()
I've tried putting this into a matrix like:
X = [Supertech; Slowpoke]'
which gives me a nice:
4×2 Array{Float64,2}:
-0.2 0.05
0.1 0.2
0.3 -0.12
0.5 0.09
but I can't get this simple thing to work. I keep coming up with dimension mismatches when I try to use the WeightedVector type.

The syntax [-.2 .1 .3 .5] doesn't create a vector, it creates a one-row matrix. The cov function is actually defined in base Julia, but it requires vectors. So you simply need to use the syntax with commas to create vectors in the first place ([-.2, .1, .3, .5]), or you can use the vec function to reshape the matrix to a one-dimensional vector. It also uses the "corrected" covariance by default, whereas Excel is using the "uncorrected" covariance. You can use the third argument to specify that you don't want this correction.
julia> cov(vec(Supertech), vec(Slowpoke))
-0.0065
julia> cov(vec(Supertech), vec(Slowpoke), false)
-0.004875

Related

Get quantile for each value

Is there an implemented (!) function in R which gives you the empirical quantile for each value? I couldn't find any ...
Let's say we have x
x = c(1,3,4,2)
I want to have the quantile of each element.
[1] 0.25, 0.75, 1, 0.5
Thank you very much!
You can use the ecdf() function:
ecdf(x)(x)
[1] 0.25 0.75 1.00 0.50
ecdf(x) creates a function, and you pass the elements of x to that function. The syntax admittedly looks strange

Julia : eigs() function returning different values after every evaluation

I noticed that after running eigs() function multiple times, every time it gives different but approximate result.
Is there way to return it every time the same result ? Output is sometimes with "+" sign or "-" sign.
Content of M :
[2, 1] = 1.0
[3, 1] = 0.5
[1, 2] = 1.0
[3, 2] = 2.5
[1, 3] = 0.5
[2, 3] = 2.5
M = M+M'
(d, v) = eigs(M, nev=1, which=:LR)
I tried running same function on same sparse matrix in Python , although the matrix looks bit different I think it is same. Just left values are numbered from 0. In julia they are numbered from 1. I do not know if that is a big difference. Values are approximately same in Julia and Python but in Python they are always the same after every evaluation. Also return values in python are complex numbers, in Julia real.
Python code:
Content of M.T :
from scipy.sparse import linalg
(1, 0) 1.0
(2, 0) 0.5
(0, 1) 1.0
(2, 1) 2.5
(0, 2) 0.5
(1, 2) 2.5
eigenvalue, eigenvector = linalg.eigs(M.T, k=1, which='LR')
Any idea why this behavior is occurring ?
Edit :
These are results of four evaluations of eigs
==========eigvalues==============
[2.8921298144977587]
===========eigvector=============
[-0.34667468634025667
-0.679134250677923
-0.6469878912367839]
=================================
==========eigvalues==============
[2.8921298144977596]
===========eigvector=============
[0.34667468634025655
0.6791342506779232
0.646987891236784]
=================================
==========eigvalues==============
[2.8921298144977596]
===========eigvector=============
[0.34667468634025655
0.6791342506779233
0.6469878912367841]
=================================
==========eigvalues==============
[2.8921298144977583]
===========eigvector=============
[0.3466746863402567
0.679134250677923
0.646987891236784]
=================================
The result of eigs depends on the initial vector for the Lanczos iterations. When not specified, it is random so even though all the vectors returned are correct the phase is not guaranteed to be the same over different iterations.
If you want the result to be the same every time, you can set v0 in eigs, e.g.
eigs(M, nev=1, which=:LR, v0 = ones(3))
As long as v0 doesn't change you should get deterministic results.
Note that if you want a deterministic result for testing purposes, you might want to consider a testing scheme that allows phase shifts since the phase can shift with the smallest perturbations. E.g. if you link a different BLAS or change the number of threads the result might change again.

Multiply Probability Distribution Functions

I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps.
Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days = 10%, one day = 40%, two days = 50%. Let "b" represent the probability distribution function of how long it takes to complete process "B". Zero days = 10%, one day = 20%, etc.
Process "B" can't be started until process "A" is complete, so "B" is dependent upon "A".
a <- c(.1, .4, .5)
b <- c(.1,.2,.3,.3,.1)
How can I calculate the probability density function of the time to complete "A" and "B"?
This is what I'd expect as the output for or the following example:
totallength <- 0 # initialize
totallength[1:(length(a) + length(b))] <- 0 # initialize
totallength[1] <- a[1]*b[1]
totallength[2] <- a[1]*b[2] + a[2]*b[1]
totallength[3] <- a[1]*b[3] + a[2]*b[2] + a[3]*b[1]
totallength[4] <- a[1]*b[4] + a[2]*b[3] + a[3]*b[2]
totallength[5] <- a[1]*b[5] + a[2]*b[4] + a[3]*b[3]
totallength[6] <- a[2]*b[5] + a[3]*b[4]
totallength[7] <- a[3]*b[5]
print(totallength)
[1] [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
sum(totallength)
[1] 1
I have an approach in visual basic that used three for loops (one for each of the steps, and one for the output) but I hope I don't have to loop in R.
Since this seems to be a pretty standard process flow question, part two of my question is whether any libraries exist to model operations flow so I'm not creating this from scratch.
The efficient way to do this sort of operation is to use a convolution:
convolve(a, rev(b), type="open")
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the Fast Fourier Transform, or FFT).
You can confirm that each of these values is correct using the formulas you posted:
(expected <- c(a[1]*b[1], a[1]*b[2] + a[2]*b[1], a[1]*b[3] + a[2]*b[2] + a[3]*b[1], a[1]*b[4] + a[2]*b[3] + a[3]*b[2], a[1]*b[5] + a[2]*b[4] + a[3]*b[3], a[2]*b[5] + a[3]*b[4], a[3]*b[5]))
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
See the package:distr. Choosing the term "multiply" is unfortunate, since the situation described is not one where the contributions to probabilities is independent (where multiplication of probabilities would be the natural term to use). It's rather some sort of sequential addition, and that is exactly what the distr package provides as its interpretation of what "+" should mean when used as a symbolic manipulation of two discrete distributions.
A <- DiscreteDistribution ( setNames(0:2, c('Zero', 'one', 'two') ), a)
B <- DiscreteDistribution(setNames(0:2, c( "Zero2" ,"one2", "two2",
"three2", "four2") ), b )
?'operators-methods' # where operations on 2 DiscreteDistribution are convolution
plot(A+B)
After a bit of nosing around I see that the actual numeric values can be found here:
A.then.B <- A + B
> environment(A.the.nB#d)$dx
[1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
Seems like there should have been a method for display of the probabilities, and I'm not a regular user of this fascinating package so there well may be one. Do read the vignette and the code-demos ... which I have not yet done. Further noodling around convinces me that the right place to look is in the companion package: distrDoc where the vignette is 100+ pages long. And it shouldn't have required any effort to find it, either, since that advice is in the messages that print when the package is loaded ... except in my defense there were a couple of pages of messages, so it was more tempting to jump into coding and using the help pages.
I'm not familiar with a dedicated package that does exactly what your example describes. but let me sujust a more robust solution for this problem.
You are looking for a method to estimate the distribution of a process that might be combined by an n steps process, in your case 2 that might not be as easy to compute as your example.
The approach Iwould use is a simulation, of 10k observations drown from the underlying distributions, and then calculating the density function of the simulated results.
using your example we can do the following:
x <- runif(10000)
y <- runif(10000)
library(data.table)
z <- as.data.table(cbind(x,y))
z[x>=0 & x<0.1, a_days:=0]
z[x>=0.1 & x<0.5, a_days:=1]
z[x>=0.5 & x<=1, a_days:=2]
z[y>=0 & y <0.1, b_days:=0]
z[x>=0.1 & x<0.3, b_days:=1]
z[x>=0.3 & x<0.5, b_days:=2]
z[x>=0.5 & x<0.8, b_days:=3]
z[x>=0.8 & x<=1, b_days:=4]
z[,total_days:=a_days+b_days]
hist(z[,total_days])
this will result in a very good proxy if the density and the aproach would also work if your second process was drown from an exponential distribution. in which case you'd use rexp function to calculate b_days directly.

Non-conformable arrays in R

y <- matrix(c(7, 9, -5, 0, 2, 6), ncol = 1)
try <- t(y)
tryy <- try %*% y
i <- solve(tryy)
h <- y %*% i %*% try
uniroot(as.vector(solve(((1-x) * diag(6)) + h)), c(-Inf, Inf))
Error in (1 - x) * diag(6) : non-conformable arrays
The purpose of this command uniroot(as.vector(solve(((1-x) * diag(6)) + h)), c(-Inf, Inf)) is to solve the characteristics equation det[(1-λ)I+h] = 0
where, λ=eigenvalues , I=identity matrix , h=hat matrix=y(y'y)^(-1)y'
here λ is unknown ,we have to solve for it.
I am not understanding where is the problem here? I have tried as:
as.vector(solve(6*diag(6)+h))
This is not non-conformable. But why is not working inside the uniroot function?
Your question is a bit confusing, so I have to make a couple of assumptions. If you want the eigenvalues of h, then the characteristic equation is:
det(h - I*λ) = 0
not
det[(1-λ)I+h] = 0
So I used the former.
Given the above, the short answer is: do it this way.
f <- function(lambda) det(h -lambda*diag(6))
F <- Vectorize(f)
library(rootSolve)
uniroot.all(F,c(-1000,1000),n=2000)
# [1] 0 1
# or, much more simply
eigen(h)$values
# [1] 1.000000e+00 2.220446e-16 0.000000e+00 -2.731318e-18 -6.876381e-18 -7.365903e-17
So h has 2 eigenvalues, 0 and 1. Note that the built-in function eigen(...) finds 6 roots, but 5 of them are within the machine tolerance of 0.
The question about why your code fails is a bit more involved.
First, your code:
tryy <- try %*% y
is the dot product of y with itself (so, a scalar), returned as a matrix with one element. When you "invert" that using solve(...)
i <- solve(tryy)
you simply take the reciprocal, so i is also a matrix with 1 element. I'm not sure if this is what you had in mind.
Second, uniroot(...) does not work this way. The first argument must be a function; you've passed an expression which depends on x, which in turn is undefined. You could try:
f <- function(x) det(h-x*diag(6))
uniroot(f,c(-Inf,Inf))
but this wouldn't work either because (a) uniroot(...) works on a finite interval, (b) it requires that the function f(...) have different sign at the ends of the interval, and (c) in any event it would return only one root (the smaller one).
So you could use uniroot.all(...) in package rootSolve. uniroot.all(...) also requires a function as it's first argument, but there's a twist: the function must be "vectorized". This means that if you pass a vector of lambda values, f(...) should return a vector of the same length. Fortunately in R there is an easy way to "vectorize" a given function, as in:
F <- Vectorize(f).
Even this has it's limits. uniroot.all(...) also requires a finite interval, so we have to guess what that is, and also it evaluates F on n sub-intervals. So if your interval does not contain all the roots, or if the sub-intervals are not small enough, you will not find all the roots.
Using the built-in eigen(...) function is definitely the best option.

KS test for power law

Im attempting fitting a powerlaw distribution to a data set, using the method outlined by Aaron Clauset, Cosma Rohilla Shalizi and M.E.J. Newman in their paper "Power-Law Distributions in Empirical Data".
I've found code to compare to my own, but im a bit mystified where some of it comes from, the story thus far is,
to identify a suitable xmin for the powerlaw fit, we take each possible xmin fit a powerlaw to that data and then compute the corresponding exponet (a) then the KS statistic (D) for the fit and the observed data, then find the xmin that corresponds to the minimum of D. The KS statistic if computed as follows,
cx <- c(0:(n-1))/n # n is the sample size for the data >= xmin
cf <- 1-(xmin/z)^a # the cdf for a powerlaw z = x[x>=xmin]
D <- max(abs(cf-cx))
what i dont get is where cx comes for, surely we should be comparing the distance between the empirical distributions and the calculated distribution. something along the lines of:
cx = ecdf(sort(z))
cf <- 1-(xmin/z)^a
D <- max(abs(cf-cx(z)))
I think im just missing something very basic but please do correct me!
The answer is that they are (almost) the same. The easiest way to see this is to generate some data:
z = sort(runif(5,xmin, 10*xmin))
n = length(x)
Then examine the values of the two CDFs
R> (cx1 = c(0:(n-1))/n)
[1] 0.0 0.2 0.4 0.6 0.8
R> (cx2 = ecdf(sort(z)))
[1] 0.2 0.4 0.6 0.8 1.0
Notice that they are almost the same - essentially the cx1 gives the CDF for greater than or equal to whilst cx2 is greater than.
The advantage of the top approach is that it is very efficient and quick to calculate. The disadvantage is that if your data isn't truly continuous, i.e. z=c(1,1,2), cx1 is wrong. But then you shouldn't be fitting your data to a CTN distribution if this were the case.

Resources