Pearson's r in Julia - julia

I couldn't find an already made function in Julia to compute Pearson's r so I resorted to trying to make it myself however I run into trouble.
code:
r(x,y) = (sum(x*y) - (sum(x)*sum(y))/length(x))/sqrt((sum(x^2)-(sum(x)^2)/length(x))*(sum(y^2)-(sum(y)^2)/length(x)))
if I attempt to run this on two arrays:
b = [4,8,12,16,20,24,28]
q = [5,10,15,20,25,30,35]
I get the following error:
ERROR: `*` has no method matching *(::Array{Int64,1}, ::Array{Int64,1})
in r at none:1

Pearson's r is available in Julia as cor:
julia> cor(b,q)
1.0
When you're looking for functions in Julia, the apropos function can be very helpful:
julia> apropos("pearson")
Base.cov(v1[, v2][, vardim=1, corrected=true, mean=nothing])
Base.cor(v1[, v2][, vardim=1, mean=nothing])
The issue you're running into with your definition is the difference between elementwise multiplication/exponentiation and matrix multiplication/exponentiation. In order to use elementwise behavior as you intend, you need to .* and .^:
r(x,y) = (sum(x.*y) - (sum(x)*sum(y))/length(x))/sqrt((sum(x.^2)-(sum(x)^2)/length(x))*(sum(y.^2)-(sum(y)^2)/length(x)))
With only those three changes, your r definition seems to match Julia's cor to within a few ULPs:
julia> cor(b,q)
1.0
julia> x,y = randn(10),randn(10)
([-0.2384626335813905,0.0793838075714518,2.395918475924737,-1.6271954454542266,-0.7001484742860653,-0.33511064476423336,-1.5419149314518956,-0.8284664940238087,-0.6136547926069563,-0.1723749334766532],[0.08581770755520171,2.208288163473674,-0.5603452667737798,-3.0599443201343854,0.585509815026569,0.3876891298047877,-0.8368409374755644,1.672421071281691,0.19652240951291933,0.9838306761261647])
julia> r(x,y)
0.23514468093214283
julia> cor(x,y)
0.23514468093214275
Julia's cor is defined iteratively (this is the zero-mean implementation — calling cor first subtracts the mean and then calls corzm) which means fewer allocations and better performance. I can't speak to the numerical accuracy.

Your function is trying to multiply two column vectors. You will need to invert transpose one of them. Consider:
> [1,2]*[3,4]
ERROR: `*` has no method matching *(::Array{Int64,1}, ::Array{Int64,1})
but:
> [1,2]'*[3,4]
1-element Array(Int64,1)
11
and:
> [1,2]*[3,4]'
2x2 Array(Int64,2):
3 4
6 8

Related

Julia - exponentiating a matrix returned from another function

I have functions f1 and f2 returning matrices m1 and m2, which are calculated using Diagonal, Tridiagonal, SymTridiagonal from LinearAlgebra package.
In a new function f3 I try computing
j = m1 - m2*im
m3 = exp(j)
but this gives a Method error on computation unless I use j=Matrix(m1-m2*im), saying that no matching method for exp(::LinearAlgebra.Tridiagonal ...)
My question is how can I do this computation in the most optimal way? I am a total beginner in Julia.
Unless you have a very special structure of j (i.e. if its exponential is sparse - which is unlikely) the best you can do AFAICT is to use a dense matrix as an input to exp:
m3 = LinearAlgebra.exp!([float(x) for x in Tridiagonal(dl, d, du)])
If you expect m3 to be sparse then I think currently there is no special algorithm implemented for that case in Julia.
Note that I use exp! to do operation in place and use a comprehension to make sure the argument to exp! is dense. As exp! expects LinearAlgebra.BlasFloat (that is Union{Complex{Float32}, Complex{Float64}, Float32, Float64}) I use float to make sure that elements of j are appropriately converted. Note that it might fail if you work with e.g. BigFloat or Float16 values - in this case you have to do an appropriate conversion to the expected types yourself.

Cumulative Integration Options With Julia

I have two 1-D arrays in which I would like to calculate the approximate cumulative integral of 1 array with respect to the scalar spacing specified by the 2nd array. MATLAB has a function called cumtrapz that handles this scenario. Is there something similar that I can try within Julia to accomplish the same thing?
The expected result is another 1-D array with the integral calculated for each element.
There is a numerical integration package for Julia (see the link) that defines cumul_integrate(X, Y) and uses the trapezoidal rule by default.
If this package didn't exist, though, you could easily write the function yourself and have a very efficient implementation out of the box because the loop does not come with a performance penalty.
Edit: Added an #assert to check matching vector dimensions and fixed a typo.
function cumtrapz(X::T, Y::T) where {T <: AbstractVector}
# Check matching vector length
#assert length(X) == length(Y)
# Initialize Output
out = similar(X)
out[1] = 0
# Iterate over arrays
for i in 2:length(X)
out[i] = out[i-1] + 0.5*(X[i] - X[i-1])*(Y[i] + Y[i-1])
end
# Return output
out
end

How to do a mathematical sum in R?

I have the following mathematical formula that I want to program as efficiently as possible in R.
$\sum_{i=1}^{N}(x_i-\bar x)(y_i-\bar y)$
Let's say we have the following example data:
x = c(1,5,7,10,11)
y = c(2,4,8,9,12)
How can I easily get this sum with this data without making a separate function?
Isn't there a package or a function that can compute these mathematical sums?
Use the sum command and vectorized operations: sum((x-mean(x))*(y-mean(y)))
The key revelation here is that the sum function is just taking the sum over the argument (vector, matrix, whatever). In this case, it's sufficient to give it a vector, and in this case, the vector expression is a little more complicated than sum(z), but notice that (x-mean(x))*(y-mean(y)) evaluates to z, so the fact that the command is slightly ornate doesn't really change how the function works. This is true in many places, not just the sum command.

Generate identical random numbers in R and Julia

I'd like to generate identical random numbers in R and Julia. Both languages appear to use the Mersenne-Twister library by default, however in Julia 1.0.0:
julia> using Random
julia> Random.seed!(3)
julia> rand()
0.8116984049958615
Produces 0.811..., while in R:
set.seed(3)
runif(1)
produces 0.168.
Any ideas?
Related SO questions here and here.
My use case for those who are interested: Testing new Julia code that requires random number generation (e.g. statistical bootstrapping) by comparing output to that from equivalent libraries in R.
That is an old problem.
Paul Gilbert addressed the same issue in the late 1990s (!!) when trying to assert that simulations in R (then then newcomer) gave the same result as those in S-Plus (then the incumbent).
His solution, and still the golden approach AFAICT: re-implement in fresh code in both languages as the this the only way to ensure identical seeding, state, ... and whatever else affects it.
Pursuing the RCall suggestion made by #Khashaa, it's clear that you can set the seed and get the random numbers from R.
julia> using RCall
julia> RCall.reval("set.seed(3)")
RCall.NilSxp(16777344,Ptr{Void} #0x0a4b6330)
julia> a = zeros(Float64,20);
julia> unsafe_copy!(pointer(a), RCall.reval("runif(20)").pv, 20)
Ptr{Float64} #0x972f4860
julia> map(x -> #printf("%20.15f\n", x), a);
0.168041526339948
0.807516399072483
0.384942351374775
0.327734317164868
0.602100674761459
0.604394054040313
0.124633444240317
0.294600924244151
0.577609919011593
0.630979274399579
0.512015897547826
0.505023914156482
0.534035353455693
0.557249435689300
0.867919487645850
0.829708693316206
0.111449153395370
0.703688358888030
0.897488264366984
0.279732553754002
and from R:
> options(digits=15)
> set.seed(3)
> runif(20)
[1] 0.168041526339948 0.807516399072483 0.384942351374775 0.327734317164868
[5] 0.602100674761459 0.604394054040313 0.124633444240317 0.294600924244151
[9] 0.577609919011593 0.630979274399579 0.512015897547826 0.505023914156482
[13] 0.534035353455693 0.557249435689300 0.867919487645850 0.829708693316206
[17] 0.111449153395370 0.703688358888030 0.897488264366984 0.279732553754002
** EDIT **
Per the suggestion by #ColinTBowers, here's a simpler/cleaner way to access R random numbers from Julia.
julia> using RCall
julia> reval("set.seed(3)");
julia> a = rcopy("runif(20)");
julia> map(x -> #printf("%20.15f\n", x), a);
0.168041526339948
0.807516399072483
0.384942351374775
0.327734317164868
0.602100674761459
0.604394054040313
0.124633444240317
0.294600924244151
0.577609919011593
0.630979274399579
0.512015897547826
0.505023914156482
0.534035353455693
0.557249435689300
0.867919487645850
0.829708693316206
0.111449153395370
0.703688358888030
0.897488264366984
0.279732553754002
See:
?set.seed
"Mersenne-Twister":
From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 - 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.
And you might see if you can link to the same C code from both languages. If you want to see the list/vector, type:
.Random.seed

What is idiomatic Julia style for by column or row operations?

Apologies if this rather general - albeit still a coding question.
With a bit of time on my hands I've been trying to learn a bit of Julia. I thought a good start would be to copy the R microbenchmark function - so I could seamlessly compare R and Julia functions.
e.g. this is microbenchmark output for 2 R functions that I am trying to emulate:
Unit: seconds
expr min lq median uq max neval
vectorised(x, y) 0.2058464 0.2165744 0.2610062 0.2612965 0.2805144 5
devectorised(x, y) 9.7923054 9.8095265 9.8097871 9.8606076 10.0144012 5
So thus far in Julia I am trying to write idiomatic and hopefully understandable/terse code. Therefore I replaced a double loop with a list comprehension to create an array of timings, like so:
function timer(fs::Vector{Function}, reps::Integer)
# funs=length(fs)
# times = Array(Float64, reps, funs)
# for funsitr in 1:funs
# for repsitr in 1:reps
# times[reps, funs] = #elapsed fs[funs]()
# end
# end
times= [#elapsed fs[funs]() for x=1:reps, funs=1:length(fs)]
return times
end
This gives an array of timings for each of 2 functions:
julia> test=timer([vec, devec], 10)
10x2 Array{Float64,2}:
0.231621 0.173984
0.237173 0.210059
0.26722 0.174007
0.265869 0.208332
0.266447 0.174051
0.266637 0.208457
0.267824 0.174044
0.26576 0.208687
0.267089 0.174014
0.266926 0.208741
My question (finally) is how do I idiomatically apply a function such as min, max, median across columns (or rows) of an array without using a loop?
I can of course do it easily for this simple case with a loop (sim to that I crossed out above)- but I can't find anything in the docs which is equivalent to say apply(array,1, fun) or even colMeans.
The closest generic sort of function I can think of is
julia> [mean(test[:,col]) for col=1:size(test)[2]]
2-element Array{Any,1}:
0.231621
0.237173
.. but the syntax really really doesn't appeal. Is there a more natural way to apply functions across columns or rows of a multidimensional array in Julia?
The function you want is mapslices.
Anonymous functions was are currently slow in julia, so I would not use them for benchmarking unless you benchmark anonymous functions. That will give wrong performance prediction for code that does not use anonymous functions in performance critical parts of the code.
I think you want the two argument version of the reduction functions, like sum(arr, 1) to sum over the first dimension. If there isn't a library function available, you might use reducedim
I think #ivarne has the right answer (and have ticked it) but I just add that I made an apply like function:
function aaply(fun::Function, dim::Integer, ar::Array)
if !(1 <= dim <= 2)
error("rows is 1, columns is 2")
end
if(dim==1)
res= [fun(ar[row, :]) for row=1:size(ar)[dim]]
end
if(dim==2)
res= [fun(ar[:,col]) for col=1:size(ar)[dim]]
end
return res
end
this then gets what I want like so:
julia> aaply(quantile, 2, test)
2-element Array{Any,1}:
[0.231621,0.265787,0.266542,0.267048,0.267824]
[0.173984,0.174021,0.191191,0.20863,0.210059]
where quantile is a built-in that gives min, lq, median, uq, and max.. just like microbenchmark.
EDIT Following the advice here I tested the new function mapslice which works pretty much like R apply and benchmarked it against the function above. Note that mapslice has dim=1 as by column slice whilst test[:,1] is the first column... so the opposite of R though it has the same indexing?
# nonsense test data big columns
julia> ar=ones(Int64,1000000,4)
1000000x4 Array{Int64,2}:
# built in function
julia> ms()=mapslices(quantile,ar,1)
ms (generic function with 1 method)
# my apply function
julia> aa()=aaply(quantile, 2, ar)
aa (generic function with 1 method)
# compare both functions
julia> aaply(quantile, 2, timer1([ms, aa], 40))
2-element Array{Any,1}:
[0.23566,0.236108,0.236348,0.236735,0.243008]
[0.235401,0.236058,0.236257,0.236686,0.238958]
So the funs are approximately as quick as each other. From reading bits of the Julia mailing list they seem to intend to do some work on this bit of Julialang so that making slices is by reference rather than making new copies of each slice (column row etc)...

Resources