How to calculate the variance of an array in Julia? - julia

I have the following array which I need to figure out its variance:
julia> a = [5, 6, 7, 8, 10, 12, 12, 17, 67, 68, 69, 72, 74, 74, 92, 93, 100, 105, 110, 120, 124]
21-element Vector{Int64}:
5
6
7
8
10
12
12
17
67
68
⋮
74
74
92
93
100
105
110
120
124
How can I do this in Julia?

Julia has var function in built into standard Statistics module. So you can just do:
using Statistics
var(a)
The StatsBase.jl package does not export the var function, so your code would not work when used in a fresh Julia session. You would have to write StatsBase.var(a) instead (or add using Statistics).
What StatsBase.jl adds to the var function is that it defines additional methods that allow computation of weighted variance. So e.g. the following works with StatsBase.jl (but would currently not work without it):
julia> using Statistics
julia> using StatsBase
julia> var([1,2,3], Weights([1,2,3]))
0.5555555555555555

If someone wants to calculate variance from first principles, you can do this:
function my_variance(x)
n = length(x)
μ = sum(x) / n
sum((x .- μ) .^ 2) / (n - 1)
end
But please, just use StatsBase.var!

Related

Get a seed to generate a specific set of pseudo-casual number

I was wondering if there is a way in R to find the specific seed that generates a specific set of numbers;
For example:
sample(1:300, 10)
I want to find the seed that gives me, as the output of the previous code:
58 235 243 42 281 137 2 219 284 184
As far as I know there is no elegant way to do this, but you could brute force it:
desired_output <- c(58, 235, 243, 42, 281, 137, 2, 219, 284, 184)
MAX_SEED <- .Machine$integer.max
MIN_SEED <- MAX_SEED * -1
i <- MIN_SEED
while (i < MAX_SEED - 1) {
set.seed(i)
actual_output <- sample(1:300, 10)
if (identical(actual_output, desired_output)) {
message("Seed found! Seed is: ", i)
break
}
i <- i + 1
}
This takes 11.5 seconds to run with the first 1e6 seeds on my laptop - so if you're unlucky then it would take about 7 hours to run. Also, this is exactly the kind of task you could run in parallel in separate threads to cut the time down quite a bit.
EDIT: Updated to include negative seeds which I had not considered. So in fact it could take twice as long.

Using the akima bilinear function for interpolation

I am using the akima package and bilinear function to interpolate z values (temperatures) from a coarse coordinate grid (2.5° x 2.5°) to a finer grid (0.5° x 0.5°). The bilinear function works as follows:
Usage
bilinear(x, y, z, x0, y0)
Arguments
x a vector containing the x coordinates of the rectangular data grid.
y a vector containing the y coordinates of the rectangular data grid.
z a matrix containing the z[i,j] data values for the grid points (x[i],y[j]).
x0 vector of x coordinates used to interpolate at.
y0 vector of y coordinates used to interpolate at.
Value
This function produces a list of interpolated points:
x vector of x coordinates.
y vector of y coordinates.
z vector of interpolated data z.
Given the following data:
# coarse grid longitudes x -> c(0, 2.5, 5, 7.5, 10)
# coarse grid latitudes y -> c(50, 55, 60, 65, 70)
# temperatures z -> c(10.5, 11.1, 12.4, 9.8, 10.6)
# fine grid longitudes x0 -> c(0, 0.5, 1, 1.5, 2)
# fine grid latitudes y0 -> c(50, 50.5, 51, 51.5, 52)
I tried the function:
bilinear -> (x=x, y=y, z=z, x0=x0, y0=y0)
But I get the following:
Error in if (dim(z)[1] != nx) stop("dim(z)[1] and length of x differs!") :
argument is of length zero
I clearly don't fully understand how this function works and would really appreciate any suggestions if somebody knows what I'm doing wrong? I'm open to an alternative solution using a different package also.
Read carefully the description of the function, z need to be a matrix with dimension x,y:
library(akima)
x <- c(0, 2.5, 5, 7.5, 10)
y <- c(50, 55, 60, 65, 70)
z <- matrix(rnorm(25), 5, 5)
x0 <- seq(0, 10, .5)
y0 <- seq(50, 70, length = length(x0))
> bilinear(x, y, z, x0, y0)
$x
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
[16] 7.5 8.0 8.5 9.0 9.5 10.0
$y
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
$z
[1] 1.14880762 1.08789150 0.88252672 0.53271328 0.03845118 -0.60025959
[7] -0.13758256 0.17947029 0.35089894 0.37670342 0.25688371 -0.06736752
[13] -0.42197570 -0.80694083 -1.22226291 -1.66794194 -1.38940279 -1.08889523
[19] -0.76641923 -0.42197481 -0.05556197

Find entry that causes low p-value

In R is have 2 vectors
u <- c(109, 77, 57, 158, 60, 63, 42, 20, 139, 15, 64, 18)
v <- c(734, 645, 1001, 1117, 1071, 687, 162, 84, 626, 64, 218, 79)
I want to test H: u and v are independent so I run a chi-square test:
chisq.test( as.data.frame( rbind(u,v) ) )
and get a very low p-value meaning that I can reject H, meaning that u and v are not independent.
But when I type
chisq.test(u,v)
I get a p-value on 0.23 which mean that I can accept H.
Which one of these two test should I chose ?
Furthermore I want to find the entries in these vectors that causes this low p-value. Any ideas how to do this?
The test statistic uses the sum of squared standardised residuals. You can look at these values to get an idea of the importance of particular values
m = chisq.test(u, v)
residuals(m)
m$stdres

R nls function and starting values

I'm wondering how I can find/choose the starting values for the nls function as I'm getting errors with any I put in. I also want to confirm that I can actually use the nls function with my data set.
data
[1] 108 128 93 96 107 126 105 137 78 117 111 131 106 123 112 90 79 106 120
[20] 91 100 103 112 138 61 57 97 100 95 92 78
week = (1:31)
> data.fit = nls(data~M*(((P+Q)^2/P)*exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2), start=c(M=?, P=?, Q=?))
If we change the function a bit and use nls2 to get starting values then we can get it to converge. The model we are using is:
log(data) = .lin1 + .lin2 * log((exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2))) +error
In this model .lin1 = log(M*(((P+Q)^2/P)) and when .lin2=1 it reduces to the model in the question (except for the multiplicative rather than additive error and the fact that the parameterization is different but when appropriately reduced gives the same predictions). This is a 4 parameter rather than 3 parameter model.
The linear parameters are .lin1 and .lin2. We are using algorithm = "plinear" which does not require starting values for these parameters. The RHS of plinear formulas is specified as a matrix with one column for each linear parameter specifying its coefficient (which may be a nonlinear function of the nonlinear parameters).
The code is:
data <- c(108, 128, 93, 96, 107, 126, 105, 137, 78, 117, 111, 131, 106,
123, 112, 90, 79, 106, 120, 91, 100, 103, 112, 138, 61, 57, 97,
100, 95, 92, 78)
week <- 1:31
if (exists("fit2")) rm(fit2)
library(nls2)
fo <- log(data) ~ cbind(1, log((exp((P+Q)*week)/(1+(Q/P)*exp(-(P+Q)*week))^2)))
# try maxiter random starting values
set.seed(123)
fit2 = nls2(fo, alg = "plinear-random",
start = data.frame(P = c(-10, 10), Q = c(-10, 10)),
control = nls.control(maxiter = 1000))
# use starting value found by nls2
fit = nls(fo, alg = "plinear", start = coef(fit2)[1:2])
plot(log(data) ~ week)
lines(fitted(fit) ~ week, col = "red")
giving:
> fit
Nonlinear regression model
model: log(data) ~ cbind(1, log((exp((P + Q) * week)/(1 + (Q/P) * exp(-(P + Q) * week))^2)))
data: parent.frame()
P Q .lin1 .lin2
0.05974 -0.02538 5.63199 -0.87963
residual sum-of-squares: 1.069
Number of iterations to convergence: 16
Achieved convergence tolerance: 9.421e-06

Using RGL to plot 3d line segments in R

I having some problems with an application of the rgl 3d graphing package.
I'm trying to draw some line segments. And my data is arranged in a dataframe called 'markers' with six columns, one for each of the starting x, y, and z values, and one for each of the ending x, y, and z values.
startX startY startZ endX endY endZ
69.345 45.732 20 115 39.072 1.92413
80.270 38.480 30 175 44.548 0.36777
99.590 33.596 20 175 35.224 0.06929
32.120 41.218 20 115 39.294 2.81424
11.775 37.000 30 175 35.890 1.38047
76.820 44.104 22 115 44.992 4.14674
85.790 23.384 18 115 36.112 0.40508
80.040 17.464 20 175 31.080 2.59038
103.615 38.850 22 115 39.220 3.18201
41.200 31.006 30 175 36.260 3.48049
88.665 43.956 30 115 39.738 0.50635
109.365 23.976 20 175 33.374 3.99750
This should be a piece of cake. Just feed those values to the segment3d() command and I should get the plot I want. Only I can't figure out how to correctly pass the respective starting and ending pairs into segment3d().
I've tried just about everything possible ($ notation, indexing, concatenating, using a loop, apply and sapply, etc.), including reading the documentation. It's great, it says for the arguments x, y, and z: "Any reasonable way of defining the coordinates is acceptable." Ugh... it does refer you to the xyz.coords utility.
So I went over that documentation. And I think I understand what it does; I can even use it to standardize my data e.g.
starts <- xyz.coords(markers$startX, markers$startY, markers$startZ)
ends <- xyz.coords(markers$endX, markers$endY, markers$endZ)
But then I'm still not sure what to do with those two lists.
segments3d(starts, ends)
segments3d(starts + ends)
segments3d((starts, ends), (starts, ends), (starts, ends))
segments3d(c(starts, ends), c(starts, ends), c(starts, ends))
segments3d(c(starts$x, ends$x), c(starts$y, ends$y), c(starts$z, ends$z))
I mean I know why the above don't work. I'm basically just trying things at this point as this is making me feel incredibly stupid, like there is something obvious—I mean facepalm level obvious—I'm missing.
I went through the rgl documentation itself looking for an example, and the only place I found them using segment3d() in any manner resembling what I'm trying to do, they used the '+' notation I tried above. Basically they built 2 matrices and added the second to the first.
Something like this should work.
library(rgl)
open3d(scale=c(1/5,1,1))
segments3d(x=as.vector(t(markers[,c(1,4)])),
y=as.vector(t(markers[,c(2,5)])),
z=as.vector(t(markers[,c(3,6)])))
axes3d()
title3d(xlab="X",ylab="Y",zlab="Z")
The problem is that segments3d(...) takes the x (and y and z) values in pairs. So rows 1-2 are the first segment, rows 3-4 are the second segment, etc. You need to interleave, e.g. $startx and $endx, etc. The code above does that.
Code for creating the data set:
markers <- data.frame(startX = c(69.345, 80.270, 99.590, 32.120, 11.775, 76.820, 85.790, 80.040, 103.615, 41.200, 88.665, 109.365), startY = c(45.732, 38.480, 33.596, 41.218, 37.000, 44.104, 23.384, 17.464, 38.850, 31.006, 43.956, 23.976), startZ = c(20, 30, 20, 20, 30, 22, 18, 20, 22, 30, 30, 20), endX = c(115, 175, 175, 115, 175, 115, 115, 175, 115, 175, 115, 175), endY = c(39.072, 44.548, 35.224, 39.294, 35.890, 44.992, 36.112, 31.080, 39.220, 36.260, 39.738, 33.374), endZ = c(1.92413, 0.36777, 0.06929, 2.81424, 1.38047, 4.14674, 0.40508, 2.59038, 3.18201, 3.48049, 0.50635, 3.99750))

Resources