storing several matrices on one object - julia

I have this simple econometrics/statistics exercise that I'm trying to implement in Julia. I've never used Julia before, and I have already done this code in R, but I cant seem to translate the code.
Here is the outline: I have an inner loop where for a fixed N, I draw K samples from a distribution, and I store them in a matrix that is N x K. I do this for two matrices, them multiply each column of each matrix (obtaining a 2x1 vector) them and store in a 2 x K matrix.
After that, I do an outer loop for, say, 5 different values of N. In the end, I would like to have 5 different matrices that are 2 x K, so that I can plot them. What I can't figure out is how to store efficiently this matrix. In R, I would simply put them in a list, and call them each out for calculations.
using Distributions
using StatsBase
J = 500
N = [10 100 500 1000 10000]
i = 1
b = ones(3,J)
for n in N
x = ones(n,J)
e = ones(n,J)
y = ones(n,J)
for j = 1:J
x[:,j] = rand(Normal(3,1),n)
y[:,j] = 3 + 2.5*x[:,j] + e[:,j]
x = [y[:,j] x[:,j]]
b[:,j] = inv(y'*x)*(x'*y[:,j])
end
i = i + 1
i
end
I've tried with this code, but it doesn't seem to work at all. I can't even make this simple for loop below work:
for i = 1:10
x = ones(i, 10)
end
I get a ERROR: UndefVarError: x not defined. Can you guys help me?

You'd simply put them in a Vector{Matrix}, a vector of matrices.
But a common paradigm in Julia would be to instead define a function fthat does what you want, and then broadcast that function over N, i.e. by calling a = f.(N) (N should be a Vector, not a Matrix, BTW, so commas between the numbers rather than spaces).
So, say you want to store 5 Matrices that are 2 x K with varying K. There are in fact numerous ways you can do that, depending on taste, style, convenience, etc. Here are some examples, where k varies from 20 to 24:
# a python-style list comprehension
mymatrices = [randn(K,2) for K in 20:24]
# a function broadcast over the array
f(k) = randn(k,2)
mymatrices = f.(20:24)
# a Vector with preallocated elements
mymatrices = Vector{Matrix{Float64}}(5)
for (i, k) in enumerate(20:24) #enumerate is smart
mymatrices[i] = randn(k, 2)
end
# a Vector that grows dynamically
mymatrices = Matrix{Float64}[] # shorter optional syntax for an empty vector
for (i, k) in enumerate(20:24)
push!(mymatrices, randn(k, 2))
end
# map with an anonymous function
mymatrices = map(k -> randn(k,2), 20:24)

So after help from you guys and some more googleing, i've come up with this code that works. Not sure if this violates the purpose of StackOverflow, but I'm posting my answer. If you guys have time, please criticize, it seems a little too verbose, I'm sure there is a more efficient way to do this.
using Distributions
N = [10, 100, 500, 1000, 10000]
J = 500
bet = Vector{Matrix{Float64}}()
for n in N
b = Array{Float64, 2}(3, J)
for j = 1:J
x1 = Array{Float64, 2}(n,J)
x2 = Array{Float64, 2}(n,J)
epsilon = Array{Float64, 2}(n,J)
y = Array{Float64, 2}(n,J)
cons = ones(n)
x1[:,j] = rand(Normal(3,1),n)
x2[:,j] = rand(Normal(-1,1),n)
epsilon[:,j] = rand(Normal(0,1),n)
y[:,j] = 3 + 2.5*x1[:,j] + 4*x2[:,j] + epsilon[:,j]
x = [cons x1[:,j] x2[:,j]]
b[:,j] = inv(x'*x)*(x'*y[:,j])
end
push!(bet, b)
end

The loop works fine, and x is set correctly inside, but x is out of scope once the loop ends, and so is undefined.
julia> for i = 1:10
x = ones(i, 10)
end
julia> x
ERROR: UndefVarError: x not defined
If you create it first:
julia> x = Array{Float64}(10, 10);
julia> for i = 1:10
x = ones(i, 10)
end
julia> x
10×10 Array{Float64,2}:
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
then it stays in scope.

Related

Split number in random parts with constrain

I am trying to find a smart way for splitting a number (eg 50) into a number of random parts (e.g. 20) BUT under the constrain that each generated random value cannot be greater than a specific value (e.g. 4).
So for example in this case I would expect as an output a vector of 20 values of which sum is 50 but none of the 20 values is greater than 4 (e.g 2.5, 1.3, 3.9 etc..)
I had a look at similar questions but from what i see these are dealing with splitting a number into equal or random parts but none of them included the constrain, which is the bit i am stuck with! Any help would be higly appreciated!!
here is a fast (random) solution (as long as you can appect one-decimal parts).
every time you run partitionsSample, you will get a different answer.
library(RcppAlgos)
# input
goal <- 50
parts <- 20
maxval <- 4
# sample with 1 decimal
m <- partitionsSample(maxval * 10, parts, repetition = FALSE, target = goal * 10, n = 1)
# divide by ten
answer <- m[1,]/10
# [1] 0.2 1.4 1.5 1.6 1.7 1.8 1.9 2.2 2.3 2.6 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.7 3.9
# check
sum(answer)
[1] 50
set.seed(42)
repeat {
vec <- runif(20)
vec <- 50 * vec/sum(vec)
if (all(vec <= 4)) break
}
sum(vec)
# [1] 50
# [1] 50
vec
# [1] 3.7299658 3.8207653 1.1666852 3.3860087 2.6166080 2.1165253 3.0033133 0.5490801 2.6787741 2.8747815 1.8663641
# [12] 2.9320577 3.8109668 1.0414675 1.8849202 3.8327490 3.9885516 0.4790347 1.9367197 2.2846613
Note: it is feasible with certain combinations that this could run "forever" (improbable to find a vector where all values are under 4). This works well here, but if you (say) change 20 to 10, it will never succeed (and will loop forever).
Another possible solution (by limiting the range of the interval of the uniform distribution, it is more likely to get a solution):
set.seed(123)
x <- 50
n <- 20
threshold <- 4
gotit <- F
while (!gotit)
{
v <- round(runif(n, 0, x/(n/2)), 1)
if (sum(v) == x & all(v <= threshold))
gotit <- T
}
v
#> [1] 2.2 3.0 2.2 3.0 3.0 0.5 2.4 2.5 2.7 4.0 1.0 1.7 1.2 2.8 2.9 3.3 2.9 3.0 3.0
#> [20] 2.7

R Use of number 2 prefacing nrow function?

Probably a silly question but what is the purpose of the number 2 in the below code, (specifically the 2:nrow)
cumsum_Petal.Width <- iris$Petal.Width[1] # Create new vector object
for(i in 2:nrow(iris)) { # Use nrow as condition
cumsum_Petal.Width[i] <- # Calculate cumulative sum
cumsum_Petal.Width[i - 1] + iris$Petal.Width[i]
}
cumsum_Petal.Width # Print vector to RStudio console
# 0.2 0.4 0.6 0.8 1.0 1.4 1.7 1.9 2.1 2.2

is there a way to obtain a collection pointing to variables in Julia?

Lets assume I have 3 variables R1, R2 and R3. I'd like to have a Dictionary (or other collection) that point to the variables, so that if I modify the variable it also changes the value in the Dictionary.
Basically I want to do something like this:
R1 = 0.0
R2 = 0.0
R3 = 0.0
D = Dict(1=>R1, 2=>R2, 3=>R3)
D[1]
output> 0.0
R1 = 1.0
D[1]
output> 1.0
Is there a way to do this in Julia?
Thanks
You can make them Refs:
R1 = Ref(0.0)
R2 = Ref(0.0)
R3 = Ref(0.0)
D = Dict(1=>R1, 2=>R2, 3=>R3)
D[1][] # output> 0.0
R1[] = 1.0
D[1][] # output> 1.0
Refs are like pointers. The syntax for assigning into them is ref[] = x, and the syntax for getting their value is ref[]. So just make sure you don't forget the [].
You could also just use a mutable object and mutate it instead of assigning over it. e.g.
R1 = [0.];
R2 = [0.];
R3 = [0.];
D = Dict(1=>R1, 2=>R2, 3=>R3);
D[1] #> 0.0
R1[1] = 1. # or just R1[] = 1. since empty brackets reference first element
D[1] #> 1.0

Coin Toss game in R

So trying to make a simulation of a coin toss game where you double your money if you get heads and half it if you have tales. And want to see what you get after n throws if you start with x money
However I'm not sure how to tackle this problem in a nice clean way, without just doing a forloop to n.
Is there some clean way to do this?
You can use sample to create a list of times 0.5 and times 2.
sample_products = sample(c(0.5, 2), 100, replace = TRUE)
> sample_products
[1] 0.5 2.0 0.5 2.0 2.0 0.5 2.0 0.5 2.0 2.0 0.5 0.5 0.5 0.5 2.0 2.0 0.5 0.5
[19] 2.0 2.0 0.5 0.5 0.5 2.0 2.0 2.0 2.0 0.5 0.5 2.0 2.0 2.0 2.0 2.0 2.0 0.5
[37] 2.0 2.0 2.0 0.5 2.0 2.0 0.5 0.5 0.5 2.0 0.5 2.0 2.0 0.5 2.0 2.0 2.0 2.0
[55] 0.5 2.0 0.5 2.0 0.5 0.5 0.5 2.0 2.0 2.0 2.0 0.5 2.0 0.5 0.5 2.0 0.5 0.5
[73] 0.5 2.0 0.5 0.5 0.5 2.0 2.0 0.5 2.0 0.5 0.5 0.5 2.0 2.0 2.0 2.0 0.5 0.5
[91] 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 2.0 0.5
and to get the cumulative effect of those products:
cumulative_prod = prod(sample_products)
and include the start money:
start_money = 1000
new_money = cumulative_prod * start_money
Note that for larger sampling sizes, cumulative_prod will converge towards 1, for a fair coin (which sample is).
You can loop over this if you want to run multiple iterations
n = 10
toss <- round(runif(n),0)
toss[toss == 0] = -1
toss <- 2^toss
Reduce(x = toss,'*')
This is not the best way (I'm sure there are a lot of better ways to do it), nevertheless, you can consider it as a very starting point to understand how to do it
> set.seed(1)
> x <- 100 # amount of money
> N <- 10 #number of throws
> TH <- sample(c("H", "T"), N, TRUE) # Heads or Tails, drawin "H" or "T" with same probability
> sum(ifelse(TH=="H", 2*x, 0.5*x)) # final amount of money
[1] 1100
Also you can write a function that takes as argument the initial anount of money x and the number of trials N
> head.or.tails <- function(x, N){
TH <- sample(c("H", "T"), N, TRUE) # Heads or Tails
sum(ifelse(TH=="H", 2*x, 0.5*x)) # final amount of money
}
>
> set.seed(1)
> head.or.tails(100, 10)
[1] 1100
In order to avoid the ifelse part, you can write sample(c(0.5, 2), 100, replace = TRUE) instead of sample(c("H", "T"), N, TRUE), see #Paul Hiemstra answer.
If you're starting to get your head around this sort of thing, I'd be tempted to work in log space, i.e. add one for a win and subtract one for a loss. You can sample as others have done, i.e. #Paul's answer.
y <- sample(c(-1,1), 100, replace=TRUE)
plot(cumsum(y), type="s")
if you want to convert back to "winnings" you can just do:
plot(2^cumsum(y)*start_money, type="s", log="y", xlab="Round", ylab="Winnings")
this will look very similar, but the y-axis will be in winnings.
If you're new to stochastic processes such as this, it can be interesting to see lots of "winning" or "losing" streaks. If you want to see how long they are, the rle function can be useful here, for example:
table(rle(y)$len)
will print the frequencies of the lengths of these runs, which can get surprisingly long. You could play with the negative-binomial distribution to see where this comes from:
plot(table(rle(y)$len) / length(y))
points(1:15, dnbinom(1:15, 1, 0.5), col=2)
although you'll probably need to work with larger samples (i.e. 1000 samples or more) to see the same "shape".

Computing sparse pairwise distance matrix in R

I have a NxM matrix and I want to compute the NxN matrix of Euclidean distances between the M points. In my problem, N is about 100,000. As I plan to use this matrix for a k-nearest neighbor algorithm, I only need to keep the k smallest distances, so the resulting NxN matrix is very sparse. This is in contrast to what comes out of dist(), for example, which would result in a dense matrix (and probably storage problems for my size N).
The packages for kNN that I've found so far (knnflex, kknn, etc) all appear to use dense matrices. Also, the Matrix package does not offer a pairwise distance function.
Closer to my goal, I see that the spam package has a nearest.dist() function that allows one to only consider distances less than some threshold, delta. In my case, however, a particular value of delta may produce too many distances (so that I have to store the NxN matrix densely) or too few distances (so that I can't use kNN).
I have seen previous discussion on trying to perform k-means clustering using the bigmemory/biganalytics packages, but it doesn't seem like I can leverage these methods in this case.
Does anybody know a function/implementation that will compute a distance matrix in a sparse fashion in R? My (dreaded) backup plan is to have two for loops and save results in a Matrix object.
Well, we can't have you resorting to for-loops, now can we :)
There is of course the question of how to represent the sparse matrix. A simple way is to have it only contain the indices of the points that are closest (and recalculate as needed). But in the solution below, I put both distance ('d1' etc) and index ('i1' etc) in a single matrix:
sparseDist <- function(m, k) {
m <- t(m)
n <- ncol(m)
d <- vapply( seq_len(n-1L), function(i) {
d<-colSums((m[, seq(i+1L, n), drop=FALSE]-m[,i])^2)
o<-sort.list(d, na.last=NA, method='quick')[seq_len(k)]
c(sqrt(d[o]), o+i)
}, numeric(2*k)
)
dimnames(d) <- list(c(paste('d', seq_len(k), sep=''),
paste('i', seq_len(k), sep='')), colnames(m)[-n])
d
}
Trying it out on 9 2d-points:
> m <- matrix(c(0,0, 1.1,0, 2,0, 0,1.2, 1.1,1.2, 2,1.2, 0,2, 1.1,2, 2,2),
9, byrow=TRUE, dimnames=list(letters[1:9], letters[24:25]))
> print(dist(m), digits=2)
a b c d e f g h
b 1.1
c 2.0 0.9
d 1.2 1.6 2.3
e 1.6 1.2 1.5 1.1
f 2.3 1.5 1.2 2.0 0.9
g 2.0 2.3 2.8 0.8 1.4 2.2
h 2.3 2.0 2.2 1.4 0.8 1.2 1.1
i 2.8 2.2 2.0 2.2 1.2 0.8 2.0 0.9
> print(sparseDist(m, 3), digits=2)
a b c d e f g h
d1 1.1 0.9 1.2 0.8 0.8 0.8 1.1 0.9
d2 1.2 1.2 1.5 1.1 0.9 1.2 2.0 NA
d3 1.6 1.5 2.0 1.4 1.2 2.2 NA NA
i1 2.0 3.0 6.0 7.0 8.0 9.0 8.0 9.0
i2 4.0 5.0 5.0 5.0 6.0 8.0 9.0 NA
i3 5.0 6.0 9.0 8.0 9.0 7.0 NA NA
And trying it on a larger problem (10k points). Still, on 100k points and more dimensions it will take a long time (like 15-30 minutes).
n<-1e4; m<-3; m=matrix(runif(n*m), n)
system.time( d <- sparseDist(m, 3) ) # 9 seconds on my machine...
P.S. Just noted that you posted an answer as I was writing this: the solution here is roughly twice as fast because it doesn't calculate the same distance twice (the distance between points 1 and 13 is the same as between points 13 and 1).
For now I am using the following, inspired by this answer. The output is a n x k matrix where element (i,k) is the index of the data point that is the kth closest to i.
n <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = n)
min.k.dists <- function(x,k=5) {
apply(x,2,function(r) {
b <- colSums((x - r)^2)
o <- order(b)
o[1:k]
})
}
min.k.dists(x) # first row should be 1:ncol(x); these points have distance 0
dist(t(x)) # can check answer against this
If one is worried about how ties are handled and whatnot, perhaps rank() should be incorporated.
The above code seems somewhat fast, but I'm sure it could be improved (though I don't have time to go the C or fortran route). So I'm still open to fast and sparse implementations of the above.
Below I include a parallelized version that I ended up using:
min.k.dists <- function(x,k=5,cores=1) {
require(multicore)
xx <- as.list(as.data.frame(x))
names(xx) <- c()
m <- mclapply(xx,function(r) {
b <- colSums((x - r)^2)
o <- order(b)
o[1:k]
},mc.cores=cores)
t(do.call(rbind,m))
}
If you want to keep the logic of your min.k.dist function and return duplicate distances, you might want to consider modifying it a bit. It seems pointless to return the first line with 0 distance, right? ...and by incorporating some of the tricks in my other answer, you can speed up your version by some 30%:
min.k.dists2 <- function(x, k=4L) {
k <- max(2L, k + 1L)
apply(x, 2, function(r) {
sort.list(colSums((x - r)^2), na.last=NA, method='quick')[2:k]
})
}
> n<-1e4; m<-3; m=matrix(runif(n*m), n)
> system.time(d <- min.k.dists(t(m), 4)) #To get 3 nearest neighbours and itself
user system elapsed
17.26 0.00 17.30
> system.time(d <- min.k.dists2(t(m), 3)) #To get 3 nearest neighbours
user system elapsed
12.7 0.0 12.7

Resources