A simple approach of approximating the maximum likelihood of a model given some data is grid approximation. For example, in R, we can generate a grid of parameter values and then evaluate the likelihood of each value given some data (example from Statistical Rethinking by McElreath):
p_grid <- seq(from=0, to=1, length.out=1000)
likelihood <- dbinom(6, size=9, prob=p_grid)
Here, likelihood is an array of 1000 values and I assume this is an efficient way to get such an array.
I am new to Julia (and not so good at R) so my approach of doing the same as above relies on comprehension syntax:
using Distributions
p_grid = collect(LinRange(0, 1, 1000))
likelihood = [pdf(Binomial(n9=, p=p), 6) for p in p_grid]
which is not only clunky but somehow seems inefficient because a new Binomial gets constructed 1000 times. Is there a better, perhaps vectorized, approach to accomplishing the same task?
In languages like R or Python, people often use the term "vectorization" to mean "avoid for loops in the language". I say "in the language" because there are still for loops, it's just that they're now in C instead of R/Python.
In Julia, there's nothing to worry about. You'll still sometimes hear "vectorization", but Julia folks tend to use this in the original sense of hardware vectorization. More on that here.
As for your code, I think it's fine. To be sure, let's benchmark!
julia> using BenchmarkTools
julia> #btime [pdf(Binomial(9, p), 6) for p in $p_grid]
111.352 μs (1 allocation: 7.94 KiB)
Another way you could write this is using map:
julia> #btime map($p_grid) do p
pdf(Binomial(9, p), 6)
end;
111.623 μs (1 allocation: 7.94 KiB)
To check for construction overhead, you could make lower-level calls to StatsFuns, like this
julia> using StatsFuns
julia> #btime map($p_grid) do p
binompdf(9, p, 6)
end;
109.809 μs (1 allocation: 7.94 KiB)
It looks like there some difference, but it's pretty minor, maybe around 2% of the overall cost.
Related
Suppose I have the polynomial f(x) = x^n + x + a. I set a value for n, and want 0 <= a <= A, where A is some other value I set. This means I will have a total of A different polynomials, since a can be any value between 0 and A.
Using Sage, I'm finding the number of these A polynomials that are reducible. For example, suppose I set n=5 and A=10^7. That would tell me how many of these 10^7 polynomials of degree 5 are reducible. I've done this using a loop, which works for low values of A. But for the large values I need (ie. A=10^7), it's taking an extremely long & impractical amount of time. The code is below. Could someone please help me meaningfully optimize this?
x = polygen(QQ)
n = 5
A = 10^7
count = 0
for i in range(A):
p_pol = x^n + x + i
if not p_pol.is_irreducible():
count = count + 1
print(i)
print('Count:' + str(count))
One small, but in this case pretty meaningless optimization is to replace range(A) with xrange(A). The former will create an array of all integers from 0 to A - 1 which is a waste of time and space. xrange(A) will just produce integers one by one and discard them when you're done. Sage 9.0 will be base on Python 3 by default where range is equivalent to xrange.
Let's do a little experiment though. Another small optimization will be to pre-define the part of your polynomial that's constant in each loop:
x = polygen(QQ)
n = 5
A = 10^7
base = x^n + x
Now just as a general test, let's see how long it takes in a few cases to add an integer to the polynomial and then compute its irreducibility:
sage: (base + 1).is_irreducible()
False
sage: %timeit (base + 1).is_irreducible()
1000 loops, best of 3: 744 µs per loop
sage: (base + 3).is_irreducible()
True
sage: %timeit (base + 3).is_irreducible()
1000 loops, best of 3: 387 µs per loop
So it seems in cases where it is irreducible (which will be the majority) it's a little faster, so let's say on average it will take 387µs per. Then:
sage: 0.000387 * 10^7 / 60
64.5000000000000
So this will still take a little over an hour, on average (on my machine).
One thing you can do to speed things up is parallelize it, if you have many CPU cores. For example:
x = polygen(QQ)
A = 10^7
def is_irreducible(i, base=(x^5 + x)):
return (base + i).is_irreducible()
from multiprocessing import Pool
pool = Pool()
A - sum(pool.map(is_irreducible, xrange(A)))
That will in principle give you the same result. Though the speed up you'll get will only be on the order of the number of CPUs you have at best (typically a little less). Sage also comes with some parallelization helpers but I tend to find them a bit lacking for the case of speeding up small calculations over a large range of values (they can be used for this but it requires some care, such as manually batching your inputs; I'm not crazy about it...)
Beyond that, you may need to use some mathematical intuition to try to reduce the problem space.
I'm using Julia at the moment but I have a performance critical function which requires an enormous amount of repeated matrix operations on small fixed size matrices (3 dimensional or 4 dimensional). It seems that all the matrix operations in Julia are handled by a BLAS and LAPACK back end. It also appears theres a lot of memory allocation going on within some of these functions.
There is a julia library for small matrices which boasts impressive speedups for 3x3 matrices, but it has not been updated in 3 years. I am considering rewriting my performance critical function in Eigen
I know that Eigen claims to be really good for fixed size matrices, but I am still trying to judge whether I should rewrite this function in Eigen or not. The performance benchmarks are for dynamic sized matrices. Does anyone have any data to suggest how much performance one gets from the fixed size matrices? The types of operations I'm doing are matrix x matrix, matrix x vector, positive definite linear solves.
If you want fast operations for small matrices, I highly recommend StaticArrays. For example (NOTE: this was originally written before the BenchmarkTools package, which is now recommended):
using StaticArrays
using LinearAlgebra
function foo(A, b, n)
s = 0.0
for i = 1:n
s += sum(A*b)
end
s
end
function foo2(A, b, n)
c = A*b
s = 0.0
for i = 1:n
mul!(c, A, b)
s += sum(c)
end
s
end
A = rand(3,3)
b = rand(3)
Af = SMatrix{3,3}(A)
bf = SVector{3}(b)
foo(A, b, 1)
foo2(A, b, 1)
foo(Af, bf, 1)
#time foo(A, b, 10^6)
#time foo2(A, b, 10^6)
#time foo(Af, bf, 10^6)
Results:
julia> include("/tmp/foo.jl")
0.080535 seconds (1.00 M allocations: 106.812 MiB, 14.86% gc time)
0.064963 seconds (3 allocations: 144 bytes)
0.001719 seconds (2 allocations: 32 bytes)
foo2 tries to be clever and avoid memory allocation, yet it's simply blown away by the naive implementation when using StaticArrays.
I have used the time library and timed how long the recursive algorithm takes to calculate the fib numbers up to 50. Give those number, is there a formula I can use to determine how long it would have potentially taken to calculate fib(100)?
Times for smaller values:
Fib(40): 0.316 sec
Fib(80): 2.3 years
Fib(100): ???
This depends very much on the algorithm in use. The direct computation takes constant time. The recursive computation without memoization is exponential, with a base of phi. Add memoization to this, and it drops to logarithmic time.
The only one that could fit your data is the exponential time. Doing the basic math ...
(2.3 years / 0.316 sec) ** (1.0/40)
gives us
base = 1.6181589...
Gee, look at that! Less than one part in 10^4 more than phi!
Let t(n) be the time to compute Fib(n).
We can support the hypothesis that
t(n) = phi * t(n-1)
Therefore,
t(100) = phi^(100-80) * t(80)
I trust you can finish from here.
I am interested in solving the linear system of equations Ax=b where A is a lower-triangular matrix (n × n) and b is a (n × 1) vector where n ≈ 600k.
I coded up backsubstitution in R and it works fast for matrices of size upto 1000, but is really slow for larger n (≈ 600k). I know that the naive backsubstitution is O(n^2).
My R function is below; does anyone know of a more efficient (vectorized, parallelized etc.) way of doing it, which scales to large n?
Backsubstitution
backsub=function(X,y)
{
l=dim(X)
n=l[1]
p=l[2]
y=as.matrix(y)
for (j in seq(p,1,-1))
{
y[j,1]=y[j,1]/X[j,j]
if((j-1)>0)
y[1:(j-1),1]=y[1:(j-1),1]-(y[j,1]*X[1:(j-1),j])
}
return(y)
}
How about the R function backsolve? It calls the Level 3 BLAS routine dtrsm which is probably what you want to be doing. In general, you won't beat BLAS/LAPACK linear algebra routines: they're insanely optimized.
This question is related to this one and this one
I have two full rank matrices A1, A2 each
of dimension p x p and a p-vector y.
These matrices are closely related in the sense that
matrix A2 is a rank one update of matrix A1.
I'm interested in the vector
β2 | (β1, y, A1, A2, A1-1})
where
β2 = (A2' A2)-1(A2'y)
and
β1 = (A1' A1)-1(A1' y)
Now, in a previous question here I have been advised
to estimate β2 by the Choleski approach since the Choleski
decomposition is easy to update using R functions such as chud()
in package SamplerCompare.
Below are two functions to solve linear systems in R, the first one uses
the solve() function and the second one the Choleski approach
(the second one I can efficiently update).
fx01 <- function(ll,A,y) chol2inv(chol(crossprod(A))) %*% crossprod(A,y)
fx03 <- function(ll,A,y) solve(A,y)
p <- 5
A <- matrix(rnorm(p^2),p,p)
y <- rnorm(p)
system.time(lapply(1:1000,fx01,A=A,y=y))
system.time(lapply(1:1000,fx03,A=A,y=y))
My question is: for p small, both functions seems to be comparable
(actually fx01 is even faster). But as I increase p,
fx01 becomes increasingly slower so that for p = 100,
fx03 is three times as fast as fx01.
What is causing the performance deterioration of fx01 and can it
be improved/solved (maybe my implementation of the Choleski is too naive? Shouldn't I be using functions of the Choleski constellation such as backsolve, and if yes, how?
A %*% B is the R lingo for matrix multiplication of A by B.
crossprod(A,B) is the R lingo for A' B (ie transpose of A matrix
multiplying the matrix/vector B).
solve(A,b) solves for x the linear system A x=b.
chol(A) is the Choleski decomposition of a PSD matrix A.
chol2inv computes (X' X)-1 from the (R part) of the QR decomposition of X.
Your 'fx01' implementation is, as you mentioned, somewhat naive and is performing far more work than the 'fx03' approach. In linear algebra (my apologies for the main StackOverflow not supporting LaTeX!), 'fx01' performs:
B := A' A in roughly n^3 flops.
L := chol(B) in roughly 1/3 n^3 flops.
L := inv(L) in roughly 1/3 n^3 flops.
B := L' L in roughly 1/3 n^3 flops.
z := A y in roughly 2n^2 flops.
x := B z in roughly 2n^2 flops.
Thus, the cost looks very similar to 2n^3 + 4n^2, whereas your 'fx03' approach uses the default 'solve' routine, which likely performs an LU decomposition with partial pivoting (2/3 n^3 flops) and two triangle solves (plus pivoting) in 2n^2 flops. Your 'fx01' approach therefore performs three times as much work asymptotically, and this amazingly agrees with your experimental results. Note that if A was real symmetric or complex Hermitian, that an LDL^T or LDL' factorization and solve would only require half as much work.
With that said, I think that you should replace your Cholesky update of A' A with a more stable QR update of A, as I just answered in your previous question. A QR decomposition costs roughly 4/3 n^3 flops and a rank-one update to a QR decomposition is only O(n^2), so this approach only makes sense for general A when there is more than just one related solve that is simply a rank-one modification.