I'm reading Deep Learning by Goodfellow et al. and am trying to implement gradient descent as shown in Section 4.5 Example: Linear Least Squares. This is page 92 in the hard copy of the book.
The algorithm can be viewed in detail at https://www.deeplearningbook.org/contents/numerical.html with R implementation of linear least squares on page 94.
I've tried implementing in R, and the algorithm as implemented converges on a vector, but this vector does not seem to minimize the least squares function as required. Adding epsilon to the vector in question frequently produces a "minimum" less than the minimum outputted by my program.
options(digits = 15)
dim_square = 2 ### set dimension of square matrix
# Generate random vector, random matrix, and
set.seed(1234)
A = matrix(nrow = dim_square, ncol = dim_square, byrow = T, rlnorm(dim_square ^ 2)/10)
b = rep(rnorm(1), dim_square)
# having fixed A & B, select X randomly
x = rnorm(dim_square) # vector length of dim_square--supposed to be arbitrary
f = function(x, A, b){
total_vector = A %*% x + b # this is the function that we want to minimize
total = 0.5 * sum(abs(total_vector) ^ 2) # L2 norm squared
return(total)
}
f(x,A,b)
# how close do we want to get?
epsilon = 0.1
delta = 0.01
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
steps = vector()
while(L2_norm > delta){
x = x - epsilon * value
value = (t(A) %*% A) %*% x - t(A) %*% b
L2_norm = (sum(abs(value) ^ 2)) ^ 0.5
print(L2_norm)
}
minimum = f(x, A, b)
minimum
minimum_minus = f(x - 0.5*epsilon, A, b)
minimum_minus # less than the minimum found by gradient descent! Why?
On page 94 of the pdf appearing at https://www.deeplearningbook.org/contents/numerical.html
I am trying to find the values of the vector x such that f(x) is minimized. However, as demonstrated by the minimum in my code, and minimum_minus, minimum is not the actual minimum, as it exceeds minimum minus.
Any idea what the problem might be?
Original Problem
Finding the value of x such that the quantity Ax - b is minimized is equivalent to finding the value of x such that Ax - b = 0, or x = (A^-1)*b. This is because the L2 norm is the euclidean norm, more commonly known as the distance formula. By definition, distance cannot be negative, making its minimum identically zero.
This algorithm, as implemented, actually comes quite close to estimating x. However, because of recursive subtraction and rounding one quickly runs into the problem of underflow, resulting in massive oscillation, below:
Value of L2 Norm as a function of step size
Above algorithm vs. solve function in R
Above we have the results of A %% x followed by A %% min_x, with x estimated by the implemented algorithm and min_x estimated by the solve function in R.
The problem of underflow, well known to those familiar with numerical analysis, is probably best tackled by the programmers of lower-level libraries best equipped to tackle it.
To summarize, the algorithm appears to work as implemented. Important to note, however, is that not every function will have a minimum (think of a straight line), and also be aware that this algorithm should only be able to find a local, as opposed to a global minimum.
In the SVM opimization problem, we either want to maximise the margin of 2/||w||,
or minimise the Euclidean Norm of weight vector w:
(1/2)*w^t*w
Can somebody explain to me why the Euclidean Norm is the formula above? And not 1/sqrt(w^t*w)?
I assume euclidean norm is the Euclidean distance, how do we get to that formula?
The reason is that the following three are equivalent (under suitable mathematical conditions which are usually met):
Maximize a quantity z.
Maximize f(z), where f is a strictly growing function.
Minimize g(z), where g is a strictly decreasing function.
In your case, set z=||w||, and apply the above the other way round. Then minimizing||w|| is equivalent to minimizing f(z) = 1/2 ||w||^2, and to maximizing g(z) = 2/||w||.
We have two boundaries π€β
π₯+π=1 and π€β
π₯+π=β1 and a middle one which is π€β
π₯+π=0. Now we want to figure out the distance between each of these two lines with the middle one. If we call the distance π.
We consider a point z on π€β
π₯+π=1, then setting this point z as the origin, the point on π€β
π₯+π=0 would be π§βπβ
π€||π€||, here, the distance between the middle line and π€β
π₯+π=1 is the distance between z and π§βπβ
π€||π€||
now since this point is on π€β
π₯+π=0, we would have:
π€β
(π§βπβ
π€ / ||π€||)+π = 0
π€β
π§+π βπ€β
πβ
π€ / ||π€|| = 0
we know that since z is located on π€β
π₯+π=1, then π€β
π§+π equals 1, hence:
1 βπ€β
πβ
π€ / ||π€|| = 0
π€β
πβ
π€ / ||π€|| = 1
π β
||π€||^2 / ||π€|| = 1
π β
||π€|| = 1
π = 1/||π€||
I would like to know the range of values that a function f(x) can take based on a range of values of x.
For instance, say I have a quadratic equation f(x)=x^2 - x + 0.2 and I want to know the range of f(x) for x in the range [0.2, 1].
is there a function or package in R that can do this?
If I correct understand your question you are looking for:
f <- function(x) x^2 - x + 0.2
x <- seq(0.2, 1, by=0.1)
range(f(x))
# [1] -0.05 0.20 # approximate numerical answer
If you want to know the range in an analytical way you have to do some mathematics (or further programming) to determine the maximum and minimum of the function f in that range of x.
An analytic answer can be calculated using calculus, if the function is differentiable. For the example quadratic, the calculation is:
f'(x) = 2x -1 = 0 => x* =1/2 is argmin/max, and lies within the domain for x: [0.2,1]
Evaluate f at the domain endpoints, and the argmin/max:
f(0.2) = 0.04, f(0.5) = -0.05, f(1) = 0.2.
So min = -0.05, max = 0.2.
A numerical approximation will work if the function is well-behaved (e.g. continuous, differentiable). Otherwise, a spike or discontinuity (e.g. f(x) = 1/x) could be missed depending on the step-size.
I have a (non-symmetric) probability matrix, and an observed vector of integer outcomes. I would like to find a vector that maximises the probability of the outcomes, given the transition matrix. Simply, I am trying to estimate a distribution of particles at sea given their ultimate distribution on land, and a matrix of probabilities of a particle released from a given point in the ocean ending up at a given point on the land.
The vector that I want to find is subject to the constraint that all components must be between 0-1, and the sum of the components must equal 1. I am trying to figure out the best optimisation approach for the problem.
My transition matrix and data set are quite large, but I have created a smaller one here:
I used a simulated known at- sea distribution of
msim<-c(.3,.2,.1,.3,.1,0) and a simulated probability matrix (t) to come up with an estimated coastal matrix (Datasim2), as follows:
t<-matrix (c(0,.1,.1,.1,.1,.2,0,.1,0,0,.3,0,0,0,0,.4,.1,.3,0,.1,0,.1,.4,0,0,0,.1,0,.1,.1),
nrow=5,ncol=6, byrow=T)
rownames(t)<-c("C1","C2","C3","C4","C5") ### locations on land
colnames(t)<-c("S1","S2","S3","S4","S5","S6") ### locations at sea
Datasim<-as.numeric (round((t %*% msim)*500))
Datasim2<-c(rep("C1",95), rep("C2",35), rep("C3",90),rep("C4",15),rep("C5",30))
M <-c(0.1,0.1,0.1,0.1,0.1,0.1) ## starting M
I started with a straightforward function as follows:
EstimateSource3<-function(M,Data,T){
EstEndProbsall<-M%*%T
TotalLkhd<-rep(NA, times=dim(Data)[1])
for (j in 1:dim(Data)[1]){
ObsEstEndLkhd<-0
ObsEstEndLkhd<-1-EstEndProbsall[1,] ## likelihood of particle NOT ending up at locations other than the location of interest
IndexC<-which(colnames(EstEndProbsall)==Data$LocationCode[j], arr.ind=T) ## likelihood of ending up at location of interest
ObsEstEndLkhd[IndexC]<-EstEndProbsall[IndexC]
#Total likelihood
TotalLkhd[j]<-sum(log(ObsEstEndLkhd))
}
SumTotalLkhd<-sum(TotalLkhd)
return(SumTotalLkhd)
}
DistributionEstimate <- optim(par = M, fn = EstimateSource3, Data = Datasim2, T=t,
control = list(fnscale = -1, trace=5, maxit=500), lower = 0, upper = 1)
To constrain the sum to 1, I tried using a few of the suggestions posted here:How to set parameters' sum to 1 in constrained optimization
e.g. adding M<-M/sum(M) or SumTotalLkhd<-SumTotalLkhd-(10*pwr) to the body of the function, but neither yielded anything like msim, and in fact, the 2nd solution came up with the error βL-BFGS-B needs finite values of 'fn'β
I thought perhaps the quadprog package might be of some help, but I donβt think I have a symmetric positive definite matrixβ¦
Thanks in advance for your help!
What about that: Let D = distribution at land, M = at sea, T the transition matrix. You know D, T, you want to calculate M. You have
D' = M' T
hence D' T' = M' (T T')
and accordingly D'T'(T T')^(-1) = M'
Basically you solve it as when doing linear regression (seems SO does not support math notation: ' is transpose, ^(-1) is ordinary matrix inverse.)
Alternatively, D may be counts of particles, and now you can ask questions like: what is the most likely distribution of particles at sea. That needs a different approach though.
Well, I have never done such models but think along the following lines. Let M be of length 3 and D of length 2, and T is hence 3x2. We know T and we observe D_1 particles at location 1 and D_2 particles at location 2.
What is the likelihood that you observe one particle at location D_1? It is Pr(D = 1) = M_1 T_11 + M_2 T_21 + M_3 T_32. Analogously, Pr(D = 2) = M_1 T_12 + M_2 T_22 + M_3 T_32. Now you can easily write the log-likelihood of observing D_1 and D_2 particles at locations 1 and 2. The code might look like this:
loglik <- function(M) {
if(M[1] < 0 | M[1] > 1)
return(NA)
if(M[2] < 0 | M[2] > 1)
return(NA)
M3 <- 1 - M[1] - M[2]
if(M3 < 0 | M3 > 1)
return(NA)
D[1]*log(T[1,1]*M[1] + T[2,1]*M[2] + T[3,1]*M3) +
D[2]*log(T[1,2]*M[1] + T[2,2]*M[2] + T[3,2]*M3)
}
T <- matrix(c(0.1,0.2,0.3,0.9,0.8,0.7), 3, 2)
D <- c(100,200)
library(maxLik)
m <- maxLik(loglik, start=c(0.4,0.4), method="BFGS")
summary(m)
I get the answer (0, 0.2, 0.8) when I estimate it but standard errors are very large.
As I told, I have never done it so I don't know it it makes sense.
I am trying to generate power-law distributed numbers ranging from 0 to 1 from a uniform distribution. I found two approaches and I am not sure which one is right and which one is wrong.
1st Source: Wolfram:
2nd Source: Physical Review (Page 2):
Where: y = uniform variate, n = distribution power, x0 and x1 = range of the distribution, x = power-law distributed variate.
The second one only gives decent results for x0 = 0 and x1 = 1, when n is between 0 and 1.
If y is a uniform random variable between 0 and 1, then 1-y also is. Thereby letting z = 1-y you can transform your formula (1) as :
x = [(x_1^{n+1}-(x_1^{n+1}-x_0^{n+1}) z]^{1/(n+1)}
which is then the same as your formula (2) except for the change n -> (-n).
So I suppose that the only difference between these two formula in the notation on how n relates to the power law decay (unfortunately the link you gave for the Wolfram alpha formula is invalid so I cannot check which notation they use).