Error in Gradient Descent Calculation - r

I tried to write a function to calculate gradient descent for a linear regression model. However the answers I was getting does not match the answers I get using the normal equation method.
My sample data is:
df <- data.frame(c(1,5,6),c(3,5,6),c(4,6,8))
with c(4,6,8) being the y values.
lm_gradient_descent <- function(df,learning_rate, y_col=length(df),scale=TRUE){
n_features <- length(df) #n_features is the number of features in the data set
#using mean normalization to scale features
if(scale==TRUE){
for (i in 1:(n_features)){
df[,i] <- (df[,i]-mean(df[,i]))/sd(df[,i])
}
}
y_data <- df[,y_col]
df[,y_col] <- NULL
par <- rep(1,n_features)
df <- merge(1,df)
data_mat <- data.matrix(df)
#we need a temp_arr to store each iteration of parameter values so that we can do a
#simultaneous update
temp_arr <- rep(0,n_features)
diff <- 1
while(diff>0.0000001){
for (i in 1:(n_features)){
temp_arr[i] <- par[i]-learning_rate*sum((data_mat%*%par-y_data)*df[,i])/length(y_data)
}
diff <- par[1]-temp_arr[1]
print(diff)
par <- temp_arr
}
return(par)
}
Running this function,
lm_gradient_descent(df,0.0001,,0)
the results I got were
c(0.9165891,0.6115482,0.5652970)
when I use the normal equation method, I get
c(2,1,0).
Hope someone can shed some light on where I went wrong in this function.

You used the stopping criterion
old parameters - new parameters <= 0.0000001
First of all I think there's an abs() missing if you want to use this criterion (though my ignorance of R may be at fault).
But even if you use
abs(old parameters - new parameters) <= 0.0000001
this is not a good stopping criterion: it only tells you that progress has slowed down, not that it's already sufficiently accurate. Try instead simply to iterate for a fixed number of iterations. Unfortunately it's not that easy to give a good, generally applicable stopping criterion for gradient descent here.

It seems that you have not implemented a bias term. In a linear model like this, you always want to have an additional additive constant, i.e., your model should be like
w_0 + w_1*x_1 + ... + w_n*x_n.
Without the w_0 term, you usually won't get a good fit.

I know this is a couple of weeks old at this point but I'm going to take a stab at for several reasons, namely
Relatively new to R so deciphering your code and rewriting it is good practice for me
Working on a different Gradient Descent problem so this is all fresh to me
Need the stackflow points and
As far as I can tell you never got a working answer.
First, regarding your data structures. You start with a dataframe, rename a column, strip out a vector, then strip out a matrix. It would be a lot easier to just start with an X matrix (capitalized since its component 'features' are referred to as xsubscript i) and a y solution vector.
X <- cbind(c(1,5,6),c(3,5,6))
y <- c(4,6,8)
We can easily see what the desired solutions are, with and without scaling by fitting a linear fit model. (NOTE We only scale X/features and not y/solutions)
> lm(y~X)
Call:
lm(formula = y ~ X)
Coefficients:
(Intercept) X1 X2
-4 -1 3
> lm(y~scale(X))
Call:
lm(formula = y ~ scale(X))
Coefficients:
(Intercept) scale(X)1 scale(X)2
6.000 -2.646 4.583
With regards to your code, one of the beauties of R is that it can perform matrix multiplication which is significantly faster than using loops.
lm_gradient_descent <- function(X, y, learning_rate, scale=TRUE){
if(scale==TRUE){X <- scale(X)}
X <- cbind(1, X)
theta <- rep(0, ncol(X)) #your old temp_arr
diff <- 1
old.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
while(diff>0.000000001){
theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
new.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
diff <- abs(old.error - new.error)
old.error <- new.error
}
return(theta)
}
And to show it works...
> lm_gradient_descent(X, y, .01, 0)
[,1]
[1,] -3.9360685
[2,] -0.9851775
[3,] 2.9736566
vs expected of (-4, -1, 3)
For what its worth while I agree with #cfh that I would prefer a loop with a defined number of iterations, I'm actually not sure you need the abs function. If diff < 0 then your function is not converging.
Finally rather than using something like old.error and new.error I'd suggest using a a vector that records all errors. You can then plot that vector to see how quickly your function converges.

Related

how to draw the log-likelihood graph

I am learning how to draw a log-likelihood graph. Please allow me briefly introduce what I want to do specifically:
Assume we have the data/vector as below:
set.seed(123)
sample <- rpois(50, 1.65)
And the log_like function is given as below:
log_like_graph <- function(lambda){
X <- as.matrix(sample) # not sure whether this is necessary for one-parameter distribution.
N <- nrow(X)
logLik <- N*log(lambda) - lambda*N*mean(X)
return(loglik)
}
log_like_graph <- Vectorize(log_like_graph)
# set range of lambda
lambda_vals <- seq(-10,10,by=1)
log_vals <- outer(lambda_vals,log_like_graph)
Based on the above lambda_vals and log_vals, I expect to produce a plot like below:
However, when I excute the last command: log_vals <- outer(lambda_vals,log_like_graph), I got the error hint
Error in as.vector(x, mode) :
cannot coerce type 'closure' to vector of type 'any'
Could you please help me solve this problem? Thank you very much!
(FYI: I mainly follow the youtube video https://www.youtube.com/watch?v=w3drLH-DFpE&ab_channel=CalebLikesR that teaches to draw the curve for a log-likelihood function, although it uses normal distribution for demonstration.)
A couple of things I see; no need to vectorise log_like_graph as you can just pass lambda values into it with sapply rather than outer, you are passing lambda_vals < 0 but the support of lambda is >= 0, and I don't think your log-likelihood function is correct (I think it should be -N * lambda - sum(lfactorial(sample)) + log(lambda) * sum(X) but it is easier/more accurate to use dpois(..., log=TRUE)).
So fixing these things
# data
set.seed(123)
samples <- rpois(50, 1.65)
# The log-likelihood becomes
log_like_graph <- function(X, lambda){
N <- NROW(X)
logLik <- -N * lambda - sum(lfactorial(X)) + log(lambda) * sum(X)
return(logLik)
}
# set lambda >= 0 and take smaller steps (0.01) for a smoother curve
lambda_vals <- seq(0,10,by=0.01)
# loop through lambda values calculating the log-likehood at each value
ll1 <- sapply(lambda_vals, function(i) log_like_graph(samples, i))
plot(lambda_vals, ll1, type="l")
This can also be done with dpois(..., log=TRUE) :
ll2 <- sapply(lambda_vals, function(i) sum(dpois(samples, lambda=i, log=TRUE)))
all.equal(ll1, ll2)
# [1] TRUE

Non-linear optimization for exponential function with linear constraints

I try to solve a non-linear optimization problem using the function donlp2 in R. My goal is to find out the maximum value of the following function:
442.8658*(x1+1)^(0.008752747)*(y1+1)^(0.555782)+(x2+1)^(0.008752747)*(y2+1)^(0.555782)
There is no non-linear constraints. The linear constraints are listed below:
x1+x2<=20000;
y1+y2<=20000;
x1<=4662.41;
x2<=149339;
y1<=14013.94;
y2<=1342738;
x1>=0;
x2>=0;
y1>=0;
y2>=0;
Below is my code:
p <- c(rep(0,4))
par.l <- c(rep(0,4))
par.u <- c(4662.41, 149339, 14013.94, 1342738)
fn <- function(par){
x1 <- par[1]; y1<-par[3]
x2 <- par[2]; y2<-par[4]
y <- 1 / (442.8658*(x1+1)^(0.008752747)*(y1+1)^(0.555782)
+ (x2+1)^(0.008752747)*(y2+1)^(0.555782))
}
A <- matrix(c(rep(c(1,0),2), rep(c(0,1),2)), nrow=2)
lin.l <- c(-Inf, 20000)
lin.u <- c(-Inf, 20000)
ret <- donlp2(p, fn, par.u=par.u, par.l=par.l, A=A, lin.l=lin.l, lin.u=lin.u)
I searched and found some related posts saying that donlp2 is only good to find minimum value of a function, which is the reason I took the reciprocal in the objective function.
The code ran correctly, but I have concerns with the results, since I can easily find other values that can give me greater outcome, i.e. the minimization of the objective function is not true.
I also found that when I change the initial value or the lower bound of x1,x2,y1,y2, the results will change dramatically. For example, if I set p=c(rep(0,4)), par.l<-c(rep(1,4)) instead of p=c(rep(0,4)), par.l<-c(rep(0,4)), the results will change from
$par
[1] 2.410409e+00 5.442753e-03 1.000000e+04 1.000000e+04
to
$par
[1] 2331.748 74670.025 3180.113 16819.887
Any ideas? I appreciate your suggestions and help!

Function to calculate R2 (R-squared) in R

I have a dataframe with observed and modelled data, and I would like to calculate the R2 value. I expected there to be a function I could call for this, but can't locate one. I know I can write my own and apply it, but am I missing something obvious? I want something like
obs <- 1:5
mod <- c(0.8,2.4,2,3,4.8)
df <- data.frame(obs, mod)
R2 <- rsq(df)
# 0.85
You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:
rsq <- function (x, y) cor(x, y) ^ 2
Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).
Let's do the statistics
Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.
lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)
lemma 2: beta = cov(x, y) / var(x)
lemma 3: R.square = cor(x, y) ^ 2
Warning
R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:
How to obtain RMSE out of lm result?
R - Calculate Test MSE given a trained model from a training set and a test set
I agree with 42-'s comment:
The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.
R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.
Here is very extreme example:
preds <- 1:4/4
actual <- 1:4
The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??
In reply to wordsforthewise
Thanks for your comments 1, 2 and your answer of details.
You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.
As you demonstrated, this is just one way for computing R squared:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25
But there is another:
regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75
Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).
preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375
Final remark
I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.
Why not this:
rsq <- function(x, y) summary(lm(y~x))$r.squared
rsq(obs, mod)
#[1] 0.8560185
It is not something obvious, but the caret package has a function postResample() that will calculate "A vector of performance estimates" according to the documentation. The "performance estimates" are
RMSE
Rsquared
mean absolute error (MAE)
and have to be accessed from the vector like this
library(caret)
vect1 <- c(1, 2, 3)
vect2 <- c(3, 2, 2)
res <- caret::postResample(vect1, vect2)
rsq <- res[2]
However, this is using the correlation squared approximation for r-squared as mentioned in another answer. I'm not sure why Max Kuhn didn't just use the conventional 1-SSE/SST.
caret also has an R2() method, although it's hard to find in the documentation.
The way to implement the normal coefficient of determination equation is:
preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)
tss <- sum((actual - mean(actual)) ^ 2)
rsq <- 1 - rss/tss
Not too bad to code by hand of course, but why isn't there a function for it in a language primarily made for statistics? I'm thinking I must be missing the implementation of R^2 somewhere, or no one cares enough about it to implement it. Most of the implementations, like this one, seem to be for generalized linear models.
You can also use the summary for linear models:
summary(lm(obs ~ mod, data=df))$r.squared
Here is the simplest solution based on [https://en.wikipedia.org/wiki/Coefficient_of_determination]
# 1. 'Actual' and 'Predicted' data
df <- data.frame(
y_actual = c(1:5),
y_predicted = c(0.8, 2.4, 2, 3, 4.8))
# 2. R2 Score components
# 2.1. Average of actual data
avr_y_actual <- mean(df$y_actual)
# 2.2. Total sum of squares
ss_total <- sum((df$y_actual - avr_y_actual)^2)
# 2.3. Regression sum of squares
ss_regression <- sum((df$y_predicted - avr_y_actual)^2)
# 2.4. Residual sum of squares
ss_residuals <- sum((df$y_actual - df$y_predicted)^2)
# 3. R2 Score
r2 <- 1 - ss_residuals / ss_total
Not sure why this isn't implemented directly in R, but this answer is essentially the same as Andrii's and Wordsforthewise, I just turned into a function for the sake of convenience if somebody uses it a lot like me.
r2_general <-function(preds,actual){
return(1- sum((preds - actual) ^ 2)/sum((actual - mean(actual))^2))
}
I am use the function MLmetrics::R2_Score from the packages MLmetrics, to compute R2 it uses the vanilla 1-(RSS/TSS) formula.

sampling a multimensional posterior distribution using MCMC Metropolis-Hastings algo in R

I am quite new in sampling posterior distributions(so therefore Bayesian approach) using a MCMC technique based on Metropolis-Hastings algorithm.
I am using the mcmc library in R for this. My distribution is multidimensionnal. In order to check if this metro algorithm works for multivaiate distribution I did it successfully on a multidimensional student-t distribution (package mvtnorm, function dmvt).
Now I want to apply the same thing to my multivariate distribution (2 vars x and y) but it doesn't work; I get an error : Error in X[, 1] : incorrect number of dimensions
Here is my code:
library(mcmc)
library(mvtnorm)
my.seed <- 123
logprior<-function(X,...)
{
ifelse( (-50.0 <= X[,1] & X[,1]<=50.0) & (-50.0 <= X[,2] & X[,2]<=50.0), return(0), return(-Inf))
}
logpost<-function(X,...)
{
log.like <- log( exp(-((X[,1]^2 + X[,2]^2 - 4)/10 )^2) * sin(4*atan(X[,2]/X[,1])) )
log.prior<-logprior(X)
log.post<-log.like + log.prior # if flat prior, the posterior distribution is the likelihood one
return (log.post)
}
x <- seq(-5,5,0.15)
y <- seq(-5,5,0.15)
X<-cbind(x,y)
#out <- metrop(function(X) dmvt(X, df=3, log=TRUE), 0, blen=100, nbatch=100) ; this works
out <- metrop(function(X) logpost(X), c(0,0), blen=100, nbatch=100)
out <- metrop(out)
out$accept
So I tried to respect the same kind of format than for the MWE, but it doesn't work still as I got the error mentioned before.
Another thing, is that applying logpost to X works perfectly.
Thanks in advance for your help, best
The metrop function passes individual samples, and therefore a simple vector to logpost, not a matrix (which is what X is). Hence, the solution is to change X[,1] and X[,2] to X[1] and X[2], respectively.
I ran it like this, and it leads to other issues (X[2]/X[1] is NaN for the initialization), but that has more to do with your specific likelihood model and is out of the scope of your question.

Does cattell's profile similarity coefficient (Rp) exist as a function in R?

i'm comparing different measures of distance and similarity for vector profiles (Subtest results) in R, most of them are easy to compute and/or exist in dist().
Unfortunately, one that might be interesting and is to difficult for me to calculate myself is Cattel's Rp. I can not find it in R.
Does anybody know if this exists already?
Or can you help me to write a function?
The formula (Cattell 1994) of Rp is this:
(2k-d^2)/(2k + d^2)
where:
k is the median for chi square on a sample of size n;
d is the sum of the (weighted=m) difference between the two profiles,
sth like: sum(m(x(i)-y(i)));
one thing i don't know is, how to get the chi square median in there
Thank you
What i get without defining the k is:
Rp.Cattell <- function(x,y){z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);return(z)}
Vector examples are:
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
They are measures by the same device, but related to different bodyparts. They don't need to be standartised or weighted, i would say.
This page gives a general formula for k, and then gives a more thorough method using SAS/IML which pretty much gives the same results. So I used the general formula, added calculation of degrees of freedom, which leads to this:
Rp.Cattell <- function(x,y) {
dof <- (2-1) * (length(y)-1)
k <- (1-2/(9*dof))^3
z <- (2*k-sum(sum(x-y))^2)/(2*k+sum(sum(x-y))^2)
return(z)
}
x <- c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758)
y <- c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925)
Rp.Cattell(x, y)
# [1] -0.9012083
Does this figure appear to make sense?
Trying to verify the function, I found out now that the median of chisquare is the chisquare value for 50% probability - relating to random. So the function should be:
Rp.Cattell <- function(x,y){
dof <- (2-1) * (length(y)-1)
k <- qchisq(.50, df=dof)
z <- (2k-(sum(x-y))^2)/(2k+(sum(x-y))^2);
return(z)}
It is necessary though to standardize the Values before, so the results are distributed correctly.
So:
library ("stringr")
# they are centered already
x <- as.vector(scale(c(-1.2357,-1.1999,-1.4727,-0.3915,-0.2547,-0.4758),center=F, scale=T))
y <- as.vector(scale(c(0.7785,0.9357,0.7165,-0.6067,-0.4668,-0.5925),center=F, scale=T))
Rp.Cattell(x, y) -0.584423
This sounds reasonable now - or not?
I consider calculation of z is incorrect.
You need to calculate the sum of the squared differences. Not the square of the sum of differences. Besides product operator is missing in 2k.
It should be
z <- (2*k-sum((x-y)^2))/(2*k+sum((x-y)^2))
Do you agree?

Resources