Difference in Theta values between gradient descent and linear model in R - r

I am using the Boston dataset as my input and I am trying to build a model to predict MEDV (median values of owner-occupied housing in USD 1000) using RM (average numbers of rooms per dwelling)
I have bastardised the following code from Digitheads blog and not by very much as you can see.
My code is as follows:
#library(datasets)
#data("Boston")
x <- Boston$rm
y <- Boston$medv
# fit a linear model
res <- lm( y ~ x )
print(res)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-34.671 9.102
# plot the data and the model
plot(x,y, col=rgb(0.2,0.4,0.6,0.4), main='Linear regression')
abline(res, col='blue')
# squared error cost function
cost <- function(X, y, theta) {
sum( (X %*% theta - y)^2 ) / (2*length(y))
}
# learning rate and iteration limit
alpha <- 0.01
num_iters <- 1000
# keep history
cost_history <- double(num_iters)
theta_history <- list(num_iters)
# initialize coefficients
theta <- matrix(c(0,0), nrow=2)
# add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))
# gradient descent
for (i in 1:num_iters) {
error <- (X %*% theta - y)
delta <- t(X) %*% error / length(y)
theta <- theta - alpha * delta
cost_history[i] <- cost(X, y, theta)
theta_history[[i]] <- theta
}
print(theta)
[,1]
[1,] -3.431269
[2,] 4.191125
As per Digitheads blog, his value for theta using the lm (linear model) and his value from gradient descent match, whereas mine doesn't. Shouldn't these figures match?
As you can see from the plot for the various values of theta, my final y intercept does not tally up with the print(theta) value a few lines up?
Can anyone make a suggestion as to where I am going wrong?

Gradient descent takes a while to converge. Increasing the number of iterations will get the model to converge to the OLS values. For instance:
# learning rate and iteration limit
alpha <- 0.01
num_iters <- 100000 # Here I increase the number of iterations in your code to 100k.
# The gd algorithm now takes a minute or so to run on my admittedly
# middle-of-the-line laptop.
# keep history
cost_history <- double(num_iters)
theta_history <- list(num_iters)
# initialize coefficients
theta <- matrix(c(0,0), nrow=2)
# add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))
# gradient descent (now takes a little longer!)
for (i in 1:num_iters) {
error <- (X %*% theta - y)
delta <- (t(X) %*% error) / length(y)
theta <- theta - alpha * delta
cost_history[i] <- cost(X, y, theta)
theta_history[[i]] <- theta
}
print(theta)
[,1]
[1,] -34.670410
[2,] 9.102076

Related

How to include penalisation in Hat matrix?

I am trying to optimize the span value of a LOESS fit based on GCV. Below is the code I am using.
gcv.calc <- function(span){
for (sp in span)
{
fit <- loess(y ~ x, data = df, span = sp)
res <- residuals(fit)
sse <- sum(res ^ 2)
## Calculate trace of Hat Matrix
H <- X %*% (solve(t(X) %*% X)%*% t(X)) ### solve() calculates the inverse of a matrix
traceMat <- sum(diag(H))
GCV <- (n * sse) / (n - traceMat) ^ 2
cal[[length(cal)+1]] = GCV
}
return(cal)
}
span <- seq(from = 0. 1, to = 0.5, by = 0.01)
output <- gcv.calc(span)
I want to incorporate LOESS smoothing so that the degrees of freedom gets adjusted as span changes and to minimise the degrees of freedom to get GCVS as low as possible.
How to incorporate the regularization parameter and smoothing constraint to the hat matrix?

Linear regression gradient descent algorithms in R produce varying results

I am trying to implement a linear regression in R from scratch without using any packages or libraries using the following data:
UCI Machine Learning Repository, Bike-Sharing-Dataset
The linear regression was easy enough, here is the code:
data <- read.csv("Bike-Sharing-Dataset/hour.csv")
# Select the useable features
data1 <- data[, c("season", "mnth", "hr", "holiday", "weekday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed", "cnt")]
# Split the data
trainingObs<-sample(nrow(data1),0.70*nrow(data1),replace=FALSE)
# Create the training dataset
trainingDS<-data1[trainingObs,]
# Create the test dataset
testDS<-data1[-trainingObs,]
x0 <- rep(1, nrow(trainingDS)) # column of 1's
x1 <- trainingDS[, c("season", "mnth", "hr", "holiday", "weekday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed")]
# create the x- matrix of explanatory variables
x <- as.matrix(cbind(x0,x1))
# create the y-matrix of dependent variables
y <- as.matrix(trainingDS$cnt)
m <- nrow(y)
solve(t(x)%*%x)%*%t(x)%*%y
The next step is to implement the batch update gradient descent and here is where I am running into problems. I dont know where the errors are coming from or how to fix them, but the code works. The problem is that the values being produced are radically different from the results of the regression and I am unsure of why.
The two versions of the batch update gradient descent that I have implemented are as follows (the results of both algorithms differ from one another and from the results of the regression):
# Gradient descent 1
gradientDesc <- function(x, y, learn_rate, conv_threshold, n, max_iter) {
plot(x, y, col = "blue", pch = 20)
m <- runif(1, 0, 1)
c <- runif(1, 0, 1)
yhat <- m * x + c
MSE <- sum((y - yhat) ^ 2) / n
converged = F
iterations = 0
while(converged == F) {
## Implement the gradient descent algorithm
m_new <- m - learn_rate * ((1 / n) * (sum((yhat - y) * x)))
c_new <- c - learn_rate * ((1 / n) * (sum(yhat - y)))
m <- m_new
c <- c_new
yhat <- m * x + c
MSE_new <- sum((y - yhat) ^ 2) / n
if(MSE - MSE_new <= conv_threshold) {
abline(c, m)
converged = T
return(paste("Optimal intercept:", c, "Optimal slope:", m))
}
iterations = iterations + 1
if(iterations > max_iter) {
abline(c, m)
converged = T
return(paste("Optimal intercept:", c, "Optimal slope:", m))
}
}
return(paste("MSE=", MSE))
}
AND:
grad <- function(x, y, theta) { # note that for readability, I redefined theta as a column vector
gradient <- 1/m* t(x) %*% (x %*% theta - y)
return(gradient)
}
grad.descent <- function(x, maxit, alpha){
theta <- matrix(rep(0, length=ncol(x)), ncol = 1)
for (i in 1:maxit) {
theta <- theta - alpha * grad(x, y, theta)
}
return(theta)
}
If someone could explain why these two functions are producing different results I would greatly appreciate it. I also want to make sure that I am in fact implementing the gradient descent correctly.
Lastly, how can I plot the results of the descent with varying learning rates and superimpose this data over the results of the regression itself?
EDIT
Here are the results of running the two algorithms with alpha = .005 and 10,000 iterations:
1)
> gradientDesc(trainingDS, y, 0.005, 0.001, 32, 10000)
TEXT_SHOW_BACKTRACE environmental variable.
[1] "Optimal intercept: 2183458.95872599 Optimal slope: 62417773.0184353"
2)
> print(grad.descent(x, 10000, .005))
[,1]
x0 8.3681113
season 19.8399837
mnth -0.3515479
hr 8.0269388
holiday -16.2429750
weekday 1.9615369
workingday 7.6063719
weathersit -12.0611266
temp 157.5315413
atemp 138.8019732
hum -162.7948299
windspeed 31.5442471
To give you an example of how to write functions like this in a slightly better way, consider the following:
gradientDesc <- function(x, y, learn_rate, conv_threshold, max_iter) {
n <- nrow(x)
m <- runif(ncol(x), 0, 1) # m is a vector of dimension ncol(x), 1
yhat <- x %*% m # since x already contains a constant, no need to add another one
MSE <- sum((y - yhat) ^ 2) / n
converged = F
iterations = 0
while(converged == F) {
m <- m - learn_rate * ( 1/n * t(x) %*% (yhat - y))
yhat <- x %*% m
MSE_new <- sum((y - yhat) ^ 2) / n
if( abs(MSE - MSE_new) <= conv_threshold) {
converged = T
}
iterations = iterations + 1
MSE <- MSE_new
if(iterations >= max_iter) break
}
return(list(converged = converged,
num_iterations = iterations,
MSE = MSE_new,
coefs = m) )
}
For comparison:
ols <- solve(t(x)%*%x)%*%t(x)%*%y
Now,
out <- gradientDesc(x,y, 0.005, 1e-7, 200000)
data.frame(ols, out$coefs)
ols out.coefs
x0 33.0663095 35.2995589
season 18.5603565 18.5779534
mnth -0.1441603 -0.1458521
hr 7.4374031 7.4420685
holiday -21.0608520 -21.3284449
weekday 1.5115838 1.4813259
workingday 5.9953383 5.9643950
weathersit -0.2990723 -0.4073493
temp 100.0719903 147.1157262
atemp 226.9828394 174.0260534
hum -225.7411524 -225.2686640
windspeed 12.3671942 9.5792498
Here, x refers to your x as defined in your first code chunk. Note the similarity between the coefficients. However, also note that
out$converged
[1] FALSE
so that you could increase the accuracy by increasing the number of iterations or by playing around with the step size. It might also help to scale your variables first.

Estimate a probit regression model with optim()

I need to manually program a probit regression model without using glm. I would use optim for direct minimization of negative log-likelihood.
I wrote code below but it does not work, giving error:
cannot coerce type 'closure' to vector of type 'double'
# load data: data provided via the bottom link
Datospregunta2a <- read.dta("problema2_1.dta")
attach(Datospregunta2a)
# model matrix `X` and response `Y`
X <- cbind(1, associate_professor, full_professor, emeritus_professor, other_rank)
Y <- volunteer
# number of regression coefficients
K <- ncol(X)
# initial guess on coefficients
vi <- lm(volunteer ~ associate_professor, full_professor, emeritus_professor, other_rank)$coefficients
# negative log-likelihood
probit.nll <- function (beta) {
exb <- exp(X%*%beta)
prob<- rnorm(exb)
logexb <- log(prob)
y0 <- (1-y)
logexb0 <- log(1-prob)
yt <- t(y)
y0t <- t(y0)
-sum(yt%*%logexb + y0t%*%logexb0)
}
# gradient
probit.gr <- function (beta) {
grad <- numeric(K)
exb <- exp(X%*%beta)
prob <- rnorm(exb)
for (k in 1:K) grad[k] <- sum(X[,k]*(y - prob))
return(-grad)
}
# direct minimization
fit <- optim(vi, probit.nll, gr = probit.gr, method = "BFGS", hessian = TRUE)
data: https://drive.google.com/file/d/0B06Id6VJyeb5OTFjbHVHUE42THc/view?usp=sharing
case sensitive
Y and y are different. So you should use Y not y in your defined functions probit.nll and probit.gr.
These two functions also do not look correct to me. The most evident problem is the existence of rnorm. The following are correct ones.
negative log-likelihood function
# requires model matrix `X` and binary response `Y`
probit.nll <- function (beta) {
# linear predictor
eta <- X %*% beta
# probability
p <- pnorm(eta)
# negative log-likelihood
-sum((1 - Y) * log(1 - p) + Y * log(p))
}
gradient function
# requires model matrix `X` and binary response `Y`
probit.gr <- function (beta) {
# linear predictor
eta <- X %*% beta
# probability
p <- pnorm(eta)
# chain rule
u <- dnorm(eta) * (Y - p) / (p * (1 - p))
# gradient
-crossprod(X, u)
}
initial parameter values from lm()
This does not sound like a reasonable idea. In no cases should we apply linear regression to binary data.
However, purely focusing on the use of lm, you need + not , to separate covariates in the right hand side of the formula.
reproducible example
Let's generate a toy dataset
set.seed(0)
# model matrix
X <- cbind(1, matrix(runif(300, -2, 1), 100))
# coefficients
b <- runif(4)
# response
Y <- rbinom(100, 1, pnorm(X %*% b))
# `glm` estimate
GLM <- glm(Y ~ X - 1, family = binomial(link = "probit"))
# our own estimation via `optim`
# I am using `b` as initial parameter values (being lazy)
fit <- optim(b, probit.nll, gr = probit.gr, method = "BFGS", hessian = TRUE)
# comparison
unname(coef(GLM))
# 0.62183195 0.38971121 0.06321124 0.44199523
fit$par
# 0.62183540 0.38971287 0.06321318 0.44199659
They are very close to each other!

How to calculate variance of least squares estimator using QR decomposition in R?

I'm trying to learn QR decomposition, but can't figure out how to get the variance of beta_hat without resorting to traditional matrix calculations. I'm practising with the iris data set, and here's what I have so far:
y<-(iris$Sepal.Length)
x<-(iris$Sepal.Width)
X<-cbind(1,x)
n<-nrow(X)
p<-ncol(X)
qr.X<-qr(X)
b<-(t(qr.Q(qr.X)) %*% y)[1:p]
R<-qr.R(qr.X)
beta<-as.vector(backsolve(R,b))
res<-as.vector(y-X %*% beta)
Thanks for your help!
setup (copying in your code)
y <- iris$Sepal.Length
x <- iris$Sepal.Width
X <- cbind(1,x)
n <- nrow(X)
p <- ncol(X)
qr.X <- qr(X)
b <- (t(qr.Q(qr.X)) %*% y)[1:p] ## can be optimized; see Remark 1 below
R <- qr.R(qr.X) ## can be optimized; see Remark 2 below
beta <- as.vector(backsolve(R, b))
res <- as.vector(y - X %*% beta)
math
computation
Residual degree of freedom is n - p, so estimated variance is
se2 <- sum(res ^ 2) / (n - p)
Thus, the variance covariance matrix of estimated coefficients is
V <- chol2inv(R) * se2
# [,1] [,2]
#[1,] 0.22934170 -0.07352916
#[2,] -0.07352916 0.02405009
validation
Let's check the correctness by comparing with lm:
fit <- lm(Sepal.Length ~ Sepal.Width, iris)
vcov(fit)
# (Intercept) Sepal.Width
#(Intercept) 0.22934170 -0.07352916
#Sepal.Width -0.07352916 0.02405009
Identical result!
Remark 1 (skip forming 'Q' factor)
Instead of b <- (t(qr.Q(qr.X)) %*% y)[1:p], you can use function qr.qty (to avoid forming 'Q' matrix):
b <- qr.qty(qr.X, y)[1:p]
Remark 2 (skip forming 'R' factor)
You don't have to extract R <- qr.R(qr.X) for backsolve; using qr.X$qr is sufficient:
beta <- as.vector(backsolve(qr.X$qr, b))
Appendix: A function for estimation
The above is the simplest demonstration. In practice column pivoting and rank-deficiency need be dealt with. The following is an implementation. X is a model matrix and y is the response. Results should be compared with lm(y ~ X + 0).
qr_estimation <- function (X, y) {
## QR factorization
QR <- qr(X)
r <- QR$rank
piv <- QR$pivot[1:r]
## estimate identifiable coefficients
b <- qr.qty(QR, y)[1:r]
beta <- backsolve(QR$qr, b, r)
## fitted values
yhat <- base::c(X[, piv] %*% beta)
## residuals
resi <- y - yhat
## error variance
se2 <- base::c(crossprod(resi)) / (nrow(X) - r)
## variance-covariance for coefficients
V <- chol2inv(QR$qr, r) * se2
## post-processing on pivoting and rank-deficiency
p <- ncol(X)
beta_full <- rep.int(NA_real_, p)
beta_full[piv] <- beta
V_full <- matrix(NA_real_, p, p)
V_full[piv, piv] <- V
## return
list(coefficients = beta_full, vcov = V_full,
fitted.values = yhat, residuals = resi, sig = sqrt(se2))
}

R - Fitting a constrained AutoRegression time series

I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)

Resources