I am trying to calculate the standardized Pearson Residuals by hand in R. However, I am struggling when it comes to calculating the hat matrix.
I have built my own logistic regression and I am trying to calculate the standardized Pearson residuals in the logReg function.
logRegEst <- function(x, y, threshold = 1e-10, maxIter = 100)
{
calcPi <- function(x, beta)
{
beta <- as.vector(beta)
return(exp(x %*% beta) / (1 + exp(x %*% beta)))
}
beta <- rep(0, ncol(x)) # initial guess for beta
diff <- 1000
# initial value bigger than threshold so that we can enter our while loop
iterCount = 0
# counter for the iterations to ensure we're not stuck in an infinite loop
while(diff > threshold) # tests for convergence
{
pi <- as.vector(calcPi(x, beta))
# calculate pi by using the current estimate of beta
W <- diag(pi * (1 - pi))
# calculate matrix of weights W as defined int he fisher scooring algorithem
beta_change <- solve(t(x) %*% W %*% x) %*% t(x) %*% (y - pi)
# calculate the change in beta
beta <- beta + beta_change # new beta
diff <- sum(beta_change^2)
# calculate how much we changed beta by in this iteration
# if this is less than threshold, we'll break the while loop
iterCount <- iterCount + 1
# see if we've hit the maximum number of iterations
if(iterCount > maxIter){
stop("This isn't converging.")
}
# stop if we have hit the maximum number of iterations
}
n <- length(y)
df <- length(y) - ncol(x)
# calculating the degrees of freedom by taking the length of y minus
# the number of x columns
vcov <- solve(t(x) %*% W %*% x)
logLik <- sum(y * log(pi / (1 - pi)) + log(1 - pi))
deviance <- -2 * logLik
AIC <- -2 * logLik + 2 * ncol(x)
rank <- ncol(x)
list(coefficients = beta, vcov = vcov, df = df, deviance = deviance,
AIC = AIC, iter = iterCount - 1, x = x, y = y, n = n, rank = rank)
# returning results
}
logReg <- function(formula, data)
{
if (sum(is.na(data)) > 0) {
print("missing values in data")
} else {
mf <- model.frame(formula = formula, data = data)
# model.frame() returns us a data.frame with the variables needed to use the
# formula.
x <- model.matrix(attr(mf, "terms"), data = mf)
# model.matrix() creates a design matrix. That means that for example the
#"Sex"-variable is given as a dummy variable with ones and zeros.
y <- as.numeric(model.response(mf)) - 1
# model.response gives us the response variable.
est <- logRegEst(x, y)
# Now we have the starting position to apply our function from above.
est$formula <- formula
est$call <- match.call()
# We add the formular and the call to the list.
nullModel <- logRegEst(x = as.matrix(rep(1, length(y))), y)
est$nullDeviance <- nullModel$deviance
est$nullDf <- nullModel$df
mu <- exp(as.vector(est$x %*% est$coefficients)) /
(1 + exp(as.vector(est$x %*% est$coefficients)))
# computing the fitted values
est$residuals <- (est$y - mu) / sqrt(mu * (1 - mu))
est$mu <- mu
est$x <- x
est$y <- y
est$data <- data
hat <- (t(mu))^(1/2)%*%x%*%(t(x)%*%mu%*%x)^(-1)%*%t(x)%*%mu^(1/2)
est$stdresiduals <- est$residuals/(sqrt(1-hat))
class(est) <- "logReg"
# defining the class
est
}
}
I am struggling when it comes to calculating 𝐻=𝑉̂1/2𝑋(𝑋𝑇𝑉̂𝑋)−1𝑋𝑇𝑉̂1/2. This is called hat in my code.
If I try to calculate the hat matrix (hat) I get the error that I cannot multiply the vector mu and the matrix x in this case: t(x)%*%mu%*%x.
I can see that the rank of the matrices are not identical and therefor I can't multiply them.
Can Anyone see where my mistake is? Help is very appreciated. Thanks!
Related
Following this question: How to get the value of `t` so that my function `h(t)=epsilon` for a fixed `epsilon`?
I first sampling 500 eigenvectors v of a random matrix G and then generate 100 different random vectors initial of dimension 500. I normalized them in mats.
#make this example reproducible
set.seed(100001)
n <- 500
#Sample GOE random matrix
A <- matrix(rnorm(n*n, mean=0, sd=1), n, n)
G <- (A + t(A))/sqrt(2*n)
ev <- eigen(G)
l <- ev$values
v <- ev$vectors
#size of multivariate distribution
mean <- rep(0, n)
var <- diag(n)
#simulate bivariate normal distribution
initial <- MASS::mvrnorm(n=1000, mu=mean, Sigma=var) #ten random vectors
#normalized the first possible initial value, the initial data uniformly distributed on the sphere
xmats <- lapply(1:1000, function(i) initial[i, ]/norm(initial[i, ], type="2"))
Then I compute res
h1t <- function(t,x_0) {
h10 <- c(x_0 %*% v[, n])
denom <- vapply(t, function(.t) {
sum((x_0 %*% v)^2 * exp(-4*(l - l[n]) * .t))
}, numeric(1L))
abs(h10) / sqrt(denom)
}
find_t <- function(x, epsilon = 0.01, range = c(-50, 50)) {
uniroot(function(t) h1t(t, x) - epsilon, range,
tol = .Machine$double.eps)$root
}
I want to get res:
res <- lapply(xmats, find_t)
However, it shows error that Error in uniroot(function(t) h1t(t, x) - epsilon, range, tol = .Machine$double.eps) : f() values at end points not of opposite sign
res is a list. I run hist(unlist(res)) and it worked well.
I have a problem when I try to run the dffits() function for an object of my own logistic regression.
When I'm running dffits(log) I get the error message:
error in if (model$rank == 0) { : Argument is of length 0
However, when I'm using the inbuilt gym function (family = binomial), then dffits(glm) works just fine.
Here is my function for the logistic regression and a short example of my problem:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mydata$admit <- factor(mydata$admit)
logRegEst <- function(x, y, threshold = 1e-10, maxIter = 100)
{
calcPi <- function(x, beta)
{
beta <- as.vector(beta)
return(exp(x %*% beta) / (1 + exp(x %*% beta)))
}
beta <- rep(0, ncol(x)) # initial guess for beta
diff <- 1000
# initial value bigger than threshold so that we can enter our while loop
iterCount = 0
# counter to ensure we're not stuck in an infinite loop
while(diff > threshold) # tests for convergence
{
pi <- as.vector(calcPi(x, beta))
# calculate pi by using the current estimate of beta
W <- diag(pi * (1 - pi)) # calculate matrix of weights W
beta_change <- solve(t(x) %*% W %*% x) %*% t(x) %*% (y - pi)
# calculate the change in beta
beta <- beta + beta_change # new beta
diff <- sum(beta_change^2)
# calculate how much we changed beta by in this iteration
# if this is less than threshold, we'll break the while loop
iterCount <- iterCount + 1
# see if we've hit the maximum number of iterations
if(iterCount > maxIter){
stop("This isn't converging.")
}
# stop if we have hit the maximum number of iterations
}
df <- length(y) - ncol(x)
# calculating the degrees of freedom by taking the length of y minus
# the number of x columns
vcov <- solve(t(x) %*% W %*% x)
list(coefficients = beta, vcov = vcov, df = df)
# returning results
}
logReg <- function(formula, data)
{
mf <- model.frame(formula = formula, data = data)
# model.frame() returns us a data.frame with the variables needed to use the
# formula.
x <- model.matrix(attr(mf, "terms"), data = mf)
# model.matrix() creates a design matrix. That means that for example the
#"Sex"-variable is given as a dummy variable with ones and zeros.
y <- as.numeric(model.response(mf)) - 1
# model.response gives us the response variable.
est <- logRegEst(x, y)
# Now we have the starting position to apply our function from above.
est$formula <- formula
est$call <- match.call()
est$data <- data
# We add the formular and the call to the list.
est$x <- x
est$y <- y
# We add x and y to the list.
class(est) <- "logReg"
# defining the class
est
}
log <- logReg(admit ~ gre + gpa, data= mydata)
glm <- glm(admit ~ gre + gpa, data= mydata, family = binomial)
dffits(glm)
dffits(log)
log$data
glm$data
I don't understand why mydata$rank == 0, because when I look at log$data I see that the rank is just defined as in glm$data.
I really appreciate your help!
I am trying to code gradient descent in R. The goal is to collect a data frame of each estimate so I can plot the algorithm's search through the parameter space.
I am using the built-in dataset data(cars) in R. Unfortunately something is way off in my function. The estimates just increase linearly with each iteration! But I cannot figure out where I err.
Any tips?
Code:
GradientDescent <- function(b0_start, b1_start, x, y, niter=10, alpha=0.1) {
# initialize
gradient_b0 = 0
gradient_b1 = 0
x <- as.matrix(x)
y <- as.matrix(y)
N = length(y)
results <- matrix(nrow=niter, ncol=2)
# gradient
for(i in 1:N){
gradient_b0 <- gradient_b0 + (-2/N) * (y[i] - (b0_start + b1_start*x[i]))
gradient_b1 <- gradient_b1 + (-2/N) * x[i] * (y[i] - (b0_start + b1_start*x[i]))
}
# descent
b0_hat <- b0_start
b1_hat <- b1_start
for(i in 1:niter){
b0_hat <- b0_hat - (alpha*gradient_b0)
b1_hat <- b1_hat - (alpha*gradient_b1)
# collect
results[i,] <- c(b0_hat,b1_hat)
}
# return
df <- data.frame(results)
colnames(df) <- c("b0", "b1")
return(df)
}
> test <- GradientDescent(0,0,cars$speed, cars$dist, niter=1000)
> head(test,2); tail(test,2)
b0 b1
1 8.596 153.928
2 17.192 307.856
b0 b1
999 8587.404 153774.1
1000 8596.000 153928.0
Here is a solution for cars dataset:
# dependent and independent variables
y <- cars$dist
x <- cars$speed
# number of iterations
iter_n <- 100
# initial value of the parameter
theta1 <- 0
# learning rate
alpha <- 0.001
m <- nrow(cars)
yhat <- theta1*x
# a tibble to record the parameter update and cost
library(tibble)
results <- data_frame(theta1 = as.numeric(),
cost = NA,
iteration = 1)
# run the gradient descent
for (i in 1:iter_n){
theta1 <- theta1 - alpha * ((1 / m) * (sum((yhat - y) * x)))
yhat <- theta1*x
cost <- (1/m)*sum((yhat-y)^2)
results[i, 1] = theta1
results[i, 2] <- cost
results[i, 3] <- i
}
# print the parameter value after the defined iteration
print(theta1)
# 2.909132
Checking whether cost is decreasing:
library(ggplot2)
ggplot(results, aes(x = iteration, y = cost))+
geom_line()+
geom_point()
I wrote a more detailed blog post here.
I'm having trouble optimizing a multivariate normal log-likelihood in R. If anyone has a good solution for that, please let me know. Specifically, I cannot seem to keep the variance-covariance matrix positive-definite and the parameters in a reasonable range.
Let me introduce the problem more completely. I am essentially trying to simultaneously solve these two regression equations using MLE:
$$
y_1 = \beta_1 + \beta_2 x_1 + \beta_3 x_2 \\
y_2 = \beta_4 + \beta_3 x_1 + \beta_5 x_2
$$
The fact that $\beta_3$ is in both equations is not a mistake. I try to solve this using MLE by maximizing the likelihood of the multivariate normal distribution for $Y = (y_1, y_2)^\top$ where the mean is parameterized as above in the regression equations.
I've attached the log-likelihood function as I believe it should be, where I constrain the variance covariance matrix to be positive-definite by recreating it from necessarily positive eigenvalues and a cholesky decomposition.
mvrestricted_ll <- function(par, Y, X) {
# Indices
n <- nrow(X)
nbetas <- (2 + 3 * (ncol(Y) - 1))
# Extract parameters
beta <- par[1:nbetas]
eigvals <- exp(par[(nbetas + 1):(nbetas + ncol(Y))]) # constrain to be positive
chole <- par[(nbetas + ncol(Y) + 1):(nbetas + ncol(Y) + ncol(Y)*(ncol(Y)+1)/2)]
# Build Sigma from positive eigenvalues and cholesky (should be pos def)
L <- diag(ncol(Y))
L[lower.tri(L, diag=T)] <- chole
Sigma <- diag(eigvals) + tcrossprod(L)
# Linear predictor
# Hard coded for 2x2 example for now
mu <- cbind(beta[1] + beta[2]*X[,1] + beta[3]*X[,2],
beta[4] + beta[3]*X[,1] + beta[5]*X[,2])
yminmu <- Y - mu
nlogs <- n * log(det(Sigma))
invSigma <- solve(Sigma)
meat <- yminmu %*% tcrossprod(invSigma, yminmu)
return(- nlogs - sum(diag(meat)))
}
# Create fake data
n <- 1000
p <- 2
set.seed(20160201)
X <- matrix(rnorm(n*p), nrow = n)
set.seed(20160201)
Y <- matrix(rnorm(n*p), nrow = n)
# Initialize parameters
initpars <- c(rep(0, (2 + 3 * (ncol(Y) - 1)) + ncol(Y) + ncol(Y)*(ncol(Y)+1)/2))
# Optimize fails with BFGS
optim(par = initpars, fn = mvrestricted_ll, X=X, Y=Y, method = "BFGS")
# Optim does not converge with Nelder-mead, if you up the maxits it also fails
optim(par = initpars, fn = mvrestricted_ll, X=X, Y=Y)
Any help would be greatly appreciated.
EDIT: I should note that just letting Sigma be a vector in the parameters and then returning a very large value whenever it is not positive definite does not work either.
I have no idea if the code/answer is correct, but
invSigma <- try(solve(Sigma))
if (inherits(invSigma, "try-error")) return(NA)
and running
optim(par = initpars, fn = mvrestricted_ll, X=X, Y=Y,
control = list(maxit = 1e5))
gets me a little farther to a convergence code of 10 (degenerate Nelder-Mead simplex).
$par
[1] 1.361612e+01 4.674349e+01 -3.050170e+01 3.305013e+01 6.731194e+01
[6] -3.117192e+01 -5.408598e+00 -6.326897e-07 -1.987449e+01 -1.795924e+01
$value
[1] -1.529013e+19
$counts
function gradient
1219 NA
$convergence
[1] 10
I suspect that a real solution will involve looking more carefully at the code to see if it's really doing what you think it's doing (sorry); understanding why solve() errors occur might be a good first step. You can work on troubleshooting this by putting a cat(par, "\n") as the first line of the function and running it without the try/NA-return code. That will allow you to isolate an example data set that throws the error — then you can work your way through your code a line at a time (with debug() or by hand) to see what's happening.
You can consider using the following approach :
library(DEoptim)
fn <- function(par, mat_X, mat_Y)
{
X <- mat_X
Y <- mat_Y
n <- nrow(X)
nbetas <- (2 + 3 * (ncol(Y) - 1))
beta <- par[1 : nbetas]
eigvals <- exp(par[(nbetas + 1) : (nbetas + ncol(Y))])
chole <- par[(nbetas + ncol(Y) + 1) : (nbetas + ncol(Y) + ncol(Y) * (ncol(Y) + 1) / 2)]
L <- diag(ncol(Y))
L[lower.tri(L, diag = TRUE)] <- chole
Sigma <- tryCatch(diag(eigvals) + tcrossprod(L), error = function(e) NA)
if(is.null(dim(Sigma)))
{
return(10 ^ 30)
}else
{
mu <- cbind(beta[1] + beta[2] * X[,1] + beta[3] * X[,2],
beta[4] + beta[3] * X[,1] + beta[5] * X[,2])
yminmu <- Y - mu
nlogs <- n * log(det(Sigma))
invSigma <- tryCatch(solve(Sigma), error = function(e) NA)
if(is.null(dim(invSigma)))
{
return(10 ^ 30)
}else
{
meat <- yminmu %*% tcrossprod(invSigma, yminmu)
log_Lik <- - nlogs - sum(diag(meat))
if(is.na(log_Lik) | is.nan(log_Lik) | is.infinite(log_Lik))
{
return(10 ^ 30)
}else
{
return(-log_Lik)
}
}
}
}
n <- 1000
p <- 2
set.seed(20160201)
mat_X <- matrix(rnorm(n * p), nrow = n)
set.seed(2436537)
mat_Y <- matrix(rnorm(n * p), nrow = n)
lower <- rep(-10, 10)
upper <- rep(10, 10)
DEoptim(fn = fn, lower = lower, upper = upper,
control = list(itermax = 10000, parallelType = 1), mat_X = mat_X, mat_Y = mat_Y)
I did code for Newton Raphson for logistic regression. Unfortunately I tried many data there is no convergence. there is a mistake I do not know where is it. Can anyone help to figure out what is the problem.
First the data is as following; y indicate the response (0,1) , Z is 115*30 matrix which is the exploratory variables. I need to estimate the 30 parameters.
y = c(rep(0,60),rep(1,55))
X = sample(c(0,1),size=3450,replace=T)
Z = t(matrix(X,ncol=115))
#The code is ;
B = matrix(rep(0,30*10),ncol=10)
B[,1] = matrix(rep(0,30),ncol=1)
for(i in 2 : 10){
print(i)
p <- exp(Z %*%as.matrix(B[,i])) / (1 + exp(Z %*% as.matrix(B[,i])))
v.2 <- diag(as.vector(1 * p*(1-p)))
score.2 <- t(Z) %*% (y - p) # score function
increm <- solve(t(Z) %*% v.2 %*% Z)
B[,i] = as.matrix(B[,i-1])+increm%*%score.2
if(B[,i]-B[i-1]==matrix(rep(0.0001,30),ncol=1)){
return(B)
}
}
Found it! You're updating p based on B[,i], you should be using B[,i-1] ...
While I was finding the answer, I cleaned up your code and incorporated the results in a function. R's built-in glm seems to work (see below). One note is that this approach is likely to be unstable: fitting a binary model with 30 predictors and only 115 binary responses, and without any penalization or shrinkage, is extremely optimistic ...
set.seed(101)
n.obs <- 115
n.zero <- 60
n.pred <- 30
y <- c(rep(0,n.zero),rep(1,n.obs-n.zero))
X <- sample(c(0,1),size=n.pred*n.obs,replace=TRUE)
Z <- t(matrix(X,ncol=n.obs))
R's built-in glm fitter does work (it uses iteratively reweighted least squares, not N-R):
g1 <- glm(y~.-1,data.frame(y,Z),family="binomial")
(If you want to view the results, library("arm"); coefplot(g1).)
## B_{m+1} = B_m + (X^T V_m X)^{-1} X^T (Y-P_m)
NRfit function:
NRfit <- function(y,X,start,n.iter=100,tol=1e-4,verbose=TRUE) {
## used X rather than Z just because it's more standard notation
n.pred <- ncol(X)
B <- matrix(NA,ncol=n.iter,
nrow=n.pred)
B[,1] <- start
for (i in 2:n.iter) {
if (verbose) cat(i,"\n")
p <- plogis(X %*% B[,i-1])
v.2 <- diag(c(p*(1-p)))
score.2 <- t(X) %*% (y - p) # score function
increm <- solve(t(X) %*% v.2 %*% X)
B[,i] <- B[,i-1]+increm%*%score.2
if (all(abs(B[,i]-B[,i-1]) < tol)) return(B)
}
B
}
matplot(res1 <- t(NRfit(y,Z,start=coef(g1))))
matplot(res2 <- t(NRfit(y,Z,start=rep(0,ncol(Z)))))
all.equal(res2[6,],unname(coef(g1))) ## TRUE