I am trying to calculate the standardized Pearson Residuals by hand in R. However, I am struggling when it comes to calculating the hat matrix.
I have built my own logistic regression and I am trying to calculate the standardized Pearson residuals in the logReg function.
logRegEst <- function(x, y, threshold = 1e-10, maxIter = 100)
{
calcPi <- function(x, beta)
{
beta <- as.vector(beta)
return(exp(x %*% beta) / (1 + exp(x %*% beta)))
}
beta <- rep(0, ncol(x)) # initial guess for beta
diff <- 1000
# initial value bigger than threshold so that we can enter our while loop
iterCount = 0
# counter for the iterations to ensure we're not stuck in an infinite loop
while(diff > threshold) # tests for convergence
{
pi <- as.vector(calcPi(x, beta))
# calculate pi by using the current estimate of beta
W <- diag(pi * (1 - pi))
# calculate matrix of weights W as defined int he fisher scooring algorithem
beta_change <- solve(t(x) %*% W %*% x) %*% t(x) %*% (y - pi)
# calculate the change in beta
beta <- beta + beta_change # new beta
diff <- sum(beta_change^2)
# calculate how much we changed beta by in this iteration
# if this is less than threshold, we'll break the while loop
iterCount <- iterCount + 1
# see if we've hit the maximum number of iterations
if(iterCount > maxIter){
stop("This isn't converging.")
}
# stop if we have hit the maximum number of iterations
}
n <- length(y)
df <- length(y) - ncol(x)
# calculating the degrees of freedom by taking the length of y minus
# the number of x columns
vcov <- solve(t(x) %*% W %*% x)
logLik <- sum(y * log(pi / (1 - pi)) + log(1 - pi))
deviance <- -2 * logLik
AIC <- -2 * logLik + 2 * ncol(x)
rank <- ncol(x)
list(coefficients = beta, vcov = vcov, df = df, deviance = deviance,
AIC = AIC, iter = iterCount - 1, x = x, y = y, n = n, rank = rank)
# returning results
}
logReg <- function(formula, data)
{
if (sum(is.na(data)) > 0) {
print("missing values in data")
} else {
mf <- model.frame(formula = formula, data = data)
# model.frame() returns us a data.frame with the variables needed to use the
# formula.
x <- model.matrix(attr(mf, "terms"), data = mf)
# model.matrix() creates a design matrix. That means that for example the
#"Sex"-variable is given as a dummy variable with ones and zeros.
y <- as.numeric(model.response(mf)) - 1
# model.response gives us the response variable.
est <- logRegEst(x, y)
# Now we have the starting position to apply our function from above.
est$formula <- formula
est$call <- match.call()
# We add the formular and the call to the list.
nullModel <- logRegEst(x = as.matrix(rep(1, length(y))), y)
est$nullDeviance <- nullModel$deviance
est$nullDf <- nullModel$df
mu <- exp(as.vector(est$x %*% est$coefficients)) /
(1 + exp(as.vector(est$x %*% est$coefficients)))
# computing the fitted values
est$residuals <- (est$y - mu) / sqrt(mu * (1 - mu))
est$mu <- mu
est$x <- x
est$y <- y
est$data <- data
hat <- (t(mu))^(1/2)%*%x%*%(t(x)%*%mu%*%x)^(-1)%*%t(x)%*%mu^(1/2)
est$stdresiduals <- est$residuals/(sqrt(1-hat))
class(est) <- "logReg"
# defining the class
est
}
}
I am struggling when it comes to calculating 𝐻=𝑉̂1/2𝑋(𝑋𝑇𝑉̂𝑋)−1𝑋𝑇𝑉̂1/2. This is called hat in my code.
If I try to calculate the hat matrix (hat) I get the error that I cannot multiply the vector mu and the matrix x in this case: t(x)%*%mu%*%x.
I can see that the rank of the matrices are not identical and therefor I can't multiply them.
Can Anyone see where my mistake is? Help is very appreciated. Thanks!
I hope you are well. I was wondering if you could help me with the question provided in the attached link, please. Below the link I attach an R-code that solves the problem recursively for particular values of the parameters of the distributions involved. However, I realized that this method is inefficient. Thanks a lot for your help.
How to obtain the probability distribution of a sum of dependent discrete random variables more efficiently
library(boot) # The library boot is necessary to use the command inv.logit.
TMax <- 500 # In this R-code, I am using TMax instead of using T.
M <- 2000
beta0 <- 1
beta1 <- 0.5
Prob_S <- function(k, r){ # In this R-code, I am using r instead of using t.
if(r == 1){
Aux <- dbinom(x = k, size = M, prob = inv.logit(beta0))
}
if(r %in% 2:TMax){
Aux <- 0
for(u in 0:k){
Aux <- Aux + dbinom(x = k - u, size = M - u,
prob = inv.logit(beta0 + beta1 * u)) * Prob_S(u, r - 1)
}
}
Aux
}
m <- 300
P <- Prob_S(k = m, r = TMax) # Computing P takes a loooong time. :(
Let a data set with 3 variables (here Normal and independent but they can be correlated)
data = data.frame(x1 = rnorm(10000),
x2 = rnorm(10000),
x3 = rnorm(10000))
I would like to get the narrowest range for x1, x2 and x3 such that 95% of the observation fall within all three ranges.
So far I have the code below.
is.between <- function(x, a, b){
x <= max(c(a, b)) & x >= min(c(a, b))
}
getlims <- function(lims, x1, x2, x3){
abs(mean(
is.between(x1, lims[1], lims[2]) &
is.between(x2, lims[3], lims[4]) &
is.between(x3, lims[5], lims[6])
) - 0.95)
}
optim(initial_values, getlims, x1=x1,x2=x2,x3=x3)
Where lims[1,2] is the range for x1, lims[3,4] is the range for x2 and lims[5,6] is the range for x3.
It provides limits that contain 95% of my observation, but doesn't guarantee that it will be the smaller volume lims[1,2]*lims[3,4]*lims[5,6].
I think this is actually a problem in discrete optimization. It was given in three dimensions, I have reformulated it in two dimensions to have a better visualization, it can be immediately extended to more dimensions.
Let us try to solve it as a nonlinear optimization problem with constraints.
set.seed(1009)
N <- 1000
x <- rnorm(N); y <- rnorm(N)
The 0.05 and 0.95 quantiles of these coordinates will also be needed.
q1 <- quantile(x, 0.05); q2 <- quantile(x, 0.95)
q3 <- quantile(y, 0.05); q4 <- quantile(y, 0.95)
We define two functions, fmin the function that is to be minimized, and fbnd the function that defines the constraints. That is, we require fbnd(x) >= 0, in this way expressing that at least 95% of the points lie inside the rectangle.
fmin <- function(p) (p[2]-p[1]) * (p[4]-p[3])
fbnd <- function(p) {
c(0.05 - sum(x < p[1] | x > p[2] | y < p[3] | y > p[4]) / N,
q1 - p[1], p[2] - q2,
q3 - p[3], p[4] - q4 )
}
As starting point we can take the ranges of x- and y-coordinates.
start <- c(range(x), range(y))
The optimization solver has to minimize a function with nonlinear constraints. The auglag routine in package nloptr is a candidate solver.
S <- nloptr::auglag(start, fn=fmin, hin=fbnd)
S$par; S$value
# [1] -2.301263 2.308038 -2.079166 2.130744
# [1] 19.40474
We can improve the solution by moving the rectangle boundaries to the next upper or lower x- resp. y-coordinate. This is a necessary step as the objective function is locally constant.
r <- S$par
r[1] <- min(x[x >= r[1]]); r[2] <- max(x[x <= r[2]])
r[3] <- min(y[y >= r[3]]); r[4] <- max(y[y <= r[4]])
r
# [1] -2.299467 2.281395 -2.079166 2.127260
We can see, 50 points lie outside the rectangle and the area is 19.26905.
(r[2]-r[1]) * (r[4]-r[3]) # 19.26905
sum(x < r[1] | x > r[2] | y < r[3] | y > r[4]) # 50
The solution could still be a local minimum. Fortunately, the objective function is also locally monotone, so this will normally not happen. Of course, one could verify the solution by applying a global solver.
I'm playing around with apply() family of functions in R, and was trying to write a log-likelihood function using apply().
Here's the log-likelihood for a linear regression model assuming gaussian disturbances:
# Likelihood function for the standard linear regression model
logL <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
y <- data[, 1]
x <- data[, -1]
N <- nrow(data)
# This is the contribution to the log-likelihood of individual i. Initialized at 0.
contrib <- 0
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
for (i in 1:N){
contrib <- contrib + (y[i] - beta%*%x[i,])**2
}
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
And below we simulate some data and minimize the negative log-likelihood (which is equivalent to maximising the log-likelihood).
# Simulate some data
N <- 1000
x <- cbind(1, rnorm(N,0,sd=1), rnorm(N, 0, sd=2))
true_theta <- c(2, 3, 2, 4)
y <- true_theta[1:3]%*%t(x) + rnorm(N, mean = 0, sd = true_theta[4])
my_data <- cbind(t(y),x)
optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))
So far so good, we get the same results as those used to simulate the data. By using the rbenchmark package I get that 10 replications of the optimization step takes around 4 seconds on my computer.
benchmark(optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
replications=10)
Now I tried replacing the for-loop with the apply function. For this, I defined contrib to be a function:
contrib <- function(beta, one_obs){
y <- one_obs[1]
x <- one_obs[-1]
return((y - beta%*%x)**2)
}
And the new log-likelihood function:
logL2 <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*sum(apply(data, FUN=contrib, beta = beta, 1)))
- 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
This is almost twice as long. Now, I may have misunderstood the role of the apply family of functions, as they should be used for code clarity rather than for performance. However, they shouldn't be slower than a for loop, right? So what is happening with my code? Is some type conversion going on? I checked and logL returns a matrix and logL2 returns a numeric. I tried using vapply() as it allows to specify the type of the object returned, but vapply() seems to convert my data matrix into a vector by stacking every column on top of each other. This causes the contrib function not to work anymore:
logL2 <- function(theta, data){
# Return minus the log likelihood function for the standard linear regression model
# y: endogenous variable
# x: matrix of regressors
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*sum(vapply(data, FUN=contrib, beta = beta, FUN.VALUE = matrix(1)))) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
This is what I get then:
class(logL2(theta = c(1,2,2,2), my_data))
Error in beta %*% x : non-conformable arguments
So how could I use the apply family of functions to make my code more readable, and at least as fast as with a for loop?
You can simplify your code by thinking about the maths involved in your for loop.
Your for loop is
contrib <- contrib + (y[i] - beta%*%x[i,])**2
Now this is the same as just calculating all the (y[i] - beta %*% x[i, ])^2 and summing them all. Thinking about beta %*% x[i, ] you are doing matrix multiplication of a 1x3 matrix (beta) with a 3x1 (x[i, ]), giving a 1x1 result. So what you are doing is matrix-multiplying beta by each row of x independently.
However, with matrix multiplication you can do them all simultaneously anyway, and get a Nx1 matrix out!
i.e. beta (1x3) %*% x (3xN) would give you a 1xN matrix, and then subtract this from y which is also a vector of length N, square each difference independently and sum them. This is equivalent to your for loop.
The only catch is that your x is Nx3 not 3xN, so we t() it first:
contrib <- sum((y - beta %*% t(x))^2)
This does away with your for loop entirely.
logL2 <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
N <- nrow(data)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
contrib <- sum((y - beta %*% t(x))^2)
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
library(rbenchmark)
benchmark(
orig={orig.answer <- optim(c(1,1,1, 1), fn = logL, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))},
new={new.answer <- optim(c(1,1,1, 1), fn = logL2, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01))},
replications=10
)
which yields
test replications elapsed relative user.self sys.self user.child sys.child
2 new 10 0.306 1.00 0.332 0.048 0 0
1 orig 10 4.584 14.98 4.588 0.000 0 0
and also let's just check we didn't make a mistake
all.equal(orig.answer, new.answer)
# [1] TRUE
As a style point, why not have y being a third argument to logL2 (rather than cbinding it to data at the start, and then having to select the appropriate row/columns all the time)? This saves you from doing the y <- data[, 1] and x <- data[, -1] all the time. I.e. do something like logL <- function (theta, x, y) { ... } and then in your optim() call you can provide the x and y arguments rather than my_data. You might even get a further improvement by doing t(x) at the very start (e.g. in your call to optim) so it doesn't have to be done every time logL2 is called?
logL3 <- function(theta, x, y){
N <- length(y)
beta <- head(theta, -1) # Every element but the last one
sigma <- tail(theta, 1) # Only the last element
contrib <- sum((y - beta %*% x)^2)
sigma <- abs(sigma)
L <- -(1/(2*sigma^2)*contrib) - 1/2 * N * log(2*pi) - N * log(sigma)
return(-L)
}
benchmark(
new=optim(c(1,1,1, 1), fn = logL2, data = my_data,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
new.new=optim(c(1,1,1, 1), fn = logL3, x=t(x), y=y,
method = "L-BFGS-B",upper = c(Inf, Inf, Inf), lower=c(-Inf, -Inf, 0.01)),
replications=100
)
test replications elapsed relative user.self sys.self user.child sys.child
1 new 100 3.149 2.006 3.317 0.700 0 0
2 new.new 100 1.570 1.000 1.488 0.344 0 0
It's about twice as fast. In general, if you can do something once rather than every time logL2 is called (e.g. t(x), data[, 1] etc) it'll save you some small amount of time.
With respect to your original question however (specifically to do with the *apply functions:
vapply takes a list as input, and your data is a matrix, so contrib is operating on one element of data at a time. I.e. contrib sees x as a single number. Hence nonconformable matrices, since your matrix multiplication is multiplying beta (a 1x3) with x (a 1x1) and for matrix multiplication to work, you need the number of columns of beta to equal the number of rows of x. To use vapply you'd need something like
vapply(1:nrow(data), function(i) contrib(beta, data[i, ]), FUN.VALUE=1)
(! I have not tested these statements by benchmarking or anything. This is just what I have found in my experience): of all the *apply functions, I find that apply() is slow (often slower than the for-loop). It is handy for neatness of code ("do this for every row", or "do this for every column"-type tasks: instead of lots of data[i, ] it's just apply(.., MARGIN=1)), but if you need speed do a for loop or use one of the other cousins like vapply, lapply or sapply.
vapply, lapply are fast. sapply is too, but in general one of the former two is faster (sapply is easier to use due to the FUN.VALUE bit of vapply being worked out for you. Or if you know that the FUN.VALUE won't always be the same, it is equivalent to lapply so you may as well use that. Since sapply does all this working out for you it can be easier to use, but minutely slower).
fastest of all is if you can use some maths to avoid a loop! e.g. if you can rephrase your loop in terms of a matrix multiplication as I did here. Though this only applies to a very small number of situations.
I tried to minimize the following function:
func <- function(qq){
x <- qq[1]
y <- qq[2]
output <- 1 - 2 * x + x^2 - 2 * y + 2 * x * y + y^2
return(output)
}
when x+y=1 and 0<=x,y<=1. To use gosolnp in Rsolnp package, firstly, I defined cons to use it in eqfun argument:
cons <- function(qq)
sum(qq)
Then I applied gosolnp function:
install.packages("Rsolnp")
require(Rsolnp)
gosolnp(fun = func, LB = c(0, 0), UB = c(1, 1), eqfun = cons, eqB = 1)
res$pars
[1] 0.8028775 0.1971225
res$value
[1] 2.606528e-09 -5.551115e-17
the answer should be x = 0 and y = 1, but as you can try in every run of gosolnp you will get new points which func is approximately 0 at that points (and not exactly).
Mathematica and Maple do optimization for this function very fast and give the true answer which is x = 0 and y = 1, but instead every run in R gives a new solution which is not correct.
I also tried another optimization function as spg() in alabama or DEoptim, but the problem remained unsolved.
So my question are:
1- is there any solution that I can minimize func in R?
2- is there any difference between precision in R and Mathematica and why Mathematica could give me the exact answer but R not?
Thank you in advance
If you have two variables x and y, with y = 1 - x, then you really have a problem in just one variable x. Noting that, you can reparametrise your function to be
1 - 2 * x + x^2 - 2 * (1 - x) + 2 * x * (1 - x) + (1 - x)^2
and going through the algebra shows that this is constant as a function of x. Thus any value of x in (0, 1) is a solution, and which one your algorithm converges to will basically be random: based on numerical roundoff and your choice of starting point.
The fact that gosolnp's returned value is zero to within the limits of numerical precision should have been a tipoff, or even just plotting the curve.
I can't speak to these particular packages, but nloptr(...) in package nloptr seems to work well:
# Non-Linear Optimization (package::nloptr)
F <- function(v){
x=v[1]
y=v[2]
output <- 1 - 2 * x + x^2 - 2 * y + 2 * x * y + y^2
}
Hc <- function(v) return(1-sum(v))
library(nloptr)
opt <- nloptr(x0=c(1/2,1/2), eval_f=F, lb = c(0,0), ub = c(1,1),
eval_g_eq = Hc,
opts = list(algorithm="NLOPT_GN_ISRES",maxeval=1e6))
opt$solution
# [1] 0.0005506997 0.9994492982
Your function is identically equal to 0 so there is no point in trying to minimize it.
library(Ryacas)
x <- Sym("x")
y <- 1-x
Simplify( 1 - 2 * x + x^2 - 2 * y + 2 * x * y + y^2)
which gives:
expression(0)