Fitting probit model inr R - r

For my thesis I have to fit some glm models with MLEs that R doesn't have, I was going ok for the models with close form but now I have to use de Gausian CDF, so i decide to fit a simple probit model.
this is the code:
Data:
set.seed(123)
x <-matrix( rnorm(50,2,4),50,1)
m <- matrix(runif(50,2,4),50,1)
t <- matrix(rpois(50,0.5),50,1)
z <- (1+exp(-((x-mean(x)/sd(x)))))^-1 + runif(50)
y <- ifelse(z < 1.186228, 0, 1)
data1 <- as.data.frame(cbind(y,x,m,t))
myprobit <- function (formula, data)
{
mf <- model.frame(formula, data)
y <- model.response(mf, "numeric")
X <- model.matrix(formula, data = data)
if (any(is.na(cbind(y, X))))
stop("Some data are missing.")
loglik <- function(betas, X, y, sigma) { #loglikelihood
p <- length(betas)
beta <- betas[-p]
eta <- X %*% beta
sigma <- 1 #because of identification, sigma must be equal to 1
G <- pnorm(y, mean = eta,sd=sigma)
sum( y*log(G) + (1-y)*log(1-G))
}
ls.reg <- lm(y ~ X - 1)#starting values using ols, indicating that this model already has a constant
start <- coef(ls.reg)
fit <- optim(start, loglik, X = X, y = y, control = list(fnscale = -1), method = "BFGS", hessian = TRUE) #optimizar
if (fit$convergence > 0) {
print(fit)
stop("optim failed to converge!") #verify convergence
}
return(fit)
}
myprobit(y ~ x + m + t,data = data1)
And i get: Error in X %*% beta : non-conformable arguments, if i change start <- coef(ls.reg) with start <- c(coef(ls.reg), 1) i get wrong stimatives comparing with:
probit <- glm(y ~ x + m + t,data = data1 , family = binomial(link = "probit"))
What am I doing wrong?
Is possible to correctly fit this model using pnorm, if no, what algorithm should I use to approximate de gausian CDF. Thanks!!

The line of code responsible for your error is the following:
eta <- X %*% beta
Note that "%*%" is the matrix multiplication operator. By reproducing your code I noticed that X is a matrix with 50 rows and 4 columns. Hence, for matrix multiplication to be possible your "beta" needs to have 4 rows. But when you run "betas[-p]" you subset the betas vector by removing its last element, leaving only three elements instead of the four you need for matrix multiplication to be defined. If you remove [-p] the code will work.

Related

Unexpected result from cross validation

I would like to perform 10-fold cross validation manually using prostate data to learn how to do it manually. I utilise the elasticnet package for code. I estimated the parameters by glmnet package (of course, it can perform cross validation too, but I would like to do that manually). After the analysis, It seems to me that I need a different criterion to choose tuning parameter other than minimum of cv.error because this gives the almost null model, if not so "where is my mistake?". (According to the original paper of Tibshirani, optimum model has three variables)
Here is the code
library(ElemStatLearn)
library(glmnet)
x <- scale(prostate[,1:8],T,T)
y <- scale(prostate[,9],T,F)
lambda = seq(0,1,0.02)
cv.folds <- function(n, folds = 10){
split(sample(1:n), rep(1:folds, length = n))
}
c.val <- function(x, y, K = 10, lambda, plot.it = TRUE){
n <- nrow(x)
all.folds <- cv.folds(length(y), K)
residmat <- matrix(0, length(lambda), K)
for(i in seq(K)) {
omit <- all.folds[[i]]
xk <- as.matrix(x[-omit, ])
yk <- as.vector(y[-omit])
xg <- x[omit, ]
yg <- y[omit]
fit <- glmnet(xk, yk, family="gaussian",
alpha=1, lambda=lambda,standardize = FALSE, intercept = FALSE)
fit <- predict(fit,newx=xg,lambda=lambda)
if(length(omit)==1){fit<-matrix(fit,nrow=1)}
residmat[, i] <- apply((yg - fit)^2, 2, mean)
}
cv <- apply(residmat, 1, mean)
cv.error <- sqrt(apply(residmat, 1, var)/K)
object<-list(lambda = lambda, cv = cv, cv.error = cv.error)
if(plot.it) {
plot(lambda, cv, type = "b", xlab="lambda", ylim = range(cv, cv + cv.error, cv - cv.error))
invisible(object)
}
}
result <- c.val(x,y,K = 10,lambda = lambda)
lambda.opt <- lambda[which.min(result$cv.error)]
fit <- glmnet(x, y, family="gaussian",
alpha=1, lambda=lambda.opt,standardize = FALSE, intercept = FALSE)
coef(fit)
Result:
> coef(fit)
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
lcavol 0.01926724
lweight .
age .
lbph .
svi .
lcp .
Edit:
Model generated directly from glmnet.
fit.lasso <- glmnet(x, y, family="gaussian", alpha=1,
standardize = FALSE, intercept = FALSE)
fit.lasso.cv <- cv.glmnet(x, y, type.measure="mse", alpha=1,
family="gaussian",standardize = FALSE, intercept = FALSE)
coef.lambda.min <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.min)
coef.lambda.1se <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.1se)
cbind(coef.lambda.min,coef.lambda.1se)
Result:
9 x 2 sparse Matrix of class "dgCMatrix"
1 1
(Intercept) . .
lcavol 0.59892674 0.5286355
lweight 0.23669159 0.1201279
age -0.06979581 .
lbph 0.09392021 .
svi 0.24620007 0.1400748
lcp . .
gleason 0.00346421 .
pgg45 0.06631013 .
The second column shows the correct (lambda.1se) result.
Your "mistake" is very hard to spot: it comes from the fact that glmnet will not use the order of your own lambda vector to sort the vector of results.
Example with the data you used:
res <- glmnet(x, y, lambda=lambda)
res$lambda
So when you call the command lambda[which.min(result$cv.error)] at the end of your procedure, you will not get the value corresponding to the minimum of the cross-validated error. Also, it explains why your graph looks strange.
An easy fix would be to declare lambda at the beginning of the script as a decreasing vector:
lambda = seq(1, 0, 0.02)
Final remark: be careful when using a single lambda.

Estimate a probit regression model with optim()

I need to manually program a probit regression model without using glm. I would use optim for direct minimization of negative log-likelihood.
I wrote code below but it does not work, giving error:
cannot coerce type 'closure' to vector of type 'double'
# load data: data provided via the bottom link
Datospregunta2a <- read.dta("problema2_1.dta")
attach(Datospregunta2a)
# model matrix `X` and response `Y`
X <- cbind(1, associate_professor, full_professor, emeritus_professor, other_rank)
Y <- volunteer
# number of regression coefficients
K <- ncol(X)
# initial guess on coefficients
vi <- lm(volunteer ~ associate_professor, full_professor, emeritus_professor, other_rank)$coefficients
# negative log-likelihood
probit.nll <- function (beta) {
exb <- exp(X%*%beta)
prob<- rnorm(exb)
logexb <- log(prob)
y0 <- (1-y)
logexb0 <- log(1-prob)
yt <- t(y)
y0t <- t(y0)
-sum(yt%*%logexb + y0t%*%logexb0)
}
# gradient
probit.gr <- function (beta) {
grad <- numeric(K)
exb <- exp(X%*%beta)
prob <- rnorm(exb)
for (k in 1:K) grad[k] <- sum(X[,k]*(y - prob))
return(-grad)
}
# direct minimization
fit <- optim(vi, probit.nll, gr = probit.gr, method = "BFGS", hessian = TRUE)
data: https://drive.google.com/file/d/0B06Id6VJyeb5OTFjbHVHUE42THc/view?usp=sharing
case sensitive
Y and y are different. So you should use Y not y in your defined functions probit.nll and probit.gr.
These two functions also do not look correct to me. The most evident problem is the existence of rnorm. The following are correct ones.
negative log-likelihood function
# requires model matrix `X` and binary response `Y`
probit.nll <- function (beta) {
# linear predictor
eta <- X %*% beta
# probability
p <- pnorm(eta)
# negative log-likelihood
-sum((1 - Y) * log(1 - p) + Y * log(p))
}
gradient function
# requires model matrix `X` and binary response `Y`
probit.gr <- function (beta) {
# linear predictor
eta <- X %*% beta
# probability
p <- pnorm(eta)
# chain rule
u <- dnorm(eta) * (Y - p) / (p * (1 - p))
# gradient
-crossprod(X, u)
}
initial parameter values from lm()
This does not sound like a reasonable idea. In no cases should we apply linear regression to binary data.
However, purely focusing on the use of lm, you need + not , to separate covariates in the right hand side of the formula.
reproducible example
Let's generate a toy dataset
set.seed(0)
# model matrix
X <- cbind(1, matrix(runif(300, -2, 1), 100))
# coefficients
b <- runif(4)
# response
Y <- rbinom(100, 1, pnorm(X %*% b))
# `glm` estimate
GLM <- glm(Y ~ X - 1, family = binomial(link = "probit"))
# our own estimation via `optim`
# I am using `b` as initial parameter values (being lazy)
fit <- optim(b, probit.nll, gr = probit.gr, method = "BFGS", hessian = TRUE)
# comparison
unname(coef(GLM))
# 0.62183195 0.38971121 0.06321124 0.44199523
fit$par
# 0.62183540 0.38971287 0.06321318 0.44199659
They are very close to each other!

Lasso regression makes a mistake by a constant

I'm trying to apply a lasso regression for my data. I'm using lars package for R. Using coef function, I get coefficients of lasso model and using them, I plot this model. But this model is always wrong by a constant (blue color).
FX <- cbind(1, X, X^2, X^3, X^4, X^5,X^6, X^7)
lasso <- lars(FX, Y, type='lasso')
alpha <- coef(lasso, s=1, mode='lambda')
tr_x <- (1:100)/100
y0 <- getFuncByParam(tr_x, alpha)
lines(x,y0, col='blue', lwd=2)
But when I use predict function, I'm getting correct model (pink color)
Ftest <- cbind(1, tr_x , tr_x^2, tr_x^3, tr_x^4, tr_x^5, tr_x^6, tr_x^7)
y0 <- predict(lasso, Ftest, 1, type='fit', mode='lambda')
UPD
getFuncByParam <- function(x, a){
n <- length(a)
res <- 0
for(i in 1 : n ){
res <- res + a[i]*x^(i - 1)
}
return(res)
}

Maximum likelihood estimation of the log-normal distribution using R

I'm trying to estimate a linear model with a log-normal distributed error term. I already have working code for a linear model with normally distributed errors:
library(Ecdat)
library(assertthat)
library(maxLik)
# Load the data
data(Wages1)
# Check what R says
summary(lm(wage ~ school + exper + sex, data = Wages1))
# Use maxLik from package maxLik
# The likelihood function
my_log_lik_pos <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(my_data), 1))
return(sum(log(dnorm(y, mean = xb, sd = sigma))))
}
# Bind the data
my_data <- cbind(Wages1$wage, 1, Wages1$school, Wages1$exper, Wages1$sex)
my_problem <- maxLik(my_log_lik_pos, data = my_data,
start = rep(1,5), method = "BFGS")
summary(my_problem)
I get approximately the same results. Now I try to do the same, but using the log-normal likelihood. For this, I have to first simulate some data:
true_beta <- c(0.1, 0.2, 0.3, 0.4, 0.5)
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
my_data_2 <- cbind(ys, my_data[, -1])
And the log-likelihood function:
my_log_lik_lognorm <- function(theta, data){
y <- data[, 1]
x <- data[, -1]
beta <- head(theta, -1)
sigma <- tail(theta, 1)
xb <- x%*%beta
are_equal(dim(xb), c(nrow(data), 1))
return(sum(log(dlnorm(y, mean = xb, sd = sigma))))
}
my_problem2 <- maxLik(my_log_lik_lognorm, data = my_data_2,
start = rep(0.2,5), method = "BFGS")
summary(my_problem2)
The estimated parameters should be around the values of true_beta, but for some reason I find completely different values. I tried with different methods, different starting values but to no avail. I'm sure that I'm missing something obvious, but I don't see what.
Am I right to assume that the log-likelihood of the log-normal distribution is:
sum(log(dlnorm(y, mean = .., sd = ...))
Unless I'm mistaken, this is the definition of the log-likelihood (sum of the logs of the densities).
I found the issue: it seems the problem is not my log-likelihood function. When I try to estimate the model with glm:
summary(glm(ys ~ school + exper + sex, family=gaussian(link="log"), data=Wages1))
I get the same result as with maxLik and my log-likelihood. It would seem the problem comes from when I tried to simulate some data:
ys <- my_data[, -1] %*% head(true_beta, -1) +
rlnorm(nrow(my_data), 0, tail(true_beta, 1))
The correct way to simulate the data:
ys <- rlnorm(nrow(my_data), my_data[, -1] %*% head(true_beta, -1), tail(true_beta, 1))
Now everything works!

R - Fitting a constrained AutoRegression time series

I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)

Resources