I want to do a Negative Binomial regression manually and define a function that can be used for estimation of an arbitrary number of coefficients. I have
How can I get a matrix of betas and p-values and don't have to write every beta explicity.
For poisson regression I was succesfull as see the below code:
daten<-warpbreaks
LogLike <- function(y,x, par) {
beta <- par
lambda <- exp(beta%*%t(x))
LL <- -sum(dpois(y, lambda, log = TRUE))
return(LL)
}
PoisMod<-function(formula, data){
form <- formula(formula)
model <- model.frame(formula, data = data)
x <- model.matrix(formula,data = data)
y <- model.response(model)
par <- rep(0,ncol(x))
erg <- list(optim(par=par,fn=LogLike,x=x,y=y)$par)
return(erg)
}
PoisMod(breaks~wool+tension, as.data.frame(daten))
glm(breaks~wool+tension, family = "poisson", data = as.data.frame(daten))
In case of Negative Binomial, how to write this part
daten<-warpbreaks
LogLike <- function(y,x, par) {
beta <- par
lambda <- exp(beta%*%t(x))
LL <- -sum(dpois(y, lambda, log = TRUE))
return(LL)
}
Related
Using the following dataset "data": load(url("https://www.math.ntnu.no/emner/TMA4315/2020h/hoge-veluwe.Rdata"))
I have fit a poisson GLM model = glm(y~t + I(t^2), family = poisson, data). I now want to estimate the covariance matrix of the β coefficients obtained from the GLM regression by using parametric bootstrapping with 1000 simulations. My code so far is:
ysim = simulate(mod_quad, 1)
betahat = matrix(0,nrow = 1000, ncol = 3)
for (i in 1:1000){
sim_data = cbind(ysim, data$t)
betahat[i, ]= glm(ysim ~ data$t + I(data$t^2),
family = poisson,
data = sim_data )$coefficients
ysim = simulate(glm(ysim ~ data$t + I(data$t^2)family = poisson, data = sim_data ), 1)
}
var(betahat[1000,])
The betahat matrix just comes out to a zero matrix each time so I'm not sure what is missing in my approach?
This is really more of a coding than a statistical question. Streamlining, I would do this:
## set vals to NA (not 0) to make detection of problems easier
betahat <- matrix(NA, nrow = 1000, ncol = 3)
for (i in 1:1000) {
## replace response with parametric simulation
sim_data <- transform(data, y=simulate(mod_quad, 1)[[1]])
## refit model with new data
newfit <- update(mod_quad, data=sim_data)
## store new coefficients
betahat[i, ] <- coef(newfit)
}
## compute variance
var(betahat)
I want to want to estimate a model in R.
One of its part is a finite mixture model which is consisted of two OLS.
As a freshman in R, I don't know how to write this probability density function in R.
I wonder if you can give some help.
The probability density function is as following:
f(y|x)=(p/σ1)*φ(y-x*b1/σ1)+((1-p)/σ2)*φ(y-x*b2/σ2)
I have used stata to write a example:
gen double f1'=normalden($ML_y1,xb1',exp(lns1'))
gen doublef2'=normalden($ML_y1,xb2',exp(lns2'))
tempvar p
gen double p'=exp(lp')/(1+exp(lp'))
replacelnf'=ln(p'*f1'+(1-p')*f2')
I wonder if you can show me how to write this function in R.
Thanks a lot and I am looking forward to your help
See the function FLXMRglm. The density is estimated with dnorm
library(flexmix)
FLXMRglm
# your case
if (family == "gaussian") {
z#defineComponent <- function(para) {
predict <- function(x, ...) {
dotarg = list(...)
if ("offset" %in% names(dotarg))
offset <- dotarg$offset
p <- x %*% para$coef
if (!is.null(offset))
p <- p + offset
p
}
logLik <- function(x, y, ...) dnorm(y, mean = predict(x,
...), sd = para$sigma, log = TRUE)
new("FLXcomponent", parameters = list(coef = para$coef,
sigma = para$sigma), logLik = logLik, predict = predict,
df = para$df)
}
z#fit <- function(x, y, w, component) {
fit <- lm.wfit(x, y, w = w, offset = offset)
z#defineComponent(para = list(coef = coef(fit), df = ncol(x) +
1, sigma = sqrt(sum(fit$weights * fit$residuals^2/mean(fit$weights))/(nrow(x) -
fit$rank))))
}
}
For my thesis I have to fit some glm models with MLEs that R doesn't have, I was going ok for the models with close form but now I have to use de Gausian CDF, so i decide to fit a simple probit model.
this is the code:
Data:
set.seed(123)
x <-matrix( rnorm(50,2,4),50,1)
m <- matrix(runif(50,2,4),50,1)
t <- matrix(rpois(50,0.5),50,1)
z <- (1+exp(-((x-mean(x)/sd(x)))))^-1 + runif(50)
y <- ifelse(z < 1.186228, 0, 1)
data1 <- as.data.frame(cbind(y,x,m,t))
myprobit <- function (formula, data)
{
mf <- model.frame(formula, data)
y <- model.response(mf, "numeric")
X <- model.matrix(formula, data = data)
if (any(is.na(cbind(y, X))))
stop("Some data are missing.")
loglik <- function(betas, X, y, sigma) { #loglikelihood
p <- length(betas)
beta <- betas[-p]
eta <- X %*% beta
sigma <- 1 #because of identification, sigma must be equal to 1
G <- pnorm(y, mean = eta,sd=sigma)
sum( y*log(G) + (1-y)*log(1-G))
}
ls.reg <- lm(y ~ X - 1)#starting values using ols, indicating that this model already has a constant
start <- coef(ls.reg)
fit <- optim(start, loglik, X = X, y = y, control = list(fnscale = -1), method = "BFGS", hessian = TRUE) #optimizar
if (fit$convergence > 0) {
print(fit)
stop("optim failed to converge!") #verify convergence
}
return(fit)
}
myprobit(y ~ x + m + t,data = data1)
And i get: Error in X %*% beta : non-conformable arguments, if i change start <- coef(ls.reg) with start <- c(coef(ls.reg), 1) i get wrong stimatives comparing with:
probit <- glm(y ~ x + m + t,data = data1 , family = binomial(link = "probit"))
What am I doing wrong?
Is possible to correctly fit this model using pnorm, if no, what algorithm should I use to approximate de gausian CDF. Thanks!!
The line of code responsible for your error is the following:
eta <- X %*% beta
Note that "%*%" is the matrix multiplication operator. By reproducing your code I noticed that X is a matrix with 50 rows and 4 columns. Hence, for matrix multiplication to be possible your "beta" needs to have 4 rows. But when you run "betas[-p]" you subset the betas vector by removing its last element, leaving only three elements instead of the four you need for matrix multiplication to be defined. If you remove [-p] the code will work.
I am fitting a multivariate model in JAGS using the dirlichet distribution. I have a matrix y of 3 species proportional abundances.
#generate 3 columns of species proprotional abundance data
y <- matrix(ncol = 3, nrow = 100)
y[,] <- abs(rnorm(length(y)))
for(i in 1:nrow(y)){
y[i,] <- y[i,] / sum(y[i,])
}
I have a matrix x of predictor values, the first of which is an intercept.
#generate 2 columns of predictors and an intercept
x <- matrix(ncol = 2, nrow = 100)
x[,] <- rnorm(length(x), mean = 20, sd = 4)
x <- cbind(rep(1,nrow(x)),x)
I specify a multivariate jags model, jags.model:
jags.model = "
model {
#setup parameter priors for each species * predictor combination.
for(j in 1:N.spp){
for(k in 1:N.preds){
m[k,j] ~ dgamma(1.0E-3, 1.0E-3)
}
}
#go ahead and fit means of species abundances as a linear combination of predictor and parameters.
for(i in 1:N){
for(j in 1:N.spp){
log(a0[i,j]) <- m[,j] * x[i,]
}
y[i,1:N.spp] ~ ddirch(a0[i,1:N.spp])
}
} #close model loop.
"
I setup the JAGS data object, jags.data:
jags.data <- list(y = as.matrix(y), x = as.matrix(x),
N.spp = ncol(y), N.preds = ncol(x), N = nrow(y))
I fit the JAGS model using the runjags package in R.
jags.out <- runjags::run.jags(jags.model,
data=jags.data,
adapt = 100,
burnin = 200,
sample = 400,
n.chains=3,
monitor=c('m'))
I get the following error:
Error: The following error occured when compiling and adapting the model using rjags:
Error in rjags::jags.model(model, data = dataenv, n.chains = length(runjags.object$end.state), :
RUNTIME ERROR:
Invalid vector argument to exp
What am I doing wrong here? For reference, spelling out each parameter by predictor combination still fits fine:
jags.model = "
model {
#setup parameter priors for each species * predictor combination.
for(j in 1:N.spp){
for(k in 1:N.preds){
m[k,j] ~ dgamma(1.0E-3, 1.0E-3)
}
}
#go ahead and fit means of species abundances as a linear combination of predictor and parameters.
for(i in 1:N){
for(j in 1:N.spp){
log(a0[i,j]) <- m[1,j] * x[i,1] + m[2,j] * x[i,2] + m[3,j] * x[i,3]
}
y[i,1:N.spp] ~ ddirch(a0[i,1:N.spp])
}
} #close model loop.
"
The solution to this problem is to take a dot product, or an inner product in JAGS. Change the line:
log(a0[i,j]) <- m[,j] * x[i,]
to:
log(a0[i,j]) <- inprod(m[,j] , x[i,])
And everything should work fine. Full model below.
jags.model = "
model {
#setup parameter priors for each species * predictor combination.
for(j in 1:N.spp){
for(k in 1:N.preds){
m[k,j] ~ dgamma(1.0E-3, 1.0E-3)
}
}
#go ahead and fit means of species abundances as a linear combination of predictor and parameters.
for(i in 1:N){
for(j in 1:N.spp){
log(a0[i,j]) <- inprod(m[,j] , x[i,])
}
y[i,1:N.spp] ~ ddirch(a0[i,1:N.spp])
}
} #close model loop.
"
I would like to perform 10-fold cross validation manually using prostate data to learn how to do it manually. I utilise the elasticnet package for code. I estimated the parameters by glmnet package (of course, it can perform cross validation too, but I would like to do that manually). After the analysis, It seems to me that I need a different criterion to choose tuning parameter other than minimum of cv.error because this gives the almost null model, if not so "where is my mistake?". (According to the original paper of Tibshirani, optimum model has three variables)
Here is the code
library(ElemStatLearn)
library(glmnet)
x <- scale(prostate[,1:8],T,T)
y <- scale(prostate[,9],T,F)
lambda = seq(0,1,0.02)
cv.folds <- function(n, folds = 10){
split(sample(1:n), rep(1:folds, length = n))
}
c.val <- function(x, y, K = 10, lambda, plot.it = TRUE){
n <- nrow(x)
all.folds <- cv.folds(length(y), K)
residmat <- matrix(0, length(lambda), K)
for(i in seq(K)) {
omit <- all.folds[[i]]
xk <- as.matrix(x[-omit, ])
yk <- as.vector(y[-omit])
xg <- x[omit, ]
yg <- y[omit]
fit <- glmnet(xk, yk, family="gaussian",
alpha=1, lambda=lambda,standardize = FALSE, intercept = FALSE)
fit <- predict(fit,newx=xg,lambda=lambda)
if(length(omit)==1){fit<-matrix(fit,nrow=1)}
residmat[, i] <- apply((yg - fit)^2, 2, mean)
}
cv <- apply(residmat, 1, mean)
cv.error <- sqrt(apply(residmat, 1, var)/K)
object<-list(lambda = lambda, cv = cv, cv.error = cv.error)
if(plot.it) {
plot(lambda, cv, type = "b", xlab="lambda", ylim = range(cv, cv + cv.error, cv - cv.error))
invisible(object)
}
}
result <- c.val(x,y,K = 10,lambda = lambda)
lambda.opt <- lambda[which.min(result$cv.error)]
fit <- glmnet(x, y, family="gaussian",
alpha=1, lambda=lambda.opt,standardize = FALSE, intercept = FALSE)
coef(fit)
Result:
> coef(fit)
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
lcavol 0.01926724
lweight .
age .
lbph .
svi .
lcp .
Edit:
Model generated directly from glmnet.
fit.lasso <- glmnet(x, y, family="gaussian", alpha=1,
standardize = FALSE, intercept = FALSE)
fit.lasso.cv <- cv.glmnet(x, y, type.measure="mse", alpha=1,
family="gaussian",standardize = FALSE, intercept = FALSE)
coef.lambda.min <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.min)
coef.lambda.1se <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.1se)
cbind(coef.lambda.min,coef.lambda.1se)
Result:
9 x 2 sparse Matrix of class "dgCMatrix"
1 1
(Intercept) . .
lcavol 0.59892674 0.5286355
lweight 0.23669159 0.1201279
age -0.06979581 .
lbph 0.09392021 .
svi 0.24620007 0.1400748
lcp . .
gleason 0.00346421 .
pgg45 0.06631013 .
The second column shows the correct (lambda.1se) result.
Your "mistake" is very hard to spot: it comes from the fact that glmnet will not use the order of your own lambda vector to sort the vector of results.
Example with the data you used:
res <- glmnet(x, y, lambda=lambda)
res$lambda
So when you call the command lambda[which.min(result$cv.error)] at the end of your procedure, you will not get the value corresponding to the minimum of the cross-validated error. Also, it explains why your graph looks strange.
An easy fix would be to declare lambda at the beginning of the script as a decreasing vector:
lambda = seq(1, 0, 0.02)
Final remark: be careful when using a single lambda.