How to fit a negative binomial distribution in R while incorporating censoring - r

I need to fit Y_ij ~ NegBin(m_ij,k), hence a negative binomial distribution to a count. However, the data I have observed are censored, I know the value of y_ij, but it could be more than that value. Writting down the loglikelihood going with this problem is:
ll = \sum_{i=1}^n w_i (c_i log(P(Y_ij=y_ij|X_ij)) + (1- c_i) log(1- \sum_{k=1}^32 P(Y_ij = k|X_ij)))
Where X_ij represent the design matrix (with the covariates of interest), w_i is the weight for each observation, y_ij is the response variable and P(Y_ij=y_ij|Xij) is the negative binomial distribution where the m_ij=exp(X_ij \beta) and \alpha is the overdispersion parameter.
Does someone knows if there exist a build-in code in R that could be used to obtain this?

Check this paper out: Regression Models for Count Data in R

Related

Output from Linear Mixed Models differs from Estimated Marginal Means

I have a query about the output statistics gained from linear mixed models (using the lmer function) relative to the output statistics taken from the estimated marginal means gained from this model
Essentially, I am running an LMM comparing the within-subjects effect of different contexts (with "Negative" coded as the baseline) on enjoyment ratings. The LMM output suggests that the difference between negative and polite contexts is not significant, with a p-value of .35. See the screenshot below with the relevant line highlighted:
LMM output
However, when I then run the lsmeans function on the same model (with the Holm correction), the p-value for the comparison between Negative and Polite context categories is now .05, and all of the other statistics have changed too. Again, see the screenshot below with the relevant line highlighted:
LSMeans output
I'm probably being dense because my understanding of LMMs isn't hugely advanced, but I've tried to Google the reason for this and yet I can't seem to find out why? I don't think it has anything to do with the corrections because the smaller p-value is observed when the Holm correction is used. Therefore, I was wondering why this is the case, and which value I should report/stick with and why?
Thank you for your help!
Regression coefficients and marginal means are not one and the same. Once you learn these concepts it'll be easier to figure out which one is more informative and therefore which one you should report.
After we fit a regression by estimating its coefficients, we can predict the outcome yi given the m input variables Xi = (Xi1, ..., Xim). If the inputs are informative about the outcome, the predicted yi is different for different Xi. If we average the predictions yi for examples with Xij = xj, we get the marginal effect of the jth feature at the value xj. It's crucial to keep track of which inputs are kept fixed (and at what values) and which inputs are averaged over (aka marginalized out).
In your case, contextCatPolite in the coefficients summary is the difference between Polite and Negative when smileType is set to its reference level (no reward, I'd guess). In the emmeans contrasts, Polite - Negative is the average difference over all smileTypes.
Interactions have a way of making interpretation more challenging and your model includes an interaction between smileType and contextCat. See Interaction analysis in emmeans.
To add to #dipetkov's answer, the coefficients in your LMM are based on treatment coding (sometimes called 'dummy' coding). With the interactions in the model, these coefficients are no longer "main-effects" in the traditional sense of factorial ANOVA. For instance, if you have:
y = b_0 + b_1(X_1) + b_2(X_2) + b_3 (X_1 * X_2)
...b_1 is "the effect of X_1" only when X_2 = 0:
y = b_0 + b_1(X_1) + b_2(0) + b_3 (X_1 * 0)
y = b_0 + b_1(X_1)
Thus, as #dipetkov points out, 1.625 is not the difference between Negative and Polite on average across all other factors (which you get from emmeans). Instead, this coefficient is the difference between Negative and Polite specifically when smileType = 0.
If you use contrast coding instead of treatment coding, then the coefficients from the regression output would match the estimated marginal means, because smileType = 0 would now be on average across smile types. The coding scheme thus has a huge effect on the estimated values and statistical significance of regression coefficients, but it should not effect F-tests based on the reduction in deviance/variance (because no matter how you code it, a given variable explains the same amount of variance).
https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/

How to specify random coefficients priors in rstanarm?

Suppose I have a following formula for a mixed effects model:
Performance ~ 1 + WorkingHours + Tenure + (1 + WorkingHours + Tenure || JobClass)
then I can specify priors for fixed slopes and fixed intercept as:
prior = normal(c(mu1,mu2), c(sd1,sd2), autoscale = FALSE)
prior_intercept = normal(mean, scale, autoscale = FALSE)
But how do I specify the priors for random slopes and intercept using
prior_covariance = decov(regularization, concentration, shape, scale)
(or)
lkj(regularization, scale, df)
if I know the variance between the slopes and intercepts and the correlation between them.
I am unable to understand how to specify the parameters for the above mixed effects formula.
Because you're working in a Bayesian model, you aren't going to specify the correlations or variances. You're going to specify a likelihood distribution of covariance matrices (by way of the correlation matrix and vector of variances) by giving the values for a few parameters.
The regularization parameter is a positive real value that determines how likely things are to be correlated. A value of 1 is sort of the "anything's possible" option (this is the default). Values greater than 1 mean that you believe there are few, if any, correlations. Values less than 1 mean you believe there is a lot of correlation.
The scale parameter is related to the sum of the variances. In particular, the scale parameter is equal to the square root of the average variance.
The concentration parameter is used to control how the total variance is distributed among the different variables. A value of 1 is saying you don't have an expectation. Larger values say that you believe that the variables have similar proportions of the total variance. Values between 0 and 1 mean that you think there are dissimilar contributions.
The shape parameter is used for a Gamma distribution that acts as a prior on the scale.
Then, finally, df is your prior degrees of freedom.
So, decov and lkj are each giving you a different way to express your expectations about properties of the covariance matrix, but they won't let you specify which specific variables you believe to be correlated with which other specific variables. It should decide that as part of the model fitting process.
This is all from the rstanarm documentation

LASSO-type regressions with non-negative continuous dependent variable (dependent var)

I am using "glmnet" package (in R) mostly to perform regularized linear regression.
However I am wondering if it can perform LASSO-type regressions with non-negative (integer) continuous (dependent) outcome variable.
I can use family = poisson, but the outcome variable is not specifically "count" variable. It is just a continuous variable with lower limit 0.
I aware of "lower.limits" function, but I guess it is for covariates (independent variables). (Please correct me if my understanding of this function not right.)
I look forward to hearing from you all! Thanks :-)
You are right that setting lower limit in glmnet is meant for covariates. Poisson will set a lower limit to zero because you exponentiate to get back the "counts".
Going along those lines, most likely it will work if you transform your response variable. One quick way is to take the log of your response variable, do the fit and transform it back, this will ensure that it's always positive. you have to deal with zeros
An alternative is a power transformation. There's a lot to think about and I can only try a two parameter box-cox with a dataset since you did not provide yours:
library(glmnet)
library(mlbench)
library(geoR)
data(BostonHousing)
data = BostonHousing
data$chas=as.numeric(data$chas)
# change it to min 0 and max 1
data$medv = (data$medv-min(data$medv))/diff(range(data$medv))
Then here I use a quick approximation via pca (without fitting all the variables) to get the suitable lambda1 and lambda2 :
bcfit = boxcoxfit(object = data[,14],
xmat = prcomp(data[,-14],scale=TRUE,center=TRUE)$x[,1:2],
lambda2=TRUE)
bcfit
Fitted parameters:
lambda lambda2 beta0 beta1 beta2 sigmasq
0.42696313 0.00001000 -0.83074178 -0.09876102 0.08970137 0.05655903
Convergence code returned by optim: 0
Check the lambda2, it is the one thats critical for deciding whether you get a negative value.. It should be rather small.
Create the functions to power transform:
bct = function(y,l1,l2){((y+l2)^l1 -1)/l1}
bctinverse = function(y,l1,l2){(y*l1+1)^(1/l1) -l2}
Now we transform the response:
data$medv_trans = bct(data$medv,bcfit$lambda[1],bcfit$lambda[2])
And fit glmnet:
fit = glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans,nlambda=500)
Get predictions over all lambdas, and you can see there's no negative predictions once you transform back:
pred = predict(fit,as.matrix(data[,1:13]))
range(bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]))
[1] 0.006690685 0.918473356
And let's say we do a fit with cv:
fit = cv.glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans)
pred = predict(fit,as.matrix(data[,1:13]))
pred_transformed = bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]
plot(data$medv,pred_transformed,xlab="orig response",ylab="predictions")

How does lmer (from the R package lme4) compute log likelihood?

I'm trying to understand the function lmer. I've found plenty of information about how to use the command, but not much about what it's actually doing (save for some cryptic comments here: http://www.bioconductor.org/help/course-materials/2008/PHSIntro/lme4Intro-handout-6.pdf). I'm playing with the following simple example:
library(data.table)
library(lme4)
options(digits=15)
n<-1000
m<-100
data<-data.table(id=sample(1:m,n,replace=T),key="id")
b<-rnorm(m)
data$y<-rand[data$id]+rnorm(n)*0.1
fitted<-lmer(b~(1|id),data=data,verbose=T)
fitted
I understand that lmer is fitting a model of the form Y_{ij} = beta + B_i + epsilon_{ij}, where epsilon_{ij} and B_i are independent normals with variances sigma^2 and tau^2 respectively. If theta = tau/sigma is fixed, I computed the estimate for beta with the correct mean and minimum variance to be
c = sum_{i,j} alpha_i y_{ij}
where
alpha_i = lambda/(1 + theta^2 n_i)
lambda = 1/[\sum_i n_i/(1+theta^2 n_i)]
n_i = number of observations from group i
I also computed the following unbiased estimate for sigma^2:
s^2 = \sum_{i,j} alpha_i (y_{ij} - c)^2 / (1 + theta^2 - lambda)
These estimates seem to agree with what lmer produces. However, I can't figure out how log likelihood is defined in this context. I calculated the probability density to be
pd(Y_{ij}=y_{ij}) = \prod_{i,j}[f_sigma(y_{ij}-ybar_i)]
* prod_i[f_{sqrt(sigma^2/n_i+tau^2)}(ybar_i-beta) sigma sqrt(2 pi/n_i)]
where
ybar_i = \sum_j y_{ij}/n_i (the mean of observations in group i)
f_sigma(x) = 1/(sqrt{2 pi}sigma) exp(-x^2/(2 sigma)) (normal density with sd sigma)
But log of the above is not what lmer produces. How is log likelihood computed in this case (and for bonus marks, why)?
Edit: Changed notation for consistency, striked out incorrect formula for standard deviation estimate.
The links in the comments contained the answer. Below I've put what the formulae simplify to in this simple example, since the results are somewhat intuitive.
lmer fits a model of the form , where and are independent normals with variances and respectively. The joint probability distribution of and is therefore
where
.
The likelihood is obtained by integrating this with respect to (which isn't observed) to give
where is the number of observations from group , and is the mean of observations from group . This is somewhat intuitive since the first term captures spread within each group, which should have variance , and the second captures the spread between groups. Note that is the variance of .
However, by default (REML=T) lmer maximises not the likelihood but the "REML criterion", obtained by additionally integrating this with respect to to give
where is given below.
Maximising likelihood (REML=F)
If is fixed, we can explicitly find the and which maximise likelihood. They turn out to be
Note has two terms for variation within and between groups, and is somewhere between the mean of and the mean of depending on the value of .
Substituting these into likelihood, we can express the log likelihood in terms of only:
lmer iterates to find the value of which minimises this. In the output, and are shown in the fields "deviance" and "logLik" (if REML=F) respectively.
Maximising restricted likelihood (REML=T)
Since the REML criterion doesn't depend on , we use the same estimate for as above. We estimate to maximise the REML criterion:
The restricted log likelihood is given by
In the output of lmer, and are shown in the fields "REMLdev" and "logLik" (if REML=T) respectively.

Determining the values of beta in a logistic regression equation

I read about logistic regression on Wikipedia and it talks of the equation where z, the output depends on the values of beta1, beta2, beta3, and so on which are termed as regression coefficients along with beta0, which is the intercept.
How do you determine the values of these coefficients and intercept given a sample set where we know the output probability and the values of the input dependent factors?
Use R. Here's a video: www.youtube.com/watch?v=Yv05RjKpEKY

Resources