lme4: Random slopes shared by all observations - r

I'm using R's lme4. Suppose I have a mixed-effects logistic-regression model where I want some random slopes shared by every observation. They're supposed to be random in the sense that these random slopes should all come from a single normal distribution. This is essentially the same thing as ridge regression, but without choosing a penalty size with cross-validation.
I tried the following code:
library(lme4)
ilogit = function(v)
1 / (1 + exp(-v))
set.seed(20)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)
x3 = rnorm(n)
x4 = rnorm(n)
x5 = rnorm(n)
y.p = ilogit(.5 + x1 - x2)
y = rbinom(n = n, size = 1, prob = y.p)
m1 = glm(
y ~ x1 + x2 + x3 + x4 + x5,
family = binomial)
print(round(d = 2, unname(coef(m1))))
m2 = glmer(
y ~ ((x1 + x2 + x3 + x4 + x5)|1),
family = binomial)
print(round(d = 2, unname(coef(m2))))
This yields:
Loading required package: Matrix
[1] 0.66 1.14 -0.78 -0.01 -0.16 0.25
Error: (p <- ncol(X)) == ncol(Y) is not TRUE
Execution halted
What did I do wrong? What's the right way to do this?

Looks like lme4 can't do this as-is. Here's what #amoeba said in stats.SE chat:
What Kodi wants to do is definitely a mixed model, in the sense of Bates et al. see e.g. eq (2) here https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf As far as I can see, X and Z design matrices are equal in this case. However, there is no way one can use lme4 to fit this (without hacking into the code): it allows only particular Z matrices that arise from the model formulas of the type (formula|factor).
See https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015581.html "We intend to allow lmer to be able to use more flexible model matrices for the random effects although, at present, that requires a certain amount of tweaking on the part of the user"
And https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q2/002351.html "I view the variance-covariance structures available in the lme4 package as being related to random-effects terms in the model matrix. A random-effects term is of the form (LMexpr | GrpFac). The expression on the right of the vertical bar is evaluated as a factor, which I call the grouping factor. The expression on the left is evaluated as a linear model expression."
That's all quotes from Bates. He does say "In future versions of lme4 I plan to allow for extensions of the unconditional variance-covariance structures." (in 2009) but I don't this was implemented.

Related

How to do a one-sample location, two-way approximate Z test in R using estimates from the delta method?

I used the delta method to estimate the difference between two coefficients from a glm fit (attached code below). Now, I want to compare this estimate to zero (i.e., a null hypothesis of no difference). One article mentions using a one-sample location, two-way approximate Z test to compute this difference.
However, I cannot find an easy way to do that in R using the delta difference. I looked over the two-sample Z test documentation and possibly thought of using the difference as a substitute in the z-stat formula...but I am not sure if that's the best way to go about it.
##GENERATE DATA SET
y <- c(1:12)
x1 <- rep(c(1000, 4000, 0), each = 4)
x2 <- rep(c(0, 1000, 4000), each = 4)
df <- data.frame(y, x1, x2)
##RUN GLM
library(lmerTest)
g1 <- glm(log(y) ~ x1 + x2, data = df)
##Use delta-method to estimate the difference between coefficients of x1 and x2 (Ritz & Streibig 2008)
library(car)
g1.delta <- deltaMethod(g1,"(-x1) - (-x2)")
Estimate SE 2.5 % 97.5 %
(-x1) - (-x2) 2.3217e-04 7.3180e-05 8.8738e-05 4e-04

SparkR MLlib & spark.ml: least squares and glm optimization

Would anyone be able to explain how to specify optimization methods in the SparkR operation glm? When I try to fit an OLS model with glm, I can only specify "normal" or "auto" as the solver type. SparkR isn't able to interpret the solver specification "l-bfgs", leading me to believe that when I do specify "auto", SparkR simply assumes "normal" and then estimates the model coefficients analytically, using the LS normal equation.
Is fitting GLMs with stochastic gradient descent and L-BFGS not available in SparkR, or am I writing the following evaluation incorrectly?
m <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
There's plenty of documentation in Spark about using iterative methods to fit GLMs, e.g. LogisticRegressionWithLBFGS and LinearRegressionWithSGD (discussed here), but I haven't been able to find any such documentation for the R API. Is this simply not available in SparkR (i.e. are SparkR users constrained to solving analytically and, therefore, constrained in the size of our data), or am I missing something essential here? If it isn't currently available in SparkR, is it supposed to come out with SparkR 2.0.0?
Below, I create a toy data set and fit three models, each with a different solver specification:
x1 <- rnorm(n=200, mean=10, sd=2)
x2 <- rnorm(n=200, mean=17, sd=3)
x3 <- rnorm(n=200, mean=8, sd=1)
y <- 1 + .2 * x1 + .4 * x2 + .5 * x3 + rnorm(n=200, mean=0, sd=.1)
dat <- cbind.data.frame(y, x1, x2, x3)
df <- as.DataFrame(sqlContext, dat)
m1 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "normal")
m2 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "auto")
m3 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")
The first and second model result in the same parameter estimation values (supporting my assumption that SparkR is solving the normal equation when fitting both models and, consequently, the models are equivalent). SparkR is able to fit the third model, but when I try to print a summary of the GLM, I receive the following error:
For reference, I am doing this through AWS and have tried different versions of EMR, including the most recent (in case that makes a difference). Also, I am using Spark 1.6.1 (R API).
Spark 1.6.2 API documentation is here
solver:
The solver algorithm used for optimization, this can be "l-bfgs", "normal" and "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. The default value is "auto" which means that the solver algorithm is selected automatically.
To me - this looks worthy of a bug report on the Apache Spark Jira site.

test proportional odds assumption with 2 random variables R ordinal logistic

I'm using the package ordinal in R to run ordinal logistic regression on a dependent variable that is based on a 1 - 5 likert scale and trying to figure out how to test the proportional odds assumption.
My current model is y ~ x1 + x2 + x3 + x4 + x2*x3 + (1|ID) + (1|form) where x1 and x2 are dichotomous and x3 and x4 are continuous variables. (92 subjects, 4 forms).
As far as I know,
-"nominal" is not implemented in the more recent version of clmm.
-clmm2 (the older version) does not accept more than one random variable
-nominal_test() only appears to work for clm2 (without random effects at all)
For a different dv (that only has one random term and no interaction), I had used:
m1 <- clmm2 (y ~ x1 + x2 + x3, random = ID, Hess = TRUE, data = d
m1.nom <- clmm2 (y ~ x1 + x2, random = ID, Hess = TRUE, nominal = ~x3, data = d)
m2.nom <- clmm2 (y ~ x2+ x3, random = ID, Hess = TRUE, nominal = ~ x1, data = d)
m3.nom <- clmm2 (y ~ x1+ x3, random = ID, Hess = TRUE, nominal = ~ x2, data = d)
anova (m1.nom, m1)
anova (m2.nom, m1)
anova (m3.nom, m1) # (as well as considering the output in summary (m#.nom)
But I'm not sure how to modify this approach to handle the current model (2 random terms and an interaction of the fixed effects), nor am I sure that this actually a correct way to test the proportional odds assumption in the first place. (The example in the package tutorial only has 2 fixed effects.)
I'm open to other approaches (be they other packages, software, or graphical approaches) that would let me test this. Any suggestions?
Even in the case of the most basic ordinal logistic regression models, the diagnostic tests for the proportional odds assumption are known to frequently reject the null hypothesis that the coefficients are the same across the levels of the ordered factor. The statistician Frank Harrell suggests here a general graphical method for examining the proportional odds assumption, which is probably your best bet. In this approach you'd just graph the linear predictions from a logit model (with random effects) for each level of the outcome and one predictor variable at a time.

optimal predictor value for multivariate regression in R

Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?

R probit regression marginal effects

I am using R to replicate a study and obtain mostly the same results the
author reported. At one point, however, I calculate marginal effects that seem to be unrealistically small. I would greatly appreciate if you could have a look at my reasoning and the code below and see if I am mistaken at one point or another.
My sample contains 24535 observations, the dependent variable "x028bin" is a
binary variable taking on the values 0 and 1, and there are furthermore 10
explaining variables. Nine of those independent variables have numeric levels, the independent variable "f025grouped" is a factor consisting of different religious denominations.
I would like to run a probit regression including dummies for religious denomination and then compute marginal effects. In order to do so, I first eliminate missing values and use cross-tabs between the dependent and independent variables to verify that there are no small or 0 cells. Then I run the probit model which works fine and I also obtain reasonable results:
probit4AKIE <- glm(x028bin ~ x003 + x003squ + x025secv2 + x025terv2 + x007bin + x04chief + x011rec + a009bin + x045mod + c001bin + f025grouped, family=binomial(link="probit"), data=wvshm5red2delna, na.action=na.pass)
summary(probit4AKIE)
However, when calculating marginal effects with all variables at their means from the probit coefficients and a scale factor, the marginal effects I obtain are much too small (e.g. 2.6042e-78).
The code looks like this:
ttt <- cbind(wvshm5red2delna$x003,
wvshm5red2delna$x003squ,
wvshm5red2delna$x025secv2,
wvshm5red2delna$x025terv2,
wvshm5red2delna$x007bin,
wvshm5red2delna$x04chief,
wvshm5red2delna$x011rec,
wvshm5red2delna$a009bin,
wvshm5red2delna$x045mod,
wvshm5red2delna$c001bin,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped) #I put variable "f025grouped" 9 times because this variable consists of 9 levels
ttt <- as.data.frame(ttt)
xbar <- as.matrix(mean(cbind(1,ttt[1:19]))) #1:19 position of variables in dataframe ttt
betaprobit4AKIE <- probit4AKIE$coefficients
zxbar <- t(xbar) %*% betaprobit4AKIE
scalefactor <- dnorm(zxbar)
marginprobit4AKIE <- scalefactor * betaprobit4AKIE[2:20] #2:20 are the positions of variables in the output of the probit model 'probit4AKIE' (variables need to be in the same ordering as in data.frame ttt), the constant in the model occupies the first position
marginprobit4AKIE #in this step I obtain values that are much too small
I apologize that I can not provide you with a working example as my dataset is
much too large. Any comment would be greatly appreciated. Thanks a lot.
Best,
Tobias
#Gavin is right and it's better to ask at the sister site.
In any case, here's my trick to interpret probit coefficients.
The probit regression coefficients are the same as the logit coefficients, up to a scale (1.6). So, if the fit of a probit model is Pr(y=1) = fi(.5 - .3*x), this is equivalent to the logistic model Pr(y=1) = invlogit(1.6(.5 - .3*x)).
And I use this to make a graphic, using the function invlogit of package arm. Another possibility is just to multiply all coefficients (including the intercept) by 1.6, and then applying the 'divide by 4 rule' (see the book by Gelman and Hill), i.e, divide the new coefficients by 4, and you will find out an upper bound of the predictive difference corresponding to a unit difference in x.
Here's an example.
x1 = rbinom(100,1,.5)
x2 = rbinom(100,1,.3)
x3 = rbinom(100,1,.9)
ystar = -.5 + x1 + x2 - x3 + rnorm(100)
y = ifelse(ystar>0,1,0)
probit = glm(y~x1 + x2 + x3, family=binomial(link='probit'))
xbar <- as.matrix(mean(cbind(1,ttt[1:3])))
# now the graphic, i.e., the marginal effect of x1, x2 and x3
library(arm)
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*x + probit$coef[3]*xbar[3] + probit$coef[4]*xbar[4]))) #x1
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*xbar[2] + probit$coef[3]*x + probit$coef[4]*xbar[4]))) #x2
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*xbar[2] + probit$coef[3]*xbar[3] + probit$coef[4]*x))) #x3
This will do the trick for probit or logit:
mfxboot <- function(modform,dist,data,boot=1000,digits=3){
x <- glm(modform, family=binomial(link=dist),data)
# get marginal effects
pdf <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
# start bootstrap
bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
set.seed(1111)
for(i in 1:boot){
samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
x1 <- glm(modform, family=binomial(link=dist),samp1)
pdf1 <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
bootvals[i,] <- pdf1*coef(x1)
}
res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
if(names(x$coefficients[1])=="(Intercept)"){
res1 <- res[2:nrow(res),]
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
rownames(res2) <- rownames(res1)
} else {
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
rownames(res2) <- rownames(res)
}
colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
return(res2)
}
Source: http://www.r-bloggers.com/probitlogit-marginal-effects-in-r/

Resources