GLMER Output from R, meaning of .sig01, .sig02, .sig03 - r

when computing profile confidence intervals using confint(m1) where m1 is a glmer() model there is a term ( or a few ) at the top which are labelled .sig01, .sig02, I can't find any documentation which explains what these mean though.

You probably didn't find the documentation for the 'merMod' class method of confint. In ?lme4::confint.merMod the following can be read at the parameters:
oldNames: (logical) use old-style names for variance-covariance parameters,
e.g. ".sig01", rather than newer (more informative) names such as
"sd_(Intercept)|Subject"? (See signames argument to profile).
The default option for oldNames is TRUE. Setting it to FALSE will give you a clearer output.
Example
library(lme4)
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))
confint(gm1, oldNames=FALSE)
# 2.5 % 97.5 %
# sd_(Intercept)|herd 0.3460732 1.0999743
# (Intercept) -1.9012119 -0.9477540
# period2 -1.6167830 -0.4077632
# period3 -1.8010241 -0.5115362
# period4 -2.5007502 -0.8007554

Related

R Quantreg: Singularity with categorical survey data

For my Bachelor's thesis I am trying to apply a linear median regression model on constant sum data from a survey (see formula from A.Blass (2008)). It is an attempt to recreate the probability elicitation approach proposed by A. Blass et al (2008) - Using Elicited Choice Probabilities to Estimate Random Utility Models: Preferences for Electricity Reliability
My dependent variable is the log-odds transformation of the constant sum allocations. Calculated using the following formula:
PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
log(Response[2]/Response[1]),
log(Response[3]/Response[1])))
My independent variables are delivery costs, minimum order quantity and delivery window, each categorical variables with levels 0, 1, 2 and 3. Here, level 0 represent the none-option.
Data snapshot
I tried running the following quantile regression (using R's quantreg package):
LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)
However, I ran into the following error indicating singularity:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I ran a linear regression and applied R's alias function for further investigation. This informed me of three cases of perfect multicollinearity:
minimum order quantity 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - minimum order quantity 1 - minimum order quantity 2
delivery window 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - delivery window 1 - delivery window 2
NoneOpt = intercept - delivery costs 1 - delivery costs 2 - delivery costs 3
In hindsight these cases all make sense. When R dichotomizedthe categorical variables you get these results by construction as, delivery costs 1 + delivery costs 2 + delivery costs 3 = 1 and minimum order quantity 1 + minimum order quantity 2 + minimum order quantity 3 = 1. Rewriting gives the first formula.
It looks like a classic dummy trap. In an attempt to workaround this issue I tried to manually dichotomize the data and used the following formula:
LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)
Instead of an error message I now got the following:
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
When using the summary function:
> summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis
Is anyone familiar with this issue? I am looking for alternative solutions. Perhaps I am making mistakes using the rq() function, or the data might be misrepresented.
I am grateful for any input, thank you in advance.
Reproducible example
library(quantreg)
#### Raw dataset (PE_raw_SO) ####
# quantile regression (produces singularity error)
LAD.factor <- rq(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO,
tau = 0.5
)
# linear regression to check for singularity
LM.factor <- lm(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO
)
alias(LM.factor)
# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")
#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO,
tau = 0.5
)
summary(LAD.di.factor) #backsolve error
# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid")
summary(LAD.di.factor, se = "boot")
# linear regression to check for singularity
LM.di.factor <- lm(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO
)
alias(LM.di.factor)
summary(LM.di.factor) #regular results, all significant
Link to sample data + code: GitHub
The Solution may be nonunique behaviour is not unusual when doing quantile regressions with dummy explanatory variables.
See, e.g., the quantreg FAQ:
The estimation of regression quantiles is a linear programming
problem. And the optimal solution may not be unique.
A more intuitive explanation for what is happening is given by Roger Koenker (the author of quantreg) on r-help back in 2006:
When computing the median from a sample with an even number of
distinct values there is inherently some ambiguity about its value:
any value between the middle order statistics is "a" median.
Similarly, in regression settings the optimization problem solved by
the "br" version of the simplex algorithm, modified to do general
quantile regression identifies cases where there may be non
uniqueness of this type. When there are "continuous" covariates this
is quite rare, when covariates are discrete then it is relatively
common, atleast when tau is chosen from the rationals. For univariate
quantiles R provides several methods of resolving this sort of
ambiguity by interpolation, "br" doesn't try to do this, instead
returning the first vertex solution that it comes to.
Your second warning -- "153 non-positive fis" -- is a warning related to how the local densities are calculated by rq. Occasionally, it could be possible that local densities of the quantile regression function end up being negative (which is obviously impossible). If this happens, rq automatically sets them to zero. Again, quoting from the FAQ:
This is generally harmless, leading to a somewhat conservative
(larger) estimate of the standard errors, however if the reported
number of non-positive fis is large relative to the sample size then
it is an indication of misspecification of the model.

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same sizeā€¦ from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

Predict function returns Fitted value even though I put newdata

I'm making model validation testing function for my own.
In doing so, I let
a=entire set of predictor variables in model-building set
b=set of response variable in model-building set
c=entire set of predictor variables in validation set
d=set of response variable in validation set
e=number of column which I have an interest
This is based on book Applied Linear Regression Models, Kutner , so I used
library(ALSM).
In my case, model-building set is SurgicalUnit, and validation set is SurgicalUnitAdditional.
Both data consists of 10 columns, of which from 1st to 8th columns are entire set of indep. variables, 9th is the response variable, 10th is the log(response variable)
So,
a=SurgicalUnit[,1:8]; b=SurgicalUnit[,10];
c=SurgicalUnitAdditional[,1:8]; d=SurgicalUnitAdditional[,10]; e=c(1,2,3,8)
, since I want to fit with logged response variable, and I want to regress with variable x1,x2,x3 and x8.
(Please note that the reason why I used "entire" set of independent variables with specific number of column instead of putting set of interested independent variables dircetly is, because I need to obtain Mallow's Cp in my function at once.)
So my regression is, asdf=lm(b~as.matrix(a[e])) , the problem is, I want to predict validation set in models built with model-building set. So, I let preds=data.frame(c[e]) and finally predict(asdf, newdata=preds) which is equal with predict(asdf), which means that it's fitted values of asdf.
Why predict doesn't work? Helps will be appreciated.
Below is my function
mod.valid=function(a,b,c,d,e){
asdf=lm(b~as.matrix(a[e])) # model what you want
qwer=lm(b~as.matrix(a[1:max(e)])) # full model in order to get Cp
mat=round(coef(summary(asdf))[,c(-3,-4)],4); mat2=matrix(0,5,2)
mat=rbind(mat,mat2); mat # matrix for coefficients and others(model-building)
n=nrow(anova(asdf)); m=nrow(anova(qwer))
nn=length(b) # To get size of sample size
p=asdf$rank # To get parameters p
cp=anova(asdf)$Sum[n] / (anova(qwer)$Mean[m]) - (nn-2*p); cp=round(cp,4)
mat[p+1,1]=p; mat[p+1,2]=cp # adding p and Cp
rp=summary(asdf)$r.squared; rap=summary(asdf)$adj.r.squared; rp=round(rp,4); rap=round(rap,4)
mat[p+2,1]=rp; mat[p+2,2]=rap # adding Rp2 and Rap2
sse=anova(asdf)$Sum[n]; pre=MPV::PRESS(asdf); sse=round(sse,4); pre=round(pre,4)
mat[p+3,1]=sse; mat[p+3,2]=pre # adding SSE and PRESS
**preds=data.frame(c[e]); predd=predict(asdf,newdata=preds)** **# I got problem here!**
mspr=sum((d-predd)^2) / length(d); mse=anova(asdf)$Mean[n]; mspr=round(mspr,4); mse=round(mse,4)
mat[p+4,1]=mse; mat[p+4,2]=mspr # adding MSE and MSPR
aic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + 2*p; aic=round(aic,4)
bic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + log(nn)*p; bic=round(bic,4)
mat[p+5,1]=aic; mat[p+5,2]=bic # adding AIC and BIC
rownames(mat)[p+1]="p&Cp"; rownames(mat)[p+2]="Rp.sq&Rap.sq"
rownames(mat)[p+3]="SSE&PRESS"; rownames(mat)[p+4]="MSE&MSPR"; rownames(mat)[p+5]="AIC&BIC"
asdf2=lm(d~as.matrix(c[e]))
qwer2=lm(d~as.matrix(c[1:max(e)]))
matt=round(coef(summary(asdf2))[,c(-3,-4)],4); matt2=matrix(0,5,2)
matt=rbind(matt,matt2); matt # matrix for coefficients and others(validation)
n2=nrow(anova(asdf2)); m2=nrow(anova(qwer2))
nn2=length(d) # To get size of sample size
p2=asdf$rank # To get parameters p
cp2=anova(asdf2)$Sum[n2] / (anova(qwer2)$Mean[m2]) - (nn2-2*p2); cp2=round(cp2,4)
matt[p2+1,1]=p2; matt[p2+1,2]=cp2 # adding p and Cp
rp2=summary(asdf2)$r.squared; rap2=summary(asdf2)$adj.r.squared; rp2=round(rp2,4); rap2=round(rap2,4)
matt[p2+2,1]=rp2; matt[p2+2,2]=rap2 # adding Rp2 and Rap2
sse2=anova(asdf2)$Sum[n]; pre2=MPV::PRESS(asdf2); sse2=round(sse2,4); pre2=round(pre2,4)
matt[p2+3,1]=sse2; matt[p2+3,2]=pre2 # adding SSE and PRESS
mse2=anova(asdf2)$Mean[n]; mse2=round(mse2,4)
matt[p2+4,1]=mse2; matt[p2+4,2]=NA # adding MSE and MSPR, in this case MSPR=0
aic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + 2*p2; aic2=round(aic2,4)
bic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + log(nn2)*p2; bic2=round(bic2,4)
matt[p2+5,1]=aic2; matt[p2+5,2]=bic2 # adding AIC and BIC
mat=cbind(mat,matt); colnames(mat)=c("Estimate","Std.Error","Val.Estimate","Val.Std.Error")
print(mat)
}
This function will provide useful statistics for model validation.
It returns a matrix with coefficients, p, Mallow's Cp, R.squared, R.adj.squared, SSE, PRESS, MSE, MSPR, AIC and BIC.
Everythig works fine for general given data, except for MSPR since predict function doesn't work! It only returns the fitted.
Can you try something like this. You have to make sure the both training and test data has same column names.
x <- rnorm(100)
y <- x + rnorm(100)
df <- data.frame(x = x, y=y)
# model fitting
fit <- lm(y ~ x, data=df)
predict(fit)
# creating new data
newx <- rnorm(50)
newdf <- data.frame(x = newx)
# making predictions
predict(fit, newdata = newdf)

Changepoint Analysis in R

I am experimenting with the cpt and bcp packages in R. I import the following data from a CSV file:
Here is the CSV file link: http://www.filedropper.com/cpttest
Running bcp:
p.bcp=bcp(p$Rate);
plot(p.bcp)
Point 28 is a change point according to bcp.
When running cpt, I get no indication that a change point exists:
p.cpt=cpt.mean(p$Rate,method="AMOC")
p.cpt
Wondering if anyone could advise why cpt does not detect a change point?
See if the changepoint package satisfies your purpose.
Replace the word method with PELT, BinSeg, or SegNeigh.
library(changepoint)
p.cpt <- changepoint::cpt.mean(p$Rate,method="method")
cpts(p.cpt)
There are many change point packages in R and you could try others. I've compiled a list here. Disclosure: I am the developer of the mcp package.
To use mcp for your problem, do:
model = list(
Rate ~ 1, # Intercept
~ 1 # Another intercept
)
library(mcp)
fit = mcp(model, p)
plot(fit)
See the estimate of the change point (cp_1) as well as the other parameters of the model. You could also see it as a narrow blue distribution in the plot above.
summary(fit)
Family: gaussian(link = 'identity')
Iterations: 9000 from 3 chains.
Segments:
1: Rate ~ 1
2: Rate ~ 1 ~ 1
Population-level parameters:
name mean lower upper Rhat n.eff
cp_1 30.5009 3.0e+01 3.1e+01 1 6250
int_1 0.0132 1.3e-02 1.3e-02 1 5501
int_2 0.0071 6.8e-03 7.3e-03 1 5514
sigma_1 0.0008 6.8e-04 9.4e-04 1 5056

rpart package median or geometric mean instead of mean

Is it possible to change the average estimator in a region by something different from the mean, like median or geometric mean using the rpart library in R? (or another library)
I believe my tree partitioning is highly affected by extreme values and I would like to build trees showing other estimators.
Thanks!
One of the usual tricks for right-skewed responses would be to take logs. In many applications this makes the response distribution more symmetric and then you don't need to switch from the usual mean predictions.
Another solution for changing the learning of the tree would be to use some more robust scores, e.g., ranks etc. The ctree() function from the partykit offers a nonparametric inference framework for this.
Finally, the partykit package also allows to compute other predictions than the means from all the terminal nodes. You can easily transform rpart trees to party trees via as.party(). A very simple example would be to learn an rpart tree for the cars data
library("rpart")
data("cars", package = "datasets")
rp <- rpart(dist ~ speed, data = cars)
And then transform it to party:
library("partykit")
pr <- as.party(rp)
The tree structure remains unchanged but you get enhanced plotting and predictions. The default plot methods yield:
Furthermore, the default predictions on both objects are the same.
nd <- data.frame(speed = c(10, 15, 20))
predict(rp, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd)
## 1 2 3
## 18.20000 39.75000 65.26316
However, the latter allows you to specify a FUNction that should be used in each of the nodes. This must be of the form function(y, w) where y is the response and w are the case weights. As we haven't used any weights here, we can simply ignore that argument and do:
predict(pr, nd, FUN = function(y, w) mean(y))
## 1 2 3
## 18.20000 39.75000 65.26316
predict(pr, nd, FUN = function(y, w) median(y))
## 1 2 3
## 18 35 64
predict(pr, nd, FUN = function(y, w) quantile(y, 0.9))
## 1 2 3
## 28.0 57.0 92.2
And so on... See the package vignettes for more details.

Resources