Get lm estimate for each categorical variable - r

So I am doing a multiple linear regression to see if fracture density and rock type effect retreat rates in rocks.
retreat <- lm(retreat_rate ~ fracture_dens + rock_unit, data = coast)
> summary(retreat)
I would like it to treat the 'rock_unit' as a category. I have two rock types in the vector. Here is my current result.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.22631 0.53806 -0.421 0.676353
fracture_dens 0.11467 0.02704 4.241 0.000132 ***
rock_unitSC_mudstone 1.73490 0.36097 4.806 2.3e-05 ***
I would like there to be 'SC_mudstone' and 'Purisima' (the other rock type) instead of the 'rock_unitSC_mudstone' it is giving me now.

this is the typical outcome for linear models: the variable rock_unitSC_mudstoneis a dummy variable which is defined as:
rock_unitSC_mudstone = 1 if rock unit = SC_mudstone and 0 otherwise.
Adding a further variable rock_unitPurisima would cause the model matrix $X$ to not have full rank.
Anyway, you do not need the rock_unitPurisima variable. You can interpret the results as follows:
Average retreat rate for SC_mudstone = -0.22631 + 1.73490
Average retreat rate for Purisima = -0.22631
If you insist on a variable rock_unitPurisimayou can set the intercept to zero:
retreat2 <- lm(retreat_rate ~ 0 + fracture_dens + rock_unit, data = coast)
But as I said, an intercept and both dummy variables would simply contain too much information.
Hope that this was helpful.

Related

Weird plots when plotting logistic regression residuals vs predictor variables?

I have fitted a logistic regression for an outcome (a type of side effect - whether patients have this or not). The formula and results of this model is below:
model <- glm(side_effect_G1 ~ age + bmi + surgerytype1 + surgerytype2 + surgerytype3 + cvd + rt_axilla, family = 'binomial', data= data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.888112 0.859847 -9.174 < 2e-16 ***
age 0.028529 0.009212 3.097 0.00196 **
bmi 0.095759 0.015265 6.273 3.53e-10 ***
surgery11 0.923723 0.524588 1.761 0.07826 .
surgery21 1.607389 0.600113 2.678 0.00740 **
surgery31 1.544822 0.573972 2.691 0.00711 **
cvd1 0.624692 0.290005 2.154 0.03123 *
rt1 -0.816374 0.353953 -2.306 0.02109 *
I want to check my models, so I have plotted residuals against predictors or fitted values. I know, if a model is properly fitted, there should be no correlation between residuals and predictors and fitted values so I essentially run...
residualPlots(model)
My plots look funny because from what I have seen from examples online, is that it should be symmetrical around 0. Also, my factor variables aren't shown in box-plots although I have checked the structure of my data and coded surgery1, surgery2, surgery4,cvd,rt as factors. Can someone help me interpret my plots and guide me how to plot boxplots for my factor variables?
Thanks
Your label or response variable is expected for an imbalanced dataset. From your plots most of your residuals actually go below the dotted line, so I suspect this is the case.
Very briefly, the symmetric around residuals only holds for logistic regression when your classes are balanced. If it is heavily imbalanced towards the reference label (or 0 label), the intercept will be forced towards a low value (i.e the 0 label), and you will see that positive labels will have a very large pearson residual (because they deviate a lot from the expected). You can read more about imbalanced class and logistic regression in this post
Here's an example to demonstrate this, using a dataset where you see the evenly distributed residues :
library(mlbench)
library(car)
data(PimaIndiansDiabetes)
table(PimaIndiansDiabetes$diabetes)
neg pos
500 268
mdl = glm(diabetes ~ .,data=PimaIndiansDiabetes,family="binomial")
residualPlots(mdl)
Let's make it more imbalanced, and you get a plot exactly like yours:
da = PimaIndiansDiabetes
wh = c(which(da$diabetes=="neg"),which(da$diabetes == "pos")[1:100])
da = da[wh,]
table(da$diabetes)
neg pos
500 100
mdl = glm(diabetes ~ .,data=da,family="binomial")
residualPlots(mdl)

Two-level modelling with lme in R

I am interested in estimating a mixed effect model with two random components (I am sorry for the somewhat unprecise notation. I am somewhat new to these kind of models). Finally, I also want also the standard errors of the variances of the random components. That is why I am somewhat boudn to using the package lme. The reason is that I found this description on how to calculate those standard errors and also interesting, the standard error for function of these variances link.
I believe I know how to use the package lmer. I am finally interested in model2. For the model1, both command yield the same estimates. But model2 with lme yields different results than model2 with lmer from the lme4 package. Could you help me to get around how to set up the random components for lme? This would be much appreciated. Thanks. Please find attached my MWE.
Best
Daniel
#### load all packages #####
loadpackage <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE )
}
# Load package (after installing)
library( i , character.only = TRUE )
}
}
# Then try/install packages...
loadpackage( c("nlme", "msm", "lmeInfo", "lme4"))
alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
attach(alcohol1)
id <- as.factor(id)
age <- as.factor(age)
model1.lmer <-lmer(alcuse ~ 1 + peer + (1|id))
summary(model1.lmer)
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age))
summary(model2.lmer)
model1.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id, method ="REML")
summary(model1.lme)
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id + 1|age, method ="REML")
Edit (15/09/2021):
Estimating the model as follows end then returning the estimates via nlme::VarCorr gives me different results. While the estimates seem to be in the ball park, it is as they are switched across components.
model2a.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2a.lme)
nlme::VarCorr(model2a.lme)
Variance StdDev
id = pdLogChol(1)
(Intercept) 0.38390274 0.6195989
age = pdLogChol(1)
(Intercept) 0.47892113 0.6920413
Residual 0.08282585 0.2877948
EDIT (16/09/2021):
Since Bob pushed me to think more about my model, I want to give some additional information. Please know that the data I use in the MWE do not match my true data. I just used it for illustrative purposes since I can not upload myy true data. I have a household panel with income, demographic informations and parent indicators.
I am interested in intergenerational mobility. Sibling correlations of permanent income are one industry standard. At the very least, contemporanous observations are very bad proxies of permanent income. Due to transitory shocks, i.e., classical measurement error, those estimates are most certainly attenuated. For this reason, we exploit the longitudinal dimension of our data.
For sibling correlations, this amounts to hypothesising that the income process is as follows:
$$Y_{ijt} = \beta X_{ijt} + \epsilon_{ijt}.$$
With Y being income from individual i from family j in year t. X comprises age and survey year indicators to account for life-cycle effects and macroeconmic conditions in survey years. Epsilon is a compund term comprising a random individual and family component as well as a transitory component (measurement error and short lived shocks). It looks as follows:
$$\epsilon_{ijt} = \alpha_i + \gamma_j + \eta_{ijt}.$$
The variance of income is then:
$$\sigma^2_\epsilon = \sigma^2_\alpha + \sigma^2\gamma + \sigma^2\eta.$$
The quantitiy we are interested in is
$$\rho = \frac(\sigma^2\gamma}{\sigma^2_\alpha + \sigma^2\gamma},$$
which reflects the share of shared family (and other characteristics) among siblings of the variation in permanent income.
B.t.w.: The struggle is simply because I want to have a standard errors for all estimates and for \rho.
This is an example of crossed vs nested random effects. (Note that the example you refer to is fitting a different kind of model, a random-slopes model rather than a model with two different grouping variables ...)
If you try with(alcohol1, table(age,id)) you can see that every id is associated with every possible age (14, 15, 16). Or subset(alcohol1, id==1) for example:
id age coa male age_14 alcuse peer cpeer ccoa
1 1 14 1 0 0 1.732051 1.264911 0.2469111 0.549
2 1 15 1 0 1 2.000000 1.264911 0.2469111 0.549
3 1 16 1 0 2 2.000000 1.264911 0.2469111 0.549
There are three possible models you could fit for a model with random effects of age(indexed by i) and id (indexed by j)
crossed ((1|age) + (1|id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_j +epsr_{ij}; alcohol use varies among individuals and, independently, across ages (this model won't work very well because there are only three distinct ages in the data set, more levels are usually needed)
id nested within age ((1|age/id) = (1|age) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_{ij} + epsr_{ij}; alcohol use varies across ages, and varies across individuals within ages (see note above about number of levels).
age nested within id ((1|id/age) = (1|id) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_j + eps2_{ij} + epsr_{ij}; alcohol use varies across individuals, and varies across ages within individuals
Here eps1_i, eps2_{ij}, and epsr_{ij} are normal deviates; epsr is the residual error term.
The latter two models actually don't make sense in this case; because there is only one observation per age/id combination, the nested variance (eps2) is completely confounded with the residual variance (epsr). lme doesn't notice this; if you try to fit one of the nested models in lmer it will give an error that
number of levels of each grouping factor must be < number of observations (problems: id:age)
(Although if you try to compute confidence intervals based on model1.lme you'll get an error "cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance", which is a hint that something is wrong.)
You could restate this problem as saying that the residual variation, and the variation among ages within individuals, are jointly unidentifiable (can't be separated from each other, statistically).
The updated answer here shows how to get the standard errors of the variance components from an lmer model, so you shouldn't be stuck with lme (but you should think carefully about which model you're really trying to fit ...)
The GLMM FAQ might also be useful.
More generally, the standard error of
rho = (V_gamma)/(V_alpha + V_gamma)
will be hard to compute accurately, because this is a nonlinear function of the model parameters. You can apply the delta method, but the most reliable approach would be to use parametric bootstrapping: if you have a fitted model m, then something like this should work:
var_ratio <- function(m) {
v <- as.data.frame(sapply(VarCorr(m), as.numeric))
return(v$family/(v$family + v$id))
}
confint(m, method="boot", FUN =var_ratio)
You should specify random effects in lme by using / not +
By lmer
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age), data = alcohol1)
summary(model2.lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: alcuse ~ 1 + peer + (1 | id) + (1 | age)
Data: alcohol1
REML criterion at convergence: 651.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.0228 -0.5310 -0.1329 0.5854 3.1545
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.08078 0.2842
age (Intercept) 0.30313 0.5506
Residual 0.56175 0.7495
Number of obs: 246, groups: id, 82; age, 82
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3039 0.1438 2.113
peer 0.6074 0.1151 5.276
Correlation of Fixed Effects:
(Intr)
peer -0.814
By lme
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2.lme)
Linear mixed-effects model fit by REML
Data: alcohol1
AIC BIC logLik
661.3109 678.7967 -325.6554
Random effects:
Formula: ~1 | id
(Intercept)
StdDev: 0.4381206
Formula: ~1 | age %in% id
(Intercept) Residual
StdDev: 0.4381203 0.7494988
Fixed effects: alcuse ~ 1 + peer
Value Std.Error DF t-value p-value
(Intercept) 0.3038946 0.1438333 164 2.112825 0.0361
peer 0.6073948 0.1151228 80 5.276060 0.0000
Correlation:
(Intr)
peer -0.814
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.0227793 -0.5309669 -0.1329302 0.5853768 3.1544873
Number of Observations: 246
Number of Groups:
id age %in% id
82 82
Okay, finally. Just to sketch my confidential data: I have a panel of individuals. The data includes siblings, identified via mnr. income is earnings, wavey survey year, age age factors. female a factor for gender, pid is the factor identifying the individual.
m1 <- lmer(income ~ age + wavey + female + (1|pid) + (1 | mnr),
data = panel)
vv <- vcov(m1, full = TRUE)
covvar <- vv[58:60, 58:60]
covvar
3 x 3 Matrix of class "dgeMatrix"
cov_pid.(Intercept) cov_mnr.(Intercept) residual
[1,] 2.6528679 -1.4624588 -0.4077576
[2,] -1.4624588 3.1015001 -0.0597926
[3,] -0.4077576 -0.0597926 1.1634680
mean <- as.data.frame(VarCorr(m1))$vcov
mean
[1] 17.92341 16.86084 56.77185
deltamethod(~ x2/(x1+x2), mean, covvar, ses =TRUE)
[1] 0.04242089
The last scalar should be what I interprete as the shared background of the siblings of permanent income.
Thanks to #Ben Bolker who pointed me into this direction.

R: predict.averaging is not taking an offset into account when plotting

I'm currently trying to use the predict.averaging function in MuMIn to create some graphs from some model averaging I've done on some GLMMs. I'm interested in whether the number of insects caught per daylight hour in some traps changes when the traps are left out for different lengths of time; I included offset(log(Daylight)) in my GLMMs to account for this. But when I use the predict function it doesn't take the offset into account and I get the same graph that I get if I hadn't included the offset in the first place. But I know the offset is having an effect due to the output from my model averaged GLMMs, and it's the kind of effect I would expect from observations of my data.
Does anyone know why this problem might be and how I might make predict.averaging take the offset into account? I've included the code that I'm using below:
# global model for total insect abundance
glmm11 <- glmmadmb(Total_polls ~ Max_temp+Wind+Precipitation+Veg_height+Season+Year+log(Mean.nectar+1)+I(log(Nectar+1)-log(Mean.nectar+1))+Pan_colour*Assoc_col+Treatment*Area*Depth+(1|Transect)+offset(log(Daylight)), data = ab, zeroInflation = FALSE, family = "nbinom")
# make predictions based on model averaging output (subset delta < 2)
preds<-predict(ave21, full = F, type = "response", backtransform = FALSE) # on the response scale
Where ave21 is a model averaging object generated using pdredge and model.avg that was constrained to have the offset in every model: model11 <- pdredge(glmm11, cluster = clust, fixed = ~offset(log(Daylight))+(1|Transect)). The object itself looks like this:
Call:
model.avg(object = get.models(object = model11, subset = delta <
2))
Component model call:
glmmadmb(formula = Total_polls ~ <3 unique rhs>, data = ab, family = nbinom,
zeroInflation = FALSE)
Component models:
df logLik AICc delta weight
1/2/3/4/5/6/7/8/9/10/11/12 20 -864.14 1769.22 0.00 0.47
1/2/3/4/5/6/7/8/9/10/11/12/13 23 -861.39 1770.03 0.81 0.31
1/3/4/5/6/7/8/9/10/11/12 19 -865.97 1770.79 1.57 0.21
Term codes:
Area Assoc_col
1 2
Depth I(log(Nectar + 1) - log(Mean.nectar + 1))
3 4
Max_temp Pan_colour
5 6
Season Treatment
7 8
Year log(Mean.nectar + 1)
9 10
offset(log(Daylight)) Area:Depth
11 12
Assoc_col:Pan_colour
13
Which I then used to get predictions:
pred_results<-cbind(glmm21$frame, preds) # append original dataframe to predictions
plot(pred_results$preds~pred_results$Treatment) # Treatment = trap duration (hours)
This code might go around the houses a little as I borrowed it off of a fellow PhD student. The graph I get when I plot my predictions looks like this:[Model predictions vs. Trap duration (hours)][1], which is very different from the view given by the summary results of my model averaging:
(conditional average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -5.896725 0.948102 0.949386 6.211 < 2e-16 ***
Treatment24 -0.714283 0.130226 0.130403 5.478 < 2e-16 ***
Treatment48 -0.983881 0.122416 0.122582 8.026 < 2e-16 ***
Any help would be great, as I can't find any specific instances where this has been addressed on the site to date. Thank you in advance and please let me know if you need me to add anything to make this question better.
Tom
[1]:
https://i.stack.imgur.com/Pn4dK.jpg

GAM: why mgcv::gam provides different results regarding to the order of the levels of the explanatory variable

I am trying to get the seasonal trend of two groups of individuals using GAMMs. I performed two analysis changing the order of the levels of the explanatory variable in order to get one plot of the seasonal trend for each level.
However, I am surprised with the output of the two GAMMs because they vary according to the order of the levels of the explanatory variable. I expected that the results would be the same because the data and the model are the same in both occasions. However, as you can see below, the results vary the inference of the data studied.
My database contained the next variables:
Species: 4 levels
Populations: 20 levels
Reproductive_State: 2 levels
Survival_probability: range [0-1]
Year
Month
Fortnight: from 1 to 26 (called Seasonality in analysis)
I am trying to get a descriptive estimates and plots of the "common seasonal survival of the species" checking the existence of differences between the two levels of the variable reproductive_state.
In order to check it I performed did:
# Specify the contrast: Reproductive group
data$Reproductive_Group <- as.factor (data$Reproductive_State)
data$Reproductive_Group <- as.ordered(data$Reproductive_Group )
contrasts(data$Reproductive_Group )<-'contr.treatment'
model_1 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
later I change the order of the levels of the Reproductive_Group and perform the same analysis:
data$Reproductive_Group <- factor (data$Reproductive_Group , levels=c("phiNB", "phiB"))
levels (data$Reproductive_Group )
model_2 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
In the first model the output is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83569 0.01202 152.8 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 3.201 3.963 2.430 0.05046 .
s(Seasonality):Rep_GroupphiNB 5.824 6.956 2.682 0.00991 **
whereas the output of the second model is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83554 0.01205 152.4 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 5.927 7.061 6.156 3.66e-07 ***
s(Seasonality):Rep_GroupphiB 3.218 3.981 1.029 0.411
Furthermore I have attached the plots of the two models:
Group_B_as_second_level
Group_NB_as_second_level
I thought that the plot of the seasonality should be the same for both analysis, as long as it represents exclusively the seasonality. However if the seasonality reflects the seasonal trend of the other level, the plot 1 of the first picture should match with the plot 2 of the second picture and viceversa, and they donĀ“t do it.
To note, that I followed the blog Overview GAMM analysis of time series data for writting the formula and checking the differences of the seasonal trend accross the two reproductive state.
Do you know why I obtain different results with these two models?

Testing differences in coefficients including interactions from piecewise linear model

I'm running a piecewise linear random coefficient model testing the influence of a covariate on the second piece. Thereby, I want to test whether the coefficient of the second piece under the influence of the covariate (piece2 + piece2:covariate) differs from the coefficient of the first piece (piece1), hence whether the growth rate differs.
I set up some exemplary data:
set.seed(100)
# set up dependent variable
temp <- rep(seq(0,23),50)
y <- c(rep(seq(0,23),50)+rnorm(24*50), ifelse(temp <= 11, temp + runif(1200), temp + rnorm(1200) + (temp/sqrt(temp))))
# set up ID variable, variables indicating pieces and the covariate
id <- sort(rep(seq(1,100),24))
piece1 <- rep(c(seq(0,11), rep(11,12)),100)
piece2 <- rep(c(rep(0,12), seq(1,12)),100)
covariate <- c(rep(0,24*50), rep(c(rep(0,12), rep(1,12)), 50))
# data frame
example.data <- data.frame(id, y, piece1, piece2, covariate)
# run piecewise linear random effects model and show results
library(lme4)
lmer.results <- lmer(y ~ piece1 + piece2*covariate + (1|id) , example.data)
summary(lmer.results)
I came across the linearHypothesis() command from the car package to test differences in coefficients. However, I could not find an example on how to use it when including interactions.
Can I even use linearHypothesis() to test this or am I aiming for the wrong test?
I appreciate your help.
Many thanks in advance!
Mac
Assuming your output looks like this
Estimate Std. Error t value
(Intercept) 0.26293 0.04997 5.3
piece1 0.99582 0.00677 147.2
piece2 0.98083 0.00716 137.0
covariate 2.98265 0.09042 33.0
piece2:covariate 0.15287 0.01286 11.9
if I understand correctly what you want, you are looking for the contrast:
piece1-(piece2+piece2:covariate)
or
c(0,1,-1,0,-1)
My preferred tool for this is function estimable in gmodels; you could also do it by hand or with one of the functions in Frank Harrel's packages.
library(gmodels)
estimable(lmer.results,c(0,1,-1,0,-1),conf.int=TRUE)
giving
Estimate Std. Error p value Lower.CI Upper.CI
(0 1 -1 0 -1) -0.138 0.0127 0 -0.182 -0.0928

Resources