Getting p-values for all included parameters using glmmLasso - r

I am fitting a mixed model using glmmLasso in R using the command:
glmmLasso(fix = Activity ~ Novelty + Valence + ROI + Novelty:Valence +
Novelty:ROI + Valence:ROI + Novelty:Valence:ROI, rnd = list(Subject = ~1),
data = KNov, lambda = 195, switch.NR = F, final.re = TRUE)
To give you a sense of the data, the output of head(KNov) is:
Subject Activity ROI Novelty Valence Side STAIt
1 202 -0.4312944 H N E L -0.2993321
2 202 -0.6742497 H N N L -0.2993321
3 202 -1.0914216 H R E L -0.2993321
4 202 -0.6296091 H R N L -0.2993321
5 202 -0.6023507 H N E R -0.2993321
6 202 -1.1554196 H N N R -0.2993321
(I used KNov$Subject <- factor(KNov$Subject) to have Subject read as a categorical variable)
Activity is a measure of brain activity, Novelty and Valence are categorical variables coding the type of stimulus used to elicit the response and ROI is a categorical variable coding three regions of the brain that we have sampled this activity from. Subject is an ID number for the individuals the data was sampled from (n=94).
The output for glmmLasso is:
Fixed Effects:
Coefficients:
Estimate StdErr z.value p.value
(Intercept) 0.232193 0.066398 3.4970 0.0004705 ***
NoveltyR -0.190878 0.042333 -4.5089 6.516e-06 ***
ValenceN -0.164214 NA NA NA
ROIB 0.000000 NA NA NA
ROIH 0.000000 NA NA NA
NoveltyR:ValenceN 0.064523 0.077290 0.8348 0.4038189
NoveltyR:ROIB 0.000000 NA NA NA
NoveltyR:ROIH 0.000000 NA NA NA
ValenceN:ROIB -0.424670 0.069561 -6.1050 1.028e-09 ***
ValenceN:ROIH 0.000000 NA NA NA
NoveltyR:ValenceN:ROIB 0.000000 NA NA NA
NoveltyR:ValenceN:ROIH 0.000000 NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Random Effects:
StdDev:
Subject
Subject 0.6069078
I would like to get a p-value for the effect of valence. My first thought was that the p-value for valence was not included because it was non-significant and only included in the model because it is part of the significant ValenceR:ROIB interaction, however NoveltyR:ValenceN was also non-significant, but a p-value is given for that. I would like a p-value for valence even if it is non-significant, as this analysis is going to be used for a paper, and I prefer to report actual p-values rather than p>.05.

The problem here is most likely due to a "reduced rank set of predictors", i.e you have a lot of combinations where there are either no entries or where some smaller subset of entries is sufficient to unambiguously precits the rest of the values,. I suggest you run this code and notice that you get zero cells.
with(KNov, table( Novelty ,
Valence,
ROI ,
interaction(Novelty, Valence) )

Related

summary.manova output shows different p values from the summary.manova stats table and broom tidy()

I noticed that the summary.manova() function in R produces two different p.values. One in a table that is printed in the console and the other in the stats table located in the summary object. What p.values should be reported? The values are slightly different. I first noticed this problem when using the tidy() function from broom, it was reporting p.values from the stats table and not the console.
I can recreate the problem using the iris data frame:
head(iris)
fit = manova(as.matrix(iris[,1:4]) ~ Species, data = iris)
fit_summary = summary.manova(fit, test = "Wilks")
fit_summary #output1
fit_summary$stats #output2
broom::tidy(fit, test = "Wilks") #output2
Nice reproducible example! From everything I can see here, the only differences are in output representation, not in the underlying values.
In the printed summary output, p-values less than a threshold are printed only as "<2.2e-16" (on the theory that you probably shouldn't be worrying about differences among tiny p-values anyway ...)
fit_summary #output1
Df Wilks approx F num Df den Df Pr(>F)
Species 2 0.023439 199.15 8 288 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If you explicitly extract the $stats component, then you get a value printed to R's default 7-digit precision:
> fit_summary$stats #output2
Df Wilks approx F num Df den Df Pr(>F)
Species 2 0.02343863 199.1453 8 288 1.365006e-112
Residuals 147 NA NA NA NA NA
If you use tidy, it returns a tibble rather than a data frame, which has a different set of defaults for output precision (i.e., it only reports 3 significant digits).
> broom::tidy(fit, test = "Wilks")
# A tibble: 2 x 7
term df wilks statistic num.df den.df p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 0.0234 199. 8 288 1.37e-112
2 Residuals 147 NA NA NA NA NA
All of these defaults can be reset: for example, ?tibble::formatting tells you that options(pillar.sigfig=7) will set the significant digits for tibble-printing to 7; ?options tells you that you can use options(digits=n) to change the defaults for base-R printing.

Bonferroni Simultaneous Confidence Intervals of differences in means

I am trying to obtain Bonferroni simultaneous confidence intervals in R. I have the following data set that I made up for practice:
df2 <- read.table(textConnection(
'group value
1 25
2 36
3 42
4 50
1 27
2 35
3 49
4 57
1 22
2 37
3 45
4 51'), header = TRUE)
I have tried
aov(formula = value ~ group, data = df2)
However, this doesn't output simultaneous confidence intervals. Using SAS, the calculations should come out as:
There seem to be some conceptual/coding mistakes.
df$group needs to be a categorical variable for your ANOVA to work. At the moment it is numeric.
You want to perform what's called a post-hoc analysis, to correct ANOVA p-values for multiple group comparisons.
Here is an example using the R package DescTools, based on the sample data you give:
# Step 1: Make sure that group is a factor
df2$group <- as.factor(df2$group);
# Step 2: Perform ANOVA
res <- aov(formula = value ~ group, data = df2)
# Step 3: Perform post-hoc analysis
require(DescTools);
PostHocTest(res, method = "bonferroni");
#
# Posthoc multiple comparisons of means : Bonferroni
# 95% family-wise confidence level
#
#$group
# diff lwr.ci upr.ci pval
#2-1 11.333333 3.0519444 19.61472 0.00855 **
#3-1 20.666667 12.3852778 28.94806 0.00014 ***
#4-1 28.000000 19.7186111 36.28139 1.5e-05 ***
#3-2 9.333333 1.0519444 17.61472 0.02648 *
#4-2 16.666667 8.3852778 24.94806 0.00067 ***
#4-3 7.333333 -0.9480556 15.61472 0.09062 .
#
#---
#Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The reported differences between the group means and confidence intervals match the SAS numbers you give.

Nested ANOVA with time data

I am trying to perform a nested ANOVA including two factors.
Essentially, I have a time variable, which has been measured every week along one year. I want to explore differences among seasons and months, therefore I have assigned three different months to the four seasons (seasons(months)=Winter(Jan, Feb, March), Spring(April, May, June), Summer(July, Sept), Autumn(Oct, Nov, Dec)), resulting in a nested unbalanced design.
>modello<-lm(formula=y~season+season:month)
> anova(modello)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
season 3 178811 59604 144.216 < 2.2e-16 ***
season:month 7 41335 5905 14.287 < 2.2e-16 ***
Residuals 493 203754 413
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However, the df of the season:month does not seem to be correct: the df formula for a nested ANOVA is A(B-1), which in my case is 4(11-1). I also performed a Tukey test, but most of the results were NA:
$season
diff lwr upr p adj
Spring-Autumn 32.93056 26.453002 39.408109 0e+00
Summer-Autumn 15.14239 8.303663 21.981119 1e-07
Winter-Autumn -16.66300 -23.360587 -9.965413 0e+00
Summer-Spring -17.78816 -24.342077 -11.234252 0e+00
Winter-Spring -49.59356 -56.000055 -43.187056 0e+00
Winter-Summer -31.80539 -38.576856 -25.033926 0e+00
$`season:month`
diff lwr upr p adj
Spring:April-Autumn:April NA NA NA NA
Summer:April-Autumn:April NA NA NA NA
Winter:April-Autumn:April NA NA NA NA
Autumn:December-Autumn:April NA NA NA NA
...
which would be the correct procedure?
Thank you in advance for help
Ennio

R coxph() with interaction term, Warning: X matrix deemed to be singular

Please be patient with me. I'm new to this site.
I am modeling turtle nest survival using the coxph() function and have run into a confusing problem with an interaction term between species and nest cages. I have nests from 3 species of turtles (7, 10, and 111 nests per species).
There are nest cages on all nests for the species(1) with 7 nests.
There are no nest cages on all the nests for the species(2) with 10 nests.
There are nest cages on about half of the nests for the species(3) with 111 nests.
Here is my model with the summary output:
S<-Surv(time, event)
n8<-coxph(S~species:cage, data=nesta1)
Warning message:
In coxph(S ~ species:cage, data = nesta1) :
X matrix deemed to be singular; variable 1 5 6
summary(n8)
Call:
coxph(formula = S ~ species:cage, data = nesta1)
n= 128, number of events= 73
coef exp(coef) se(coef) z Pr(>|z|)
species1:cageN NA NA 0.0000 NA NA
species2:cageN 1.2399 3.4554 0.3965 3.128 0.00176 **
species3:cageN 0.5511 1.7351 0.2664 2.068 0.03860 *
species1:cageY -0.1054 0.8999 0.6145 -0.172 0.86379
species2:cageY NA NA 0.0000 NA NA
species3:cageY NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
species1:cageN NA NA NA NA
species2:cageN 3.4554 0.2894 1.5887 7.515
species3:cageN 1.7351 0.5763 1.0293 2.925
species1:cageY 0.8999 1.1112 0.2698 3.001
species2:cageY NA NA NA NA
species3:cageY NA NA NA NA
Concordance= 0.61 (se = 0.038 )
Rsquare= 0.079 (max possible= 0.993 )
Likelihood ratio test= 10.57 on 3 df, p=0.01426
Wald test = 11.36 on 3 df, p=0.009908
Score (logrank) test = 12.22 on 3 df, p=0.006672
I understand that I would have singularities for species 1 and 2, but not for species 3. Why would the "species3:cageY" line be singular when there are species 3 nests with nest cages on them?
Is it ok to include species 1 and 2 even though they have those singularities?
Edit: I cannot find any errors in my data. I have decimal numbers for the time variable for a few nests, but that doesn't seem to be a problem for species 3 nests without a nest cage. For species 3, I have the full range of time values for nests with and without a nest cage and I have both true and false events for nests with and without a nest cage.
Edit:
with( nesta1, table(event, species, cage))
, , cage = N
species
event 1 2 3
0 0 1 24
1 0 9 38
, , cage = Y
species
event 1 2 3
0 4 0 26
1 3 0 23
Edit 2: I understand that interaction-only models are not very useful, but the interaction term results behave the same way whether I have other main effects in the model or not. I've removed the other main effects to simplify this question.
Thank you!

How can I compare regression coefficients across three (or more) groups using R?

Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)

Resources