R coxph() with interaction term, Warning: X matrix deemed to be singular - r

Please be patient with me. I'm new to this site.
I am modeling turtle nest survival using the coxph() function and have run into a confusing problem with an interaction term between species and nest cages. I have nests from 3 species of turtles (7, 10, and 111 nests per species).
There are nest cages on all nests for the species(1) with 7 nests.
There are no nest cages on all the nests for the species(2) with 10 nests.
There are nest cages on about half of the nests for the species(3) with 111 nests.
Here is my model with the summary output:
S<-Surv(time, event)
n8<-coxph(S~species:cage, data=nesta1)
Warning message:
In coxph(S ~ species:cage, data = nesta1) :
X matrix deemed to be singular; variable 1 5 6
summary(n8)
Call:
coxph(formula = S ~ species:cage, data = nesta1)
n= 128, number of events= 73
coef exp(coef) se(coef) z Pr(>|z|)
species1:cageN NA NA 0.0000 NA NA
species2:cageN 1.2399 3.4554 0.3965 3.128 0.00176 **
species3:cageN 0.5511 1.7351 0.2664 2.068 0.03860 *
species1:cageY -0.1054 0.8999 0.6145 -0.172 0.86379
species2:cageY NA NA 0.0000 NA NA
species3:cageY NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
species1:cageN NA NA NA NA
species2:cageN 3.4554 0.2894 1.5887 7.515
species3:cageN 1.7351 0.5763 1.0293 2.925
species1:cageY 0.8999 1.1112 0.2698 3.001
species2:cageY NA NA NA NA
species3:cageY NA NA NA NA
Concordance= 0.61 (se = 0.038 )
Rsquare= 0.079 (max possible= 0.993 )
Likelihood ratio test= 10.57 on 3 df, p=0.01426
Wald test = 11.36 on 3 df, p=0.009908
Score (logrank) test = 12.22 on 3 df, p=0.006672
I understand that I would have singularities for species 1 and 2, but not for species 3. Why would the "species3:cageY" line be singular when there are species 3 nests with nest cages on them?
Is it ok to include species 1 and 2 even though they have those singularities?
Edit: I cannot find any errors in my data. I have decimal numbers for the time variable for a few nests, but that doesn't seem to be a problem for species 3 nests without a nest cage. For species 3, I have the full range of time values for nests with and without a nest cage and I have both true and false events for nests with and without a nest cage.
Edit:
with( nesta1, table(event, species, cage))
, , cage = N
species
event 1 2 3
0 0 1 24
1 0 9 38
, , cage = Y
species
event 1 2 3
0 4 0 26
1 3 0 23
Edit 2: I understand that interaction-only models are not very useful, but the interaction term results behave the same way whether I have other main effects in the model or not. I've removed the other main effects to simplify this question.
Thank you!

Related

Nested ANOVA with time data

I am trying to perform a nested ANOVA including two factors.
Essentially, I have a time variable, which has been measured every week along one year. I want to explore differences among seasons and months, therefore I have assigned three different months to the four seasons (seasons(months)=Winter(Jan, Feb, March), Spring(April, May, June), Summer(July, Sept), Autumn(Oct, Nov, Dec)), resulting in a nested unbalanced design.
>modello<-lm(formula=y~season+season:month)
> anova(modello)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
season 3 178811 59604 144.216 < 2.2e-16 ***
season:month 7 41335 5905 14.287 < 2.2e-16 ***
Residuals 493 203754 413
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However, the df of the season:month does not seem to be correct: the df formula for a nested ANOVA is A(B-1), which in my case is 4(11-1). I also performed a Tukey test, but most of the results were NA:
$season
diff lwr upr p adj
Spring-Autumn 32.93056 26.453002 39.408109 0e+00
Summer-Autumn 15.14239 8.303663 21.981119 1e-07
Winter-Autumn -16.66300 -23.360587 -9.965413 0e+00
Summer-Spring -17.78816 -24.342077 -11.234252 0e+00
Winter-Spring -49.59356 -56.000055 -43.187056 0e+00
Winter-Summer -31.80539 -38.576856 -25.033926 0e+00
$`season:month`
diff lwr upr p adj
Spring:April-Autumn:April NA NA NA NA
Summer:April-Autumn:April NA NA NA NA
Winter:April-Autumn:April NA NA NA NA
Autumn:December-Autumn:April NA NA NA NA
...
which would be the correct procedure?
Thank you in advance for help
Ennio

Getting p-values for all included parameters using glmmLasso

I am fitting a mixed model using glmmLasso in R using the command:
glmmLasso(fix = Activity ~ Novelty + Valence + ROI + Novelty:Valence +
Novelty:ROI + Valence:ROI + Novelty:Valence:ROI, rnd = list(Subject = ~1),
data = KNov, lambda = 195, switch.NR = F, final.re = TRUE)
To give you a sense of the data, the output of head(KNov) is:
Subject Activity ROI Novelty Valence Side STAIt
1 202 -0.4312944 H N E L -0.2993321
2 202 -0.6742497 H N N L -0.2993321
3 202 -1.0914216 H R E L -0.2993321
4 202 -0.6296091 H R N L -0.2993321
5 202 -0.6023507 H N E R -0.2993321
6 202 -1.1554196 H N N R -0.2993321
(I used KNov$Subject <- factor(KNov$Subject) to have Subject read as a categorical variable)
Activity is a measure of brain activity, Novelty and Valence are categorical variables coding the type of stimulus used to elicit the response and ROI is a categorical variable coding three regions of the brain that we have sampled this activity from. Subject is an ID number for the individuals the data was sampled from (n=94).
The output for glmmLasso is:
Fixed Effects:
Coefficients:
Estimate StdErr z.value p.value
(Intercept) 0.232193 0.066398 3.4970 0.0004705 ***
NoveltyR -0.190878 0.042333 -4.5089 6.516e-06 ***
ValenceN -0.164214 NA NA NA
ROIB 0.000000 NA NA NA
ROIH 0.000000 NA NA NA
NoveltyR:ValenceN 0.064523 0.077290 0.8348 0.4038189
NoveltyR:ROIB 0.000000 NA NA NA
NoveltyR:ROIH 0.000000 NA NA NA
ValenceN:ROIB -0.424670 0.069561 -6.1050 1.028e-09 ***
ValenceN:ROIH 0.000000 NA NA NA
NoveltyR:ValenceN:ROIB 0.000000 NA NA NA
NoveltyR:ValenceN:ROIH 0.000000 NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Random Effects:
StdDev:
Subject
Subject 0.6069078
I would like to get a p-value for the effect of valence. My first thought was that the p-value for valence was not included because it was non-significant and only included in the model because it is part of the significant ValenceR:ROIB interaction, however NoveltyR:ValenceN was also non-significant, but a p-value is given for that. I would like a p-value for valence even if it is non-significant, as this analysis is going to be used for a paper, and I prefer to report actual p-values rather than p>.05.
The problem here is most likely due to a "reduced rank set of predictors", i.e you have a lot of combinations where there are either no entries or where some smaller subset of entries is sufficient to unambiguously precits the rest of the values,. I suggest you run this code and notice that you get zero cells.
with(KNov, table( Novelty ,
Valence,
ROI ,
interaction(Novelty, Valence) )

How can I compare regression coefficients across three (or more) groups using R?

Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 3 fictional young people, 3 fictional middle age people, and 3 fictional senior citizens, along with their height and their weight. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
So, how can I compare regression coefficients (slope mainly) across three (or more) groups using R?
Sample data:
age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269
There is an elegant answer to this in CrossValidated.
But briefly,
require(emmeans)
data <- data.frame(age = factor(c(1,1,1,2,2,2,3,3,3)),
height = c(56,60,64,56,60,64,74,75,82),
weight = c(140,155,142,117,125,133,245,241,269))
model <- lm(weight ~ height*age, data)
anova(model) #check the results
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
height 1 25392.3 25392.3 481.5984 0.0002071 ***
age 2 2707.4 1353.7 25.6743 0.0129688 *
height:age 2 169.0 84.5 1.6027 0.3361518
Residuals 3 158.2 52.7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
slopes <- emtrends(model, 'age', var = 'height') #gets each slope
slopes
age height.trend SE df lower.CL upper.CL
1 0.25 1.28 3 -3.84 4.34
2 2.00 1.28 3 -2.09 6.09
3 3.37 1.18 3 -0.38 7.12
Confidence level used: 0.95
pairs(slopes) #gets their comparisons two by two
contrast estimate SE df t.ratio p.value
1 - 2 -1.75 1.82 3 -0.964 0.6441
1 - 3 -3.12 1.74 3 -1.790 0.3125
2 - 3 -1.37 1.74 3 -0.785 0.7363
P value adjustment: tukey method for comparing a family of 3 estimates
To determine whether the regression coefficients "differ across three age groups" we can use anova function in R. For example, using the data in the question and shown reproducibly in the note at the end:
fm1 <- lm(weight ~ height, DF)
fm3 <- lm(weight ~ age/(height - 1), DF)
giving the following which is significant at the 2.7% level so we would conclude that there are differences in the regression coefficients of the groups if we were using a 5% cutoff but not if we were using a 1% cutoff. The age/(height - 1) in the formula for fm3 means that height is nested in age and the overall intercept is omitted. Thus the model estimates separate intercepts and slopes for each age group. This is equivalent to age + age:height - 1.
> anova(fm1, fm3)
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ age/(height - 1)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 7 2991.57
2 3 149.01 4 2842.6 14.307 0.02696 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note 1: Above fm3 has 6 coefficients, an intercept and slope for each group. If you want 4 coefficients, a common intercept and separate slopes, then use
lm(weight ~ age:height, DF)
Note 2: We can also compare a model in which subsets of levels are the same. For example, we can compare a model in which ages 1 and 2 are the same to models in which they are all the same (fm1) and all different (fm3):
fm2 <- lm(weight ~ age/(height - 1), transform(DF, age = factor(c(1, 1, 3)[age])))
anova(fm1, fm2, fm3)
If you do a large number of tests you can get significance on some just by chance so you will want to lower the cutoff for p values.
Note 3: There are some notes on lm formulas here: https://sites.google.com/site/r4naturalresources/r-topics/fitting-models/formulas
Note 4: We used this as the input:
Lines <- "age height weight
1 56 140
1 60 155
1 64 143
2 56 117
2 60 125
2 64 133
3 74 245
3 75 241
3 82 269"
DF <- read.table(text = Lines, header = TRUE)
DF$age <- factor(DF$age)

R - plm regression with time in posix-format

I have little experience with panel data in R, and am trying to run a simple panel regression with the plm-package. When converting my dataframe to a pdata.frame, however, my time index-variable is transformed to a factor variable. This means that if I want to regress a dependent variable as a function of time, the regression generates a long list of dummy-variables for time and calculates individual coefficients for each. I just want the average effect per time unit (ie. average monthly increase/decrease in points).
Example dataframe:
ID Date Points
1 1/11/2014 2
1 1/12/2014 4
1 1/1/2015 6
1 1/2/2015 8
2 1/11/2014 1
2 1/12/2014 2
2 1/1/2015 3
2 1/2/2015 4
Say the example dataframe structure is ID = int, Date = POSIXct, Points = int.
I then convert it to a pdata.frame with index ID and Date:
panel <- pdata.frame(dataframe, c("ID", "Date"))
And run a plm fixed effects regression:
fixed <- plm(Points ~ Date, data=panel, model="within")
summary(fixed)
The resulting coefficients are then broken down by each month as dummies.
I want to treat my time-variable as a continuous variable, so I get only one coefficient for Date. How can I do this? Is there a way to avoid formatting the time index-variable as a factor in panel dataframes?
I think you need to create a separate clock or time counter from panel$Date to use in your model. For example:
library(dplyr)
dataframe <- dataframe %>%
group_by(ID) %>%
mutate(clock = seq_along(ID))
panel <- pdata.frame(dataframe, c("ID", "Date"))
That produces these data:
ID Date Points clock
1-2014-11-01 1 2014-11-01 2 1
1-2014-12-01 1 2014-12-01 4 2
1-2015-01-01 1 2015-01-01 6 3
1-2015-02-01 1 2015-02-01 8 4
2-2014-11-01 2 2014-11-01 1 1
2-2014-12-01 2 2014-12-01 2 2
2-2015-01-01 2 2015-01-01 3 3
2-2015-02-01 2 2015-02-01 4 4
That produces this output:
> fixed <- plm(Points ~ clock, data=panel, model="within")
> summary(fixed)
Oneway (individual) effect Within Model
Call:
plm(formula = points ~ clock, data = panel, model = "within")
Balanced Panel: n=2, T=4, N=8
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-0.750 -0.375 0.000 0.375 0.750
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
clock 1.50000 0.22361 6.7082 0.001114 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 25
Residual Sum of Squares: 2.5
R-Squared : 0.9
Adj. R-Squared : 0.5625
F-statistic: 45 on 1 and 5 DF, p-value: 0.0011144

R - cox hazard model not including levels of a factor

I am fitting a cox model to some data that is structured as such:
str(test)
'data.frame': 147 obs. of 8 variables:
$ AGE : int 71 69 90 78 61 74 78 78 81 45 ...
$ Gender : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
$ RACE : Factor w/ 5 levels "","BLACK","HISPANIC",..: 5 2 5 5 5 5 5 5 5 1 ...
$ SIDE : Factor w/ 2 levels "L","R": 1 1 2 1 2 1 1 1 2 1 ...
$ LESION.INDICATION: Factor w/ 12 levels "CLAUDICATION",..: 1 11 4 11 9 1 1 11 11 11 ...
$ RUTH.CLASS : int 3 5 4 5 4 3 3 5 5 5 ...
$ LESION.TYPE : Factor w/ 3 levels "","OCCLUSION",..: 3 3 2 3 3 3 2 3 3 3 ...
$ Primary : int 1190 1032 166 689 219 840 1063 115 810 157 ...
the RUTH.CLASS variable is actually a factor, and i've changed it to one as such:
> test$RUTH.CLASS <- as.factor(test$RUTH.CLASS)
> summary(test$RUTH.CLASS)
3 4 5 6
48 56 35 8
great.
after fitting the model
stent.surv <- Surv(test$Primary)
> cox.ruthclass <- coxph(stent.surv ~ RUTH.CLASS, data=test )
>
> summary(cox.ruthclass)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 0.1599 1.1734 0.1987 0.804 0.42111
RUTH.CLASS5 0.5848 1.7947 0.2263 2.585 0.00974 **
RUTH.CLASS6 0.3624 1.4368 0.3846 0.942 0.34599
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 1.173 0.8522 0.7948 1.732
RUTH.CLASS5 1.795 0.5572 1.1518 2.796
RUTH.CLASS6 1.437 0.6960 0.6762 3.053
Concordance= 0.574 (se = 0.026 )
Rsquare= 0.045 (max possible= 1 )
Likelihood ratio test= 6.71 on 3 df, p=0.08156
Wald test = 7.09 on 3 df, p=0.06902
Score (logrank) test = 7.23 on 3 df, p=0.06478
> levels(test$RUTH.CLASS)
[1] "3" "4" "5" "6"
When i fit more variables in the model, similar things happen:
cox.fit <- coxph(stent.surv ~ RUTH.CLASS + LESION.INDICATION + LESION.TYPE, data=test )
>
> summary(cox.fit)
Call:
coxph(formula = stent.surv ~ RUTH.CLASS + LESION.INDICATION +
LESION.TYPE, data = test)
n= 147, number of events= 147
coef exp(coef) se(coef) z Pr(>|z|)
RUTH.CLASS4 -0.5854 0.5569 1.1852 -0.494 0.6214
RUTH.CLASS5 -0.1476 0.8627 1.0182 -0.145 0.8847
RUTH.CLASS6 -0.4509 0.6370 1.0998 -0.410 0.6818
LESION.INDICATIONEMBOLIC -0.4611 0.6306 1.5425 -0.299 0.7650
LESION.INDICATIONISCHEMIA 1.3794 3.9725 1.1541 1.195 0.2320
LESION.INDICATIONISCHEMIA/CLAUDICATION 0.2546 1.2899 1.0189 0.250 0.8027
LESION.INDICATIONREST PAIN 0.5302 1.6993 1.1853 0.447 0.6547
LESION.INDICATIONTISSUE LOSS 0.7793 2.1800 1.0254 0.760 0.4473
LESION.TYPEOCCLUSION -0.5886 0.5551 0.4360 -1.350 0.1770
LESION.TYPESTEN -0.7895 0.4541 0.4378 -1.803 0.0714 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
RUTH.CLASS4 0.5569 1.7956 0.05456 5.684
RUTH.CLASS5 0.8627 1.1591 0.11726 6.348
RUTH.CLASS6 0.6370 1.5698 0.07379 5.499
LESION.INDICATIONEMBOLIC 0.6306 1.5858 0.03067 12.964
LESION.INDICATIONISCHEMIA 3.9725 0.2517 0.41374 38.141
LESION.INDICATIONISCHEMIA/CLAUDICATION 1.2899 0.7752 0.17510 9.503
LESION.INDICATIONREST PAIN 1.6993 0.5885 0.16645 17.347
LESION.INDICATIONTISSUE LOSS 2.1800 0.4587 0.29216 16.266
LESION.TYPEOCCLUSION 0.5551 1.8015 0.23619 1.305
LESION.TYPESTEN 0.4541 2.2023 0.19250 1.071
Concordance= 0.619 (se = 0.028 )
Rsquare= 0.137 (max possible= 1 )
Likelihood ratio test= 21.6 on 10 df, p=0.01726
Wald test = 22.23 on 10 df, p=0.01398
Score (logrank) test = 23.46 on 10 df, p=0.009161
> levels(test$LESION.INDICATION)
[1] "CLAUDICATION" "EMBOLIC" "ISCHEMIA" "ISCHEMIA/CLAUDICATION"
[5] "REST PAIN" "TISSUE LOSS"
> levels(test$LESION.TYPE)
[1] "" "OCCLUSION" "STEN"
truncated output from model.matrix below:
> model.matrix(cox.fit)
RUTH.CLASS4 RUTH.CLASS5 RUTH.CLASS6 LESION.INDICATIONEMBOLIC LESION.INDICATIONISCHEMIA
1 0 0 0 0 0
2 0 1 0 0 0
We can see that the the first level of each of these is being excluded from the model. Any input would be greatly appreciated. I noticed that on the LESION.TYPE variable, the blank level "" is not being included, but that is not by design - that should be a NA or something similar.
I'm confused and could use some help with this. Thanks.
Factors in any model return coefficients based on a base level (a contrast).Your contrasts default to a base factor. There is no point in calculating a coefficient for the dropped value because the model will return the predictions when that dropped value = 1 given that all the other factor values are 0 (factors are complete and mutually exclusive for every observation). You can alter your default contrast by changing the contrasts in your options.
For your coefficients to be versus an average of all factors:
options(contrasts=c(unordered="contr.sum", ordered="contr.poly"))
For your coefficients to be versus a specific treatment (what you have above and your default):
options(contrasts=c(unordered="contr.treatment", ordered="contr.poly"))
As you can see there are two types of factors in R: unordered (or categorical, e.g. red, green, blue) and ordered (e.g. strongly disagree, disagree, no opinion, agree, strongly agree)

Resources