I'm running an ANOVA with a chi square test in R to test for individual significance on the response variable and interactions between explanatory variables. For some reason, when I test the relationship between time (explanatory) and a variable called casualty code (explanatory), I'm not getting a p-value in my output.
I'm doing my dissertation on the factors that affect the survival of wildlife during the rehabilitation process. Many of my factors are categorical, however time spent in center is continuous. I've run a GLM fit with a logit function for the response variable "result" (binomial, lived or died), as a factor of time, age (categorical), species type (categorical), and casualty code (injury type, categorical).
analysis.time<-glm(Result~Time + Species.Typefac + Codefac + Agefac, family = binomial, data = GBH_Data)
I then remove time to test for significance with ANOVA:
acst.sigtime<-update(analysis.time,~.-Time)
anova(analysis.time, acst.sigtime, test = "Chisq")
Which works just fine. I did the same for interactions between time and age and time and species type and got a normal output. However, when I try and and run the same test for time and casualty code I'm not getting a p-value. This is the code:
time.interactions.code<-acs.interactions.weight<-glm(Result~Time + Agefac + Codefac + Species.Typefac + Time:Codefac, family = binomial, data = GBH_Data)
time.code.anova<-update(time.interactions.code,~.-Time:Codefac)
anova(time.code.anova, time.interactions.code, test = "Chisq")
And this is the output:
Analysis of Deviance Table
Model 1: Result ~ Time + Agefac + Codefac + Species.Typefac
Model 2: Result ~ Time + Agefac + Codefac + Species.Typefac + Time:Codefac
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
29561 25297
29554 25472 7 -174.6
For time:age and time:species type I'm using the exact same code, and I'm getting p-values. Casualty code has 8 categories, while age has 3 and species type has 4. I've double-checked my data and don't have any NAs/blanks. For context, my overall dataset is very large (over 28,000 individual casualties). What could be the reason I'm not getting a p-value here? Answers in lay terms are greatly appreciated, I don't have a lot of experience with statistics so I'm thankful for any simplification of concepts/elaboration of terms.
Related
I have created the following plot based on air quality data over three years of observation, and would like to know if these slopes are different across the two time periods (March-June 2018-2019 average vs. March-June 2020):
A snapshot of my data frame is shown here:
The figure is made using the following code:
Lockdown_Period_plot_weekday <- ggplot(COVID_NO2_weekday_avgs_Rathmines, aes(x = Date_1, y = avg_daily_Rath_NO2, color = Period, shape = Period)) +
geom_smooth(method="lm", se = FALSE) +
geom_point(size=2) +
theme_bw() +
labs(x = 'Date',
y = 'Daily Avg [NO2] µg/m^3',
title = 'Weekday NO2 Trends During Lockdown',
subtitle = 'Rathmines AQ Station')
I know that I need to remove the effect of serial correlation first (as the independent variable is a time series), but I'm not exactly sure how to do this. Should I use the date column to do so? Or should I use the dummy column Date_2 to do this? This column is just a concatenation of Month.Date to create a series of x values that are numerical and continuous.
I used the gls() function to do this, and believe I have designated the date column as my serial correlation.
My attempt is displayed here:
library(nlme)
m <- gls(avg_daily_Rath_NO2 ~ Period,
data=COVID_NO2_weekday_avgs_Rathmines,
correlation=corARMA(p=1, q=0, form=~date))
summary(m)
Output:
Generalized least squares fit by REML
Model: avg_daily_Rath_NO2 ~ Period
Data: COVID_NO2_weekday_avgs_Rathmines
Correlation Structure: ARMA(1,0)
Formula: ~date
Parameter estimate(s):
Phi1
0.6066636
Coefficients:
Correlation:
(Intr)
PeriodMarch-June 2020 -0.569
Standardized residuals:
Min Q1 Med Q3
-1.8573362 -0.6487672 -0.1588551 0.5597100
Max
3.4017470
Residual standard error: 10.46725
Degrees of freedom: 256 total; 254 residual
I am a tad rusty when it comes to linear regression outputs, and am not sure how to interpret this one.
Additionally, I would like to check that my model is correctly structured to achieve my desired output.
Any help with this would be appreciated.
-TL;DR-
I want to run a ANCOVA on two lines to find out if the slopes differ across the Period variable.
I would like to remove the effect of serial correlation since the independent variable is a time series.
What is the most effective way to accomplish this?
More information can be provided if necessary.
I originally ran my data in SPSS because figuring out the lmer package took some time for me to learn. I spent a few weeks writing up a script in R, but my output in R is different than what I'm getting using SPSS.
I have 3 Fixed Effects: Group, Session, and TrialType.
When I ran a mixed model in SPSS, I got the interaction Group*Session p=.08 OR p=.02, depending on which covariance structure I used. This is partly the reason I wanted to use R, because I didn't have enough information to help me decide which structure to use.
Here are my models in R. I'm using Log Likelihood Test to get a p-value for this Group*Session interaction.
Mod2 = lmer(accuracy ~ group*session*trialtype + (trialtype|subject), REML=F, data=data,
control = lmerControl(optimizer = "optimx", optCtrl=list(method='L-BFGS-B'))))
Mod5 = lmer(accuracy ~ session + trialtype + group + session*trialtype + trialtype*group + (trialtype|subject),
data=data, REML=FALSE,
control = lmerControl(optimizer = "optimx", optCtrl=list(method='L-BFGS-B')))
anova(Mod2, Mod5)
Data: data
Models:
Mod5: accuracy ~ session + trialtype + group + session * trialtype +
Mod5: trialtype * group + (trialtype | subject)
Mod2: accuracy ~ group * session * trialtype + (trialtype | subject)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
Mod5 23 -961.32 -855.74 503.66 -1007.3
Mod2 27 -956.32 -832.38 505.16 -1010.3 2.9989 4 0.558
I'll also note that I added the lmerControl based on the 2 warning/error messages I was getting. When I added, this, I got the singular boundary warning message.
Is it possible that R is not recognizing a grouping variable in my data? I'm not sure how to identify this or correct it.
Here is my syntax from SPSS:
MIXED Acc BY Test TrialType Group
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=Test TrialType Group Test*TrialType Test*Group TrialType*Group Test*TrialType*Group |
SSTYPE(3)
/METHOD=ML
/PRINT=COVB DESCRIPTIVES G SOLUTION
/RANDOM=INTERCEPT TrialType | SUBJECT(Subject) COVTYPE(CS)
/REPEATED=Test | SUBJECT(Subject) COVTYPE(ID).
The first thing to do to figure this out is to make sure the log-likelihood values for the fitted models are the same, as if the models aren't getting the same results, the test statistics wouldn't be expected to be the same. Even if the models are the same, in R you're using a chi-square statistic rather than an F, as is used in SPSS Statistics MIXED. The p values often would differ, though not usually by as much as from .02-.08 to .558. I suspect you haven't actually got strictly comparable results here.
Because this is such a long question I've broken it down into 2 parts; the first being just the basic question and the second providing details of what I've attempted so far.
Question - Short
How do you fit an individual frailty survival model in R? In particular I am trying to re-create the coefficient estimates and SE's in the table below that were found from fitting the a semi-parametric frailty model to this dataset link. The model takes the form:
h_i(t) = z_i h_0(t) exp(\beta'X_i)
where z_i is the unknown frailty parameter per each patient, X_i is a vector of explanatory variables, \beta is the corresponding vector of coefficients and h_0(t) is the baseline hazard function using the explanatory variables disease, gender, bmi & age ( I have included code below to clean up the factor reference levels).
Question - Long
I am attempting to follow and re-create the Modelling Survival Data in Medical Research text book example for fitting frailty mdoels. In particular I am focusing on the semi parametric model for which the textbook provides parameter and variance estimates for the normal cox model, lognormal frailty and Gamma frailty which are shown in the above table
I am able to recreate the no frailty model estimates using
library(dplyr)
library(survival)
dat <- read.table(
"./Survival of patients registered for a lung transplant.dat",
header = T
) %>%
as_data_frame %>%
mutate( disease = factor(disease, levels = c(3,1,2,4))) %>%
mutate( gender = factor(gender, levels = c(2,1)))
mod_cox <- coxph( Surv(time, status) ~ age + gender + bmi + disease ,data = dat)
mod_cox
however I am really struggling to find a package that can reliably re-create the results of the second 2 columns. Searching online I found this table which attempts to summarise the available packages:
source
Below I have posted my current findings as well as the code I've used encase it helps someone identify if I have simply specified the functions incorrectly:
frailtyEM - Seems to work the best for gamma however doesn't offer log-normal models
frailtyEM::emfrail(
Surv(time, status) ~ age + gender + bmi + disease + cluster(patient),
data = dat ,
distribution = frailtyEM::emfrail_dist(dist = "gamma")
)
survival - Gives warnings on the gamma and from everything I've read it seems that its frailty functionality is classed as depreciated with the recommendation to use coxme instead.
coxph(
Surv(time, status) ~ age + gender + bmi + disease + frailty.gamma(patient),
data = dat
)
coxph(
Surv(time, status) ~ age + gender + bmi + disease + frailty.gaussian(patient),
data = dat
)
coxme - Seems to work but provides different estimates to those in the table and doesn't support gamma distribution
coxme::coxme(
Surv(time, status) ~ age + gender + bmi + disease + (1|patient),
data = dat
)
frailtySurv - I couldn't get to work properly and seemed to always fit the variance parameter with a flat value of 1 and provide coefficient estimates as if a no frailty model had been fitted. Additionally the documentation doesn't state what strings are support for the frailty argument so I couldn't work out how to get it to fit a log-normal
frailtySurv::fitfrail(
Surv(time, status) ~ age + gender + bmi + disease + cluster(patient),
dat = dat,
frailty = "gamma"
)
frailtyHL - Produce warning messages saying "did not converge" however it still produced coeficiant estimates however they were different to that of the text books
mod_n <- frailtyHL::frailtyHL(
Surv(time, status) ~ age + gender + bmi + disease + (1|patient),
data = dat,
RandDist = "Normal"
)
mod_g <- frailtyHL::frailtyHL(
Surv(time, status) ~ age + gender + bmi + disease + (1|patient),
data = dat,
RandDist = "Gamma"
)
frailtypack - I simply don't understand the implementation (or at least its very different from what is taught in the text book). The function requires the specification of knots and a smoother which seem to greatly impact the resulting estimates.
parfm - Only fits parametric models; having said that everytime I tried to use it to fit a weibull proportional hazards model it just errored.
phmm - Have not yet tried
I fully appreciate given the large number of packages that I've gotten through unsuccessfully that it is highly likely that the problem is myself not properly understanding the implementation and miss using the packages. Any help or examples on how to successfully re-create the above estimates though would be greatly appreciated.
Regarding
I am really struggling to find a package that can reliably re-create the results of the second 2 columns.
See the Survival Analysis CRAN task view under Random Effect Models or do a search on R Site Search on e.g., "survival frailty".
I have a data frame which contains some characteristics from clients and contracts and 0s and 1s showing whether a fall happened the period between 2008 and 2017. I'm using a binomial model to regress probability of fall on the characteristics. I have 38000 differents contracts.
So I'm using an binomial model like this (R-code):
formule <- y ~ Niveau_gar_incapacite + Niv_indem_mens + Regrpt_franchise + Niveau_prime + Situation_familiale + Classe_age_chute + Grde_Region + Regrpt_strate + Taille_courtier + Commission + Retention + Anciennete + Regrpt_CSP + Regrpt_sinistres + Couplage
logit <- glm(Chute_commerciale~1, data=train, family=binomial(link="logit"))
selection_asc_AIC <- step(logit, direction="forward", trace=TRUE, k=2, scope=list(upper=formule))
After some tests to find multi-collinearity, I did eliminations of variables or groupings of terms.
I have this result :
results from GLM
results from GLM 2
This results are not correct with null deviance and residual deviance.
I supposed my variable exposure that is the problem.
In fact, I have contracts beginning and finishing at differents years.
So my exposure can be 5.32 or 1.36 and I have truncation and censorship.
How can I treat this variable exposure in regression logistic binomial ?
If I duplicate my row by the number of year of exposure, there is a problem of independance of observations.
I am fitting a maximal model to untransformed response times to correct trials, with two, two-level, centered categorical predictors (Stimulation, Cognate Status) and an orthogonal second-order polynomial with 5 levels (Block). Random effects include full crossed structure with correlations. 32 subjects, 60 items, balanced, within-subjects design, 12,406 observations. The model converges but the summary takes an age to process.
The model runs without any convergence issues but summary() initialises a memory-intensive process but does not compile/print the output. I don't have any issues with the summary() function for any other elements.
I have included the code for the model for reference.
Max.lmer.RT = lmer(RT ~ StimCent.r*(ot1 + ot2)*CogStatCent.r +
(1 + StimCent.r*(ot1 + ot2)*CogStatCent.r | PID) +
(1 + StimCent.r*(ot1 + ot2) | DutchName:CogStatCent.r),
data = TDL.cent.RT, REML = FALSE, control =
lmerControl(optimizer = "nloptwrap2", optCtrl =
list(maxfun = 100000)))
summary(Max.lmer.RT)
A fix or suggestions on what might be causing this would be much appreciated.