Step vs. StepAIC - r

I'm trying to understand why my code has taken several days to process and how I can improve the next iteration. I'm on my third day and continue to have outputs with marginal improvements in AIC. The last couple of AIC's have been 18135.38, 18187.43, and 18243.13. I currently have 33 covariates in the model. The "none" option is 12th from the bottom, so there are still many covariates to run.
The data is ~610K observations with ~1600 variables. Outcome variables and covariates are mostly binary. My covariates were chosen after doing univariate logistical regression and P-value adjustment using Holm procedure (alpha=0.05). No interaction terms are included.
The code I've written is here:
intercept_only <- glm(outcome ~ 1, data=data, family="binomial")
full.model <- glm(outcome ~ 157 covariates, data=data, family = "binomial")
forward_step_model <- step(intercept_only, direction = "forward", scope = formula(full.model))
I'm hoping to run the same code on a different outcome variable with double the number of covariates identified in the same way as above but am worried it will take even longer to process. I see there are both the step and stepAIC functions to perform stepwise regression. Is there an appreciable difference between these functions? Are any other ways of doing this? Is there any way to speed up the processing?

Related

Syntax error when fitting a Bayesian logistic regression

I am attempting to model binary species traits, where presence is represented by 1 and absence by 0, as a function of some sampling variables. To accomplish this, I have constructed a brms model and added a phylogenetic structure to it. Here is the model I used:
model <- brms::brm(male_head | trials(1 + 0) ~
PC1 + PC2 + PC3 +
(1|gr(phylo, cov = covariance_matrix)),
data = data,
family = binomial(),
prior = prior,
data2 = list(covariance_matrix = covariance_matrix))
Each line of my df represents one observation with a binary outcome.
Initially, I was unsure about which arguments to use in the trials() function. Since my species are non-repeated and some have the traits I'm modeling while others do not, I thought that trials(1 + 0) might be appropriate. I recall seeing a vignette that suggested this, but I can't find it now. Is this syntax correct?
Furthermore, for some reason I'm unaware, the model is producing one estimate value for each line of my predictors. As my df has 362 lines, the model summary displays a lengthy list of 362 estimate values. I would prefer to have one estimate value for each sampling variable instead. Although I have managed to achieve this by making the treatment effect a random effect (i.e., (1|PC1) + (1|PC2) + (1|PC3)), I don't think this is the appropriate approach. I also tried bernoulli() but no success either. Do you have any suggestions for how I can address this issue?
EDIT:
For some reason the values of my sampling variables/principal components were being read as factors. The second part of this question was solved.

GLS / GLM nested design with autocorrelation over time

Still fairly new to GLM and a bit confused about how to establish my model.
About my project:
I sampled the microbiome (and measured a diversity index value = Shannon) from the root system of a sample of 9 trees (=tree1_cat).
In each tree I sampled fine and thick roots (=rootpart) and each tree was sampled four times (=days) over the course of one season. Thus I have a nested design but have to keep time in mind for autocorrelation. Also not all values are present, thus I have a few missing values). So far I have tried and tested the following:
Model <- gls(Shannon ~ tree1_cat/rootpart + tree1_cat + days,
na.action = na.omit, data = psL.meta,
correlation = corAR1(form =~ 1|days),
weights = varIdent(form= ~ 1|days))
Furthermore I've tried to get more insight and used anova(Model) to get the p-values of those factors. Am I allowed to use those p-values? Also I've used emmeans(Model, specs = pairwise ~ rootpart)for pairwise comparisons but since rootpart was entered as nested factor it only gives me the paired interactions.
It all works, but I am not sure, whether this is the right model! Any help would be highly appreciated!
It would be helpful to know your scientific question, but let's suppose you're interested in differences in Shannon diversity between fine and thick roots and in time trends. A model you could use would be:
library(lmerTest)
lmer(Shannon ~ rootpart*days + (rootpart*days|tree1_cat), data = ...)
The fixed-effect component rootpart*days can be expanded into 1 + rootpart + days + rootpart:days (where 1 signifies the intercept)
intercept: SD in fine roots on day 0 (hopefully that's the beginning of the season)
rootpart: difference between fine and thick roots on day 0
days: change per day in SD in fine roots (slope)
rootpart:days difference in slope between thick roots and fine roots
The random-effect component (rootpart*days|tree1_cat) measures how all four of these effects vary across trees, and their correlations (e.g. do trees with a larger-than-average difference between fine and thick roots on day 0 also have a larger-than-average change over time in fine root SD?)
This 'maximal' random effects model is almost certainly too complex for your data; a rough rule of thumb says you should have 10-20 data points per parameter estimated, the fixed-effect model takes 4 parameters. A full model with 4 random effects requires the estimate of a 4×4 covariance matrix, which has (4*5)/2 = 10 parameters all by itself. I might just try (1+days|tree1_cat) (random slopes) or (rootpart|tree_cat) (among-tree difference in fine vs. thick differences), with a bias towards allowing for the variation in the effect that is your primary interest (e.g. if your primary question is about fine vs. thick then go with (rootpart|tree_cat).
I probably wouldn't worry about autocorrelation at all, nor heteroscedasticity by day (your varIdent(~1|days) term) unless those patterns are very strongly evident in the data.
If you want to allow for autocorrelation you'll need to fit the model with nlme::lme or glmmTMB (lmer still doesn't have machinery for autocorrelation models); something like
library(nlme)
lme(Shannon ~ rootpart*days,
random = ~days|tree1_cat,
data = ...,
correlation = corCAR1(form = ~days|tree1_cat)
)
You need to use corCAR1 (continuous-time autoregressive order-1) rather than the more common corAR1 for unevenly sampled data. Be aware that lme is more finicky/worse at dealing with singular models, so you may discover you need to simplify your model before you can actually get this model to run.

Why do heteroscedasticity-robust standard errors in logistic regression?

I am following a course on R. At the moment, we are working with logistic regression. The basic form we are taught is this one:
model <- glm(
formula = y ~ x1 + x2,
data = df,
family = quasibinomial(link = "logit"),
weights = weight
)
This makes perfectly sense to me. However, then we are being recommended to use the following to get coefficients and heteroscedasticity-robust inference:
model_rob <- lmtest::coeftest(model, sandwich::vcovHC(model))
This confuses me bit. Reading about vcovHC is states that it creates a "heteroskedasticity-consistent estimation". Why would you do this when doing logistic regression? I taught it did not assume homoscedasticity? Also, I am not sure what the coeftest does?
Thank you!
You're right - homoscedasticity (residuals at each level of the predictor have the same variance), is not an assumption in logistic regression. However, the binary response in logistic regression is heteroscedastic (0 or 1) which is why a corresponding estimator should be consistent with it. I guess that is what is meant with "heteroscedasticity-consistent". As #MrFlick already pointed out, if you would like more information on that topic, Cross Validated is likely to be the place to be. The coeftest produces the Wald test statistic of the estimated coefficients. These tests give you some information on whether a predictor (independent variable) seems to be associated to the dependent variable according to your data.

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

Tukey HSD for multiple variables and single variable return different results

I have tried to run Tukey HSD for a multi-variable dataset. However, when I run the same test on a single variable, the results are completely opposite.
While running for multiple variables, I observed the following error in ANOVA output:
8 out of 87 effects not estimable
Estimated effects may be unbalanced
While running for single variable, I observed the following error in ANOVA output:
Estimated effects may be unbalanced
Is this in any way related to the completely opposite Tukey HSD output which I received? Also, how do I go on solving this problem?
I used aov() and have close to 500000 datapoints in my dataset.
to be more specific, the following code gave me a different result:
code1:
lm_test1 <- lm(y ~ x1+ x2, data=data)
glht(lm_test1, linfct = mcp(x1 = "Tukey"))
code2:
lm_test1 <- lm(y ~ x1, data=data)
glht(lm_test1, linfct = mcp(x1 = "Tukey"))
Please tell me how this is possible...
after some more research, I found the answer to this, so thought I should post this. Anova in R is by default type - I anova. So that means the first variable that we input, the effects are considered without controlling for any other factors, on the other hand, for the other variables, the results are shown after controlling for the effects of other variables. Therefore, since I was inputting my variable as the 2nd variable, the results shown were after controlling for the 1st variable which was by chance, in a completely opposite direction to looking at a direct effect.

Resources