assumption of variance homogenity violated two-way ANOVA R / Rstudio - r

I want to conduct a two-way ANOVA to investigate if self-rated-health is differently related to life-satisfaction in different age groups. First I checked the assumption of variance-homogenity with the levene Test in R. The outcome was that the assumption of variance-homogenity is violated (p = 2.2^e-16). Then I decided to calculate a Welch-ANOVA since it does not require homogene variances with the oneway.test() function. But after that you have to control for the alpha error with a paired t-test, though this is not possible for a two-way ANOVA.
What can I do now? And what is the detrimental outcome if I calculate a ANOVA even though the assumption of variance homogenity is violated?
I am new here and in stats, hope you can still understand my question.
my variables:
lz_20 (life satisfaction): numeric ;
Fjp40 (self-rated health): I converted it from numeric to factor w/ 5 levels ;
Falter (age): I converted it from numeric to factor w/ 3 levels

Related

How to specify contrasts in lme to test hypotheses with interactions

I have a generalized mixed model that has 2 factors (fac1 (2 levels), fac2 (3 levels)) and 4 continuous variables (x1,x2,x3,x4) as fixed effects and a continuous response. I am interested in answering:
are the main effects x1-x4 (slopes) significant ignoring fac1 and fac2
are fac1 and fac2 levels significantly different from the model mean and to each other
is there a difference in slopes between fac1 levels and fac2 levels and fac1*fac2 levels
This means I would need to include interations in my model (random effects ignored here)
say: Y~x1+x2+x3+x4+fac1+fac2+x1:fac1+x2:fac1+x3:fac1+x4:fac1+x1:fac2+x2:fac2+x3:fac2+x4:fac2
but now my coefficients for x1-x4 are based on my ref level and interpretation of the overall main effects is not possible.
Also do I have to include xi:fac1:fac2+fac1:fac2 in my model as well to answer 3)?
is there an R package that can do this? I though about refitting the model (e.g. without the interactions) to answer 1) but the data points in each factor level are not the same so ignoring this in Y~x1+x2+x3+x4 the slope of the most common factor combination may dominate the result and inference? I know you can use contrasts e.g. by not dummy coding a factor with 2 levels to 0 and 1 but -0.5,0.5 but not sure how something would look like in this case.
Would it be better to ease the model combining the factors first e.g.
fac3<-interaction(fac1,fac2) #and then
Y~x1+x2+x3+x4+x1:fac3+x2:fac3+x3:fac3+x4:fac3
But how do I answer 1-3 from that.
Thanks a lot for your advice
I think you have to take a step back and ask yourself what hypotheses exactly you want to test here. Taken word for word, your 3-point list results in a lot (!) of hypotheses tests, some of which can be done in the same model, some requiring different parameterizations. Given that the question at hand is about hypotheses and not how to code them in R, this is more about statistics rather than programming and may be better moved to CrossValidated.
Nevertheless, for starters, I would propose the following procedure:
To test x1-x4 alone, just add all of these to your model, then use drop1() to check which of them actually add to the model fit.
In order to reduce the number of hypothesis tests (and different models to fit), I suggest you also test for each factor and the interaction as whole whether it is relevant. Again, add all three terms (both factors and interaction, so just fac1*fac2 if they are formatted as factors) to the model and use drop1.
This point alone includes many potential hypotheses/contrasts to test. Depending on parameterization (dummy or effect coding), for each of the 4 continuous predictors you have 3 or 5 first-order interactions with the factor dummies/effects and 2 or 6 second-order interactions, given that you test each group against all others. This is a total of 20 or 44 tests and means that it becomes very likely that you have false-positives (if you test at 95% confidence level). Additionally, please ask yourself whether these interactions can even be interpreted in a meaningful way. Therefore, I would advise that you to either focus on some specific interactions that you expect to be relevant. If you really want to explore all interactions, just test entire interactions (e.g. fac1:x1, not specific contrasts) first. For this you have to make 8 models, each including one factor-continuous interaction, then compare all of them to the no-interaction model, using anova().
One last thing: I have assumed that you have already figured out the random variable structure of your model (i.e. what cluster variable(s) to consider and whether there should be random slopes). If not, do that first.

Regression with factor variables [duplicate]

I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!

How to treat a variable as random factor in GLM in R

I am doing statistical analysis for a dataset using GLM in R. Basically the predictor variables are: "Probe"(types of probes used in the experiment - Factor with 4 levels), "Extraction"(types of extraction used in the experiment - Factor with 2 levels), "Tank"(the tank number that the sample is collected from - integers from 1 to 9), and "Dilution"(the dilution of each sample - numbers: 3.125, 6.25, 12.5, 25, 50, 100). The response is the number of positive responses ("Positive") obtained from a number of repetition of the experiment ("Rep"). I want to assess the effects of all predictor variables (and their interactions) on the number of positive responses, so I tried to fit a GLM model like this:
y<-cbind(mydata$Positive,mydata$Rep - mydata$Positive)
model1<-glm(y~Probe*Extraction*Dilution*Tank, family=quasibinomial, data=mydata)
But I was later advised by my supervisor that the "Tank" predictor variable should not be treated as a level-based variable. i.e. it has values of 1 to 9, but it's just the tank label so the difference between 1 and, say, 7 is not important. Treating this variable as factor would only make a large model with bad results. So how to treat the "Tank" variable as a random factor and include it in the GLM?
Thanks
It is called a "mixed effect model". Check out the lme4 package.
library(lme4)
glmer(y~Probe + Extraction + Dilution + (1|Tank), family=binomial, data=mydata)
Also, you should probably use + instead of * to add factors. * includes all interactions and levels of each factor, which would lead to a huge overfitting model. Unless you have a specific reason to believe that there is interaction, in which case you should code that interaction explicitly.

Post hoc test in Generalised linear mixed models: how to do?

I am working with a mixed model (glmmadmb) in R for count data. I have one random factor (Locality), and one fixed factor(Habitat). The fixed factor has two levels, and the random factor has seven levels. I want to do comparisons between the two levels of fixed factor within each of the seven levels of random factor. But I don't know how to do it in R. I am very new to R. Can anyone help me? Many thanks.
This is my glmm formula for over dispersed data:
model<-glmmadmb(Species.abundance~Habitat(1|Locality:Habitat),
data=data,family='nbinom1')
I tried it with just "Habitat" but it is clearly not taking Locality into account:
summary(glht(model,linfct=mcp(Habitat='Tukey')))
Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: glmmadmb(formula = Species.abundance ~ Habitat + (1 | Locality:Habitat),
data = data, family = "nbinom1")
Linear Hypotheses:
Estimate Std. Error z value Pr(>|z|)
Fynbos - Forest == 0 -0.2614 0.2010 -1.301 0.193
(Adjusted p values reported -- single-step method)
I would probably just do separate tests within each Locality, and do multiple-comparison corrections if you like. Functions from plyr are convenient, but not necessary, to do this, something like
library(plyr)
library(glmmADMB)
model.list <- dlply(data,"Locality",glmmadmb,
formula=Species.abundance~Habitat,
family="nbinom1")
p.vals <- laply(model.list,function(x) coef(summary(x))[2,"Pr(>|z|)"])
p.adjust(pvals)
(I can't guarantee that this actually works since you haven't given a reproducible example and I can't be bothered to invent one ...)

interpret/extract coefficients from factor variable in glmnet

I have run a logit model through glmnet. I am extracting the coefficients from the minimum lambda, and it gives me the results I expect. However I have a factor variable with nine unique values, and glmnet produces a single coefficient for this, which is expected for a binary variable but not factor...
library(glmnet)
coef(model.obj, s = 'lambda.min')
#output:
TraumaticInj 2.912419e-02
Toxin .
OthInj 4.065266e-03
CurrentSTDYN 7.601812e-01
GeoDiv 1.372628e-02 #this is a factor variable w/ 9 options...
so my questions:
1) how should I interpret a single coefficient from a factor variable in glmnet?
2) is there a method to extract the coefficients for the different factors of the variable?
Glmnet doesn't handle factor variables. You have to convert them to dummies using eg model. Matrix. So the results you are seeing is glmnet treating your factor variable as a single real variable.
Can't be done, b/c glmnet doesn't treat factor variables. This is pretty much answered here: How does glmnet's standardize argument handle dummy variables?
This comment by #R_User in the answer is particularly insightful:
#DTRM - In general, one does not standardize categorical variables to
retain the interpretability of the estimated regressors. However, as
pointed out by Tibshirani here:
statweb.stanford.edu/~tibs/lasso/fulltext.pdf, "The lasso method
requires initial standardization of the regressors, so that the
penalization scheme is fair to all regressors. For categorical
regressors, one codes the regressor with dummy variables and then
standardizes the dummy variables" - so while this causes arbitrary
scaling between continuous and categorical variables, it's done for
equal penalization treatment. – R_User Dec 6 '13 at 1:20

Resources