Post Hoc tests for ANOVA [closed] - r

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I was trying to figure out how to do post-hoc tests for Two Way ANOVAs and found the following 2 approaches:
Do pairwise t-tests (bonferroni corrected) if one finds significance with the ANOVA.
Link- http://rtutorialseries.blogspot.com/2011/01/r-tutorial-series-two-way-anova-with.html
Do TukeyHSD on an aov model
Link- http://www.r-bloggers.com/post-hoc-pairwise-comparisons-of-two-way-anova/
Running the data set given in the first example in SPSS gives significant pairwise difference for Treatment and Age (Treatmen and Age were the independent variables) , while using the directions given in the first link didn't give me significant pairwise different for Treatment (only gave for Age).
I have a few questions:
Is the first method completely incorrect as hinted in the second link?
What is the right way to do Bonferroni corrected post hoc tests for Two Way ANOVA in R?
Does anyone know how post hoc tests for SPSS work in the case of Two Way ANOVAs (Univariate analysis)? Especially for Bonferroni corrected tests.
I am new to R, so please let me know if I made a mistake in framing the question; I will try to elucidate as much as I personally can. Thanks for your help.

Related

R influence variables on y target [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
One task of Machine Learning / Data Science is making predictions. But, I want to get more insights in the variables of my model.
To get more insights, I tried different methods:
Logistic Regression (The output provides some 'insights' in the influence of the different variables, see: Checking interpretation of GLM summary in R)
The xgb.plot.importance function applied on a Boosting Tree, see picture below (applied on the Titanic Data Set).
And I saw a great article (but unfortunately, it is not working) how to explain a boosting tree (see: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211).
My question: are there other methods to give yourself (or even better: the business) more insights about which variables have a influence on the target variable? And of course: is the influence positive/negative and how big is it?
You could also try to use lasso regression (https://stats.stackexchange.com/questions/17251/what-is-the-lasso-in-regression-analysis), which basically selects the variables that influence the response variable mostly.
The glmnet package provides support for this type of regression.

Machine learning - Calculating the importance of a "value" in a variable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.

How can a SE be above 1000 in a multilevel logistic regression? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Maybe my question will fail to be specific but when fitting a glme model (using lme4 package in R) I get for one of the parameters SE=1000, with the estimated parameter as high as 16. The variable is a dichotomous variable. My question is if there might be an explanation for such a result, considering that the other parameters have parameters and SE that seem ok
That's a sign that you have complete separation. You should re-run the model without that covariate. Since its an ME model you may need to do a tabulation of outcome by covariate by levels to see what is happening. More details would allow greater specificity in our answers.
This is a link to a posting by Jarrod Hadfield, one of the guRus on the R mixed model mailing list. It demonstrates how complete separation leads to the Hauck-Donner effect, and it offers some further approaches to attempt dealing with it.
You may be seeing a case of the Hauck-Donner effect. Here is one post that discusses it, you can read the original paper or search the web for additional discussions.

comparison of regression models built on two time points [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have two multiple linear regression models, built using the same groups of subjects, variables, the only difference is the time point: one is baseline data and the other is obtained some time after.
I want to compare if there is any statistical significance between the two models. I have seen articles saying that using AIC maybe a better option over p-value when comparing models.
My question is: does it make sense to just purely compare the AIC using extractAIC in R, or to obtain the anova(lm)?
It is not standard to test for statistical significance between observations recorded at two points in time by estimating two different models.
You may mean that you are testing to see whether the observations recorded at a second point in time are statistically different from the first, by including some dummy variables, and testing the coefficients on these. Still, this is only estimating one model.
In your model you will have dummy variables for your second point in time, either one intercept or an intercept and an interaction dummy like this.
Then you should do both - test the p-value significance for either or both gammas in the models described, and also look at the AIC. There is no definitive 'better', as the articles likely described.

Should Categorical predictors within a linear model be normally distributed? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am running simple linear models(Y~X) in R where my predictor is a categorical variable (0-10). However, this variable is not normally distributed and none of the transformation techniques available are healpful (e.g. log, sq etc.) as the data is not negatively/positively skewed but rather all over the place. I am aware that for lm the outcome variable (Y) has to be normally distributed but is this also required for predictors? If yes, any suggestions of how to do this would be more than welcome.
Also, as the data I am looking at has two groups, patients vs controls (I am interested in group differences, as you can guess), do I have to look at whether the data is normally distributed within the two groups or overall across the two groups?
Thanks.
See #Roman Luštriks comment above: it does not matter how your predictors are distributed. (Except for problems with multicollinearity.) What is important is that the residuals be normal (and with homogeneous variances).

Resources