How can I classify post-hoc test results in R? - r

I am trying to understand how to work with ANOVAs and post-hoc tests in R.
So far, I have used aov() and TukeyHSD() to analyse my data. Example:
uni2.anova <- aov(Sum_Uni ~ Micro, data= uni2)
uni2.anova
Call:
aov(formula = Sum_Uni ~ Micro, data = uni2)
Terms:
Micro Residuals
Sum of Squares 0.04917262 0.00602925
Deg. of Freedom 15 48
Residual standard error: 0.01120756
Estimated effects may be unbalanced
My problem is, now I have a huge list of pairwise comparisons but cannot do anything with it:
TukeyHSD(uni2.anova)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Sum_Uni ~ Micro, data = uni2)
$Micro
diff lwr upr p adj
Act_Glu2-Act_Ala2 -0.0180017863 -0.046632157 0.0106285840 0.6448524
Ana_Ala2-Act_Ala2 -0.0250134285 -0.053643799 0.0036169417 0.1493629
NegI_Ala2-Act_Ala2 0.0702274527 0.041597082 0.0988578230 0.0000000
This dataset has 40 rows...
Idealy, I would like to get a dataset that looks something like this:
Act_Glu2 : a
Act_Ala2 : a
NegI_Ala2: b...
I hope you get the point. So far, I have found nothing comparable online... I also tried to select only significant pairs in the file resulting from TukeyHSD, but the file does not "acknowlegde" that it is made up of rows & columns, making selecting impossible...
Maybe there is something fundamentally wrong with my approach?

I think the OP wants the letters to get a view of the comparisons.
library(multcompView)
multcompLetters(extract_p(TukeyHSD(uni2.anova)))
That will get you the letters.
You can also use the multcomp package
library(multcomp)
cld(glht(uni2.anova, linct = mcp(Micro = "Tukey")))
I hope this is what you need.

The results from the TukeyHSD are a list. Use str to look at the structure. In your case you'll see that it's a list of one item and that item is basically a matrix. So, to extract the first column you'll want to save the TukeyHSD result
hsd <- TukeyHSD(uni2.anova)
If you look at str(hsd) you can that you can then get at bits...
hsd$Micro[,1]
That will give you the column of your differences. You should be able to extract what you want now.

Hard to tell without example data, but assuming Micro is just a factor with 4 levels and uni2 looks something like
n = 40
Micro = c('Act_Glu2', 'Act_Ala2', 'Ana_Ala2', 'NegI_Ala2')[sample(4, 40, rep=T)]
Sum_Uni = rnorm(n, 5, 0.5)
Sum_Uni[Micro=='Act_Glu2'] = Sum_Uni[Micro=='Act_Glu2'] + 0.5
uni2 = data.frame(Sum_Uni, Micro)
> uni2
Sum_Uni Micro
1 4.964061 Ana_Ala2
2 4.807680 Ana_Ala2
3 4.643279 NegI_Ala2
4 4.793383 Act_Ala2
5 5.307951 NegI_Ala2
6 5.171687 Act_Glu2
...
then I think what you're actually trying to get at is the basic multiple regression output:
fit = lm(Sum_Uni ~ Micro, data = uni2)
summary(fit)
anova(fit)
> summary(fit)
Call:
lm(formula = Sum_Uni ~ Micro, data = uni2)
Residuals:
Min 1Q Median 3Q Max
-1.26301 -0.35337 -0.04991 0.29544 1.07887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8364 0.1659 29.157 < 2e-16 ***
MicroAct_Glu2 0.9542 0.2623 3.638 0.000854 ***
MicroAna_Ala2 0.1844 0.2194 0.841 0.406143
MicroNegI_Ala2 0.1937 0.2158 0.898 0.375239
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4976 on 36 degrees of freedom
Multiple R-squared: 0.2891, Adjusted R-squared: 0.2299
F-statistic: 4.88 on 3 and 36 DF, p-value: 0.005996
> anova(fit)
Analysis of Variance Table
Response: Sum_Uni
Df Sum Sq Mean Sq F value Pr(>F)
Micro 3 3.6254 1.20847 4.8801 0.005996 **
Residuals 36 8.9148 0.24763
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You can access the numbers in any of these tables like, for example,
> summary(fit)$coef[2,4]
[1] 0.0008536287
To see the list of what is stored in each object, use names():
> names(summary(fit))
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled"
In addition to the TukeyHSD() function you found, there are many other options for looking at the pairwise tests further, and correcting the p-values if desired. These include pairwise.table(), estimable() in gmodels, the resampling and boot packages, and others...

Related

how to center the response and predictor( center at mean) variable using the summary table?

**we dont have the data only summary table is given**
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.8522 12.6560 -2.912 0.005573 **
X1 -0.7120 1.4540 -0.490 0.626747
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.27 on 45 degrees of freedom
Multiple R-squared: 0.7377, Adjusted R-squared: 0.7144
F-statistic: 31.63 on 4 and 45 DF, p-value: 1.478e-12
I know that for centering I need to center predictor that is centered at the mean has new values–the entire scale has shifted so that the mean now has a value of 0.And I know The intercept will change, but the regression coefficient for that variable will not.
It looks like you have access to the model in R (since you mention you can get the variance/covariance matrix of the betas). If that's the case, then you can convert it using the model.
The relationship between standardized slopes and unstandardized slopes is:
beta = b*(sx/sy)
Embedded within all lm's is the data, which can be accessed using the following R code:
model$model
(Assuming the fitted object is called "model")
To get the standardized slopes, all you'd have to do is something like this:
get_betas = function(object){
b = summary(object)$coef[, 1]
sx = apply(model.matrix(object), 2, sd)
sy = apply(object$model[1], 2, sd)
beta <- b * sx/sy
return(beta)
}
Then you would use that function to extract the betas
get_betas(model)

Binary Logistic Regression output

I'm an undergrad student and am currently struggling with R, i'be been trying to teach myself for weeks but I'm not a natural, so I thought i'd seek some support.
I'm currently trying to analyse the interaction of my variables on recall of a target using logistic regression, as specified by my tutor. I have a 2 (isolate x control condition)by 2 (similarity/difference list type) study, and my dependent variable is binary of recall (yes or no). I've tried to clean my data and run the code,
Call:
glm(formula = Target ~ Condition * List, family = "binomial",
data = pro)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8297 -0.3288 0.6444 0.6876 2.4267
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.4663 0.6405 2.289 0.022061 *
Conditionisolate -1.1097 0.8082 -1.373 0.169727
Listsim -4.3567 1.2107 -3.599 0.000320 ***
Conditionisolate:Listsim 5.3218 1.4231 3.740 0.000184 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.736 on 70 degrees of freedom
Residual deviance: 65.869 on 67 degrees of freedom
AIC: 73.869
that's my output^ it completely ignores the difference and control condition, I know i'm doing something wrong and i'm feeling quite exacerbated by it. Can any one help me?
In the model output, R is treating control and difference as the baseline levels of your two variables. The outcome associated with them is wrapped up in the intercept. For other combinations of variable levels, the coefficients show how those differ from that baseline.
Control/Difference: just use the intercept
Control/Similarity: intercept + listsim
Isolate/Difference: intercept + conditionisolate
Isolate/Similarity: intercept + listsim + conditionisolate + conditionisolate:listsim

How to extract various statistics like teststat and pvalue etc. from ur test output object

I need to run stationarity test for multiple time series. I use the ur.df function from the urca package to do them. I then store the outputs from each of these adf test as a list of lists - as each of these is a list itself.
I need to be able to store all the parameters like from the output in to a data frame. Is there a way to do it?
I know I can extract some of the parameters like the #teststat, #cval and the like but how do we get out all of the parameters as we can with the lm regression output using broom
For example, if this is the output from ur.df function
test1 <- ur.df(usage_1601_1612, type = "none", lags = 1, selectlags = "AIC")
The contents of test1 are shown below
Test regression none
Call:
lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
Residuals:
Min 1Q Median 3Q Max
-6093.2 -1385.8 -100.9 1414.3 6962.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
z.lag.1 -0.004212 0.005191 -0.811 0.4177
z.diff.lag -0.126685 0.052161 -2.429 0.0156 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2351 on 362 degrees of freedom
Multiple R-squared: 0.01838, Adjusted R-squared: 0.01296
F-statistic: 3.39 on 2 and 362 DF, p-value: 0.03479
Value of test-statistic is: -0.8114
Critical values for test statistics:
1pct 5pct 10pct
tau1 -2.58 -1.95 -1.62
This is the general code that I run
urresultorigobjects <-
lapply(usagextsobjects, function(x) {
summary(ur.df(x, type = "none", lags = 1, selectlags = "AIC"))
})
This generates the list urresultorigobjects containing the outputs from the stationarity test on each of the xts objects in the list usagextsobjects.
I would like to be able to capture all of the output and store the relevant information in a data frame for code based reference for downstream processing.
Any suggestions would be greatly appreciated.
I don't know if there is an easier way but I managed it like this:
(ur.df(valdata, type="none", lags=0))#testreg[["coefficients"]][,"Pr(>|t|)"]

plm vs lm - different results?

I tried several times to use lm and plm to do a regression. And I get different results.
First, I used lm as follows:
fixed.Region1 <- lm(CapNormChange ~ Policychanges + factor(Region),
data=Panel)
Further I used plm in the following way:
fixed.Region2 <- plm(CapNormChange ~ Policychanges+ factor(Region),
data=Panel, index=c("Region", "Year"), model="within", effect="individual")
I think there is something wrong with plm because I don't see an intercept in the results (see below).
Furthermore, I am not entirely sure if + factor (Region) is necessary, however, if it is not there, I don't see the coefficients (and significance) for the dummy.
So, my question is:
I am using the plm function wrong? (or what is wrong about it)
If not, how can it be that the results are different?
If somebody could give me a hint, I would really appreciate.
Results from LM:
Call:
lm(formula = CapNormChange ~ Policychanges + factor(Region),
data = Panel)
Residuals:
Min 1Q Median 3Q Max
-31.141 -4.856 -0.642 1.262 192.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.3488 4.9134 3.531 0.000558 ***
Policychanges 0.6412 0.1215 5.277 4.77e-07 ***
factor(Region)Asia -19.3377 6.7804 -2.852 0.004989 **
factor(Region)C America + Carib 0.1147 6.8049 0.017 0.986578
factor(Region)Eurasia -17.6476 6.8294 -2.584 0.010767 *
factor(Region)Europe -20.7759 8.8993 -2.335 0.020959 *
factor(Region)Middle East -17.3348 6.8285 -2.539 0.012200 *
factor(Region)N America -17.5932 6.8064 -2.585 0.010745 *
factor(Region)Oceania -14.0440 6.8417 -2.053 0.041925 *
factor(Region)S America -14.3580 6.7781 -2.118 0.035878 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.72 on 143 degrees of freedom
Multiple R-squared: 0.3455, Adjusted R-squared: 0.3043
F-statistic: 8.386 on 9 and 143 DF, p-value: 5.444e-10`
Results from PLM:
Call:
plm(formula = CapNormChange ~ Policychanges, data = Panel, effect = "individual",
model = "within", index = c("Region", "Year"))
Balanced Panel: n = 9, T = 17, N = 153
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-31.14147 -4.85551 -0.64177 1.26236 192.80277
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Policychanges 0.64118 0.12150 5.277 4.769e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 66459
Residual Sum of Squares: 55627
R-Squared: 0.16299
Adj. R-Squared: 0.11031
F-statistic: 27.8465 on 1 and 143 DF, p-value: 4.7687e-07`
You would need to leave out + factor(Region) in your formula for the within model with plm to get what you want.
Within models do not have an intercept, but some software packages (esp. Stata and Gretl) report one. You can estimate it with plm by running within_intercept on you estimated model. The help page has the details about this somewhat artificial intercept.
If you want the individual effects and their significance, use summary(fixef(<your_plm_model>)). Use pFtest to check if the within specification seems worthwhile.
The R squareds diverge between the lm model and the plm model. This is due to the lm model (if used like this with the dummies, it is usually called the LSDV model (least squares dummy variables)) gives what is sometimes called the overall R squared while plm will give you the R squared of the demeaned regression, sometimes called the within R squared. Stata's documentation has some details about this: https://www.stata.com/manuals/xtxtreg.pdf

A glm with interactions for overdisperesed rates

I have measurements obtained from 2 groups (a and b) where each group has the same 3 levels (x, y, z). The measurements are counts out of totals (i.e., rates), but in group a there cannot be zeros whereas in group b there can (hard coded in the example below).
Here's my example data.frame:
set.seed(3)
df <- data.frame(count = c(rpois(15,5),rpois(15,5),rpois(15,3),
rpois(15,7.5),rpois(15,2.5),rep(0,15)),
group = as.factor(c(rep("a",45),rep("b",45))),
level = as.factor(rep(c(rep("x",15),rep("y",15),rep("z",15)),2)))
#add total - fixed for all
df$total <- rep(max(df$count)*2,nrow(df))
I'm interested in quantifying for each level x,y,z if there is any difference between the (average) measurements of a and b? If there is, is it statistically significant?
From what I understand a Poisson GLM for rates seems to be appropriate for these types of data. In my case it seems that perhaps a negative binomial GLM would be even more appropriate since my data are over dispersed (I tried to create that in my example data to some extent but in my real data it is definitely the case).
Following the answer I got for a previous post I went with:
library(dplyr)
library(MASS)
df %>%
mutate(interactions = paste0(group,":",level),
interactions = ifelse(group=="a","a",interactions)) -> df2
df2$interactions = as.factor(df2$interactions)
fit <- glm.nb(count ~ interactions + offset(log(total)), data = df2)
> summary(fit)
Call:
glm.nb(formula = count ~ interactions + offset(log(total)), data = df2,
init.theta = 41.48656798, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.40686 -0.75495 -0.00009 0.46892 2.28720
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.02047 0.07824 -25.822 < 2e-16 ***
interactionsb:x 0.59336 0.13034 4.552 5.3e-06 ***
interactionsb:y -0.28211 0.17306 -1.630 0.103
interactionsb:z -20.68331 2433.94201 -0.008 0.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(41.4866) family taken to be 1)
Null deviance: 218.340 on 89 degrees of freedom
Residual deviance: 74.379 on 86 degrees of freedom
AIC: 330.23
Number of Fisher Scoring iterations: 1
Theta: 41.5
Std. Err.: 64.6
2 x log-likelihood: -320.233
I'd expect the difference between a and b for level z to be significant. However, the Std. Error for level z seems enormous and hence the p-value is nearly 1.
My question is whether the model I'm using is set up correctly to answer my question (mainly through the use of the interactions factor?)

Resources