I have looked online forums and through various papers and am a little stumped on the interpretation of my results for RDA analysis.
I ran the full model with genetic cluster at the condition and came up with a significant model with a global test using permutations of the anova (PERMANOVA) using the anova.cca() function.
signif.full.c <- anova.cca(gno.rda.c)
signif.full.c
#Permutation test for rda under reduced model
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + lat + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#Model 4 221.0 1.0546 0.007 **
# Residual 100 5239.3
Then I look at the RDA axes, which none are significant:
signif.axis.c <- anova.cca(gno.rda.c, by="axis")
signif.axis.c
#Permutation test for rda under reduced model
#Forward tests for axes
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + lat + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#RDA1 1 58.0 1.1078 0.123
#RDA2 1 56.3 1.0740 0.307
#RDA3 1 55.3 1.0549 0.302
#RDA4 1 51.4 0.9816 0.686
#Residual 100 5239.3
But, looking the "margin" permutations which looks at the significance of the of terms, I get significant results for longitude and depth:
signif.margin.c <- anova.cca(gno.rda.c, by="margin")
signif.margin.c
#Permutation test for rda under reduced model
#Marginal effects of terms
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + lat + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#long 1 56.2 1.0717 0.027 *
# lat 1 53.9 1.0285 0.214
#Depth 2 112.8 1.0762 0.007 **
# Residual 100 5239.3
I remove lattitude from the models and again the model is significant as well as the terms, but again the RDA axes are not significant:
#Permutation test for rda under reduced model
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#Model 3 167.1 1.063 0.005 **
# Residual 101 5293.2
#Permutation test for rda under reduced model
#Marginal effects of terms
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#long 1 55.9 1.0657 0.039 *
# Depth 2 112.4 1.0719 0.015 *
# Residual 101 5293.2
#Permutation test for rda under reduced model
#Forward tests for axes
#Permutation: free
#Number of permutations: 999
#Model: rda(formula = gen.imp ~ long + Depth + Condition(Clusters), data = gno.clusters, scale = T)
#Df Variance F Pr(>F)
#RDA1 1 57.1 1.0900 0.165
#RDA2 1 56.0 1.0681 0.178
#RDA3 1 54.0 1.0308 0.245
#Residual 101 5293.2
Does this mean that I can ignore the model significance and the term significance because the RDA axes are not significant?
Related
I am using the lrm function from the rms package to get:
> model_1 <- lrm(dependent_variable ~ var1+ var2 + var3, data = merged_dataset, na.action="na.delete")
> print(model_1)
Logistic Regression Model
lrm(dependent_variable ~ var1+ var2 + var3, data = merged_dataset, na.action="na.delete")
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 6046 LR chi2 21.97 R2 0.005 C 0.531
0 3151 d.f. 11 g 0.138 Dxy 0.062
1 2895 Pr(> chi2) 0.0246 gr 1.148 gamma 0.062
max |deriv| 1e-13 gp 0.034 tau-a 0.031
Brier 0.249
Coef S.E. Wald Z Pr(>|Z|)
Intercept -0.0752 0.0348 -2.16 0.0305
var1 10.6916 2.1476 0.32 0.7474
var2 -0.1595 0.4125 -0.39 0.6990
var3 -0.0563 0.0266 -2.12 0.0341
My question is are these coefficients odds ratios or not? If not, how can I get the odds ratios coefficients?
Hi there here is an approach. Note that it helps if you include some sample data for us to work with.
Generating some fake data...
fake_data <- matrix(rnorm(300), ncol = 3)
y_start <- 1/(1+exp(-(fake_data %*% c(1, .3, 2))))
y <- rbinom(100, size = 1, prob = y_start)
dat <- data.frame(y, fake_data)
Now we fit the model:
library(rms)
fit <- lrm(y ~ ., data = dat)
The model coefficients will be in the form of log-odds (still on the log scale)
# Log-odds
coef(fit)
Intercept X1 X2 X3
0.03419513 0.92890297 0.48097414 1.86036897
If you want to move to odds then we need to use exponentiation to transfer from the log scale.
# Odds
exp(coef(fit))
Intercept X1 X2 X3
1.034787 2.531730 1.617649 6.426107
So in this example you odds of achieving Y increases by 2.5 with an increase in X1.
I ran a coxph model and a frailty model, but now I would like to change the hazard ratio for continuous variable (age) to show in terms of 5-unit increment instead of 1-unit. Is there a function in R that can perform such task? If so, does the function also work for frailty mode? I used the package frailtypack.
library('survival')
data(veteran)
cox <- coxph(Surv(time, status) ~ age, data = veteran)
summary(cox)
# Call:
# coxph(formula = Surv(time, status) ~ age, data = veteran)
#
# n= 137, number of events= 128
#
# coef exp(coef) se(coef) z Pr(>|z|)
# age 0.007500 1.007528 0.009565 0.784 0.433
#
# exp(coef) exp(-coef) lower .95 upper .95
# age 1.008 0.9925 0.9888 1.027
#
# Concordance= 0.515 (se = 0.029 )
# Likelihood ratio test= 0.63 on 1 df, p=0.4
# Wald test = 0.61 on 1 df, p=0.4
# Score (logrank) test = 0.62 on 1 df, p=0.4
Just add a new variable that represents the age group each subject belongs to; for example 1: 0-4, 2: 5-9, 3: 10-15, etc.
This is an example using the veteran dataset in the survival package. The data has a continuous variable age. Adding this as a predictor to the model will give you the relative risk (hazard ratio) for a one-year increase or increment in age. If you are interested in the x-year increment, you should generate a new variable which groups subjects accordingly. For these data, I applied the following grouping; group 1: younger than 40, group 2: 40 - <50, group 3: 50 - < 60, group 4: 60 - <70, and group 5: 70 or older. As such, the HR for a 10-year increment is 1.049. Alternatively, the risk increases with 5% for every 10 year increase in age. Note that the association is not statistically significant.
library(survival)
data(veteran)
veteran$ageCat <- 5
veteran$ageCat[veteran$age < 70] <- 4
veteran$ageCat[veteran$age < 60] <- 3
veteran$ageCat[veteran$age < 50] <- 2
veteran$ageCat[veteran$age < 40] <- 1
table(veteran$ageCat)
1 2 3 4 5
11 20 22 72 12
cox <- coxph(Surv(time, status) ~ ageCat, data = veteran)
summary(cox)
Call:
coxph(formula = Surv(time, status) ~ ageCat, data = veteran)
n= 137, number of events= 128
coef exp(coef) se(coef) z Pr(>|z|)
ageCat 0.04793 1.04910 0.09265 0.517 0.605
exp(coef) exp(-coef) lower .95 upper .95
ageCat 1.049 0.9532 0.8749 1.258
Concordance= 0.509 (se = 0.028 )
Rsquare= 0.002 (max possible= 0.999 )
Likelihood ratio test= 0.27 on 1 df, p=0.6024
Wald test = 0.27 on 1 df, p=0.6049
Score (logrank) test = 0.27 on 1 df, p=0.6048
#milan's post answers a similar question but not the one as asked. Since age was split into decades and modeled as a continuous variable, the hazard ratio would compare a subject's age-decade compared to the next youngest decade. That is, the HR for subjects aged 51 vs 49 or 59 vs 41 would be the same despite 2 or 18 years between them.
Anyway, the default as you suggest is for a 1-unit increment in the continuous variable, age in this case. It's not always useful to compare subjects by 1-unit change especially when the range gets to be much larger.
You can do the following which is naive to the model, so this should would for a lm, glm, survival::coxph, frailtypack::frailtyPenal, etc.
library('survival')
data(veteran)
## 1-year increase in age
cox <- coxph(Surv(time, status) ~ age, data = veteran)
exp(coef(cox))
# age
# 1.007528
For a multiplicative model like Cox regressions, you can get the x-unit change after the model is fit:
## 5-year increase in age
exp(coef(cox)) ^ 5
# age
# 1.038211
## or equivalently
exp(coef(cox) * 5)
# age
# 1.038211
However, it's easier to create a variable for the age transformation then fit the model:
## or you can create a variable to model
veteran <- within(veteran, {
age5 <- age / 5
})
cox5_1 <- coxph(Surv(time, status) ~ age5, data = veteran)
exp(coef(cox5_1))
# age10
# 1.038211
cox5_2 <- coxph(Surv(time, status) ~ I(age / 5), data = veteran)
exp(coef(cox5_2))
# I(age/5)
# 1.038211
Note you need to use I here in the formula interface since some operators have special meanings in formulae. For example, lm(mpg ~ wt - 1, mtcars) and lm(mpg ~ I(wt - 1), mtcars) are two different models.
You can use these methods in other models, for example frailtyPenal if that is indeed the one you are using:
library('frailtypack')
fp <- frailtyPenal(Surv(time, status) ~ age, data = veteran, n.knots = 12, kappa = 1e5)
exp(fp$coef)
exp(fp$coef) ^ 5
fp5_1 <- frailtyPenal(Surv(time, status) ~ age5, data = veteran, n.knots = 12, kappa = 1e5)
fp5_2 <- frailtyPenal(Surv(time, status) ~ I(age / 5), data = veteran, n.knots = 12, kappa = 1e5)
exp(fp5_1$coef)
exp(fp5_2$coef)
I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
http://finzi.psych.upenn.edu/library/lmPerm/html/aovp.html
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
anova(fit)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
require('Ecdat')
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
star<-Star
attach(star)
head(star)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
#bloxplots
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
#histograms
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
#qqplot
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
}
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
allPairs
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
#http://stackoverflow.com/questions/28587498/post-hoc-tests-for-one-way-anova-with-welchs-correction-in-r
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
ret
})
length(allResults)
[1] 3
allResults[[1]]
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170
Normally from aov() you can get residuals after using summary() function on it.
But how can I get residuals when I use Repeated measures ANOVA and formula is different?
## as a test, not particularly sensible statistically
npk.aovE <- aov(yield ~ N*P*K + Error(block), npk)
npk.aovE
summary(npk.aovE)
Error: block
Df Sum Sq Mean Sq F value Pr(>F)
N:P:K 1 37.0 37.00 0.483 0.525
Residuals 4 306.3 76.57
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
N 1 189.28 189.28 12.259 0.00437 **
P 1 8.40 8.40 0.544 0.47490
K 1 95.20 95.20 6.166 0.02880 *
N:P 1 21.28 21.28 1.378 0.26317
N:K 1 33.14 33.14 2.146 0.16865
P:K 1 0.48 0.48 0.031 0.86275
Residuals 12 185.29 15.44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Intuitial summary(npk.aovE)$residuals return NULL..
Can anyone can help me with this?
Look at the output of
> names(npk.aovE)
and try
> npk.aovE$residuals
EDIT: I apologize I read your example way too quickly. What I suggested is not possible with multilevel models with aov(). Try the following:
> npk.pr <- proj(npk.aovE)
> npk.pr[[3]][, "Residuals"]
Here's a simpler reproducible anyone can mess around with if they run into the same issue:
x1 <- gl(8, 4)
block <- gl(2, 16)
y <- as.numeric(x1) + rnorm(length(x1))
d <- data.frame(block, x1, y)
m <- aov(y ~ x1 + Error(block), d)
m.pr <- proj(m)
m.pr[[3]][, "Residuals"]
The other option is with lme:
require(MASS) ## for oats data set
require(nlme) ## for lme()
require(multcomp) ## for multiple comparison stuff
Aov.mod <- aov(Y ~ N * V + Error(B/V), data = oats)
the_residuals <- aov.out.pr[[3]][, "Residuals"]
Lme.mod <- lme(Y ~ N * V, random = ~1 | B/V, data = oats)
the_residuals <- residuals(Lme.mod)
The original example came without the interaction (Lme.mod <- lme(Y ~ N * V, random = ~1 | B/V, data = oats)) but it seems to be working with it (and producing different results, so it is doing something).
And that's it...
...but for completeness:
1 - The summaries of the model
summary(Aov.mod)
anova(Lme.mod)
2 - The Tukey test with repeated measures anova (3 hours looking for this!!). It does raises a warning when there is an interaction (* instead of +), but it seems to be safe to ignore it. Notice that V and N are factors inside the formula.
summary(Lme.mod)
summary(glht(Lme.mod, linfct=mcp(V="Tukey")))
summary(glht(Lme.mod, linfct=mcp(N="Tukey")))
3 - The normality and homoscedasticity plots
par(mfrow=c(1,2)) #add room for the rotated labels
aov.out.pr <- proj(aov.mod)
#oats$resi <- aov.out.pr[[3]][, "Residuals"]
oats$resi <- residuals(Lme.mod)
qqnorm(oats$resi, main="Normal Q-Q") # A quantile normal plot - good for checking normality
qqline(oats$resi)
boxplot(resi ~ interaction(N,V), main="Homoscedasticity",
xlab = "Code Categories", ylab = "Residuals", border = "white",
data=oats)
points(resi ~ interaction(N,V), pch = 1,
main="Homoscedasticity", data=oats)