Is anyone using R2OpenBUGS? Should I rather be using r2winbugs? ...
I am trying to model final (2-year) treatment outcomes (e.g. success, death, default or failure) for my sample of patients with a (single) intermediate (3-month) outcome.
R2OpenBUGS is giving me a strange posterior on the multinomial node, in which two of the outcomes are constant, the other two outcomes are equal, and the total number of outcomes is greater than the cohort size.
What am I doing wrong? Many thanks in advance! Code out output are below.
library(R2OpenBUGS)
model <- function() {
# Prior : distribution of final outcomes for treatment cohort N_tx
outc[1:4] ~ dmulti(p.outc[],N_tx)
p.outc[1] <- 164/1369
p.outc[2] <- 907/1369
p.outc[3] <- 190/1369
p.outc[4] <- 108/1369
# Prior : distribution of intermediate outcome (proxy of final outcome) for each final outcome cohort
# (e.g. proportion of patient with final outcome 1 that exhibited the intermediate outcome)
cr_1 ~ dunif(0.451, 0.609)
cr_2 ~ dunif(0.730, 0.787)
cr_3 ~ dunif(0.559, 0.700)
cr_4 ~ dunif(0.148, 0.312)
# Probability p of the intermediate outcome given prior distributions
p <- (outc[1]*cr_1+outc[2]*cr_2+outc[3]*cr_3+outc[4]*cr_4)/N_tx
# Likelihood function for the number of culture conversions at 3 months among those still on treatment in month 6 (excludes confirmed deaths and defaulters)
cs ~ dbin(p,N_tx)
}
# N_tx is the number of patients in our cohort
N_tx <- 100
# cs is the number of patient exhibiting the intermediate outcome
cs <- 80
data <- list("N_tx", "cs")
inits <- function() { list(outc=c(round(164/1369*N_tx),
round(907/1369*N_tx),
round(190/1369*N_tx),
round(108/1369*N_tx)),
cr_1=87/(87+77),
cr_2=689/(689+218),
cr_3=120/(120+70),
cr_4=24/(24+84))
}
params <- c("outc")
model.file <- file.path(tempdir(), "model.txt")
write.model(model, model.file)
out <- bugs(data, inits, params, model.file, n.iter=100000,debug=TRUE)
all(out$summary[,"Rhat"] < 1.1)
out$mean["outc"]
out$sd["outc"]
print(out, digits=5)
And here are some of the outputs:
> all(out$summary[,"Rhat"] < 1.1)
[1] TRUE
>
> out$mean["outc"]
$outc
[1] 15.53095 66.00000 14.00000 15.53095
> out$sd["outc"]
$outc
[1] 3.137715 0.000000 0.000000 3.137715
>
> print(out, digits=5)
Inference for Bugs model at "C:\",
Current: 3 chains, each with 1e+05 iterations (first 50000 discarded)
Cumulative: n.sims = 150000 iterations saved
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
outc[1] 15.53095 3.13771 10.00000 13.000 15.000 18.000 22.00 1.00100 130000
outc[2] 66.00000 0.00000 66.00000 66.000 66.000 66.000 66.00 1.00000 1
outc[3] 14.00000 0.00000 14.00000 14.000 14.000 14.000 14.00 1.00000 1
outc[4] 15.53095 3.13771 10.00000 13.000 15.000 18.000 22.00 1.00100 130000
deviance 8.59096 2.23382 5.08097 6.927 8.323 9.963 13.66 1.00102 55000
For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = var(deviance)/2)
pD = 2.5 and DIC = 11.1
DIC is an estimate of expected predictive error (lower deviance is better).
Related
I'm following up on this great answer. Function foo below, takes the Name column of VarCorr(fit) output and makes them the column names for summary(rePCA(fit)) call.
It works fine when we input fm1, fm2, but I wonder why it fails for fm3? Is there a fix?
library(lme4)
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/sng.csv')
fm1 <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin)
fm2 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fm3 <- lmer(y ~ A * B * C + (A + B | group) + (C|group), data = dat)
foo <- function(fit) {
obj <- summary(rePCA(fit))
model <- VarCorr(fit)
Map(function(x, z) {
colnames(x$importance) <- paste(z, unique(sapply(model, colnames)), sep = '_')
x
}, obj, names(obj))
}
#EXAMPLE OF USE:
foo(fm1) ### OK !
foo(fm2) ### OK !
foo(fm3) ### :-( Error in dimnames(x) <- dn
The function fails when length of obj and model is different. Here is a hack to make it work for fm3.
foo <- function(fit) {
obj <- summary(rePCA(fit))
model <- VarCorr(fit)
if(length(obj) == length(model)) {
obj <- Map(function(x, z) {
colnames(x$importance) <- paste(z, unique(sapply(model, colnames)), sep = '_')
x
}, obj, names(obj))
}
else if(length(obj) == 1) {
colnames(obj[[1]]$importance) <- unlist(mapply(paste, names(model), sapply(model, colnames), MoreArgs = list(sep = '_')))
}
return(obj)
}
This returns the following output :
foo(fm1)
#$plate
#Importance of components:
# plate_(Intercept)
#Standard deviation 1.54
#Proportion of Variance 1.00
#Cumulative Proportion 1.00
#$sample
#Importance of components:
# sample_(Intercept)
#Standard deviation 3.51
#Proportion of Variance 1.00
#Cumulative Proportion 1.00
foo(fm2)
#$Subject
#Importance of components:
# Subject_(Intercept) Subject_Days
#Standard deviation 0.967 0.2309
#Proportion of Variance 0.946 0.0539
#Cumulative Proportion 0.946 1.0000
foo(fm3)
#$group
#Importance of components:
# group_(Intercept) group_A group_B group.1_(Intercept) group.1_C
#Standard deviation 1.418 1.291 0.5129 0.4542 0.0000497
#Proportion of Variance 0.485 0.402 0.0634 0.0498 0.0000000
#Cumulative Proportion 0.485 0.887 0.9502 1.0000 1.0000000
You could get the column names from 'fit#cnms', which saves you the trouble of using 'VarCorr'. The devil seems to be the case of fm1 which gives a list output and we may want to coerce as.data.frame. Then we may just use `colnames<-` and no Map battles are needed.
foo2 <- function(fit) {
obj <- summary(rePCA(fit))
obj <- as.data.frame(lapply(obj, `[`, "importance"))
`colnames<-`(obj, paste(names(fit#cnms), unlist(fit#cnms), sep="_"))
}
foo2(fm1)
# plate_(Intercept) sample_(Intercept)
# Standard deviation 1.539676 3.512519
# Proportion of Variance 1.000000 1.000000
# Cumulative Proportion 1.000000 1.000000
foo2(fm2)
# Subject_(Intercept) Subject_Days
# Standard deviation 0.966868 0.2308798
# Proportion of Variance 0.946050 0.0539500
# Cumulative Proportion 0.946050 1.0000000
foo2(fm3)
# group_(Intercept) group_A group_B group_(Intercept) group_C
# Standard deviation 1.385987 1.322335 0.5128262 0.4547251 0.08892506
# Proportion of Variance 0.463190 0.421630 0.0634100 0.0498600 0.00191000
# Cumulative Proportion 0.463190 0.884820 0.9482300 0.9980900 1.00000000
Another advantage I find is that the numbers come out unrounded. You could build in a rounding in the function or do that afterwards:
round(foo2(fm1), 2)
# plate_(Intercept) sample_(Intercept)
# Standard deviation 1.54 3.51
# Proportion of Variance 1.00 1.00
# Cumulative Proportion 1.00 1.00
I am trying to do the restriction test for GARCH model (ugarch from 'rugarch' package) using the following hypothesis:
H0: alpha1 + beta1 = 1
H1: alpha1 + beta1 ≠ 1
So I am trying to follow the advice from
https://stats.stackexchange.com/questions/151573/testing-the-sum-of-garch1-1-parameters/151578?noredirect=1#comment629951_151578
1.Specify the restricted model using ugarchspec with option variance.model = list(model = "sGARCH") and estimate it using ugarchfit. Obtain the log-likelihood from the slot fit sub-slot likelihood.
2.Specify the restricted model using ugarchspec with option variance.model = list(model = "iGARCH") and estimate it using ugarchfit. Obtain the log-likelihood as above.
3.Calculate LR=2(Log-likelihood of unrestricted model − Log-likelihood of restricted model) and Obtain the p-value as pchisq(q = LR, df = 1).
I have the following 'sGARCH' and 'iGARCH' models I am using from 'rugarch' package.
(A) sGARCH (unrestricted model):
speccR = ugarchspec(variance.model=list(model = "sGARCH",garchOrder=c(1,1)),mean.model=list(armaOrder=c(0,0), include.mean=TRUE,archm = TRUE, archpow = 1))
ugarchfit(speccR, data=data.matrix(P),fit.control = list(scale = 1))
And the following is this sGARCH output:
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu -0.000355 0.001004 -0.35377 0.723508
archm 0.096364 0.039646 2.43059 0.015074
omega 0.000049 0.000010 4.91096 0.000001
alpha1 0.289964 0.021866 13.26117 0.000000
beta1 0.709036 0.023200 30.56156 0.000000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu -0.000355 0.001580 -0.22482 0.822122
archm 0.096364 0.056352 1.71002 0.087262
omega 0.000049 0.000051 0.96346 0.335316
alpha1 0.289964 0.078078 3.71375 0.000204
beta1 0.709036 0.111629 6.35173 0.000000
LogLikelihood : 5411.828
Information Criteria
------------------------------------
Akaike -3.9180
Bayes -3.9073
Shibata -3.9180
Hannan-Quinn -3.9141
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 233.2 0
Lag[2*(p+q)+(p+q)-1][2] 239.1 0
Lag[4*(p+q)+(p+q)-1][5] 247.4 0
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 4.695 0.03025
Lag[2*(p+q)+(p+q)-1][5] 5.941 0.09286
Lag[4*(p+q)+(p+q)-1][9] 7.865 0.13694
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.556 0.500 2.000 0.4559
ARCH Lag[5] 1.911 1.440 1.667 0.4914
ARCH Lag[7] 3.532 2.315 1.543 0.4190
Nyblom stability test
------------------------------------
Joint Statistic: 5.5144
Individual Statistics:
mu 0.5318
archm 0.4451
omega 1.3455
alpha1 4.1443
beta1 2.2202
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.28 1.47 1.88
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 0.2384 0.8116
Negative Sign Bias 1.1799 0.2381
Positive Sign Bias 1.1992 0.2305
Joint Effect 2.9540 0.3988
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 272.1 9.968e-47
2 30 296.9 3.281e-46
3 40 313.3 1.529e-44
4 50 337.4 1.091e-44
Elapsed time : 0.4910491
(B) iGARCH (restricted model):
speccRR = ugarchspec(variance.model=list(model = "iGARCH",garchOrder=c(1,1)),mean.model=list(armaOrder=c(0,0), include.mean=TRUE,archm = TRUE, archpow = 1))
ugarchfit(speccRR, data=data.matrix(P),fit.control = list(scale = 1))
However, I get the following output of beta1 with N/A in its standard error, t-value, and p-value.
The following is the iGARCH output:
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : iGARCH(1,1)
Mean Model : ARFIMA(0,0,0)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu -0.000355 0.001001 -0.35485 0.722700
archm 0.096303 0.039514 2.43718 0.014802
omega 0.000049 0.000008 6.42826 0.000000
alpha1 0.290304 0.021314 13.62022 0.000000
beta1 0.709696 NA NA NA
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu -0.000355 0.001488 -0.2386 0.811415
archm 0.096303 0.054471 1.7680 0.077066
omega 0.000049 0.000032 1.5133 0.130215
alpha1 0.290304 0.091133 3.1855 0.001445
beta1 0.709696 NA NA NA
LogLikelihood : 5412.268
Information Criteria
------------------------------------
Akaike -3.9190
Bayes -3.9105
Shibata -3.9190
Hannan-Quinn -3.9159
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 233.2 0
Lag[2*(p+q)+(p+q)-1][2] 239.1 0
Lag[4*(p+q)+(p+q)-1][5] 247.5 0
d.o.f=0
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 4.674 0.03063
Lag[2*(p+q)+(p+q)-1][5] 5.926 0.09364
Lag[4*(p+q)+(p+q)-1][9] 7.860 0.13725
d.o.f=2
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[3] 0.5613 0.500 2.000 0.4538
ARCH Lag[5] 1.9248 1.440 1.667 0.4881
ARCH Lag[7] 3.5532 2.315 1.543 0.4156
Nyblom stability test
------------------------------------
Joint Statistic: 1.8138
Individual Statistics:
mu 0.5301
archm 0.4444
omega 1.3355
alpha1 0.4610
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 1.07 1.24 1.6
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 0.2252 0.8218
Negative Sign Bias 1.1672 0.2432
Positive Sign Bias 1.1966 0.2316
Joint Effect 2.9091 0.4059
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 273.4 5.443e-47
2 30 300.4 6.873e-47
3 40 313.7 1.312e-44
4 50 337.0 1.275e-44
Elapsed time : 0.365
If I calculate the log-likelihood difference to derive the chi-square value
as suggested I get negative value as such:
2*(5411.828-5412.268)=-0.88
The Log-likelihood of the restricted model "iGARCH" is 5412.268 which is higher than the Log-likelihood of the unrestricted model "sGARCH" of 5411.828
which should not happen.
The data I use are as follows in time series manner (I am only posting first 100 data due to space limit):
Time P
1 0.454213593
2 0.10713195
3 -0.106819399
4 -0.101610699
5 -0.094327846
6 -0.037176107
7 -0.101550977
8 -0.016309894
9 -0.041889484
10 0.103384357
11 -0.011746377
12 0.063304432
13 0.059539249
14 -0.049946177
15 -0.023251656
16 0.013989353
17 -0.002815588
18 -0.009678745
19 -0.011139779
20 0.031592303
21 -0.02348106
22 -0.007206591
23 0.077422089
24 0.064632768
25 -0.003396734
26 -0.025524166
27 -0.026632474
28 0.014614485
29 -0.012380888
30 -0.007463018
31 0.022759969
32 0.038667465
33 -0.028619484
34 -0.021995984
35 -0.006162809
36 -0.031187399
37 0.022455611
38 0.011419264
39 -0.005700445
40 -0.010106343
41 -0.004310162
42 0.00513715
43 -0.00498106
44 -0.021382251
45 -0.000694252
46 -0.033326085
47 0.002596086
48 0.011008057
49 -0.004754233
50 0.008969559
51 -0.00354088
52 -0.007213115
53 -0.003101495
54 0.005016228
55 -0.010762641
56 0.030770993
57 -0.015636325
58 0.000875417
59 0.03975863
60 -0.050207219
61 0.011308261
62 -0.021453315
63 -0.003309127
64 0.025687191
65 0.009467306
66 0.005519485
67 -0.011473758
68 0.00223934
69 -0.000913651
70 -0.003055385
71 0.000974694
72 0.000288611
73 -0.002432251
74 -0.0016975
75 -0.001565034
76 0.003332848
77 -0.008007295
78 -0.003086435
79 -0.00160435
80 0.005825885
81 0.020078093
82 0.018055453
83 0.181098137
84 0.102698818
85 0.128786594
86 -0.013587077
87 -0.038429879
88 0.043637258
89 0.042741709
90 0.016384872
91 0.000216317
92 0.009275681
93 -0.008595197
94 -0.016323335
95 -0.024083247
96 0.035922206
97 0.034863621
98 0.032401779
99 0.126333922
100 0.054751935
In order to perform the restriction test from my H0 and H1 hypothesis, may I know how I can fix this problem?
There seems to be a problem with the estimation procedure... Since one model is a restricted version of the other, using iGARCH should indeed lead to a lower likelihood.
Using the subset of your data,
fit1 <- ugarchfit(speca, data = data.matrix(P))
# [1] 161.7373
fit2 <- ugarchfit(speca2, data = data.matrix(P))
# [1] 165.333
As I said in my deleted post, those numbers looked suspicious, as if they are -loglikelihoods. However, recovering the likelihood from the residuals gives
-sum(log(2 * pi * sigma(fit1)^2)) / 2 - sum(residuals(fit1, standardize = TRUE)^2) / 2
# [1] 161.7373
-sum(log(2 * pi * sigma(fit2)^2)) / 2 - sum(residuals(fit2, standardize = TRUE)^2) / 2
# [1] 165.333
Meaning that my suspicion was wrong (it must then be that the density values are > 1). For this reason, I think there is no way to use the current output to construct a test. The iGARCH restriction fits miraculously well..
However, some experimenting showed that using
fit.control = list(scale = 1)
changes things. In particular,
fit1 <- ugarchfit(speca, data = data.matrix(P), fit.control = list(scale = 1))
likelihood(fit1)
# [1] 161.7373
-sum(log(2 * pi * sigma(fit1)^2)) / 2 - sum(residuals(fit1, standardize = TRUE)^2) / 2
# [1] 161.7373
fit2 <- ugarchfit(speca2, data = data.matrix(P), fit.control = list(scale = 1))
likelihood(fit2)
# [1] 19.5233
-sum(log(2 * pi * sigma(fit2)^2)) / 2 - sum(residuals(fit2, standardize = TRUE)^2) / 2
# [1] 19.5233
That would somewhat make sense given
(page 25) "scaling sometimes facilitates the estimation process"
(page 46) Q: My model does not converge, what can I do?
"...Additionally, in the fit.control list of the fitting routines, the option to perform scaling of the data prior to fitting usually helps, although it is not available under some setups..."
However, it again is suspicious that the likelihood of the first model remains the same. Then we have that
fit1 <- ugarchfit(speca, data = data.matrix(P), fit.control = list(scale = 0), solver.control = list(trace = TRUE))
#
# Iter: 1 fn: -161.7373 Pars: -0.0454619 0.0085993 0.0002706 0.0593231 # 0.6898473
# Iter: 2 fn: -161.7373 Pars: -0.0454619 0.0085993 0.0002706 0.0593231 0.6898473
# solnp--> Completed in 2 iterations
coef(fit1)
# mu mxreg1 omega alpha1 beta1
# -0.0454619274 0.0085992743 0.0002706018 0.0593231138 0.6898472858
fit1 <- ugarchfit(speca, data = data.matrix(P), fit.control = list(scale = 1), solver.control = list(trace = TRUE))
# Iter: 1 fn: 114.8143 Pars: -0.72230 0.13663 0.06830 0.05930 0.68988
# Iter: 2 fn: 114.8143 Pars: -0.72228 0.13662 0.06830 0.05931 0.68986
# solnp--> Completed in 2 iterations
coef(fit1)
# mu mxreg1 omega alpha1 beta1
# -0.045463099 0.008599494 0.000270610 0.059310622 0.689858216
and
fit2 <- ugarchfit(speca2, data = data.matrix(P), fit.control = list(scale = 0), solver.control = list(trace = TRUE))
# Iter: 1 fn: -165.3330 Pars: 0.0292439 -0.0051098 0.0002221 0.7495846
# Iter: 2 fn: -165.3330 Pars: 0.0292434 -0.0051097 0.0002221 0.7495853
# solnp--> Completed in 2 iterations
coef(fit2)
# mu mxreg1 omega alpha1 beta1
# 0.0292434276 -0.0051096984 0.0002221457 0.7495853224 0.2504146776
fit2 <- ugarchfit(speca2, data = data.matrix(P), fit.control = list(scale = 1), solver.control = list(trace = TRUE))
# Iter: 1 fn: 111.2185 Pars: 0.46462 -0.08118 0.05607 0.74959
# Iter: 2 fn: 111.2185 Pars: 0.46458 -0.08118 0.05607 0.74959
# solnp--> Completed in 2 iterations
coef(fit2)
# mu mxreg1 omega alpha1 beta1
# 0.46458110 -0.08117626 0.05607215 0.74959242 0.25040758
Which makes things even stranger due to multiple inconsistencies...
This is the answer I received from the package author "Alexios Galanos":
The problem is that there is a restriction on the stationarity of the GARCH model which may interfere with
the solver convergence for models which are on the border of stationarity. Here is the solution:
library(rugarch)
library(xts)
dat<-read.table("data.txt",header = TRUE, stringsAsFactors = FALSE)
dat = xts(dat[,2], as.Date(strptime(dat[,1],"%d/%m/%Y")))
spec1<-ugarchspec(mean.model=list(armaOrder=c(0,0), archm=TRUE, archpow=1), variance.model=list(model="iGARCH"))
spec2<-ugarchspec(mean.model=list(armaOrder=c(0,0), archm=TRUE, archpow=1), variance.model=list(model="sGARCH"))
mod1<-ugarchfit(spec1, dat, solver="solnp")
mod2<-ugarchfit(spec2,dat)
persistence(mod2)
>0.999
# at the limit of the internal constraint
mod2<-ugarchfit(spec2, dat, solver="solnp", fit.control = list(stationarity=0))
likelihood(mod2)
>5428.871
likelihood(mod1)
>5412.268
persistence(mod2)
1.08693
# above the limit
Here is one solution to change the constraint:
.garchconbounds2= function(){
return(list(LB = 1e-12,UB = 0.99999999999))
}
assignInNamespace(x = ".garchconbounds", value=.garchconbounds2, ns="rugarch")
mod2<-ugarchfit(spec2, dat, solver="solnp")
likelihood(mod2)
>5412.268
Now the value is the same as the constrained model (they are both effectively integrated), but the constrained model has one less parameter
to estimate.
I don't even need a fit.control=list(scale=1) at all here. Probably better to delete this scale.
I have the following data (dat)
I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
library(reshape2)
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
#$key
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034
I'm doing some exploring with the same data and I'm trying to highlight the in-group variance versus the between group variance. Now I have been able to successfully show the between group variance is very strong, however, the nature of the data should show weak within group variance. (I.e. My Shapiro-Wilk normality test shows this) I believe if I do some re-sampling with a welch correction, this might be the case.
I was wondering if someone knew if there was a re-sampling based anova with a Welch correction in R. I see there is an R implementation of the permutation test but with no correction. If not, how would I code the test directly while using this implementation.
http://finzi.psych.upenn.edu/library/lmPerm/html/aovp.html
Here is the outline for my basic between group ANOVA:
fit <- lm(formula = data$Boys ~ data$GroupofBoys)
anova(fit)
I believe you're correct in that there isn't an easy way to do welch corrected anova with resampling, but it should be possible to hobble a few things together to make it work.
require('Ecdat')
I'll use the “Star” dataset from the “Ecdat" package which looks at the effects of small class sizes on standardized test scores.
star<-Star
attach(star)
head(star)
tmathssk treadssk classk totexpk sex freelunk race schidkn
2 473 447 small.class 7 girl no white 63
3 536 450 small.class 21 girl no black 20
5 463 439 regular.with.aide 0 boy yes black 19
11 559 448 regular 16 boy no white 69
12 489 447 small.class 5 boy yes white 79
13 454 431 regular 8 boy yes white 5
Some exploratory analysis:
#bloxplots
boxplot(treadssk ~ classk, ylab="Total Reading Scaled Score")
title("Reading Scores by Class Size")
#histograms
hist(treadssk, xlab="Total Reading Scaled Score")
Run regular anova
model1 = aov(treadssk ~ classk, data = star)
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
classk 2 37201 18601 18.54 9.44e-09 ***
Residuals 5745 5764478 1003
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A look at the anova residuals
#qqplot
qqnorm(residuals(model1),ylab="Reading Scaled Score")
qqline(residuals(model1),ylab="Reading Scaled Score")
qqplot shows that ANOVA residuals deviate from the normal qqline
#Fitted Y vs. Residuals
plot(fitted(model1), residuals(model1))
Fitted Y vs. Residuals shows converging trend in the residuals, can test with a Shapiro-Wilk test just to be sure
shapiro.test(treadssk[1:5000]) #shapiro.test contrained to sample sizes between 3 and 5000
Shapiro-Wilk normality test
data: treadssk[1:5000]
W = 0.92256, p-value < 2.2e-16
Just confirms that we aren't going to be able to assume a normal distribution.
We can use bootstrap to estimate the true F-dist.
#Bootstrap version (with 10,000 iterations)
mean_read = mean(treadssk)
grpA = treadssk[classk=="regular"] - mean_read[1]
grpB = treadssk[classk=="small.class"] - mean_read[2]
grpC = treadssk[classk=="regular.with.aide"] - mean_read[3]
sim_classk <- classk
R = 10000
sim_Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=2000, replace=T)
groupB = sample(grpB, size=1733, replace=T)
groupC = sample(grpC, size=2015, replace=T)
sim_score = c(groupA,groupB,groupC)
sim_data = data.frame(sim_score,sim_classk)
}
Now we need to get the set of unique pairs of the Group factor
allPairs <- expand.grid(levels(sim_data$sim_classk), levels(sim_data$sim_classk))
## http://stackoverflow.com/questions/28574006/unique-combination-of-two-columns-in-r/28574136#28574136
allPairs <- unique(t(apply(allPairs, 1, sort)))
allPairs <- allPairs[ allPairs[,1] != allPairs[,2], ]
allPairs
[,1] [,2]
[1,] "regular" "small.class"
[2,] "regular" "regular.with.aide"
[3,] "regular.with.aide" "small.class"
Since oneway.test() applies a Welch correction by default, we can use that on our simulated data.
allResults <- apply(allPairs, 1, function(p) {
#http://stackoverflow.com/questions/28587498/post-hoc-tests-for-one-way-anova-with-welchs-correction-in-r
dat <- sim_data[sim_data$sim_classk %in% p, ]
ret <- oneway.test(sim_score ~ sim_classk, data = sim_data, na.action = na.omit)
ret$sim_classk <- p
ret
})
length(allResults)
[1] 3
allResults[[1]]
One-way analysis of means (not assuming equal variances)
data: sim_score and sim_classk
F = 1.7741, num df = 2.0, denom df = 1305.9, p-value = 0.170
I am new to R and I am stuck with a problem. I am trying to read a set of data in a table and I want to perform linear modeling. Below is how I read my data and my variables names:
>data =read.table(datafilename,header=TRUE)
>names(data)
[1] "price" "model" "size" "year" "color"
What I want to do is create several linear models using different combinations of the variables (price being the target ), such as:
> attach(data)
> model1 = lm(price~model+size)
> model2 = lm(price~model+year)
> model3 = lm(price~model+color)
> model4 = lm(price~model+size)
> model4 = lm(price~size+year+color)
#... and so on for all different combination...
My main aim is to compare the different models. Is there a more clever way to generate these models instead of hard coding the variables, especially that the number of my variables in some cases will increase to 13 or so.
If your goal is model selection there are several tools available in R which attempt to automate this process. Read the documentation on dredge(...) in the MuMIn package.
# dredge: example of use
library(MuMIn)
df <- mtcars[,c("mpg","cyl","disp","hp","wt")] # subset of mtcars
full.model <- lm(mpg ~ cyl+disp+hp+wt,df) # model for predicting mpg
dredge(full.model)
# Global model call: lm(formula = mpg ~ cyl + disp + hp + wt, data = df)
# ---
# Model selection table
# (Intrc) cyl disp hp wt df logLik AICc delta weight
# 10 39.69 -1.5080 -3.191 4 -74.005 157.5 0.00 0.291
# 14 38.75 -0.9416 -0.01804 -3.167 5 -72.738 157.8 0.29 0.251
# 13 37.23 -0.03177 -3.878 4 -74.326 158.1 0.64 0.211
# 16 40.83 -1.2930 0.011600 -0.02054 -3.854 6 -72.169 159.7 2.21 0.096
# 12 41.11 -1.7850 0.007473 -3.636 5 -73.779 159.9 2.37 0.089
# 15 37.11 -0.000937 -0.03116 -3.801 5 -74.321 161.0 3.46 0.052
# 11 34.96 -0.017720 -3.351 4 -78.084 165.6 8.16 0.005
# 9 37.29 -5.344 3 -80.015 166.9 9.40 0.003
# 4 34.66 -1.5870 -0.020580 4 -79.573 168.6 11.14 0.001
# 7 30.74 -0.030350 -0.02484 4 -80.309 170.1 12.61 0.001
# 2 37.88 -2.8760 3 -81.653 170.2 12.67 0.001
# 8 34.18 -1.2270 -0.018840 -0.01468 5 -79.009 170.3 12.83 0.000
# 6 36.91 -2.2650 -0.01912 4 -80.781 171.0 13.55 0.000
# 3 29.60 -0.041220 3 -82.105 171.1 13.57 0.000
# 5 30.10 -0.06823 3 -87.619 182.1 24.60 0.000
# 1 20.09 2 -102.378 209.2 51.68 0.000
You should consider these tools to help you make intelligent decisions. Do not let the tool make the decision for you!!!
For example, in this case dredge(...) suggests that the "best" model for predicting mpg, based on the AICc criterion, includes cyl and wt. But note that AICc for this model is 157.7 whereas the second best model has an AICc of 157.8, so these are basically the same. In fact, the first 5 models in this list are not significantly different in their ability to predict mpg. It does, however, narrow things down a bit. Among these 5, I would want to look at distribution of residuals (should be normal), trends in residuals (there should be none), and leverage (do some points have undue influence), before picking a "best" model.
Here's one way to get all of the combinations of variables using the combn function. It's a bit messy, and uses a loop (perhaps someone can improve on this with mapply):
vars <- c("price","model","size","year","color")
N <- list(1,2,3,4)
COMB <- sapply(N, function(m) combn(x=vars[2:5], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
COMB2[[k]] <- formula(paste("price", "~", paste(tmp[,j], collapse=" + ")))
}
}
Then, you can call these formulas and store the model objects using a list or possibly give unique names with the assign function:
res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
res[[i]] <- lm(COMB2[[i]], data=data)
}
You can use stepwise multiple regression to determine what variables make sense to include. To get this started you write one lm() statement with all variables, such as:
library(MASS)
fit <- lm(price ~ model + size + year + color)
Then you continue with:
step <- stepAIC(model, direction="both")
Finally, you can use to following to show the results:
step$anova
Hope this gives you some inspiration for advancing your script.