Whenever I use manova(), then summary.aov(), I only get df, Sum sq, and Mean Sq, with no p value.
My data frame looks like: (sorry I'm not sure if there's a better way to display this!)
subtype lymphocytosis anemia thrombocytopenia eosinophilia hypercalcemia hyperglobulinemia
1 MBC 0.60 0.18 0.17 0.02 0.01 0.04
2 SBC 0.25 0.18 0.14 0.03 0.02 0.12
3 BCLL 1.00 0.29 0.18 0.08 0.03 0.21
neutrophilia neutropenia lymphadenopathy_peripheral lymphadenopathy_visceral splenomegaly
1 0.23 0.02 1.00 0.65 0.60
2 0.22 0.04 0.99 0.62 0.49
3 0.23 0.04 0.40 0.25 0.49
hepatomegaly pleural_effusion peritoneal_effusion intestinal_mass mediastinal_mass pulmonary_mass
1 0.41 0.02 0.05 0.10 0.09 0.22
2 0.37 0.03 0.05 0.17 0.12 0.22
3 0.27 0.01 0.04 0.25 0.03 0.25
The values in the data frame represent the mean number of cases from each subtype for each clinical sign. I am a little worried that, for manova() to work, I should have each individual case and their clinical signs inputted so that manova can do its own math? Which would be a huge pain for me to assemble, hence why I've done it this way. Either way, I still think I should bet getting P values, they just might be wrong if my data frame is wrong?
The code I am using is:
cs_comp_try <- manova(cbind(lymphocytosis, anemia, thrombocytopenia, eosinophilia, hypercalcemia,
hyperglobulinemia, neutrophilia, neutropenia, lymphadenopathy_peripheral, lymphadenopathy_visceral,
splenomegaly, hepatomegaly, pleural_effusion, peritoneal_effusion, intestinal_mass, mediastinal_mass, pulmonary_mass) ~ subtype, data = cs_comp)
summary(cs_comp_try)
summary.aov(cs_comp_try)
The result I get for summary.aov() is:
Response peritoneal_effusion :
Df Sum Sq Mean Sq
subtype 2 6.6667e-05 3.3333e-05
Response intestinal_mass :
Df Sum Sq Mean Sq
subtype 2 0.011267 0.0056333
Response mediastinal_mass :
Df Sum Sq Mean Sq
subtype 2 0.0042 0.0021
Response pulmonary_mass :
Df Sum Sq Mean Sq
subtype 2 6e-04 3e-04
I think I've replicated all the examples I've seen on the internet, so I'm not sure why I'm not getting an F statistic and p value when I run this code.
You can just use the summary function to get the p-values like this (I use iris data as an example):
fit <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(fit)
#> Df Pillai approx F num Df den Df Pr(>F)
#> Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
#> Residuals 147
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Created on 2022-07-15 by the reprex package (v2.0.1)
If you want to extract the actual p-values, you can use the following code:
fit <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(fit)$stats[1, "Pr(>F)"]
#> [1] 2.216888e-42
Created on 2022-07-15 by the reprex package (v2.0.1)
Related
I have questions about multivariable cox regression analysis including non-binary categorical variables.
My data consists of several variables, and some of them are binary (like sex, and age over 70, etc..)
whereas the rest of them are not (for example, ECOG)
I tried both analyse_multivariate function and coxph function, but it seems that I can only get overall hazard ratios regarding non-categorical variables, but I'd like to know both overall hazard ratios for the variable and individual hazard ratios for the subcategories in the variable (like hazard ratios for ECOG 0, ECOG 1, ECOG 2, and for overall ECOG)
What I tried in the process is like this:
(1)
ECOG = as.factor(df$ECOG)
analyse_multivariate(data=df,
time_status = vars(df$OS, df$survival_status==1),
covariates = vars(df$age70, df$sex, ECOG),
reference_level_dict = c(ECOG==0))
and the result is like this:
Hazard Ratios:
factor.id factor.name factor.value HR Lower_CI Upper_CI Inv_HR Inv_Lower_CI Inv_Upper_CI
df$age70 df$age70 <continuous> 1.07 0.82 1.41 0.93 0.71 1.22
ECOG:4 ECOG 4 1.13 0.16 8.19 0.89 0.12 6.43
df$sex df$sex <continuous> 1.87 0.96 3.66 0.53 0.27 1.04
ECOG:1 ECOG 1 2.14 1.63 2.81 0.47 0.36 0.61
ECOG:3 ECOG 3 12.12 7.83 18.76 0.08 0.05 0.13
ECOG:2 ECOG 2 13.72 4.92 38.26 0.07 0.03 0.2
(2)
analyse_multivariate(data=df,
time_status = vars(df$OS, df$survival_status==1),
covariates = vars(df$age70, df$sex, df$ECOG),
reference_level_dict = c(ECOG==0))
and the result is:
Hazard Ratios:
factor.id factor.name factor.value HR Lower_CI Upper_CI Inv_HR Inv_Lower_CI Inv_Upper_CI
df$age70 df$age70 <continuous> 0.89 0.68 1.16 1.13 0.86 1.47
df$sex df$sex <continuous> 1.87 0.96 3.65 0.53 0.27 1.04
df$ECOG df$ECOG <continuous> 1.9 1.69 2.15 0.53 0.47 0.59
Does it make sense if I use a p-value for ECOG in total from (2) and consider ECOG as a significant variable if its p-value is <0.05, and combine individual hazard ratios for individual ECOG status from (1)?
like for generating a table like followings:
p-value 0.01
ECOG 1 Reference
ECOG 2 13.72 (4.92-38.26)
ECOG 3 12.12 (7.83-18.76)
ECOG 4 1.13 (0.16-8.19)
I believe there are better solutions but couldn't find one.
Any comments would be appreciated!
Thank you in advance.
Short answer is no. In (2), it is a continuous response, meaning you expect the log odds ratio of survival to have a linear relationship with ECOG, whereas in (1) you expect every level (1 to 4) to have a different effect on survival. To test the variable ECOG collective, you can do an anova:
library(survivalAnalysis)
data = survival::lung
data$ECOG = factor(data$ph.ecog)
data$sex = factor(data$sex)
fit1 = data %>%
analyse_multivariate(vars(time, status),
covariates = vars(age, sex, ECOG, wt.loss))
anova(fit1$coxph)
Analysis of Deviance Table
Cox model: response is Surv(time, status)
Terms added sequentially (first to last)
loglik Chisq Df Pr(>|Chi|)
NULL -675.02
age -672.36 5.3325 1 0.020931 *
sex -667.82 9.0851 1 0.002577 **
ECOG -660.26 15.1127 3 0.001723 **
wt.loss -659.31 1.9036 1 0.167680
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I am trying find a function that allows me two easily get the confidence interval of difference between two means.
I am pretty sure t.test has this functionality, but I haven't been able to make it work. Below is a screenshot of what I have tried so far:
Image
This is the dataset I am using
Indoor Outdoor
1 0.07 0.29
2 0.08 0.68
3 0.09 0.47
4 0.12 0.54
5 0.12 0.97
6 0.12 0.35
7 0.13 0.49
8 0.14 0.84
9 0.15 0.86
10 0.15 0.28
11 0.17 0.32
12 0.17 0.32
13 0.18 1.55
14 0.18 0.66
15 0.18 0.29
16 0.18 0.21
17 0.19 1.02
18 0.20 1.59
19 0.22 0.90
20 0.22 0.52
21 0.23 0.12
22 0.23 0.54
23 0.25 0.88
24 0.26 0.49
25 0.28 1.24
26 0.28 0.48
27 0.29 0.27
28 0.34 0.37
29 0.39 1.26
30 0.40 0.70
31 0.45 0.76
32 0.54 0.99
33 0.62 0.36
and I have been trying to use t.test function that has been installed from
install.packages("ggpubr")
I am pretty new to R, so sorry if there is a simple answer to this question. I have searched around quite a bit and haven't been able to find anything that I am looking for.
Note: The output I am looking for is Between -1.224 and 0.376
Edit:
The CI of difference between means I am looking for is if a random 34th datapoint was added to the chart by picking a random value in the Indoor column and a random value in the Outdoor column and duplicating it. Running the t.test will output the correct CI for the difference of means for the given sample size of 33.
How can I go about doing this pretending the sample size is 34?
there's probably something more convenient in the standard library, but it's pretty easy to calculate. given your df variable, we can just do:
# calculate mean of difference
d_mu <- mean(df$Indoor) - mean(df$Outdoor)
# calculate SD of difference
d_sd <- sqrt(var(df$Indoor) + var(df$Outdoor))
# calculate 95% CI of this
d_mu + d_sd * qt(c(0.025, 0.975), nrow(df)*2)
giving me: -1.2246 0.3767
mostly for #AkselA: I often find it helpful to check my work by sampling simpler distributions, in this case I'd do something like:
a <- mean(df$Indoor) + sd(df$Indoor) * rt(1000000, nrow(df)-1)
b <- mean(df$Outdoor) + sd(df$Outdoor) * rt(1000000, nrow(df)-1)
quantile(a - b, c(0.025, 0.975))
which gives me answers much closer to the CI I gave in the comment
Even though I always find the approach of manually calculating the results, as shown by #Sam Mason, the most insightful, there are some who want a shortcut. And sometimes, it's also ok to be lazy :)
So among the different ways to calculate CIs, this is imho the most comfortable:
DescTools::MeanDiffCI(Indoor, Outdoor)
Here's a reprex:
IV <- diamonds$price
DV <- rnorm(length(IV), mean = mean(IV), sd = sd(IV))
DescTools::MeanDiffCI(IV, DV)
gives
meandiff lwr.ci upr.ci
-18.94825 -66.51845 28.62195
This is calculated with 999 bootstrapped samples by default. If you want 1000 or more, you can just add that in the argument R:
DescTools::MeanDiffCI(IV, DV, R = 1000)
I have plotted the conditional density distribution of my variables by using cdplot (R). My independent variable and my dependent variable are not independent. Independent variable is discrete (it takes only certain values between 0 and 3) and dependent variable is also discrete (11 levels from 0 to 1 in steps of 0.1).
Some data:
dat <- read.table( text="y x
3.00 0.0
2.75 0.0
2.75 0.1
2.75 0.1
2.75 0.2
2.25 0.2
3 0.3
2 0.3
2.25 0.4
1.75 0.4
1.75 0.5
2 0.5
1.75 0.6
1.75 0.6
1.75 0.7
1 0.7
0.54 0.8
0 0.8
0.54 0.9
0 0.9
0 1.0
0 1.0", header=TRUE, colClasses="factor")
I wonder if my variables are appropriate to run this kind of analysis.
Also, I'd like to know how to report this results in an elegant way with academic and statistical sense.
This is a run using the rms-packages `lrm function which is typically used for binary outcomes but also handles ordered categorical variables:
library(rms) # also loads Hmisc
# first get data in the form you described
dat[] <- lapply(dat, ordered) # makes both columns ordered factor variables
?lrm
#read help page ... Also look at the supporting book and citations on that page
lrm( y ~ x, data=dat)
# --- output------
Logistic Regression Model
lrm(formula = y ~ x, data = dat)
Frequencies of Responses
0 0.54 1 1.75 2 2.25 2.75 3 3.00
4 2 1 5 2 2 4 1 1
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 22 LR chi2 51.66 R2 0.920 C 0.869
max |deriv| 0.0004 d.f. 10 g 20.742 Dxy 0.738
Pr(> chi2) <0.0001 gr 1019053402.761 gamma 0.916
gp 0.500 tau-a 0.658
Brier 0.048
Coef S.E. Wald Z Pr(>|Z|)
y>=0.54 41.6140 108.3624 0.38 0.7010
y>=1 31.9345 88.0084 0.36 0.7167
y>=1.75 23.5277 74.2031 0.32 0.7512
y>=2 6.3002 2.2886 2.75 0.0059
y>=2.25 4.6790 2.0494 2.28 0.0224
y>=2.75 3.2223 1.8577 1.73 0.0828
y>=3 0.5919 1.4855 0.40 0.6903
y>=3.00 -0.4283 1.5004 -0.29 0.7753
x -19.0710 19.8718 -0.96 0.3372
x=0.2 0.7630 3.1058 0.25 0.8059
x=0.3 3.0129 5.2589 0.57 0.5667
x=0.4 1.9526 6.9051 0.28 0.7773
x=0.5 2.9703 8.8464 0.34 0.7370
x=0.6 -3.4705 53.5272 -0.06 0.9483
x=0.7 -10.1780 75.2585 -0.14 0.8924
x=0.8 -26.3573 109.3298 -0.24 0.8095
x=0.9 -24.4502 109.6118 -0.22 0.8235
x=1 -35.5679 488.7155 -0.07 0.9420
There is also the MASS::polr function, but I find Harrell's version more approachable. This could also be approached with rank regression. The quantreg package is pretty standard if that were the route you chose. Looking at your other question, I wondered if you had tried a logistic transform as a method of linearizing that relationship. Of course, the illustrated use of lrm with an ordered variable is a logistic transformation "under the hood".
JAGS
I have an intercept-only logistic model in JAGS, defined as follows:
model{
for(i in 1:Ny){
y[i] ~ dbern(mu[s[i]])
}
for(j in 1:Ns){
mu[j] <- ilogit(b0[j])
b0[j] ~ dnorm(0, sigma)
}
sigma ~ dunif(0, 100)
}
When I plot the posterior distribution of b0 collapsing across all subjects (i.e., all b0[j]), my 95% HDI includes 0: -0.55 to 2.13. The Effective Sample Size is way above 10,000 for every b0 (around 18,000 on average). Diagnostics look good.
glmer()
Now, this is the equivalent glmer() model:
glmer(response ~ 1 + (1|subject), data = myData, family = "binomial")
The result of this model, however, is as follows:
Random effects:
Groups Name Variance Std.Dev.
speaker (Intercept) 0.3317 0.576
Number of obs: 1544, groups: subject, 27
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7401 0.1247 5.935 2.94e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So here it says my estimate is significantly above 0.
What the data look like
Here are the proportions of 0s and 1s by subject. You can see that, for the vast majority of subjects, the proportion of 1 is above 50%.
Any ideas why JAGS and glmer() are so different here?
0 1
1 0.47 0.53
2 0.36 0.64
3 0.29 0.71
4 0.42 0.58
5 0.12 0.88
6 0.22 0.78
7 0.54 0.46
8 0.39 0.61
9 0.30 0.70
10 0.32 0.68
11 0.36 0.64
12 0.66 0.34
13 0.38 0.62
14 0.49 0.51
15 0.35 0.65
16 0.32 0.68
17 0.12 0.88
18 0.45 0.55
19 0.36 0.64
20 0.36 0.64
21 0.28 0.72
22 0.40 0.60
23 0.41 0.59
24 0.19 0.81
25 0.27 0.73
26 0.08 0.92
27 0.12 0.88
You forgot to include a mean value, so your intercept parameter is fixed to zero. Something like this should work:
model{
for(i in 1:Ny){
y[i] ~ dbern(mu[s[i]])
}
for(j in 1:Ns){
mu[j] <- ilogit(b0[j])
b0[j] ~ dnorm(mu0, sigma)
}
mu0 ~ dnorm(0,0.001)
sigma ~ dunif(0, 100)
}
Now the posterior density of mu0 should match the sampling distribution of the intercept parameter from glmer reasonably well.
Alternatively, if you use response ~ -1 + (1|subject) as your glmer formula, you should get results that match your current JAGS model.
Sorry for this stupid question, I am new to R. I have some in such formate and saved in CSV:
%Y-%m-%d,st1,st2,st3,st4,st5,st6,st7,st8,st9,st10
2005-09-20,38.75,48.625,48.5,23.667,45.5,48.75,18.75,33.25,43.455,76.042
2005-09-21,39.482,49.3,49,23.9,46.15,50.281,18.975,34.125,44.465,78.232
...
I import it in R
library(fPortfolio)
Data <- readSeries(file = "data.csv", header = TRUE, sep = ",")
I want to have some descriptive statistics
library(psych)
describe(Data)
Error in x[!is.na(x[, i]), i] :
invalid or not-yet-implemented 'timeSeries' subsetting
Any suggestion?
you probably want to make it a time series first right?
tS <- dummySeries() #make quick dummy time series
describe(tS) # fails
but
newtS<-as.ts(tS)
describe(newtS) #works fine giving:
var n mean sd median trimmed mad min max range skew kurtosis se
Series 1 1 12 0.49 0.25 0.44 0.48 0.29 0.13 0.89 0.76 0.24 -1.52 0.07
Series 2 2 12 0.45 0.28 0.44 0.45 0.42 0.07 0.83 0.77 0.03 -1.74 0.08