I have done logistic regression, a part of result is like below.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9056 0.4967 -3.837 0.000125 ***
GWAS$value 0.4474 0.1157 3.868 0.000110 ***
This is the data which I used to do logistic regression.
ID Phenotype value
1 128 0 1.510320
2 193 1 1.956477
3 067 0 2.038308
4 034 1 2.058739
5 159 0 2.066371
6 013 0 2.095866
I would like to know how to calculate Odds Ratio and 95% Confidence interval for the decile of the value? My purpose is out put a plot, the y axis is OR(95%CI) and the x axis is the decile of the value in my data Can anyone please tell me how can I calculate this in R?
This is the example of the figure.
enter image description here
I don't have your data, so I cannot obtain the right model for you. The trick is to make the predictor ordinal, and use that to regress your response variable. After that you just plot the CI of each group, and join the lines if need be. Below I used an example dataset, and if you use the same steps, you should get the plot below:
library(tidyverse)
ldata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# we break gre column into quintiles
ldata <- ldata %>% mutate(GRE = cut_number(gre,5))
#regression like you did, calculate lor for all quintiles
fit <- glm(admit ~ 0+GRE,data=ldata,family="binomial")
# results like you have
results = coefficients(summary(fit))
# rename second column, SE for plotting
colnames(results)[2] = "SE"
#use ggplot
data.frame(results) %>%
mutate(X=1:n()) %>%
ggplot(aes(x=X,y=Estimate)) + geom_point()+
geom_line() +
# 95% interval is 1.96*SE
geom_errorbar(aes(ymin=-1.96*SE+Estimate,ymax=1.96*SE+Estimate),width=0.2)+
scale_x_continuous(label=rownames(results))+
xlab("GRE quintiles") +ylab("Log Odds Ratio")
Related
I am using the quantreg package to predict quantiles and their confidence intervals. I can't understand why the predicted quantiles are different from the quantiles calculated directly from the data using quantile().
library(tidyverse)
library(quantreg)
data <- tibble(data=runif(10)*10)
qr1 <- rq(formula=data ~ 1, tau=0.9, data=data) # quantile regression
yqr1<- predict(qr1, newdata=tibble(data=c(1)), interval='confidence', level=0.95, se='boot') # predict quantile
q90 <- quantile(data$data, 0.9) # quantile of sample
> yqr1
fit lower higher
1 6.999223 3.815588 10.18286
> q90
90%
7.270891
You should realize the predicting the 90th percentile for a dataset with only 10 items is really based solely on the two highest values. You should review the help page for quantile where you will find multiple definitions of the term.
When I run this, I see:
yqr1<- predict(qr1, newdata=tibble(data=c(1)) )
yqr1
1
8.525812
And when I look at the data I see:
data
# A tibble: 10 x 1
data
<dbl>
1 8.52581158
2 7.73959380
3 4.53000680
4 0.03431813
5 2.13842058
6 5.60713159
7 6.17525537
8 8.76262959
9 5.30750304
10 4.61817190
So the rq function is estimating the second highest value as the 90th percentile, which seems perfectly reasonable. The quantile result is not actually estimated that way:
quantile(data$data, .9)
# 90%
#8.549493
?quantile
I am trying to get the seasonal trend of two groups of individuals using GAMMs. I performed two analysis changing the order of the levels of the explanatory variable in order to get one plot of the seasonal trend for each level.
However, I am surprised with the output of the two GAMMs because they vary according to the order of the levels of the explanatory variable. I expected that the results would be the same because the data and the model are the same in both occasions. However, as you can see below, the results vary the inference of the data studied.
My database contained the next variables:
Species: 4 levels
Populations: 20 levels
Reproductive_State: 2 levels
Survival_probability: range [0-1]
Year
Month
Fortnight: from 1 to 26 (called Seasonality in analysis)
I am trying to get a descriptive estimates and plots of the "common seasonal survival of the species" checking the existence of differences between the two levels of the variable reproductive_state.
In order to check it I performed did:
# Specify the contrast: Reproductive group
data$Reproductive_Group <- as.factor (data$Reproductive_State)
data$Reproductive_Group <- as.ordered(data$Reproductive_Group )
contrasts(data$Reproductive_Group )<-'contr.treatment'
model_1 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
later I change the order of the levels of the Reproductive_Group and perform the same analysis:
data$Reproductive_Group <- factor (data$Reproductive_Group , levels=c("phiNB", "phiB"))
levels (data$Reproductive_Group )
model_2 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
In the first model the output is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83569 0.01202 152.8 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 3.201 3.963 2.430 0.05046 .
s(Seasonality):Rep_GroupphiNB 5.824 6.956 2.682 0.00991 **
whereas the output of the second model is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83554 0.01205 152.4 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 5.927 7.061 6.156 3.66e-07 ***
s(Seasonality):Rep_GroupphiB 3.218 3.981 1.029 0.411
Furthermore I have attached the plots of the two models:
Group_B_as_second_level
Group_NB_as_second_level
I thought that the plot of the seasonality should be the same for both analysis, as long as it represents exclusively the seasonality. However if the seasonality reflects the seasonal trend of the other level, the plot 1 of the first picture should match with the plot 2 of the second picture and viceversa, and they donĀ“t do it.
To note, that I followed the blog Overview GAMM analysis of time series data for writting the formula and checking the differences of the seasonal trend accross the two reproductive state.
Do you know why I obtain different results with these two models?
I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.
Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.
The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))
Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf
I have the following list of data each has 10 samples.
The values indicate binding strength of a particular molecule.
What I want so show is that 'x' is statistically different from
'y', 'z' and 'w'. Which it does if you look at X it has
more values greater than zero (2.8,1.00,5.4, etc) than others.
I tried t-test, but all of them shows insignificant difference
with high P-value.
What's the appropriate test for that?
Below is my code:
#!/usr/bin/Rscript
x <-c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
t.test(x,y)
t.test(x,z)
You have not specified in what way you expect the samples to differ. One typically assumes you mean the mean differs across samples. In that case, the t-test is appropriate. While x has some high values, it also has some low values which pull the mean in. It seems what you thought was a significant difference (visually) is actually a larger variance.
If your question is about variance, then you need an F-test.
The classic test for this type of data is analysis of variance. Analysis of variance tells you if the means of all four categories are the likely the same (failure to reject null hypothesis) or if at least one mean likely differs from the others (rejection of the null hypothesis).
If the anova is significant, you will often want to perform the Tukey HSD post-hoc test to figure out which category differs from the others. Tukey HSD yields p-values that are already adjusted for multiple comparisons.
library(ggplot2)
library(reshape2)
x <- c(2.852672123,0.076840264,1.009542943,0.430716968,5.4016,0.084281843,
0.065654548,0.971907344,3.325405405,0.606504718)
y <- c(0.122615039,0.844203734,0.002128992,0.628740077,0.87752229,
0.888600425,0.728667099,0.000375047,0.911153571,0.553786408);
z <- c(0.766445916,0.726801899,0.389718652,0.978733927,0.405585807,
0.408554832,0.799010791,0.737676439,0.433279599,0.947906524)
w <- c(0.000124984,1.486637663,0.979713013,0.917105894,0.660855127,
0.338574774,0.211689885,0.434050179,0.955522972,0.014195184)
dat = data.frame(x, y, z, w)
mdat = melt(dat)
anova_results = aov(value ~ variable, data=mdat)
summary(anova_results)
# Df Sum Sq Mean Sq F value Pr(>F)
# variable 3 5.83 1.9431 2.134 0.113
# Residuals 36 32.78 0.9105
The anova p-value is 0.113 and the Tukey test p-values for your "x" category are in a similar range. This is the quantification of your intuition that "x" is different from the others. Most researchers would find p = 0.11 to be suggestive but still have too high risk of being a false positive. Note that the large difference in means (diff column) along with the boxplot figure below might be more persuasive than the p-value.
TukeyHSD(anova_results)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
#
# Fit: aov(formula = value ~ variable, data = mdat)
#
# $variable
# diff lwr upr p adj
# y-x -0.92673335 -2.076048 0.2225815 0.1506271
# z-x -0.82314118 -1.972456 0.3261737 0.2342515
# w-x -0.88266565 -2.031981 0.2666492 0.1828672
# z-y 0.10359217 -1.045723 1.2529071 0.9948795
# w-y 0.04406770 -1.105247 1.1933826 0.9995981
# w-z -0.05952447 -1.208839 1.0897904 0.9990129
plot_1 = ggplot(mdat, aes(x=variable, y=value, colour=variable)) +
geom_boxplot() +
geom_point(size=5, shape=1)
ggsave("plot_1.png", plot_1, height=3.5, width=7, units="in")
In your question you referred to the distributions being different b/c some of them had more values greater than 0. Defining the distributions according to the "number of values greater than 0", then you would use the binomial distribution (after converting the values to 1's and 0's). A function you could then use would be prop.test()
I'm running a piecewise linear random coefficient model testing the influence of a covariate on the second piece. Thereby, I want to test whether the coefficient of the second piece under the influence of the covariate (piece2 + piece2:covariate) differs from the coefficient of the first piece (piece1), hence whether the growth rate differs.
I set up some exemplary data:
set.seed(100)
# set up dependent variable
temp <- rep(seq(0,23),50)
y <- c(rep(seq(0,23),50)+rnorm(24*50), ifelse(temp <= 11, temp + runif(1200), temp + rnorm(1200) + (temp/sqrt(temp))))
# set up ID variable, variables indicating pieces and the covariate
id <- sort(rep(seq(1,100),24))
piece1 <- rep(c(seq(0,11), rep(11,12)),100)
piece2 <- rep(c(rep(0,12), seq(1,12)),100)
covariate <- c(rep(0,24*50), rep(c(rep(0,12), rep(1,12)), 50))
# data frame
example.data <- data.frame(id, y, piece1, piece2, covariate)
# run piecewise linear random effects model and show results
library(lme4)
lmer.results <- lmer(y ~ piece1 + piece2*covariate + (1|id) , example.data)
summary(lmer.results)
I came across the linearHypothesis() command from the car package to test differences in coefficients. However, I could not find an example on how to use it when including interactions.
Can I even use linearHypothesis() to test this or am I aiming for the wrong test?
I appreciate your help.
Many thanks in advance!
Mac
Assuming your output looks like this
Estimate Std. Error t value
(Intercept) 0.26293 0.04997 5.3
piece1 0.99582 0.00677 147.2
piece2 0.98083 0.00716 137.0
covariate 2.98265 0.09042 33.0
piece2:covariate 0.15287 0.01286 11.9
if I understand correctly what you want, you are looking for the contrast:
piece1-(piece2+piece2:covariate)
or
c(0,1,-1,0,-1)
My preferred tool for this is function estimable in gmodels; you could also do it by hand or with one of the functions in Frank Harrel's packages.
library(gmodels)
estimable(lmer.results,c(0,1,-1,0,-1),conf.int=TRUE)
giving
Estimate Std. Error p value Lower.CI Upper.CI
(0 1 -1 0 -1) -0.138 0.0127 0 -0.182 -0.0928