I am using the quantreg package to predict quantiles and their confidence intervals. I can't understand why the predicted quantiles are different from the quantiles calculated directly from the data using quantile().
library(tidyverse)
library(quantreg)
data <- tibble(data=runif(10)*10)
qr1 <- rq(formula=data ~ 1, tau=0.9, data=data) # quantile regression
yqr1<- predict(qr1, newdata=tibble(data=c(1)), interval='confidence', level=0.95, se='boot') # predict quantile
q90 <- quantile(data$data, 0.9) # quantile of sample
> yqr1
fit lower higher
1 6.999223 3.815588 10.18286
> q90
90%
7.270891
You should realize the predicting the 90th percentile for a dataset with only 10 items is really based solely on the two highest values. You should review the help page for quantile where you will find multiple definitions of the term.
When I run this, I see:
yqr1<- predict(qr1, newdata=tibble(data=c(1)) )
yqr1
1
8.525812
And when I look at the data I see:
data
# A tibble: 10 x 1
data
<dbl>
1 8.52581158
2 7.73959380
3 4.53000680
4 0.03431813
5 2.13842058
6 5.60713159
7 6.17525537
8 8.76262959
9 5.30750304
10 4.61817190
So the rq function is estimating the second highest value as the 90th percentile, which seems perfectly reasonable. The quantile result is not actually estimated that way:
quantile(data$data, .9)
# 90%
#8.549493
?quantile
Related
I have done logistic regression, a part of result is like below.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.9056 0.4967 -3.837 0.000125 ***
GWAS$value 0.4474 0.1157 3.868 0.000110 ***
This is the data which I used to do logistic regression.
ID Phenotype value
1 128 0 1.510320
2 193 1 1.956477
3 067 0 2.038308
4 034 1 2.058739
5 159 0 2.066371
6 013 0 2.095866
I would like to know how to calculate Odds Ratio and 95% Confidence interval for the decile of the value? My purpose is out put a plot, the y axis is OR(95%CI) and the x axis is the decile of the value in my data Can anyone please tell me how can I calculate this in R?
This is the example of the figure.
enter image description here
I don't have your data, so I cannot obtain the right model for you. The trick is to make the predictor ordinal, and use that to regress your response variable. After that you just plot the CI of each group, and join the lines if need be. Below I used an example dataset, and if you use the same steps, you should get the plot below:
library(tidyverse)
ldata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# we break gre column into quintiles
ldata <- ldata %>% mutate(GRE = cut_number(gre,5))
#regression like you did, calculate lor for all quintiles
fit <- glm(admit ~ 0+GRE,data=ldata,family="binomial")
# results like you have
results = coefficients(summary(fit))
# rename second column, SE for plotting
colnames(results)[2] = "SE"
#use ggplot
data.frame(results) %>%
mutate(X=1:n()) %>%
ggplot(aes(x=X,y=Estimate)) + geom_point()+
geom_line() +
# 95% interval is 1.96*SE
geom_errorbar(aes(ymin=-1.96*SE+Estimate,ymax=1.96*SE+Estimate),width=0.2)+
scale_x_continuous(label=rownames(results))+
xlab("GRE quintiles") +ylab("Log Odds Ratio")
I am use to conducting Tukey post-hoc tests in minitab. When I do, I usually get family grouping of the dependent/predictor variables.
In R, using TukeyHSD() the family grouping is not displayed (or calculated?). It only displays the relationship between each of the dependent/predictor variables. Is it possible to display the family groupings like in minitab?
Using the diamonds data set:
av <- aov(price ~ cut, data = diamonds)
tk <- TukeyHSD(av, ordered = T, which = "cut")
plot(tk)
Output:
Fit: aov(formula = price ~ cut, data = diamonds)
$cut
diff lwr upr p adj
Good-Ideal 471.32248 300.28228 642.3627 0.0000000
Very Good-Ideal 524.21792 401.33117 647.1047 0.0000000
Fair-Ideal 901.21579 621.86019 1180.5714 0.0000000
Premium-Ideal 1126.71573 1008.80880 1244.6227 0.0000000
Very Good-Good 52.89544 -130.15186 235.9427 0.9341158
Fair-Good 429.89331 119.33783 740.4488 0.0014980
Premium-Good 655.39325 475.65120 835.1353 0.0000000
Fair-Very Good 376.99787 90.13360 663.8622 0.0031094
Premium-Very Good 602.49781 467.76249 737.2331 0.0000000
Premium-Fair 225.49994 -59.26664 510.2665 0.1950425
Picture added to help clarify my response to Maruits's comment:
Here is a step-by-step example on how to reproduce minitab's table for the ggplot2::diamonds dataset. I've included details/explanation as much as possible.
Please note that as far as I can tell, results shown in minitab's table are not dependent/related to results from Tukey's post-hoc test; they are based on results from the analysis of variance. Tukey's honest significant difference (HSD) test is a post-hoc test that establishes which comparisons (of all the possible pairwise comparisons of group means) are (honestly) statistically significant, given the ANOVA results.
In order to reproduce minitabs "mean-grouping" summary table (see the first table of "Interpret the results: Step 3" of the minitab Express Support), I recommend (re-)running a linear model to extract means and confidence intervals. Note that this is exactly how aov fits the analysis of variance model for each group.
Fit a linear model
We specify a 0 offset to get absolute estimates for every group (rather than estimates for the changes relative to an offset).
fit <- lm(price ~ 0 + cut, data = diamonds)
coef <- summary(fit)$coef;
coef;
# Estimate Std. Error t value Pr(>|t|)
#cutFair 4358.758 98.78795 44.12236 0
#cutGood 3928.864 56.59175 69.42468 0
#cutVery Good 3981.760 36.06181 110.41487 0
#cutPremium 4584.258 33.75352 135.81570 0
#cutIdeal 3457.542 27.00121 128.05137 0
Determine family groupings
In order to obtain something similar to minitab's "family groupings", we adopt the following approach:
Calculate confidence intervals for all parameters
Perform a hierarchical clustering analysis on the confidence interval data for all parameters
Cut the resulting tree at a height corresponding to the standard deviation of the CIs. This will gives us a grouping of parameter estimates based on their confidence intervals. This is a somewhat empirical approach but justifiable as the tree measures pairwise distances between the confidence intervals, and the standard deviation can be interpreted as a Euclidean distance.
We start by calculating the confidence interval and cluster the resulting distance matrix using hierarchical clustering using complete linkage.
CI <- confint(fit);
hc <- hclust(dist(CI));
We inspect the cluster dendrogram
plot(hc);
We now cut the tree at a height corresponding to the standard deviation of all CIs across all parameter estimates to get the "family groupings"
grps <- cutree(hc, h = sd(CI))
Summarise results
Finally, we collate all quantities and store results in a table similar to minitab's "mean-grouping" table.
library(tidyverse)
bind_cols(
cut = rownames(coef),
N = as.integer(table(fit$model$cut)),
Mean = coef[, 1],
Groupings = grps) %>%
as.data.frame()
# cut N Mean Groupings
#1 cutFair 1610 4358.758 1
#2 cutGood 4906 3928.864 2
#3 cutVery Good 12082 3981.760 2
#4 cutPremium 13791 4584.258 1
#5 cutIdeal 21551 3457.542 3
Note the near-perfect agreement of our results with those from the minitab "mean-grouping" table: cut = Ideal is by itself in group 3 (group C in minitab's table), while Fair+Premium share group 1 (minitab: group A ), and Good+Very Good share group 2 (minitab: group B).
See the cld function in the multcomp package, as explained here (copy-pasted below).
Example data set:
> data(ToothGrowth)
> ToothGrowth$treat <- with(ToothGrowth, interaction(supp,dose))
> str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
$ treat: Factor w/ 6 levels "OJ.0.5","VC.0.5",..: 2 2 2 2 2 2 2 2 2 2 ...
Model fit:
> fit <- lm(len ~ treat, data=ToothGrowth)
All pairwise comparisons with Tukey test:
> apctt <- multcomp::glht(fit, linfct = multcomp::mcp(treat = "Tukey"))
Letter-based representation of all-pairwise comparisons (algorithm from Piepho 2004):
> lbrapc <- multcomp::cld(apctt)
> lbrapc
OJ.0.5 VC.0.5 OJ.1 VC.1 OJ.2 VC.2
"b" "a" "c" "b" "c" "c"
I'm working on a model to predict the probability that college baseball players will make the major leagues. My dataset has 633 observations and 13 predictors with a binary response. The code below generates smaller reproducible examples of training and testing datasets:
set.seed(1)
OBP <- rnorm(50, mean=1, sd=.2)
HR.PCT <- rnorm(50, mean=1, sd=.2)
AGE <- rnorm(50, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=50, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
df.train <- data.frame(OBP, HR.PCT, AGE, CONF)
df.train <- df.train[order(-OBP),]
df.train$MADE.MAJORS <- 0
df.train$MADE.MAJORS[1:10] <- 1
OBP <- rnorm(10, mean=1, sd=.2)
HR.PCT <- rnorm(10, mean=1, sd=.2)
AGE <- rnorm(10, mean=21, sd=1)
CONF <- sample(c("A","B","C","D","E"), size=10, replace=TRUE)
CONF <- factor(CONF, levels=c("A","B","C","D","E"))
MADE.MAJORS <- sample(0:1, size=10, replace=TRUE, prob=c(0.8,0.2))
df.test <- data.frame(OBP, HR.PCT, AGE, CONF, MADE.MAJORS)
I then used glmnet to perform the lasso with logistic regression and generate predictions. I want the predictions to be in the form of probabilities (that is, between 0 and 1).
library(glmnet)
train.mtx <- with(df.train, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
glmmod <- glmnet(x=train.mtx, y=as.factor(df.train$MADE.MAJORS), alpha=1, family="binomial")
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1)
test.mtx <- with(df.test, model.matrix(MADE.MAJORS ~ OBP + HR.PCT + AGE + CONF)[,-1])
preds <- predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="response")
cv.preds <- predict.cv.glmnet(object=cv.glmmod, newx=test.mtx, s="lambda.min")
Here are the predictions:
> preds
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
> cv.preds
1
1 0.1568839
2 0.3630938
3 0.7435941
4 0.4808428
5 0.5261076
6 -0.1431655
7 0.4123054
8 0.2207381
9 -0.1446941
10 0.2962391
I have a few questions about these results. Feel free to answer any or all (or none) of them. I'm most interested in an answer for the first question.
Why are the predictions from predict.glmnet (the preds vector) not in the form of probabilities? I put the preds values through the inverse logit function and got reasonable probabilities. Was that correct?
The predictions from predict.cv.glmnet (the cv.preds vector) mostly look like probabilities, but some of them are negative. Why is this?
When I use the glmnet function to create the glmmod object, I include the family="binomial" argument to indicate that I'm using logistic regression. However, when I use the cv.glmnet function to find the best value for lambda, I'm not able to specify logistic regression. Am I actually getting the best value for lambda if the cross-validation doesn't use logistic regression?
Similarly, when I use the predict.cv.glmnet function, I'm not able to specify logistic regression. Does this function produce the predictions that I want?
I am not 100% sure on the following because the package does seem to operate counter to its documentation, as you've noticed, but it may produce some indication whether your thinking is along the right path.
Question 1
Yes, you're right. Note that,
> predict.glmnet(object=glmmod, newx=test.mtx, s=cv.glmmod$lambda.min, type="link")
1
1 -3.2589440
2 -0.4435265
3 3.9646670
4 0.3772816
5 0.9952887
6 -7.3555661
7 0.2283675
8 -2.3871317
9 -8.1632749
10 -1.3563051
which is the same output as type="response". Thus, putting it through the inverse logit function would be the right way to get the probabilities. As to why is this happening, I have not clue -perhaps a bug.
Question 2...4
For the cv.preds, you're getting something along the lines of probabilities because you're fitting a Gaussian link. In order to fit a logit link, you should specify the family parameter. Namely:
cv.glmmod <- cv.glmnet(x=train.mtx, y=df.train$MADE.MAJORS, alpha=1, family="binomial")
> cv.preds
1
1 -10.873290
2 1.299113
3 15.812671
4 3.622259
5 5.621857
6 -24.826551
7 1.734000
8 -5.420878
9 -26.160403
10 -4.496020
In this case, cv.preds will output along the real line and you can put those values through the inverse logit to get the probabilities.
I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.
Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.
The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))
Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf
How do I get the CI for one point on the regression line? I'm quite sure I should use confint() for that, but if I try this
confint(model,param=value)
it just gives me the same number as if I just type in
confint(model)
if I try without a value, it does not give me any values at all.
What am I doing wrong?
You want predict() instead of confint(). Also, as Joran noted, you'll need to be clear about whether you want the confidence interval or prediction interval for a given x. (A confidence interval expresses uncertainty about the expected value of y-values at a given x. A prediction interval expresses uncertainty surrounding the predicted y-value of a single sampled point with that value of x.)
Here's a simple example of how to do this in R:
df <- data.frame(x=1:10, y=1:10 + rnorm(10))
f <- lm(y~x, data=df)
predict(f, newdata=data.frame(x=c(0, 5.5, 10)), interval="confidence")
# fit lwr upr
# 1 0.5500246 -1.649235 2.749284
# 2 5.7292889 4.711230 6.747348
# 3 9.9668688 8.074662 11.859075
predict(f, newdata=data.frame(x=c(0, 5.5, 10)), interval="prediction")
# fit lwr upr
# 1 0.5500246 -3.348845 4.448895
# 2 5.7292889 2.352769 9.105809
# 3 9.9668688 6.232583 13.701155