I would like to fit a weibull curve to some event data and then include the fitted weibull curve in a survival plot plotted by survminer::ggsurvplot. Any ideas of how?
Here is an example to work on:
A function for simulating weibull data:
# N = sample size
# lambda = scale parameter in h0()
# rho = shape parameter in h0()
# beta = fixed effect parameter
# rateC = rate parameter of the exponential distribution of C
simulWeib <- function(N, lambda, rho, beta, rateC)
{
# covariate --> N Bernoulli trials
x <- sample(x=c(0, 1), size=N, replace=TRUE, prob=c(0.5, 0.5))
# Weibull latent event times
v <- runif(n=N)
Tlat <- (- log(v) / (lambda * exp(x * beta)))^(1 / rho)
# censoring times
C <- rexp(n=N, rate=rateC)
# follow-up times and event indicators
time <- pmin(Tlat, C)
status <- as.numeric(Tlat <= C)
# data set
data.frame(id=1:N,
time=time,
status=status,
x=x)
}
generate data
set.seed(1234)
betaHat <- rep(NA, 1e3)
for(k in 1:1e3)
{
dat <- simulWeib(N=100, lambda=0.01, rho=1, beta=-0.6, rateC=0.001)
fit <- coxph(Surv(time, status) ~ x, data=dat)
betaHat[k] <- fit$coef
}
#Estimate a survival function
survfit(Surv(as.numeric(time), x)~1, data=dat) -> out0
#plot
library(survminer)
ggsurvplot(out0, data = dat, risk.table = TRUE)
gg1 <- ggsurvplot(
out0, # survfit object with calculated statistics.
data = dat, # data used to fit survival curves.
risk.table = TRUE, # show risk table.
pval = TRUE, # show p-value of log-rank test.
conf.int = TRUE, # show confidence intervals for
# point estimaes of survival curves.
xlim = c(0,2000), # present narrower X axis, but not affect
# survival estimates.
break.time.by = 500, # break X axis in time intervals by 500.
ggtheme = theme_minimal(), # customize plot and risk table with a theme.
risk.table.y.text.col = T, # colour risk table text annotations.
risk.table.y.text = FALSE,
surv.median.line = "hv",
color = "darkgreen",
conf.int.fill = "lightblue",
title = "Survival probability",# show bars instead of names in text annotations
# in legend of risk table
)
gg1
As far as I see this, it is not possible do it with ggsurvplot at this moment.
I created an issue requesting this feature: https://github.com/kassambara/survminer/issues/276
You can plot survivor curves of a weibull model with ggplot2 like this:
library("survival")
wbmod <- survreg(Surv(time, status) ~ x, data = dat)
s <- seq(.01, .99, by = .01)
t_0 <- predict(wbmod, newdata = data.frame(x = 0),
type = "quantile", p = s)
t_1 <- predict(wbmod, newdata = data.frame(x = 1),
type = "quantile", p = s)
smod <- data.frame(time = c(t_0, t_1),
surv = rep(1 - s, times = 2),
strata = rep(c(0, 1), each = length(s)),
upper = NA, lower = NA)
head(surv_summary(cm))
library("ggplot2")
ggplot() +
geom_line(data = smod, aes(x = time, y = surv, color = factor(strata))) +
theme_classic()
However to my knowledge you cannot use survminer (yet):
library("survminer")
# wrong:
ggsurvplot(smod)
# does not work:
gg1$plot + geom_line(data = smod, aes(x = time, y = surv, color = factor(strata)))
The following works for me. Probably the credit goes to Heidi filling a feature request.
Hope, someone finds this useful.
library(survminer)
library(tidyr)
s <- with(lung,Surv(time,status))
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
fKM <- survfit(s ~ sex,data=lung)
pred.sex1 = predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01))
pred.sex2 = predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01))
df = data.frame(y=seq(.99,.01,by=-.01), sex1=pred.sex1, sex2=pred.sex2)
df_long = gather(df, key= "sex", value="time", -y)
p = ggsurvplot(fKM, data = lung, risk.table = T)
p$plot = p$plot + geom_line(data=df_long, aes(x=time, y=y, group=sex))
Related
I am trying to plot 95% confidence intervals on some simulated values but am running into so issues when i am trying to plot the CIs using the geom_ribbon() argument. The trouble I'm having it that my model does not show the CIs when i plot them, like so;
I have included all of my code below if anyone knows where i have gone wrong here;
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + .1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#transform the CI limit to get one at the level of the mean
upper_mod2 = exp(upper_mod2)/(1+exp(upper_mod2))
lower_mod2 = exp(lower_mod2)/(1+exp(lower_mod2))
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds2), col = 'blue')
In a comment to the question, it's asked why not to logit transform the predicted values. The reason why is that the type of prediction asked for is "response". From the documentation, my emphasis.
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
There is a good way to answer, to show the code.
library(ggplot2, quietly = TRUE)
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + 0.1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
suppressWarnings(
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
)
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds), col = 'blue')
Created on 2022-05-29 by the reprex package (v2.0.1)
I was wondering how I can modify the following code to have a plot something like
data(airquality)
library(quantreg)
library(ggplot2)
library(data.table)
library(devtools)
# source Quantile LOESS
source("https://www.r-statistics.com/wp-content/uploads/2010/04/Quantile.loess_.r.txt")
airquality2 <- na.omit(airquality[ , c(1, 4)])
#'' quantreg::rq
rq_fit <- rq(Ozone ~ Temp, 0.95, airquality2)
rq_fit_df <- data.table(t(coef(rq_fit)))
names(rq_fit_df) <- c("intercept", "slope")
#'' quantreg::lprq
lprq_fit <- lapply(1:3, function(bw){
fit <- lprq(airquality2$Temp, airquality2$Ozone, h = bw, tau = 0.95)
return(data.table(x = fit$xx, y = fit$fv, bw = paste0("bw=", bw), fit = "quantreg::lprq"))
})
#'' Quantile LOESS
ql_fit <- Quantile.loess(airquality2$Ozone, jitter(airquality2$Temp), window.size = 10,
the.quant = .95, window.alignment = c("center"))
ql_fit_df <- data.table(x = ql_fit$x, y = ql_fit$y.loess, bw = "bw=1", fit = "Quantile LOESS")
I want to have all these fits in a plot.
geom_quantile can calculate quantiles using the rq method internally, so we don't need to create the rq_fit_df separately. However, the lprq and Quantile LOESS methods aren't available within geom_quantile, so I've used the data frames you provided and plotted them using geom_line.
In addition, to include the rq line in the color and linetype mappings and in the legend we add aes(colour="rq", linetype="rq") as a sort of "artificial" mapping inside geom_quantile.
library(dplyr) # For bind_rows()
ggplot(airquality2, aes(Temp, Ozone)) +
geom_point() +
geom_quantile(quantiles=0.95, formula=y ~ x, aes(colour="rq", linetype="rq")) +
geom_line(data=bind_rows(lprq_fit, ql_fit_df),
aes(x, y, colour=paste0(gsub("q.*:","",fit),": ", bw),
linetype=paste0(gsub("q.*:","",fit),": ", bw))) +
theme_bw() +
scale_linetype_manual(values=c(2,4,5,1,1)) +
labs(colour="Method", linetype="Method",
title="Different methods of estimating the 95th percentile by quantile regression")
Could someone show me how to fit a polynomial marginal distribution to my data? I have done a binomial and beta binomial, but I would like to see how to fit a polynomial. I would also be interested in trying a gamma if that is something you know how to do.
This is what I have done so far.
nodes <- read.table("https://web.stanford.edu/~hastie/CASI_files/DATA/nodes.txt",
header = T)
nodes %>%
ggplot(aes(x=x/n))+
geom_histogram(bins = 30)+
theme_bw()+
labs(x = "nodes",
n = "p=x/n")
# log-likelihood function
ll <- function(alpha, beta) {
x <- nodes$x
total <- nodes$n
-sum(VGAM::dbetabinom.ab(x, total, alpha, beta, log = TRUE))
}
# maximum likelihood estimation
m <- mle(ll, start = list(alpha = 1, beta = 10), method = "L-BFGS-B",
lower = c(0.0001, .1))
ab <- coef(m)
alpha0 <- ab[1]
beta0 <- ab[2]
nodes %>%
ggplot() +
geom_histogram(aes(x/n, y = ..density..), bins= 30) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("p=x/n")
Here is another fit
ll <- function(a){
x <- nodes$x
total <- nodes$n
-sum(stats::dbinom(x, total, a, log = TRUE))
}
#stats::dbinom()
m <- mle(ll, start = list(a=.5), method = "L-BFGS-B",
lower = c(0.0001, .1))
a = coef(m)
nodes %>%
ggplot() +
geom_histogram(aes(x/n, y = ..density..), bins=40) +
stat_function(fun = function(x) dbeta(x, a, 1), color = "red",
size = 1) +
xlab("proportion x/n")
For fitting a gamma distribution:
data(iris)
library(MASS) ##for the fitdistr function
fit.params <- fitdistr(iris$Sepal.Length, "gamma", lower = c(0, 0))
ggplot(data = iris) +
geom_histogram(data = as.data.frame(x), aes(x=iris$Sepal.Length, y=..density..)) +
geom_line(aes(x=iris$Sepal.Length,
y=dgamma(iris$Sepal.Length,fit.params$estimate["shape"],
fit.params$estimate["rate"])), color="red", size = 1) +
theme_classic()
You might also like to take a look at the distribution of the quantiles using the qqp function in the car package. Here are a few examples:
library(car)
qqp(iris$Sepal.Length, "norm") ##normal distribution
qqp(iris$Sepal.Length, "lnorm") ##log-normal distribution
gamma <- fitdistr(iris$Sepal.Length, "gamma")
qqp(iris$Sepal.Length, "gamma", shape = gamma$estimate[[1]],
rate = gamma$estimate[[2]]) ##gamma distribution
nbinom <- fitdistr(iris$Sepal.Length, "Negative Binomial")
qqp(iris$Sepal.Length, "nbinom", size = nbinom$estimate[[1]],
mu = nbinom$estimate[[2]]) ##negative binomial distribution
You can use the fitdistr function for ggplots or qqPlots. It supports lots of different distributions. Take a look at ?fitdistr
I feel that I am close to finding the answer for my problem, but somehow I just cannot manage to do it. I have used nls function to fit 3 parameters using a rather complicated function describing fertilization success of eggs (y-axis) in a range of sperm concentrations (x-axis) (Styan's model [1], [2]). Fitting the parameters works fine, but I cannot manage to plot a smoothed extrapolated curve using predict function (see at the end of this post). I guess it is because I have used a value that was not fitted on x-axis. My question is how to plot a smoothed and extrapolated curve based on a model fitted with nls function
using non-fitted parameter on x-axis?
Here is an example:
library(ggplot2)
data.nls <- structure(list(S0 = c(0.23298, 2.32984, 23.2984, 232.98399, 2329.83993,
23298.39926), fert = c(0.111111111111111, 0.386792452830189,
0.158415841584158, 0.898648648648649, 0.616, 0.186440677966102
), speed = c(0.035161615379406, 0.035161615379406, 0.035161615379406,
0.035161615379406, 0.035161615379406, 0.035161615379406), E0 = c(6.86219803476946,
6.86219803476946, 6.86219803476946, 6.86219803476946, 6.86219803476946,
7.05624476582978), tau = c(1800, 1800, 1800, 1800, 1800, 1800
), B0 = c(0.000102758645352932, 0.000102758645352932, 0.000102758645352932,
0.000102758645352932, 0.000102758645352932, 0.000102758645352932
)), .Names = c("S0", "fert", "speed", "E0", "tau", "B0"), row.names = c(NA,
6L), class = "data.frame")
## Model S
modelS <- function(Fe, tb, Be) with (data.nls,{
x <- Fe*(S0/E0)*(1-exp(-B0*E0*tau))
b <- Fe*(S0/E0)*(1-exp(-B0*E0*tb))
x*exp(-x)+Be*(1-exp(-x)-(x*exp(-x)))*exp(-b)})
## Define starting values
start <- list(Fe = 0.2, tb = 0.1, Be = 0.1)
## Fit the model using nls
modelS.fitted <- nls(formula = fert ~ modelS(Fe, tb, Be), data = data.nls, start = start,
control=nls.control(warnOnly=TRUE,minFactor=1e-5),trace = T, lower = c(0,0,0),
upper = c(1, Inf, 1), algorithm = "port")
## Combine model parameters
model.data <- cbind(data.nls, data.frame(pred = predict(modelS.fitted)))
## Plot
ggplot(model.data) +
geom_point(aes(x = S0, y = fert), size = 2) +
geom_line(aes(x = S0, y = pred), lwd = 1.3) +
scale_x_log10()
I have tried following joran's example here, but it has no effect, maybe because I did not fit S0:
r <- range(model.data$S0)
S0.ext <- seq(r[1],r[2],length.out = 200)
predict(modelS.fitted, newdata = list(S0 = S0.ext))
# [1] 0.002871585 0.028289057 0.244399948 0.806316161 0.705116868 0.147974213
You function should have the parameters (S0,E0,B0,tau,Fe,tb,Be). nls will look for the parameters in the data.frame passed to its data argument and only try to fit those it doesn't find there (provided that starting values are given). No need for this funny with business in your function. (with shouldn't be used inside functions anyway. It's meant for interactive usage.) In predict newdata must contain all variables, that is S0,E0,B0, and tau.
Try this:
modelS <- function(S0,E0,B0,tau,Fe, tb, Be) {
x <- Fe*(S0/E0)*(1-exp(-B0*E0*tau))
b <- Fe*(S0/E0)*(1-exp(-B0*E0*tb))
x*exp(-x)+Be*(1-exp(-x)-(x*exp(-x)))*exp(-b)}
## Define starting values
start <- list(Fe = 0.2, tb = 0.1, Be = 0.1)
## Fit the model using nls
modelS.fitted <- nls(formula = fert ~ modelS(S0,E0,B0,tau,Fe, tb, Be), data = data.nls, start = start,
control=nls.control(warnOnly=TRUE,minFactor=1e-5),trace = T, lower = c(0,0,0),
upper = c(1, Inf, 1), algorithm = "port")
## Combine model parameters
model.data <- data.frame(
S0=seq(min(data.nls$S0),max(data.nls$S0),length.out=1e5),
E0=seq(min(data.nls$E0),max(data.nls$E0),length.out=1e5),
B0=seq(min(data.nls$B0),max(data.nls$B0),length.out=1e5),
tau=seq(min(data.nls$tau),max(data.nls$tau),length.out=1e5))
model.data$pred <- predict(modelS.fitted,newdata=model.data)
## Plot
ggplot(data.nls) +
geom_point(aes(x = S0, y = fert), size = 2) +
geom_line(data=model.data,aes(x = S0, y = pred), lwd = 1.3) +
scale_x_log10()
Obviously, this might not be what you want, since the function has multiple variables and more than one vary in new.data. Normally one would only vary one and keep the others constant for such a plot.
So this might be more appropriate:
S0 <- seq(min(data.nls$S0),max(data.nls$S0),length.out=1e4)
E0 <- seq(1,20,length.out=20)
B0 <- unique(data.nls$B0)
tau <- unique(data.nls$tau)
model.data <- expand.grid(S0,E0,B0,tau)
names(model.data) <- c("S0","E0","B0","tau")
model.data$pred <- predict(modelS.fitted,newdata=model.data)
## Plot
ggplot(model.data) +
geom_line(data=,aes(x = S0, y = pred, color=interaction(E0,B0,tau)), lwd = 1.3) +
geom_point(data=data.nls,aes(x = S0, y = fert), size = 2) +
scale_x_log10()
I have created an multiple linear regression model and would now like to plot it. But I can't seem to figure it out. Any help would be greatly appreciated! I used baruto to find the feature attributes and then used train() to get the model. When I try to plot model_lm I get the error:
There are no tuning parameters with more than 1 value.
Here is my code at what I have attempted so far:
rt_train <- rttotal2
rt_train$year <- NULL
#rt_train$box_office <- NULL
#impute na and address multicoliniearity
preproc <- preProcess(rt_train, method = c("knnImpute","center",
"scale"))
rt_proc <- predict(preproc, rt_train)
rt_proc$box_office <- rt_train$box_office
sum(is.na(rt_proc))
titles <- rt_proc$titles
rt_proc$titles <- NULL
#rt_train$interval <- as.factor(rt_train$interval)
dmy <- dummyVars(" ~ .", data = rt_proc,fullRank = T)
rt_transform <- data.frame(predict(dmy, newdata = rt_proc))
index <- createDataPartition(rt_transform$interval, p =.75, list = FALSE)
train_m <- rt_transform[index, ]
rt_test <- rt_transform[-index, ]
str(rt_train)
y_train <- train_m$box_office
y_test <-rt_test$box_office
train_m$box_office <- NULL
rt_test$box_office <- NULL
#selected feature attributes
boruta.train <- Boruta(interval~., train_m, doTrace =1)
#graph to see most important var to interval
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
plot(boruta.train, xlab = "", xaxt = "n")
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
#get most important attributes
final.boruta <- TentativeRoughFix(boruta.train)
print(final.boruta)
getSelectedAttributes(final.boruta, withTentative = F)
boruta.rt_df <- attStats(final.boruta)
boruta.rt_df
boruta.rt_df <- setDT(boruta.rt_df, keep.rownames = TRUE)[]
predictors <- boruta.rt_df %>%
filter(., decision =="Confirmed") %>%
select(., rn)
predictors <- unlist(predictors)
control <- trainControl(method="repeatedcv",
number=10,
repeats=6)
#look at residuals
#p-value is very small so reject H0 that predictors have no effect so
#we can use rotten tomatoes to predict box_office ranges
train_m$interval <- NULL
model_lm <- train(train_m[,predictors],
y_train, method='lm',
trControl = control, tuneLength = 10)
model_lm #.568
#
plot(model_lm)
plot(model_lm)
z <- varImp(object=model_lm)
z <- setDT(z, keep.rownames = TRUE)
z$model <- NULL
z$calledFrom <- NULL
row.names(z)
plot(varImp(object=model_lm),main="Linear Model Variable Importance")
predictions<-predict.train(object=model_lm,rt_test[,predictors],type="raw")
table(predictions)
#get coeff
interc <- coef(model_lm$finalModel)
slope <- coef(model_lm$finalModel)
ggplot(data = rt_train, aes(y = box_office)) +
geom_point() +
geom_abline(slope = slope, intercept = interc, color = 'red')
This is what some of my input looks like. Thank you!!
Here is an example using the inbuilt data set cars:
data(cars, package = "datasets")
library(caret)
build the model
control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 6)
model_lm <- train(dist ~ speed, data = cars, method='lm',
trControl = control, tuneLength = 10)
I will assume you would like to plot the final model.
You can use the caret predict.train function to get the predictions from the model and plot them:
pred <- predict(model_lm, cars)
pred <- data.frame(pred = pred, speed = cars$speed)
additionally you can provide the cars data set to geom point and plot the observations:
library(ggplot2)
ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x=speed, y = dist))
if you would like to obtain the confidence or prediction interval you can use the predict.lm function on model_lm$finalModel:
Here is an example for the prediction interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "prediction")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_int <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
or the confidence interval:
pred <- predict(model_lm$finalModel, cars, se.fit = TRUE, interval = "confidence")
pred <- data.frame(pred = pred$fit[,1], speed = cars$speed, lwr = pred$fit[,2], upr = pred$fit[,3])
pred_conf <- ggplot(data = pred)+
geom_line(aes(x = speed, y = pred))+
geom_point(data = cars, aes(x = speed, y = dist)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = speed), alpha = 0.2)
plotting them side by side:
library(cowplot)
plot_grid(pred_int, pred_conf)
to plot the linear dependence on two variables you can use a 3D plot, for more than 3 it will be a problem.