I have set up a logistic regression model in R and successfully plotted the points of the model to show a relationship in the dataset. I am having trouble showing the line graph of the prediction. The model predicts readmission rates of a hospital based on the length of the initial stay (in days). Here is my code:
mydata <- read.csv(file = 'C:\\Users\\nickg\\Downloads\\3kfid8emf9rkc9ek30sf\\medical_clean.csv', header=TRUE)[,c("Initial_days","ReAdmis")]
head(mydata)
mydata$ReAdmis.f <- factor(mydata$ReAdmis)
logfit <- glm(mydata$ReAdmis.f ~ mydata$Initial_days, data = mydata, family = binomial)
summary(logfit)
range(mydata$Initial_days)
xweight <- seq(0, 79.992, .008)
yweight <- predict(logfit, list(xweight), type = "response")
plot(mydata$Initial_days, mydata$ReAdmis.f, pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
lines(xweight, yweight)
As you can see I have the model set up and ranges described by xweight and yweight, but nothing shows up for the line.
Always use curve for this:
plot(ReAdmis.f ~ Initial_days, data = mydata,
pch = 16, xlab = "Initial Days", ylab = "ReAdmission Y/N")
curve(predict(logfit, newdata = data.frame(Initial_days = x),
#x is created by the curve function based on the plot's x limits
#note that newdata must contain the x variable with exactly the same name as in the original data
type = "response"),
add = TRUE)
However, the issue here could be that your y variable is a factor variable (internally that's values of 1 and 2 if you have two levels) whereas logistic regression predictions are always in the interval [0, 1]. You should convert ReAdmis.f into 0/1 integer values before running the code.
Related
I am using the R programming language. I am trying to follow this tutorial :https://rdrr.io/cran/randomForestSRC/man/plot.competing.risk.rfsrc.html
This tutorial shows how to use the "survival random forest" algorithm - an algorithm used to analyze survival data. In this example, the "follic" data set is used, the survival random forest algorithm is used to analyze the instant hazard of observation experiencing "status 1" vs "status 2" (this is called "competing risks).
In the code below, the survival random forest model is trained on the follic data set using all observations except the last two observations. Then, this model is used to predict the hazards of the last two observations:
#load library
library(randomForestSRC)
#load data
data(follic, package = "randomForestSRC")
#train model on all observations except the last 2 observations
follic.obj <- rfsrc(Surv(time, status) ~ ., follic[c(1:539),], nsplit = 3, ntree = 100)
#use model to predict the last two observations
f <- predict(follic.obj, follic[540:541, ])
#plot individual curves - does not work
plot.competing.risk(f)
However, this seems to produce the average hazards for the last two observations experiencing "status 1 vs status 2".
Is there a way to plot the individual hazards of the first observation and the second observation?
Thanks
EDIT1:
I know how to do this for other functions in this package, e.g. here you can plot these curves for 7 observations at once:
data(veteran, package = "randomForestSRC")
plot.survival(rfsrc(Surv(time, status)~ ., veteran), cens.model = "rfsrc")
## pbc data
data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc)
## use subset to focus on specific individuals
plot.survival(pbc.obj, subset = c(3, 10))
This example seems to show the predicted survival curves for 7 observations (plus the confidence intervals - the red line is the average) at once. But I still do not know how to do this for the "plot.competing.risk" function.
EDIT2:
I think there might be an indirect way to solve this - you can predict each observation individually:
#use model to predict the last two observations individually
f1 <- predict(follic.obj, follic[540, ])
f2 <- predict(follic.obj, follic[541, ])
#plot individual curves
plot.competing.risk(f1)
plot.competing.risk(f2)
But I was hoping there was a more straightforward way to do this. Does anyone know how?
One possible way is to modify the function plot.competing.risk for individual line, and plot over a for loop for overlapping individual lines, as shown below.
#use model to predict the last three observations
f <- predict(follic.obj, follic[539:541, ])
x <- f
par(mfrow = c(2, 2))
for (k in 1:3) { #k for type of plot
for (i in 1:dim(x$chf)[1]) { #i for all individuals in x
#cschf <- apply(x$chf, c(2, 3), mean, na.rm = TRUE) #original group mean
cschf = x$chf[i,,] #individual values
#cif <- apply(x$cif, c(2, 3), mean, na.rm = TRUE) #original group mean
cif = x$cif[i,,] #individual values
cpc <- do.call(cbind, lapply(1:ncol(cif), function(j) {
cif[, j]/(1 - rowSums(cif[, -j, drop = FALSE]))
}))
if (k==1)
{matx = cschf
range = range(x$chf)
}
if (k==2)
{matx = cif
range = range(x$cif)
}
if (k==3)
{matx = cpc
range = c(0,1) #manually assign, for now
}
ylab = c("Cause-Specific CHF","Probability (%)","Probability (%)")[k]
matplot(x$time.interest, matx, type='l', lty=1, lwd=3, col=1:2,
add=ifelse(i==1,F,T), ylim=range, xlab="Time", ylab=ylab) #ADD tag for overlapping individual lines
}
legend <- paste(c("CSCHF","CIF","CPC")[k], 1:2, " ")
legend("bottomright", legend = legend, col = (1:2), lty = 1, lwd = 3)
}
I am trying to predict and plot the (estimated) survival curve for a new observation in R. Using the "survival" library and the "lung" data set, I first fit a cox proportional hazards model to the data. Then, I tried to predict and plot the survival curve for a hypothetical new observation (I entered the details for this hypothetical new observation in the "list" command). However, this is not working.
I have attached my code below:
#load library
library(survival)
data(lung)
#create survival object
s <- with(lung,Surv(time,status))
#create model
modelA <- coxph(s ~ as.factor(sex)+age+ph.ecog+wt.loss+ph.karno,data=lung, model=TRUE)
summary(modelA)
#plot
plot(survfit(modelA), ylab="Probability of Survival",
xlab="Time", col=c("red", "black", "black"))
#predict for a hypothetical NEW observation (here is where the error is)
lines(predict(modelA, newdata=list(sex=1,
age = 56,
ph.ecog = 1,
ph.karno = 50,
wt.loss = 11),
type="quantile",
p=seq(.01,.99,by=.01)),
seq(.99,.01,by=-.01),
col="blue")
## Error in match.arg(type) :
## 'arg' should be one of “lp”, “risk”, “expected”, “terms”, “survival”
Does anyone know what I am doing wrong? Thanks
This is what the survfit function is for. In your example, you plot the survfit for the model, but you can feed a newdata argument into this function and it will produce the estimated survival for these data.
If we reproduce your example:
library(survival)
s <- with(lung, Surv(time, status))
modelA <- coxph(s ~ as.factor(sex) + age + ph.ecog + wt.loss + ph.karno,
data = lung, model = TRUE)
plot(survfit(modelA), ylab = "Probability of Survival",
xlab = "Time", col = c("red", "black", "black"))
Then we can create a survival curve given your specified covariates like this:
est <- survfit(modelA, newdata = data.frame(sex = 1,
age = 56,
ph.ecog = 1,
ph.karno = 50,
wt.loss = 11))
Now est is an S3 object with members that include time and survival, so we can plot a blue line tracking the estimated survival of individuals with the given covariates like this:
lines(est$time, est$surv, col = 'blue', type = 's')
Or plot it on its own with a 95% confidence interval:
plot(est, ylab = "Probability of Survival",
xlab = "Time", col = c("red", "black", "black"))
Created on 2022-05-26 by the reprex package (v2.0.1)
See the description of the predict() function (you can open it in R help by running ?predict.coxph, or here for example):
type - the type of predicted value. Choices are the linear predictor
("lp"), the risk score exp(lp) ("risk"), the expected number of events
given the covariates and follow-up time ("expected"), and the terms of
the linear predictor ("terms"). The survival probability for a subject
is equal to exp(-expected).
You can see that your type="quantile" does not match expected input. If you call predict() without the type argument, in your case it will default to using lp (linear predictor).
When you call predict() function for your object modelA, it determines that it is of coxph class, so the predict.coxph() function is applied. The arguments like type="quantile" and p=seq(.01,.99,by=.01) are not acceptable for predict.coxph() (p is ignored, type raises error). They are used in another function, predict.survreg() - for it to be called, your modelA object must be of survreg class, i.e. it should be created using survreg() call instead of coxph() call.
I do have a question related to plotting actual data of a time series and the values from a fitted model. In particular, my questions relate to this paper:
https://static.googleusercontent.com/media/www.google.com/en//googleblogs/pdfs/google_predicting_the_present.pdf
In the appendix of the document, you can find an R script. Here, I do have two initial questions: (1) What does
##### Define Predictors - Time Lags;
dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)]);
dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)]);
do and what is the function of:
##### Divide data by two parts - model fitting & prediction
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]
Final and main question: Let's say I get a calculation for my data with
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1);
summary(fit)
The adj. R-squared value is 0.342. Thus, I'd argue that the model above explains roughly 34% of the variance between modeled data (predictive data?) and the actual data. Now, how can I plot this "model graph" (fitted) so that I get something like this in the paper?
I assume the second graph's "fitted" is actually the data from the estimated model, right? If so, then this part seems missing in the script.
Thanks a lot!
EDIT 1:
Tried this:
# Actual values and fitted values
plot(sales ~ month, data= dat1, col="blue", lwd=1, type="l", xaxt = "n", xaxs="r",yaxs="r", xlab="", ylab="Total Sales");
par(new=TRUE)
plot(fitted(fit) ~ month, data= dat1, col="red", lwd=1, type="l", xaxs="r", yaxs="r", yaxt = "n", xlab="Month", ylab="Index", xaxt="n");
axis(4)
Output: Error in (function (formula, data = NULL, subset = NULL, na.action = na.fail, : variable lengths differ (found for 'month')
dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)])
This creates a new column s1 with data from sales where first element is NA. Last item from sales is missing.
dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)])
Crate s12 column with 12 NAs and the rest is first nrow(dat)-12 values from dat$sales.
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]
dat1 is all but last observation (rows), dat2 is only last row. When predicting the response (sales), you only need to feed a data.frame with at least the columns that are on the right side of the formula (called also explanatory variables), in this case s1 and s12, as a newdata argument to predict() function. This is where dat2 is used.
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
This next line fits a model using dat1.
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1)
fitted(fit) will give you fitted values. Try predict(fit) and compare if it's any different.
Semicolons at the end of each statement is redundant.
When applying gam.check in the mgcv package, R produces some residual plots and basis dimension output. Is there a way to only produce the plots and not the printed output?
library(mgcv)
set.seed(0)
dat <- gamSim(1,n=200)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3), data=dat)
plot(b, pages=1)
gam.check(b, pch=19, cex=.3)
There are four plots, from top left, moving down and across we have:
A QQ plot of the residuals
A histogram of the residuals
A plot of residuals vs the linear predictor
A plot of observed vs fitted values.
In the code below, I assume b contains your fitted model, as per your example. First some things we need
type <- "deviance" ## "pearson" & "response" are other valid choices
resid <- residuals(b, type = type)
linpred <- napredict(b$na.action, b$linear.predictors)
observed.y <- napredict(b$na.action, b$y)
Note the last two lines are applying the NA handling method used when the model was fitted to the information on the linear.predictors and y, the stored copy of the response data.
The above code and that shown below is all given in the first 10 or so lines of the gam.check() source. To view this, just enter
gam.check
at the R prompt.
Each plot is produced as follows:
QQ plot
This is produced via qq.gam():
qq.gam(b, rep = 0, level = 0.9, type = type, rl.col = 2,
rep.col = "gray80")
Histogram of residuals
This is produced using
hist(resid, xlab = "Residuals", main = "Histogram of residuals")
Residuals vs linear predictor
This is produced using
plot(linpred, resid, main = "Resids vs. linear pred.",
xlab = "linear predictor", ylab = "residuals")
Observed vs fitted values
This is produced using
plot(fitted(b), observed.y, xlab = "Fitted Values",
ylab = "Response", main = "Response vs. Fitted Values")
There are now the two packages gratia and mgcViz which have functions to produce the gam.check output as ggplots which you can store as an object. The former doesn't print anything to console, the latter does.
require(gratia)
appraise(b)
require(mgcViz)
b = getViz(b)
check(b)
I have this code
factors<-read.csv("India_Factors.csv",header=TRUE)
marketfactor<-factors[,4]
sizefactor<-factors[,5]
valuefactor<-factors[,6]
dati<-get.hist.quote("SI", quote = "AdjClose", compression = "m")
returns<-diff(dati)
regression<-lm(returns ~ marketfactor + sizefactor + valuefactor,na.action=na.omit)
that does multilinear regression.
I want to plot on a 2D plane the returns against a factor (and this is trivial of course) with superimposed the projection of the linear regression hyperplane for the specific factor. To be more clear the result should be like this: wolfram demonstrations (see the snapshots).
Any help will be greatly appreciated.
Thank you for your time and have a nice week end.
Giorgio.
The points in my comment withstanding, here is the canonical way to generate output from a fitted model in R for combinations of predictors. It really isn't clear what the plots you want are showing, but the ones that make sense to me are partial plots; where one variable is varied over its range whilst holding the others at some common value. Here I use the sample mean when holding a variable constant.
First some dummy data, with only to covariates, but this extends to any number
set.seed(1)
dat <- data.frame(y = rnorm(100))
dat <- transform(dat,
x1 = 0.2 + (0.4 * y) + rnorm(100),
x2 = 2.4 + (2.3 * y) + rnorm(100))
Fit the regression model
mod <- lm(y ~ x1 + x2, data = dat)
Next some data values to predict at using the model. You could do all variables in a single prediction and then subset the resulting object to plot only the relevant rows. Alternatively, more clearly (though more verbose), you can deal with each variable separately. Below I create two data frames, one per covariate in the model. In a data frame I generate 100 values over the range of the covariate being varied, and repeat the mean value of the other covariate(s).
pdatx1 <- with(dat, data.frame(x1 = seq(min(x1), max(x1), length = 100),
x2 = rep(mean(x2), 100)))
pdatx2 <- with(dat, data.frame(x1 = rep(mean(x1), 100),
x2 = seq(min(x2), max(x2), length = 100)))
In the linear regression with straight lines, you really don't need 100 values --- the two end points of the range of the covariate will do. However for models where the fitted function is not linear you need to predict at more locations.
Next, use the model to predict at these data points
pdatx1 <- transform(pdatx1, yhat = predict(mod, pdatx1))
pdatx2 <- transform(pdatx2, yhat = predict(mod, pdatx2))
Now we are ready to draw the partial plots. First compute a range for the y axis - again it is mostly redundant here but if you are adding confidence intervals you will need to include their values below,
ylim <- range(pdatx1$y, pdatx2$y, dat$y)
To plot (here putting two figures on the same plot device) we can use the following code
layout(matrix(1:2, ncol = 2))
plot(y ~ x1, data = dat)
lines(yhat ~ x1, data = pdatx1, col = "red", lwd = 2)
plot(y ~ x2, data = dat)
lines(yhat ~ x2, data = pdatx2, col = "red", lwd = 2)
layout(1)
Which produces