R: Plot Individual Predictions - r

I am using the R programming language. I am trying to follow this tutorial :https://rdrr.io/cran/randomForestSRC/man/plot.competing.risk.rfsrc.html
This tutorial shows how to use the "survival random forest" algorithm - an algorithm used to analyze survival data. In this example, the "follic" data set is used, the survival random forest algorithm is used to analyze the instant hazard of observation experiencing "status 1" vs "status 2" (this is called "competing risks).
In the code below, the survival random forest model is trained on the follic data set using all observations except the last two observations. Then, this model is used to predict the hazards of the last two observations:
#load library
library(randomForestSRC)
#load data
data(follic, package = "randomForestSRC")
#train model on all observations except the last 2 observations
follic.obj <- rfsrc(Surv(time, status) ~ ., follic[c(1:539),], nsplit = 3, ntree = 100)
#use model to predict the last two observations
f <- predict(follic.obj, follic[540:541, ])
#plot individual curves - does not work
plot.competing.risk(f)
However, this seems to produce the average hazards for the last two observations experiencing "status 1 vs status 2".
Is there a way to plot the individual hazards of the first observation and the second observation?
Thanks
EDIT1:
I know how to do this for other functions in this package, e.g. here you can plot these curves for 7 observations at once:
data(veteran, package = "randomForestSRC")
plot.survival(rfsrc(Surv(time, status)~ ., veteran), cens.model = "rfsrc")
## pbc data
data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc)
## use subset to focus on specific individuals
plot.survival(pbc.obj, subset = c(3, 10))
This example seems to show the predicted survival curves for 7 observations (plus the confidence intervals - the red line is the average) at once. But I still do not know how to do this for the "plot.competing.risk" function.
EDIT2:
I think there might be an indirect way to solve this - you can predict each observation individually:
#use model to predict the last two observations individually
f1 <- predict(follic.obj, follic[540, ])
f2 <- predict(follic.obj, follic[541, ])
#plot individual curves
plot.competing.risk(f1)
plot.competing.risk(f2)
But I was hoping there was a more straightforward way to do this. Does anyone know how?

One possible way is to modify the function plot.competing.risk for individual line, and plot over a for loop for overlapping individual lines, as shown below.
#use model to predict the last three observations
f <- predict(follic.obj, follic[539:541, ])
x <- f
par(mfrow = c(2, 2))
for (k in 1:3) { #k for type of plot
for (i in 1:dim(x$chf)[1]) { #i for all individuals in x
#cschf <- apply(x$chf, c(2, 3), mean, na.rm = TRUE) #original group mean
cschf = x$chf[i,,] #individual values
#cif <- apply(x$cif, c(2, 3), mean, na.rm = TRUE) #original group mean
cif = x$cif[i,,] #individual values
cpc <- do.call(cbind, lapply(1:ncol(cif), function(j) {
cif[, j]/(1 - rowSums(cif[, -j, drop = FALSE]))
}))
if (k==1)
{matx = cschf
range = range(x$chf)
}
if (k==2)
{matx = cif
range = range(x$cif)
}
if (k==3)
{matx = cpc
range = c(0,1) #manually assign, for now
}
ylab = c("Cause-Specific CHF","Probability (%)","Probability (%)")[k]
matplot(x$time.interest, matx, type='l', lty=1, lwd=3, col=1:2,
add=ifelse(i==1,F,T), ylim=range, xlab="Time", ylab=ylab) #ADD tag for overlapping individual lines
}
legend <- paste(c("CSCHF","CIF","CPC")[k], 1:2, " ")
legend("bottomright", legend = legend, col = (1:2), lty = 1, lwd = 3)
}

Related

cumulative survival rate after weighting using survey::svykm()

I used the survey package to draw a weighted Kaplan Meier survival plot, like this:
library(survey)
data(pbc, package="survival")
pbc$randomized <- with(pbc, !is.na(trt) & trt>0)
biasmodel <- glm(randomized~age*edema,data=pbc)
pbc$randprob <- fitted(biasmodel)
dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized))
s2 <-svykm(Surv(time,status>0) ~ sex, design = dpbc)
svyjskm(s2, pval = T, table = T, design = dpbc)
Now I would like to calculate the survival rates in the two groups, similar to what I would get when using summary(x, times = c(1:5)) on a survfit object. Does anybody know how I can extract these values?
Many thanks in advance!
Saving survival probabilities to data frames
The time and survival probability values can be extracted from the svykm object ("s2") and saved in separate data frames for females ("s2_data_f") and males ("s2_data_m") using the following codes:
s2_data_f <- data.frame(time = s2[["f"]][["time"]], surv = s2[["f"]][["surv"]])
s2_data_m <- data.frame(time = s2[["m"]][["time"]], surv = s2[["m"]][["surv"]])
Note that not all time values will be available in the data (e.g. no one has time = 5) and for these values the nearest smaller value should be taken (e.g. for time = 5 the value for time = 0 should be used which is 100% survival).
Functions to extract survival probabilities or time values from data frames
Below is a function that locates the row in "s2_data_f" with a specified time value (or the nearest smaller time value) and returns the corresponding survival probability value.
return_surv_f <- function(x) {
time <- max(s2_data_f$time[s2_data_f$time <= x])
return(s2_data_f$surv[s2_data_f$time==time])
}
Similarly, for males ("s2_data_m") the function would be:
return_surv_m <- function(x) {
time <- max(s2_data_m$time[s2_data_m$time <= x])
return(s2_data_m$surv[s2_data_m$time==time])
}
These functions can then be used with "sapply" to return survival probability results for one or more chosen time values.
sapply(c(1:5), return_surv_f)
sapply(c(1:5), return_surv_m)
If you need to get the results in reverse (i.e. find the time corresponding with a specific survival probability) the "quantile" function can be used. For example, if you want to know at what time 75% of participants were alive then:
quantile(s2[["f"]], probs = 0.75)
quantile(s2[["m"]], probs = 0.75)
Confidence intervals or standard errors for survival probabilities
If you wish to calculate confidence intervals or standard errors for the survival probabilities then "se = TRUE" must be added to the svykm function.
s2 <- svykm(Surv(time,status>0) ~ sex, design = dpbc, se = TRUE)
Note however that this changes the statistical method with the R survey package documentation stating that:
"When standard errors are computed, the survival curve is actually the
Aalen (hazard-based) estimator rather than the Kaplan-Meier estimator."
Confidence intervals can then be obtained using "confint" with one or more time values specified in "parm =".
confint(s2[["f"]], parm = c(1000:1005), level = 0.95)
confint(s2[["m"]], parm = c(1000:1005), level = 0.95)
Standard errors can be obtained from the "varlog" values.
s2_data_f <- data.frame(varlog = s2[["f"]][["varlog"]])
s2_data_m <- data.frame(varlog = s2[["m"]][["varlog"]])
Example of a data frame including survival probabilities and their confidence intervals:
s2_results_f <- data.frame(
time = c(1000:1005),
surv = sapply(c(1000:1005), return_surv_f),
CI = confint(s2[["f"]], parm = c(1000:1005), level = 0.95)
)
s2_results_f[2:4] <- round(s2_results_f[2:4], 2)*100
s2_results_f[2:4] <- paste0(unlist(s2_results_f[2:4]), "%")
names(s2_results_f)[1:4] <- c("Follow-up time", "Survival probability", "95% CI lower", "95% CI upper")

(R) Adding Confidence Intervals To Plots

I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html

ARIMA loop in R

I'm pretty new to R and I've run into a problem with finding the optimal ARIMA model. So far I've modeled the trend and a seasonal component, and now I want to model the cyclical component with an ARIMA model. I want the output in the end to include coefficients for the time variable, the seasonal variables and also the ARIMA variables. I've tried to use a loop to find the optimal ARIMA model and the coefficients, but I just get this message:
"Error in optim(init[mask], armaCSS, method = optim.method, hessian = FALSE, :
non-finite value supplied by optim"
I've tried looking for other answers in here, but I just can't seem to figure out what I'm doing wrong.
I've included the entire code in case it is necessary, but the error appears after running the loop in the end.
I appreciate any help I can get, thank you!
#clear workspace
rm(list=ls())
#load data
setwd("~/Desktop/CBS/HA almen year 3 /Forecasting /R koder ")
data <- scan("onlineretail.txt")
data <- data[2:69] #cut off first period + two last periods for whole years
T=length(data)
s=4
years=T/s
styear=2000
st=c(styear,1)
data = ts(data,start=st, frequency = s)
plot(data)
summary(data)
#plot shows increasing variance - log transform data
lndata <- log(data)
plot(lndata)
dataTSE = decompose(lndata, type="additive")
plot(dataTSE)
########### Trend ##########
t=(1:T)
t2=t^2
lny <- lndata
lmtrend.model <- lm(lny~t)
summary(lmtrend.model)
#linear trend T_t = 8,97 + 0,039533*TIME - both coefficeients significant
#Project 2, explanation why linear is better than quadratic
qtrend.model <- lm(lny~t+t2)
summary(qtrend.model)
lntrend = fitted(lmtrend.model)
lntrend = ts(lntrend, start=st, frequency = s)
#lntrend2 = fitted(qtrend.model)
#lntrend2 = ts(lntrend2, start=st, frequency = s)
residuals=lny-lntrend
par(mar=c(5,5,5,5))
plot(lny, ylim=c(5,12), main="Log e-commerce retail sales")
lines(lntrend, col="blue")
#lines(lntrend2, col="red")
par(new=T)
plot(residuals,ylim=c(-0.2,0.8),ylab="", axes=F)
axis(4, pretty(c(-0.2,0.4)))
abline(h=0, col="grey")
mtext("Residuals", side=4, line=2.5, at=0)
############# Season #################
#The ACF of the residuals confirms the neglected seasonality, because there
#is a clear pattern for every k+4 lags:
acf(residuals)
#Remove trend to observe seasonal factors without the trend:
detrended = residuals
plot(detrended, ylab="ln sales", main="Seasonality in ecommerce retail sales")
abline(h=0, col="grey")
#We can check out the average magnitude of seasonal factors
seasonal.matrix=matrix(detrended, ncol=s, byrow=years)
SeasonalFactor = apply(seasonal.matrix, 2, mean)
SeasonalFactor=ts(SeasonalFactor, frequency = s)
SeasonalFactor
plot(SeasonalFactor);abline(h=0, col="grey")
#We add seasonal dummies to our model of trend and omit the last quarter
library("forecast")
M <- seasonaldummy(lny)
ST.model <- lm(lny ~ t+M)
summary(ST.model)
#ST.model <- tslm(lny~t+season)
#summary(ST.model)
#Both the trend and seasonal dummies appears highly significant
#We will use a Durbin-Watson test to detect serial correlation
library("lmtest")
dwtest(ST.model)
#The DW value is 0.076396. This is quite small, as the value should be around
2
#and we should therefore try to improve the model with a cyclical component
#I will construct a plot that shows how the model fits the data and
#how the residuals look
lntrend=fitted(ST.model)
lntrend = ts(lntrend, start=st, frequency = s)
residuals=lny-lntrend
par(mar=c(5,5,5,5))
plot(lny, ylim=c(5,12), main="Log e-commerce retail sales")
lines(lntrend, col="blue")
#tell R to draw over the current plot with a new one
par(new=T)
plot(residuals,ylim=c(-0.2,0.8),ylab="", axes=F)
axis(4, pretty(c(-0.2,0.4)))
abline(h=0, col="grey")
mtext("Residuals", side=4, line=2.5, at=0)
############## Test for unit root ############
#We will check if the data is stationary, and to do so we will
#test for unit root.
#To do so, we will perform a Dickey-Fuller test. First, we have to remove
seasonal component.
#We can also perform an informal test with ACF and PACF
#the autocorrelation function shows that the data damps slowly
#while the PACF is close to 1 at lag 1 and then lags become insignificant
#this is informal evidence of unit root
acf(residuals)
pacf(residuals)
#Detrended and deseasonalized data
deseason = residuals
plot(deseason)
#level changes a lot over time, not stationary in mean
#Dickey-Fuller test
require(urca)
test <- ur.df(deseason, type = c("trend"), lags=3, selectlags = "AIC")
summary(test)
#We do not reject that there is a unit root if
# |test statistics| < |critical value|
# 1,97 < 4,04
#We can see from the output that the absolute value of the test statistics
#is smaller than the critical value. Therefore, there is no evidence against
the unit root.
#We check the ACF and PACF in first differences. There should be no
significant lags
#if the data is white noise in first differences.
acf(diff(deseason))
pacf(diff(deseason))
deseasondiff = diff(deseason, differences = 2)
plot(deseasondiff)
test2 <- ur.df(deseasondiff, type=c("trend"), lags = 3, selectlags = "AIC")
summary(test2)
#From the plot and the Dickey-Fuller test, it looks like we need to difference
twice
############# ARIMA model ############
S1 = rep(c(1,0,0,0), T/s)
S2 = rep(c(0,1,0,0), T/s)
S3 = rep(c(0,0,1,0), T/s)
TrSeas = model.matrix(~ t+S1+S2+S3)
#Double loop for finding the best fitting ARIMA model and since there was
#a drift, we include this in the model
best.order <- c(0, 2, 0)
best.aic <- Inf
for (q in 1:6) for (p in 1:6) {
fit.aic <- AIC(arima(lny,order = c(p,2, q),include.mean = TRUE,xreg=TrSeas))
print(c(p,q,fit.aic))
if (fit.aic < best.aic) {
best.order <- c(p, 0, q)
best.arma <- arima(lny,order = c(p, 2, q),include.mean = TRUE,xreg=TrSeas)
best.aic <- fit.aic
}
}
best.order
Please use the forecast package from Prof. Hyndman.
The call to:
auto.arima(data)
will return you the most optimal ARIMA model for your time series. You will find https://www.otexts.org/fpp/8/7 a great reference as well.

R: Plotting "Actual vs. Fitted"

I do have a question related to plotting actual data of a time series and the values from a fitted model. In particular, my questions relate to this paper:
https://static.googleusercontent.com/media/www.google.com/en//googleblogs/pdfs/google_predicting_the_present.pdf
In the appendix of the document, you can find an R script. Here, I do have two initial questions: (1) What does
##### Define Predictors - Time Lags;
dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)]);
dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)]);
do and what is the function of:
##### Divide data by two parts - model fitting & prediction
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]
Final and main question: Let's say I get a calculation for my data with
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1);
summary(fit)
The adj. R-squared value is 0.342. Thus, I'd argue that the model above explains roughly 34% of the variance between modeled data (predictive data?) and the actual data. Now, how can I plot this "model graph" (fitted) so that I get something like this in the paper?
I assume the second graph's "fitted" is actually the data from the estimated model, right? If so, then this part seems missing in the script.
Thanks a lot!
EDIT 1:
Tried this:
# Actual values and fitted values
plot(sales ~ month, data= dat1, col="blue", lwd=1, type="l", xaxt = "n", xaxs="r",yaxs="r", xlab="", ylab="Total Sales");
par(new=TRUE)
plot(fitted(fit) ~ month, data= dat1, col="red", lwd=1, type="l", xaxs="r", yaxs="r", yaxt = "n", xlab="Month", ylab="Index", xaxt="n");
axis(4)
Output: Error in (function (formula, data = NULL, subset = NULL, na.action = na.fail, : variable lengths differ (found for 'month')
dat$s1 = c(NA, dat$sales[1:(nrow(dat)-1)])
This creates a new column s1 with data from sales where first element is NA. Last item from sales is missing.
dat$s12 = c(rep(NA, 12), dat$sales[1:(nrow(dat)-12)])
Crate s12 column with 12 NAs and the rest is first nrow(dat)-12 values from dat$sales.
dat1 = mdat[1:(nrow(mdat)-1), ]
dat2 = mdat[nrow(mdat), ]
dat1 is all but last observation (rows), dat2 is only last row. When predicting the response (sales), you only need to feed a data.frame with at least the columns that are on the right side of the formula (called also explanatory variables), in this case s1 and s12, as a newdata argument to predict() function. This is where dat2 is used.
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
This next line fits a model using dat1.
fit = lm(log(sales) ~ log(s1) + log(s12) + trends1, data=dat1)
fitted(fit) will give you fitted values. Try predict(fit) and compare if it's any different.
Semicolons at the end of each statement is redundant.

How to create a ROC in R using predicted value from SAS?

I have a dataset from SAS, it is scored data with two columns, y and yhat. y is binary (0,1), yhat is scored value, model is logistic regression. I want create roc in r for this SAS model and compare it with other models in R. I have no clue regarding how to accomplish this? Any suggestions? Thanks.
How to create a ROC in R using predicted value from SAS?
You can use the ROCR package like this:
## computing a simple ROC curve (x-axis: fpr, y-axis: tpr)
library(ROCR)
pred <- prediction( SASdataset$predictions, SASdataset$labels)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
Very simply if you know how ROC curves work. You want to be able to classify people into your dichotomous outcomes, 0 or 1 I am using below, using the predicted values from your model.
So if you were to select a cut-off for your predicted values at 0.5, say anyone above this threshold is considered positive/1/diseased/etc, and anyone below as a 0/unaffected.
That's great, but can that be improved? So the thought here is that if we go through a bunch of cutoff points, which one will be the most accurate in classifying people into our dichotomous outcomes, that is, comparing the predicted values from the model to the actual classifications that we know.
# some data
dat <- data.frame(pred = rep(0:1, each = 50),
predict = c(runif(50), runif(50, .5, 1.5)))
# a matrix of the cutoffs, specificity, and sensitivity
p1 <- matrix(0, nrow = 19, ncol = 3)
i <- 1
# for each cutoff value, create a 2x2 table and calculate your sens/spec
for (p in seq(min(dat$predict), .95, 0.05)) {
t1 <- table(dat$predict > p, dat$pred)
p1[i, ] <- c(p, (t1[2, 2]) / sum(t1[ , 2]), (t1[1, 1]) / sum(t1[ , 1]))
i <- i + 1
}
# and plot
plot(1 - p1[ , 3], p1[ , 2], type = 'l',
xlab = '1 - spec', ylab = 'sens',
main = 'ROC', cex.main = .8)
There are some packages out there, ROCR is one I have used, but this takes me a couple minutes to program, is very simple to understand, and is in base R.

Resources