Writing a function to produce a Kaplan-Meier curve - r

I am trying to write a function that spits out a KM survival curve. I am going to use this in a ShineyApp which is why I want to write a function so I can easily pass in arguments from a dropdown menu (which will input as a string into the strata argument). Here is a simplified version of what I need:
survival_function <- function(data_x, strata_x="1"){
survFormula <- Surv(data_x$time, data_x$status)
my_survfit <- survfit(data=data_x, as.formula(paste("survFormula~", {{strata_x}})))
ggsurvplot(my_survfit, data = data_x, pval=T)
}
survival_function(inputdata, "strata_var")
I get an error:
Error in paste("survFormula1~", { : object 'strata_x' not found
I'm at a loss because
as.formula(paste("~", {{arg}}))
has worked in other functions I've written to produce plots using ggplot to easily change variables to facet by, but this doesn't even seem to recognize strata_x as an argument.

Your function needs a couple of tweaks to get it working with ggsurvplot. It would be best to create the Surv object as a new column in the data frame and use this column in your formula. You also need to make sure you have an actual symbolic formula as the $call$formula member of the survfit object, otherwise ggsurvplot will fail to work due to non-standard evaluation deep within its internals.
library(survival)
library(survminer)
survival_function <- function(data_x, strata_x) {
data_x$s <- Surv(data_x$time, data_x$status)
survFormula <- as.formula(paste("s ~", strata_x))
my_survfit <- survfit(survFormula, data = data_x)
my_survfit$call$formula <- survFormula
ggsurvplot(my_survfit, data = data_x)
}
We can test this on the included lung data set:
survival_function(lung, "sex")
Created on 2022-08-03 by the reprex package (v2.0.1)

Related

Plotting using xyplot()

I am having trouble understanding the xyplot() function to create plots in R. Below, I have an example of R code that does create a nice plot
install.packages("mice")
library("mice")
data <- airquality[, c("Ozone", "Solar.R")]
# Applies regression imputation to Ozone w.r.t Solar.R
imp <- mice(data, method = "norm.predict", seed = 1,
m = 1, print = FALSE)
xyplot(imp, Ozone ~ Solar.R)
The above code creates this image as desired:
code1 output
But the code below does not create a nice plot and instead gives me the error message "Error in UseMethod("xyplot") :
no applicable method for 'xyplot' applied to an object of class "data.frame""
airquality2 <- tidyr::fill(airquality, Ozone)
xyplot(airquality2, Ozone ~ Day)
Why does this occur? I am confused as applying the typeof() function to both "imp" and "airquality2" return "list", so I believe I have the object types correct. Thank you!
Remember that in R, generic functions call a specific method depending on the "class" attribute of the object passed as the first argument. This is known as S3 dispatch. The "class" of an object is not the same thing as its storage mode or internal type, which is what typeof returns. The fact that typeof(imp) == typeof(airquality2) is therefore irrelevant.
xyplot is a generic function, borrowed from the lattice package. The lattice package itself only defines methods for the classes formula and ts. It has no method for data frames.
The reason why xyplot works with imp passed as the first argument is that imp is an object of class "mids", and the method xyplot.mids is defined as an (unexported) function in mice, so there is an available method for it.
The upshot is that, since a method is available for objects of class “formula”, you can easily plot airquality2 by passing the formula as the first argument:
xyplot(Ozone ~ Day, airquality2)
Or explicitly naming the data argument:
xyplot(data = airquality2, Ozone ~ Day)
Both of which result in:

What does "invalid type (closure) for variable 'variable1'" mean and how do I fix it?

I am trying to write a function in R, which contains a function from another package. The code works perfectly outside a function.
I am guessing, it might have got to do something with the package I am using (survey).
A self-contained code example:
#activating the package
library(survey)
#getting the dataset into R
tm <- read.spss("tm.sav", to.data.frame = T, max.value.labels = 5)
# creating svydesign object (it basically contains the weights to adjust the variables (~persgew: also a column variable contained in the tm-dataset))
tm_w <- svydesign(ids=~0, weights = ~persgew, data = tm)
#getting overview of the welle-variable
#this variable is part of the tm-dataset. it is needed to execute the following steps
table(tm$welle)
# data manipulation as in: taking the v12d_gr-variable as well as the welle-variable and the svydesign-object to create a longitudinal variable which is transformed into a data frame that can be passed to ggplot
t <- svytable(~v12d_gr+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
v12d <- tt[2,]
v12d <- as.data.frame(v12d)
this is the code outside the function, working perfectly. since I have to transform quite a few variables in the exact same way, I aim to create a function to save up some time.
The following function is supposed to take a variable that will be transformed as an argument (v12sd2_gr).
#making sure the survey-object is loaded
tm_w <- svydesign(ids=~0, weights = ~persgew, data = data)
#trying to write a function containing the code from above
ltd_zsw <- function(variable1){
t <- svytable(~variable1+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
var_ltd_zsw <- tt[2,]
var_ltd_zsw <- as.data.frame(var_ltd_zsw)
return(var_ltd_zsw)
}
Calling the function:
#as v12d has been altered already, I am trying to transform another variable v12sd2_gr
v12sd2 <- ltd_zsw(v12sd2_gr)
Console output:
Error in model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design)) :
invalid type (closure) for variable 'variable1'
Called from: model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design))
How do I fix it? And what does it mean to dynamically build a formula and reformulating?
PS: I hope it is the appropriate way to answer to the feedback in the comments.
Update: I think I was able to trace the problem back to the argument I am passing (variable1) and I am guessing it has got something to do with the fact, that I try to call a formula within the function. But when I try to call the svytable with as.formula(svytable(~variable1+welle, tm_w))it still doesn't work.
What to do?
I have found a solution to the problem.
Here is the tested and working function:
ltd_test <- function (var, x, string1="con", string2="pro") {
print (table (var))
x$w12d_gr <- ifelse(as.numeric(var)>2,1,0)
x$w12d_gr <- factor(x$w12d_gr, levels = c(0,1), labels = c(string1,string2))
print (table (x$w12d_gr))
x_w <- svydesign(ids=~0, weights = ~persgew, data = x)
t <- svytable(~w12d_gr+welle, x_w)
tt <- round(prop.table(t,2)*100, digits=0)
w12d <- tt[2,]
w12d <- as.data.frame(w12d)
}
The problem appeared to be caused by the svydesgin()-fun. In its output it produces an object which is then used by the formula for svytable()-fun. Thats why it is imperative to first create the x_w-object with svydesgin() and then use the svytable()-fun to create the t-object.
Within the code snippet I posted originally in the question the tm_w-object has been created and stored globally.
Thanks for the help to everyone. I hope this is gonna be of use to someone one day!

Converting a R2jags object into a Stanreg (rstanarm) object

I made a model using R2jags. I like the jags syntax but I find the output produced by R2jags not easy to use. I recently read about the rstanarm package. It has many useful functions and is well supported by the tidybayes and bayesplot packages for easy model diagnostics and visualisation. However, I'm not a fan of the syntax used to write a model in rstanarm. Ideally, I would like to get the best of the two worlds, that is writing the model in R2jags and convert the output into a Stanreg object to use rstanarm functions.
Is that possible? If so, how?
I think then question isn't necessarily whether or not it's possible - I suspect it probably is. The question really is how much time you're prepared to spend doing it. All you'd have to do is try to replicate in structure the object that gets created by rstanarm, to the extent that it's possible with the R2jags output. That would make it so that some post-processing tasks would probably work.
If I might be so bold, I suspect a better use of your time would be to turn the R2jags object into something that could be used with the post-processing functions you want to use. For example, it only takes a small modification to the JAGS output to make all of the mcmc_*() plotting functions from bayesplot work. Here's an example. Below is the example model from the jags() function help.
# An example model file is given in:
model.file <- system.file(package="R2jags", "model", "schools.txt")
# data
J <- 8.0
y <- c(28.4,7.9,-2.8,6.8,-0.6,0.6,18.0,12.2)
sd <- c(14.9,10.2,16.3,11.0,9.4,11.4,10.4,17.6)
jags.data <- list("y","sd","J")
jags.params <- c("mu","sigma","theta")
jags.inits <- function(){
list("mu"=rnorm(1),"sigma"=runif(1),"theta"=rnorm(J))
}
jagsfit <- jags(data=jags.data, inits=jags.inits, jags.params,
n.iter=5000, model.file=model.file, n.chains = 2)
Now, what the mcmc_*() plotting functions from bayesplot expect is a list of matrices of MCMC draws where the column names give the name of the parameter. By default, jags() puts all of them into a single matrix. In the above case, there are 5000 iterations in total, with 2500 as burnin (leaving 2500 sampled) and the n.thin is set to 2 in this case (jags() has an algorithm for identifying the thinning parameter), but in any case, the jagsfit$BUGSoutput$n.keep element identifies how many iterations are kept. In this case, it's 1250. So you could use that to make a list of two matrices from the output.
jflist <- list(jagsfit$BUGSoutput$sims.matrix[1:jagsfit$BUGSoutput$n.keep, ],
jagsfit$BUGSoutput$sims.matrix[(jagsfit$BUGSoutput$n.keep+1):(2*jagsfit$BUGSoutput$n.keep), ])
Now, you'd just have to call some of the plotting functions:
mcmc_trace(jflist, regex_pars="theta")
or
mcmc_areas(jflist, regex_pars="theta")
So, instead of trying to replicate all of the output that rstanarm produces, it might be a better use of your time to try to bend the jags output into a format that would be amenable to the post-processing functions you want to use.
EDIT - added possibility for pp_check() from bayesplot.
The posterior draws of y in this case are in the theta parameters. So, we make an object that has elements y and yrep and make it of class foo
x <- list(y = y, yrep = jagsfit$BUGSoutput$sims.list$theta)
class(x) <- "foo"
We can then write a pp_check method for objects of class foo. This come straight out of the help file for bayesplot::pp_check().
pp_check.foo <- function(object, ..., type = c("multiple", "overlaid")) {
y <- object[["y"]]
yrep <- object[["yrep"]]
switch(match.arg(type),
multiple = ppc_hist(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]),
overlaid = ppc_dens_overlay(y, yrep[1:min(8, nrow(yrep)),, drop = FALSE]))
}
Then, just call the function:
pp_check(x, type="overlaid")

Writing a function in R to plot ROC curve using pROC

I'm trying to write a function to plot ROC curves based on different scoring systems I have to predict an outcome.
I have a dataframe data_all, with columns "score_1" and "Threshold.2000". I generate a ROC curve as desired with the following:
plot.roc(data_all$Threshold.2000, data_all$score_1)
My goal is to generate a ROC curve for a number of different outcomes (e.g. Threshold.1000) and scores (score_1, score_2 etc), but am initially trying to set it up just for different scores. My function is as follows:
roc_plot <- function(dataframe_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest$Threshold.2000, dataframe_of_interest$score_of_interest)}
I get the following error: Error in roc.default(x, predictor, plot =
TRUE, ...) : No valid data provided.
I'd be very grateful if someone can spot why my function doesn't work! I'm a python coder and new-ish to R, and haven't had much luck trying a number of different things. Thanks very much.
EDIT:
Here is the same example with mtcars so it's reproducible:
data(mtcars)
plot.roc(mtcars$vs, mtcars$mpg) # --> makes correct graph
roc_plot <- function(dataframe_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest$mpg, dataframe_of_interest$score_of_interest)}
Outcome:
Error in roc.default(x, predictor, plot = TRUE, ...) : No valid data provided.
roc_plot(mtcars, vs)
Here's one solution that works as desired (i.e. lets the user specify different values for score_of_interest):
library(pROC)
data(mtcars)
plot.roc(mtcars$vs, mtcars$mpg) # --> makes correct graph
# expects `score_of_interest` to be a string!!!
roc_plot <- function(dataframe_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest$vs, dataframe_of_interest[, score_of_interest])
}
roc_plot(mtcars, 'mpg')
roc_plot(mtcars, 'cyl')
Note that your error was not resulting from an incorrect column name, it was resulting from an incorrect use of the data.frame class. Notice what happens with a simpler function:
foo <- function(x, col_name) {
head(x$col_name)
}
foo(mtcars, mpg)
## NULL
This returns NULL. So in your original function when you tried to supply plot.roc with dataframe_of_interest$score_of_interest you were actually feeding plot.roc a NULL.
There are several ways to extract a column from a data.frame by the column name when that name is stored in an object (which is what you're doing when you pass it as an argument in a function). Perhaps the easiest way is to remember that a data.frame is like a 2D array-type object and so we can use familiar object[i, j] syntax, but we ask for all rows and we specify the column by name, e.g., mtcars[, 'mpg']. This still works if we assign the string 'mpg' to an object:
x <- 'mpg'
mtcars[, x]
So that's how I produced my solution. Going a step further, it's not hard to imagine how you would be able to supply both a score_of_interest and a threshold_of_interest:
roc_plot2 <- function(dataframe_of_interest, threshold_of_interest, score_of_interest) {
plot.roc(dataframe_of_interest[, threshold_of_interest],
dataframe_of_interest[, score_of_interest])
}
roc_plot2(mtcars, 'vs', 'mpg')

Dynamic time-series prediction and rollapply

I am trying to get a rolling prediction of a dynamic timeseries in R (and then work out squared errors of the forecast). I based a lot of this code on this StackOverflow question, but I am very new to R so I am struggling quite a bit. Any help would be much appreciated.
require(zoo)
require(dynlm)
set.seed(12345)
#create variables
x<-rnorm(mean=3,sd=2,100)
y<-rep(NA,100)
y[1]<-x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int<-1:100
dummydata<-data.frame(int=int,x=x,y=y)
zoodata<-as.zoo(dummydata)
prediction<-function(series)
{
mod<-dynlm(formula = y ~ L(y) + L(x), data = series) #get model
nextOb<-nrow(series)+1
#make forecast
predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
#strip timeseries information
attributes(predicted)<-NULL
return(predicted)
}
rolling<-rollapply(zoodata,width=40,FUN=prediction,by.column=FALSE)
This returns:
20 21 ..... 80
10.18676 10.18676 10.18676
Which has two problems I was not expecting:
Runs from 20->80, not 40->100 as I would expect (as the width is 40)
The forecasts it gives out are constant: 10.18676
What am I doing wrong? And is there an easier way to do the prediction than to write it all out? Thanks!
The main problem with your function is the data argument to dynlm. If you look in ?dynlm you will see that the data argument must be a data.frame or a zoo object. Unfortunately, I just learned that rollapply splits your zoo objects into array objects. This means that dynlm, after noting that your data argument was not of the right form, searched for x and y in your global environment, which of course were defined at the top of your code. The solution is to convert series into a zoo object. There were a couple of other issues with your code, I post a corrected version here:
prediction<-function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
# nextOb <- nrow(series)+1 # This will always be 21. I think you mean:
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# make forecast
# predicted<-coef(mod)[1]+coef(mod)[2]*zoodata$y[nextOb-1]+coef(mod)[3]*zoodata$x[nextOb-1]
# That would work, but there is a very nice function called predict
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
# I'm not sure why you used nextOb-1
attributes(predicted)<-NULL
# I added the square error as well as the prediction.
c(predicted=predicted,square.res=(predicted-zoodata[nextOb,'y'])^2)
}
}
rollapply(zoodata,width=20,FUN=prediction,by.column=F,align='right')
Your second question, about the numbering of your results, can be controlled by the align argument is rollapply. left would give you 1..60, center (the default) would give you 20..80 and right gets you 40..100.

Resources