can mice() handle crr()? Fine-Gray model - r

My doubt is if it is possible to pool multiple imputation data set, from "mice()", on a fit model of Fine-Gray from "crr()", and if it is statistically correct...
example
library(survival)
library(mice)
library(cmprsk)
test1 <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1, 4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,1,1,2,0, 1,1,1,0,2,2,0,0,1,1,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1, 0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c(0,0,0,NA,1,1,1,1,NA,1,0,0, 0,0,0,NA,1,1,1,1,NA,1,0,0)))
dat <- mice(test1,m=10, seed=1982)
#Cox regression: cause 1
models.cox1 <- with(dat,coxph(Surv(time, status==1) ~ x +sex ))
summary(pool(models.cox1))
#Cox regression: cause 1 or 2
models.cox <- with(dat,coxph(Surv(time, status==1 | status==2) ~ x +sex ))
models.cox
summary(pool(models.cox))
#### crr()
#Fine-Gray model
models.FG<- with(dat,crr(ftime=time, fstatus=status, cov1=test1[,c( "x","sex")], failcode=1, cencode=0, variance=TRUE))
summary(pool(models.FG))
#Error in pool(models.FG) : Object has no vcov() method.
models.FG

There are a couple of things that need to be done to get this to work.
Your initial data and imputation.
library(survival)
library(mice)
library(cmprsk)
test1 <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1, 4,3,1,1,2,2,3,5,2,4,5,1),
status=c(1,1,1,0,2,2,0,0,1,1,2,0, 1,1,1,0,2,2,0,0,1,1,2,0),
x=c(0,2,1,1,NA,NA,0,1,1,2,0,1, 0,2,1,1,NA,NA,0,1,1,2,0,1),
sex=c(0,0,0,NA,1,1,1,1,NA,1,0,0, 0,0,0,NA,1,1,1,1,NA,1,0,0)))
dat <- mice(test1,m=10, print=FALSE)
There is no vcov method for crr models which mice requires, however,
we can access the covariance matrix using the model$var returned value.
So write own vcov method to extract, and also need a coef method.
vcov.crr <- function(object, ...) object$var # or getS3method('vcov','coxph')
coef.crr <- function(object, ...) object$coef
There is also an error in how the model is passed to with.mids: your code has cov1=test1[,c( "x","sex")], but really you want cov1 to use the imputed data. I am not sure how to correctly write this as an expression due to the cov1 requiring a matrix with relevant variables, but you can easily hard code a function.
# This function comes from mice:::with.mids
Andreus_with <-
function (data, ...) {
call <- match.call()
if (!is.mids(data))
stop("The data must have class mids")
analyses <- as.list(1:data$m)
for (i in 1:data$m) {
data.i <- complete(data, i)
analyses[[i]] <- crr(ftime=data.i[,'time'], fstatus=data.i[,'status'],
cov1=data.i[,c( "x","sex")],
failcode=1, cencode=0, variance=TRUE)
}
object <- list(call = call, call1 = data$call, nmis = data$nmis,
analyses = analyses)
oldClass(object) <- c("mira", "matrix")
return(object)
}
EDIT:
The mice internals have changed since this answer; it now uses the broom package to extract elements from the fitted crr model. So tidy and glance methods for crr models are required:
tidy.crr <- function(x, ...) {
co = coef(x)
data.frame(term = names(co),
estimate = unname(co),
std.error=sqrt(diag(x$var)),
stringsAsFactors = FALSE)
}
glance.crr <- function(x, ...){ }
The above code then allows the data to be pooled.
models.FG <- Andreus_with(dat)
summary(pool(models.FG))
Note that this gives warnings over df.residual not being defined, and so large samples are assumed. I'm not familiar with crr so a more sensible value can perhaps be extracted -- this would then be added to the tidy method. (mice version ‘3.6.0’)

Related

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

R Output of fGarch

I am modelling a time series as a GARCH(1,1)-process:
And the z_t are t-distributed.
In R, I do this in the fGarch-package via
model <- garchFit(formula = ~garch(1,1), cond.dist = "std", data=r)
Is this correct?
Now, I would like to understand the output of this to check my formula.
Obviously, model#fit$coefs gives me the coefficients and model#fitted gives me the fitted r_t.
But how do I get the fitted sigma_t and z_t?
I believe that the best way is to define extractor functions when generics are not available and methods when generics already exist.
The first two functions extract the values of interest from the fitted objects.
get_sigma_t <- function(x, ...){
x#sigma.t
}
get_z_t <- function(x, ...){
x#fit$series$z
}
Here a logLik method for objects of class "fGARCH" is defined.
logLik.fGARCH <- function(x, ...){
x#fit$value
}
Now use the functions, including the method. The data comes from the first example in help("garchFit").
N <- 200
r <- as.vector(garchSim(garchSpec(rseed = 1985), n = N)[,1])
model <- garchFit(~ garch(1, 1), data = r, trace = FALSE)
get_sigma_t(model) # output not shown
get_z_t(model) # output not shown
logLik(model)
#LogLikelihood
# -861.9494
Note also that methods coef and fitted exist, there is no need for model#fitted or model#fit$coefs, like is written in the question.
fitted(model) # much simpler
coef(model)
# mu omega alpha1 beta1
#3.541769e-05 1.081941e-06 8.885493e-02 8.120038e-01
It is a list structure. Can find the structure with
str(model)
From the structure, it is easier to extract with $ or #
model#fit$series$z
model#sigma.t

Variable importance for support vector machine and naive Bayes classifiers in R

I’m working on building predictive classifiers in R on a cancer dataset.
I’m using random forest, support vector machine and naive Bayes classifiers. I’m unable to calculate variable importance on SVM and NB models
I end up receiving the following error.
Error in UseMethod("varImp") :
no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
I would greatly appreciate it if anyone could help me.
Given
library(e1071)
model <- svm(Species ~ ., data = iris)
class(model)
# [1] "svm.formula" "svm"
library(caret)
varImp(model)
# Error in UseMethod("varImp") :
# no applicable method for 'varImp' applied to an object of class "c('svm.formula', 'svm')"
methods(varImp)
# [1] varImp.bagEarth varImp.bagFDA varImp.C5.0* varImp.classbagg*
# [5] varImp.cubist* varImp.dsa* varImp.earth* varImp.fda*
# [9] varImp.gafs* varImp.gam* varImp.gbm* varImp.glm*
# [13] varImp.glmnet* varImp.JRip* varImp.lm* varImp.multinom*
# [17] varImp.mvr* varImp.nnet* varImp.pamrtrained* varImp.PART*
# [21] varImp.plsda varImp.randomForest* varImp.RandomForest* varImp.regbagg*
# [25] varImp.rfe* varImp.rpart* varImp.RRF* varImp.safs*
# [29] varImp.sbf* varImp.train*
There is no function varImp.svm in methods(varImp), therefore the error. You might want to have a look at this post on Cross Validated, too.
If you use R, the variable importance can be calculated with Importance method in rminer package. This is my sample code:
library(rminer)
M <- fit(y~., data=train, model="svm", kpar=list(sigma=0.10), C=2)
svm.imp <- Importance(M, data=train)
In detail, refer to the following link https://cran.r-project.org/web/packages/rminer/rminer.pdf
I have created a loop that iteratively removes one predictor at a time and captures in a data frame various performance measures derived from the confusion matrix. This is not supposed to be a one size fits all solution, I don't have the time for it, but it should not be difficult to apply modifications.
Make sure that the predicted variable is last in the data frame.
I mainly needed specificity values from the models and by removing one predictor at a time, I can evaluate the importance of each predictor, i.e. by removing a predictor, the smallest specificity of the model(less predictor number i) means that the predictor has the most importance. You need to know on what indicator you will attribute importance.
You can also add another for loop inside to change between kernels, i.e. linear, polynomial, radial, but you might have to account for the other parameters,e.g. gamma. Change "label_fake" with your target variable and df_final with your data frame.
SVM version:
set.seed(1)
varimp_df <- NULL # df with results
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) { # the last var is the dep var, hence the -1
smp_size <- floor(0.70 * nrow(df_final)) # 70/30 split
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)] # receives all the df less 1 var
testing <- df_final[-train_ind, -c(i)]
tune.out.linear <- tune(svm, label_fake ~ .,
data = training,
kernel = "linear",
ranges = list(cost =10^seq(1, 3, by = 0.5))) # you can choose any range you see fit
svm.linear <- svm(label_fake ~ .,
kernel = "linear",
data = training,
cost = tune.out.linear[["best.parameters"]][["cost"]])
train.pred.linear <- predict(svm.linear, testing)
testing_y <- as.factor(testing$label_fake)
conf.matrix.svm.linear <- caret::confusionMatrix(train.pred.linear, testing_y)
varimp_df <- rbind(varimp_df,data.frame(
var_no=i,
variable=colnames(df_final[,i]),
cost_param=tune.out.linear[["best.parameters"]][["cost"]],
accuracy=conf.matrix.svm.linear[["overall"]][["Accuracy"]],
kappa=conf.matrix.svm.linear[["overall"]][["Kappa"]],
sensitivity=conf.matrix.svm.linear[["byClass"]][["Sensitivity"]],
specificity=conf.matrix.svm.linear[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Naive Bayes version:
varimp_nb_df <- NULL
ptm1 <- proc.time() # Start the clock!
for(i in 1:(ncol(df_final)-1)) {
smp_size <- floor(0.70 * nrow(df_final))
train_ind <- sample(seq_len(nrow(df_final)), size = smp_size)
training <- df_final[train_ind, -c(i)]
testing <- df_final[-train_ind, -c(i)]
x = training[, names(training) != "label_fake"]
y = training$label_fake
model_nb_var = train(x,y,'nb', trControl=ctrl)
predict_nb_var <- predict(model_nb_var, newdata = testing )
confusion_matrix_nb_1 <- caret::confusionMatrix(predict_nb_var, testing$label_fake)
varimp_nb_df <- rbind(varimp_nb_df, data.frame(
var_no=i,
variable=colnames(df_final[,i]),
accuracy=confusion_matrix_nb_1[["overall"]][["Accuracy"]],
kappa=confusion_matrix_nb_1[["overall"]][["Kappa"]],
sensitivity=confusion_matrix_nb_1[["byClass"]][["Sensitivity"]],
specificity=confusion_matrix_nb_1[["byClass"]][["Specificity"]]))
runtime1 <- as.data.frame(t(data.matrix(proc.time() - ptm1)))$elapsed # time for running this loop
runtime1 # divide by 60 and you get minutes, /3600 you get hours
}
Have fun!

Estimate CV error of R's lda using the package cvTools

I'm trying to use the package cvTools to estimate the classification error of a LDA fit. I've used cvTools successfully in the past with other models, but with LDA it doesn't work as expected.
data <- rbind(data.frame(cls='A', x=rnorm(10, mean=5)),
data.frame(cls='B', x=rnorm(10, mean=10)))
cv.fit <- cvFit(
lda,
formula=cls ~ x,
data=data,
cost=function(true, pred) {
mean(true != pred)
},
K=10
)
It seems cvTools uses the generic predict model function internally. predict.lda doesn't return a simple vector of class labels, but rather a list with attributes class, posterior and x.
If I set a debug breakpoint in the cost function I see that true is a vector of class labels but pred is said list converted to a vector.
So my question is how I can still use cvTools with LDA.
My first idea was to provide a new predict method by subclassing lda in the hope that cvTools will then call my predict function which returns a vector of class labels:
lda.fit <- lda(cls ~ x, data)
class(lda.fit) <- c('ldaCV', class(lda.fit))
predict.ldaCV <- function(m, newdata) {
MASS:::predict.lda(m, newdata)$class
}
cv.fit <- cvFit(
lda.fit,
data=data,
y=data$cls,
cost=function(true, pred) {
mean(true != pred)
},
K=10
)
Nothing changes however. pred is still the same messed up list.

Setting Random seeds do not affect classification methods C5.0 and ctree

I want to compare between two different classification methods, namely ctree and C5.0 in the libraries partyand c50 respectively, the comparison is to test their sensitivity to the initial start points. The test should be carried 30 times for each time the number of wrong classified items are calculated and stored in a vector then by using t-test I hope to see if they are really different or not.
library("foreign"); # for read.arff
library("party") # for ctree
library("C50") # for C5.0
trainTestSplit <- function(data, trainPercentage){
newData <- list();
all <- nrow(data);
splitPoint <- floor(all * trainPercentage);
newData$train <- data[1:splitPoint, ];
newData$test <- data[splitPoint:all, ];
return (newData);
}
ctreeErrorCount <- function(st,ss){
set.seed(ss);
model <- ctree(Class ~ ., data=st$train);
class <- st$test$Class;
st$test$Class <- NULL;
pre = predict(model, newdata=st$test, type="response");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
C50ErrorCount <- function(st,ss){
model <- C5.0(Class ~ ., data=st$train, seed=ss);
class <- st$test$Class;
pre = predict(model, newdata=st$test, type="class");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
compare <- function(n = 30){
data <- read.arff(file.choose());
set.seed(100);
errors = list(ctree = c(), c50 = c());
seeds <- floor(abs(rnorm(n) * 10000));
for(i in 1:n){
splitData <- trainTestSplit(data, 0.66);
errors$ctree[i] <- ctreeErrorCount(splitData, seeds[i]);
errors$c50[i] <- C50ErrorCount(splitData, seeds[i]);
}
cat("\n\n");
cat("============= ctree Vs C5.0 =================\n");
cat(paste(errors$ctree, " ", errors$c50, "\n"))
tt <- t.test(errors$ctree, errors$c50);
print(tt);
}
The program shown is supposedly doing the job of comparison, but because of the number of errors does not change in the vectors then the t.test function produces an error. I used iris inside R (but changing class to Class) and Winchester breast cancer data which can be downloaded here to test it but any data can be used as long as it has Class attribute
But I get in to the problem that the result of both methods remain constant and not changes while I am changing the random seed, theoretically ,as described in their documentation,both of the functions use random seeds, ctree uses set.seed(x) while C5.0 uses an argument called seed to set seed, unfortunatly I can not find the effect.
Could you please tell me how to control initials of these functions
ctrees does only depend on a random seed in the case where you configure it to use a random selection of input variables (ie that mtry > 0 within ctree_control). See http://cran.r-project.org/web/packages/party/party.pdf (p. 11)
In regards to C5.0-trees the seed is used this way:
ctrl = C5.0Control(sample=0.5, seed=ss);
model <- C5.0(Class ~ ., data=st$train, control = ctrl);
Notice that the seed is used to select a sample of the data, not within the algoritm itself. See http://cran.r-project.org/web/packages/C50/C50.pdf (p. 5)

Resources