C-index for each treatment arm with variable-treatment interaction - r

I have difficulty calculating the C-index (UnoC with survAUC R package) for each treatment arm to assess the variable-treatment interaction.
I have a database with 4 explanatory variables X1, X2, X3, X4, as follows:
> str(data)
'data.frame': 1000 obs. of 7 variables:
$ X1 : num -0.578 0.351 0.759 -0.858 -1.022 ...
$ X2 : num -0.7897 0.0339 -1.608 -1.1642 -0.0787 ...
$ X3 : num -0.1561 -0.7147 -0.8229 -0.1519 -0.0318 ...
$ X4 : num 1.4161 -0.0688 -0.155 -0.1571 -0.649 ...
$ TRT : num 0 0 0 0 0 0 0 1 0 1 ...
$ time: num 6.52 2.15 3 1.31 1.56 ...
$ stat: num 1 1 1 1 1 1 1 1 1 1 ...
The variable X4 interacts with the treatment variable and I don't have censored data.
I would like to calculate the C-index (UnoC) for each treatment arm. I expect the C-index to be equal to 0.5 in the control arm and much higher in the experimental arm.
But, I get almost the same value for both arms!
Can anyone confirm that: if I have a strong interaction between a variable and the treatment, the C-index in the experimental arm is high and in the control arm = 0.5?
Here is my attempt:
TR <- data[1:500,]
TE <- data[501:1000,]
s <- Surv(TR$time, TR$stat)
sNew <- Surv(TE$time, TE$stat)
train.fit <- coxph(Surv(time, stat) ~ X4, data=TR)
lpnew <- predict(train.fit, newdata=TE)
# The C-index for each treatment arm
UnoC(Surv.rsp = s[TR$TRT == 1], Surv.rsp.new = sNew[TE$TRT == 1], lpnew = lpnew[TE$TRT == 1])
[1] 0.7577109
UnoC(Surv.rsp = s[TR$TRT == 0], Surv.rsp.new = sNew[TE$TRT == 0], lpnew = -lpnew[TE$TRT == 0])
[1] 0.7295202
Thank you for your Help

Related

Error in MEEM(object, conLin, control$niterEM) in lme function

I'm trying to apply the lme function to my data, but the model gives follow message:
mod.1 = lme(lon ~ sex + month2 + bat + sex*month2, random=~1|id, method="ML", data = AA_patch_GLM, na.action=na.exclude)
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
dput for data, copy from https://pastebin.com/tv3NvChR (too large to include here)
str(AA_patch_GLM)
'data.frame': 2005 obs. of 12 variables:
$ lon : num -25.3 -25.4 -25.4 -25.4 -25.4 ...
$ lat : num -51.9 -51.9 -52 -52 -52 ...
$ id : Factor w/ 12 levels "24641.05","24642.03",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ bat : int -3442 -3364 -3462 -3216 -3216 -2643 -2812 -2307 -2131 -2131 ...
$ year : chr "2005" "2005" "2005" "2005" ...
$ month : chr "12" "12" "12" "12" ...
$ patch_id: Factor w/ 45 levels "111870.17_1",..: 34 34 34 34 34 34 34 34 34 34 ...
$ YMD : Date, format: "2005-12-30" "2005-12-31" "2005-12-31" ...
$ month2 : Ord.factor w/ 7 levels "January"<"February"<..: 7 7 7 7 7 1 1 1 1 1 ...
$ lonsc : num [1:2005, 1] -0.209 -0.213 -0.215 -0.219 -0.222 ...
$ batsc : num [1:2005, 1] 0.131 0.179 0.118 0.271 0.271 ...
What's the problem?
I saw a solution applying the lme4::lmer function, but there is another option to continue to use lme function?
The problem is that you have collinear combinations of predictors. In particular, here are some diagnostics:
## construct the fixed-effect model matrix for your problem
X <- model.matrix(~ sex + month2 + bat + sex*month2, data = AA_patch_GLM)
lc <- caret::findLinearCombos(X)
colnames(X)[lc$linearCombos[[1]]]
## [1] "sexM:month2^6" "(Intercept)" "sexM" "month2.L"
## [5] "month2.C" "month2^4" "month2^5" "month2^6"
## [9] "sexM:month2.L" "sexM:month2.C" "sexM:month2^4" "sexM:month2^5"
This is in a weird order, but it suggests that the sex × month interaction is causing problems. Indeed:
with(AA_patch_GLM, table(sex, month2))
## sex January February March April May June December
## F 367 276 317 204 43 0 6
## M 131 93 90 120 124 75 159
shows that you're missing data for one sex/month combination (i.e., no females were sampled in June).
You can:
construct the sex/month interaction yourself (data$SM <- with(data, interaction(sex, month2, drop = TRUE))) and use ~ SM + bat — but then you'll have to sort out main effects and interactions yourself (ugh)
construct the model matrix by hand (as above), drop the redundant column(s), then include all the resulting columns in the model:
d2 <- with(AA_patch_GLM,
data.frame(lon,
as.data.frame(X),
id))
## drop linearly dependent column
## note data.frame() has "sanitized" variable names (:, ^ both converted to .)
d2 <- d2[names(d2) != "sexM.month2.6"]
lme(reformulate(colnames(d2)[2:15], response = "lon"),
random=~1|id, method="ML", data = d2)
Again, the results will be uglier than the simpler version of the model.
use a patched version of nlme (I submitted a patch here but it hasn't been considered)
remotes::install_github("bbolker/nlme")

No starting estimate was successful error with coxme upon data subsetting

I have a large dataset that I subsetted and created a new dataset.
I used the following code that works perfectly
require(sjPlot);require(coxme)
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), data))
But when I used the subsetted datas set using the following code,
www<- subset(data, (data$TRIAL != 5 & data$Sex.standerd.BMI.gpM1F2 >=1))
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), www))
it gave me the following error:
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
This is my new data structure
str(www)
Classes ‘data.table’ and 'data.frame': 7576 obs. of 79 variables:
$ TRIAL : num 1 1 1 1 1 1 1 1 1 1 ...
$ FEMALE : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ type_comp2 : chr "0" "0" "Revasc" "0" ...
$ comp2 : num 0 0 1 0 0 0 0 0 0 1 ...
$ comp2_years : num 10 10 9.77 10 10 ...
$ Sex.standerd.BMI.gpM1F2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ Trial1_4.MiddleBMI : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
I saw this post but I could not solve my current problem.
Any advice will be greatly appreciated.
Add the droplevels() command to your subset.
This happened to me too, and I found that using droplevels() to forget about the levels you did not include in the subset solved it:
library(survival)
library(coxme)
Change ph.ecog from number to categorical to make this point:
lung$ph.ecog <- as.factor(lung$ph.ecog)
(fit <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lung))
Works well for the full data set. Subset out some levels of ph.ecog, and it gives this error:
lunga <- subset(lung, !ph.ecog %in% c(2, 3))
(fita <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lunga))
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
Using droplevels() to forget about empty levels allows coxme to fit again:
lungb <- droplevels(subset(lung, !ph.ecog %in% c(2, 3)))
(fitb <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lungb))

Caret: There were missing values in resampled performance measures

I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)

Logistic Ridge Regression predict ROC/ AUC and accuracy testing

I am trying to fit Logistic Ridge Regression and developed the model as follows; I need help with the coding for testing it for accuracy and ROC/AUC curve with threshold value.
My coding is as follows:
Fitting the model
library(glmnet)
library(caret)
data1<-read.csv("D:\\Research\\Final2.csv",header=T,sep=",")
str(data1)
'data.frame': 154 obs. of 12 variables:
$ Earningspershare : num 12 2.69 8.18 -0.91 3.04 ...
$ NetAssetsPerShare: num 167.1 17.2 41.1 14.2 33 ...
$ Dividendpershare : num 3 1.5 1.5 0 1.25 0 0 0 0 0.5 ...
$ PE : num 7.35 8.85 6.66 -5.27 18.49 ...
$ PB : num 0.53 1.38 1.33 0.34 1.7 0.23 0.5 3.1 0.5 0.3 ...
$ ROE : num 0.08 0.16 0.27 -0.06 0.09 -0.06 -0.06 0.15 0.09 0.
$ ROA : num 0.02 0.09 0.14 -0.03 0.05 -0.04 -0.05 0.09 0.03 0
$ Log_MV : num 8.65 10.38 9.81 8.3 10.36 ..
$ Return_yearly : int 0 1 0 0 0 0 0 0 0 0 ...
$ L3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ L6 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Sector : int 2 2 2 2 2 1 2 2 4 1 ...
smp_size <- floor(0.8 * nrow(data1))
set.seed(123)
train_ind <- sample(seq_len(nrow(data1)), size = smp_size)
train <- data1[train_ind, ]
test <- data1[-train_ind, ]
train$Return_yearly <-as.factor(train$Return_yearly)
train$L3 <-as.factor(train$L3)
train$L6 <-as.factor(train$L6)
train$Sector <-as.factor(train$Sector)
train$L3 <-model.matrix( ~ L3 - 1, data=train)
train$L6 <-model.matrix( ~ L6 - 1, data=train)
train$Sector<-model.matrix( ~ Sector - 1, data=train)
x <- model.matrix(Return_yearly ~., train)
y <- train$Return_yearly
ridge.mod <- glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nlambda=100, lambda.min.ratio=0.0001)
set.seed(1)
cv.out <- cv.glmnet(x, y=as.factor(train$Return_yearly), family='binomial', alpha=0, nfolds = 5, type.measure = "auc", nlambda=100, lambda.min.ratio=0.0001)
plot(cv.out)
best.lambda <- cv.out$lambda.min
best.lambda
[1] 5.109392
Testing the model
test$L3 <-as.factor(test$L3)
test$L6 <-as.factor(test$L6)
test$Sector <-as.factor(test$Sector)
test$Return_yearly <-as.factor(test$Return_yearly)
test$L3 <-model.matrix( ~ L3 - 1, data=test)
test$L6 <-model.matrix( ~ L6 - 1, data=test)
test$Sector<-model.matrix( ~ Sector - 1, data=test)
newx <- model.matrix(Return_yearly ~., test)
y.pred <- as.matrix(ridge.mod,newx=newx, type="class",data=test)
comparing for accuracy testing; error pops up, unable to continue
compare <- cbind (actual=test$Return_yearly, y.pred)
Warning message:
In cbind(actual = test$Return_yearly, y.pred) :
number of rows of result is not a multiple of vector length (arg 1)
Without a reproducible dataset here's a guess:
The train and test matrices have different columns as the result of converting L3 and L6 to factors. By default, as.factor() creates as many levels in a factor as there are unique values, so if by chance the train/test split has different unique values of L3 or L6, the number of dummy variables created by model.matrix() will be different as well.
Possible solution: do as.factor() before train/test split, or supply as.factor with the complete levels, like
train$L3 <- as.factor(train$L3, levels = unique(data1$L3))
Use the following code to plot the accuracy and sensitivity.
ROC_Pre <- prediction(ROC_Pre, data$LSD)
ROC <- performance(ROC_Pre, "tpr", "fpr")
plot(ROC)

Anova in R to fetch the significant codes

Code:
pred_model = anova(m1, m2, test="Chisq")
pred_model
Results:
Analysis of Variance Table
Model 1: male_birth ~ male_death + female_birth + female_death
Model 2: male_birth ~ male_death + female_birth
Res.Df RSS Df Sum of Sq Pr(>Chi)
1 48 3.4883
2 49 3.4951 -1 -0.0068189 0.7594
I am new to R, can anyone suggest how to fetch the significant codes for the model
The summary function will often return a matrix with a column of p-values. In this case, however, the result of anova is returned as a dataframe which has been further "class()-ed" as an "anova"-object (so that it can have its own print-method). Run the first example in ?lm and then:
> lm.D9 <- lm(weight ~ group)
> lm.0 <- lm(weight ~ 1)
> anova(lm.D9,lm.0)
Analysis of Variance Table
Model 1: weight ~ group
Model 2: weight ~ 1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 18 8.7292
2 19 9.4175 -1 -0.68821 1.4191 0.249
> str( anova(lm.D9,lm.0) )
Classes ‘anova’ and 'data.frame': 2 obs. of 6 variables:
$ Res.Df : num 18 19
$ RSS : num 8.73 9.42
$ Df : num NA -1
$ Sum of Sq: num NA -0.688
$ F : num NA 1.42
$ Pr(>F) : num NA 0.249
- attr(*, "heading")= chr "Analysis of Variance Table\n" "Model 1: weight ~ group\nModel 2: weight ~ 1"
So you want the second value in the column named that is named" 'Pr(>F)'
anova(lm.D9,lm.0)$'Pr(>F)'[2]
[1] 0.2490232

Resources