Why do I get NAs in my confusionMatrix doing KNN? - r

Working with R, I performed a KNN Algorithm knn <- train(x = x_train, y = y_train, method = "knn") with this dataframe:
1 0.35955056 0.62068966 0.34177215 0.27 0.7260274 0.22 MIT
2 0.59550562 0.56321839 0.35443038 0.15 0.7260274 0.22 MIT
3 0.52808989 0.35632184 0.45569620 0.13 0.7397260 0.22 NUC
4 0.34831461 0.35632184 0.34177215 0.54 0.6575342 0.22 MIT
5 0.44943820 0.31034483 0.44303797 0.17 0.6712329 0.22 CYT
6 0.43820225 0.47126437 0.34177215 0.65 0.7260274 0.22 MIT
7 0.41573034 0.36781609 0.48101266 0.20 0.7945205 0.34 NUC
8 0.49438202 0.42528736 0.56962025 0.36 0.6712329 0.22 MIT
9 0.32584270 0.29885057 0.49367089 0.15 0.7945205 0.30 CYT
10 0.35955056 0.29885057 0.41772152 0.21 0.7260274 0.27 NU
...
Obtaining this result:
k-Nearest Neighbors
945 samples
6 predictor
8 classes: 'CYT', 'ERL', 'EXC', 'ME', 'MIT', 'NUC', 'POX', 'VAC'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 945, 945, 945, 945, 945, 945, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5273630 0.3760233
7 0.5480598 0.4004283
9 0.5667651 0.4242597
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
After that, I wanted to do a confusion matrix with this code:
predictions <- predict(knn, x_test)
results <- data.frame(real = y_test, predicted = predictions)
attach(results)
confusionMatrix(real, predicted)
And I got this results:
Confusion Matrix and Statistics
Reference
Prediction CYT ERL EXC ME MIT NUC POX VAC
CYT 73 0 0 3 7 44 0 0
ERL 0 0 0 1 0 0 0 0
EXC 2 0 6 3 1 0 0 0
ME 5 0 1 68 2 11 0 0
MIT 19 0 0 8 44 14 0 0
NUC 57 0 0 6 8 74 0 0
POX 3 0 0 0 1 2 0 0
VAC 3 0 2 2 1 1 0 0
Overall Statistics
Accuracy : 0.5614
95% CI : (0.5153, 0.6068)
No Information Rate : 0.3432
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.417
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: CYT Class: ERL Class: EXC Class: ME Class: MIT Class: NUC Class: POX Class: VAC
Sensitivity 0.4506 NA 0.66667 0.7473 0.68750 0.5068 NA NA
Specificity 0.8258 0.997881 0.98704 0.9501 0.89951 0.7822 0.98729 0.98093
Pos Pred Value 0.5748 NA 0.50000 0.7816 0.51765 0.5103 NA NA
Neg Pred Value 0.7420 NA 0.99348 0.9403 0.94832 0.7798 NA NA
Prevalence 0.3432 0.000000 0.01907 0.1928 0.13559 0.3093 0.00000 0.00000
Detection Rate 0.1547 0.000000 0.01271 0.1441 0.09322 0.1568 0.00000 0.00000
Detection Prevalence 0.2691 0.002119 0.02542 0.1843 0.18008 0.3072 0.01271 0.01907
Balanced Accuracy 0.6382 NA 0.82685 0.8487 0.79350 0.6445 NA NA
I would like to know why I have got this NAs in my sensibility in the class ERL for example.
Did I do something wrong ?
What is the reason of these NAs. I can provided the completed dataframe if necessary.

Based on the confusion matrix, your prediction set is lacking data with the classification of ERL, POX, and VOC which is leading to the NAs in the Statistics by Class.
Take a look at the Sensitivity of Class ERL for example. Sensitivity, also called the True Positive Rate, is calculated as the number of correct positive predictions divided by the total number of positives.
Positive ERL Predictions = 0
Actual ERL Classifications = 0
Sensitivity ERL = 0/0 which leads to the NA.

Related

How do I generate a sample completeness~diversity order plot in iNext?

I am trying to create a plot in the style of the sample completeness~diversity order plot shown below, but referencing a different dataset.
The plot shown above is Fig. 3a in Chao et al. 2020.
I want to create a plot in this style for the ciliate dataset included in the iNEXT package, using principally functions iNEXT::iNEXT() and iNEXT::ggiNEXT(). I see that sample completeness values are contained in the output of function iNEXT() as variable SC in multiple places (shown below).
> library(ggplot2)
> library(iNEXT)
> data("ciliates")
>
> #define output of function iNEXT as object c
> c <- iNEXT::iNEXT(ciliates, datatype = "incidence_raw", q=c(0,1,2), se=TRUE, nboot = 10)
>
> head(c$iNextEst$size_based)
Assemblage t Method Order.q qD qD.LCL qD.UCL SC SC.LCL SC.UCL
1 EtoshaPan 1 Rarefaction 0 27.15789 25.46898 28.84681 0.1901378 0.1602758 0.2199998
2 EtoshaPan 2 Rarefaction 0 49.15205 46.46171 51.84239 0.3154101 0.2763704 0.3544499
3 EtoshaPan 3 Rarefaction 0 67.74407 64.36418 71.12395 0.4040033 0.3625663 0.4454402
4 EtoshaPan 4 Rarefaction 0 83.93008 80.01195 87.84822 0.4704869 0.4293215 0.5116523
5 EtoshaPan 5 Rarefaction 0 98.31054 93.93120 102.68989 0.5227274 0.4830091 0.5624458
6 EtoshaPan 6 Rarefaction 0 111.27226 106.47865 116.06587 0.5652248 0.5274311 0.6030185
>
> head(c$iNextEst$coverage_based)
Assemblage SC t Method Order.q qD qD.LCL qD.UCL
1 EtoshaPan 0.1901402 1 Rarefaction 0 27.15824 25.46933 28.84715
2 EtoshaPan 0.3154100 2 Rarefaction 0 49.15201 43.79056 54.51347
3 EtoshaPan 0.4040045 3 Rarefaction 0 67.74432 61.04684 74.44180
4 EtoshaPan 0.4704874 4 Rarefaction 0 83.93019 75.75317 92.10722
5 EtoshaPan 0.5227259 5 Rarefaction 0 98.31009 88.86377 107.75640
6 EtoshaPan 0.5652251 6 Rarefaction 0 111.27235 100.79007 121.75464
>
> c$DataInfo
Assemblage T U S.obs SC Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1 EtoshaPan 19 516 216 0.8017 107 44 26 14 6 5 4 3 2 2
2 CentralNamibDesert 17 379 130 0.8425 63 28 13 4 3 7 1 2 1 0
3 SouthernNamibDesert 15 358 150 0.7816 82 28 14 8 6 1 1 2 2 1
However, nowhere in object c do I see unique combinations of assemblage (EtoshaPan, CentralNamibDesert, SouthernNamibDesert), sample completeness value, and diversity order ("Order q" in plot above). Can a sample completeness~diversity order plot be created simply by setting the appropriate arguments in function ggiNEXT? If not, what steps must I take in order to create it?
Edit: I'm not the only one asking this question: link
Reference:
Chao, Anne, et al. "Quantifying sample completeness and comparing diversities among assemblages." Ecological Research 35.2 (2020): 292-314.
Link to reference: link
This is done using the iNEXT.4steps package, downloadable from GitHub.
> #install_github('AnneChao/iNEXT.4steps') #not currently on CRAN
> library(iNEXT)
> library(iNEXT)
> library(iNEXT.4steps)
> library(ggplot2)
> data("ciliates") #included in package iNEXT
>
> i <- iNEXT4steps(ciliates,
+ diversity = "TD",
+ q = seq(0, 2, 0.2),
+ datatype = "incidence_raw",
+ nboot = 99)
> i$figure #automatically generate plots shown below
> i
$summary
$summary$`STEP1. Sample completeness profiles`
Assemblage q = 0 q = 1 q = 2
1 CentralNamibDesert 0.66 0.84 0.98
2 EtoshaPan 0.64 0.80 0.95
3 SouthernNamibDesert 0.57 0.78 0.96
$summary$`STEP2. Asymptotic analysis`
Assemblage Diversity Observed Estimator s.e. LCL UCL
1 CentralNamibDesert Species richness 130.00 196.71 21.27 155.02 238.39
2 CentralNamibDesert Shannon diversity 81.81 106.48 6.06 94.60 118.36
3 CentralNamibDesert Simpson diversity 54.22 59.56 3.31 53.07 66.05
4 EtoshaPan Species richness 216.00 339.25 25.67 288.94 389.57
5 EtoshaPan Shannon diversity 158.37 222.94 11.84 199.74 246.13
6 EtoshaPan Simpson diversity 116.68 142.83 8.78 125.62 160.04
7 SouthernNamibDesert Species richness 150.00 262.07 26.45 210.23 313.90
8 SouthernNamibDesert Shannon diversity 103.70 149.91 8.09 134.05 165.77
9 SouthernNamibDesert Simpson diversity 72.33 84.60 4.48 75.82 93.37
$summary$`STEP3. Non-asymptotic coverage-based rarefaction and extrapolation analysis`
Cmax = 0.893 q = 0 q = 1 q = 2
1 CentralNamibDesert 151.42 88.95 55.70
2 EtoshaPan 272.81 185.79 126.43
3 SouthernNamibDesert 207.21 127.10 77.98
$summary$`STEP4. Evenness among species abundances`
Pielou J' q = 1 q = 2
CentralNamibDesert 0.89 0.58 0.36
EtoshaPan 0.93 0.68 0.46
SouthernNamibDesert 0.91 0.61 0.37
$figure
$figure[[1]]
$figure[[2]]
$figure[[3]]
$figure[[4]]
$figure[[5]]
$figure[[6]]

R hmftest multinomial logit model " system is computationally singular"

I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.

Logistic regression detection probability

I'm attempting to access the key covariates in detection probability.
I'm currently using this code
model1 <- glm(P ~ Width +
MBL +
DFT +
SGP +
SGC +
Depth,
family = binomial("logit"),
data = dframe2, na.action = na.exclude)
summary.lm(model1)
my data is structured like this-
Site Transect Q ID P Width DFT Depth Substrate SGP SGC MBL
1 Vr1 Q1 1 0 NA NA 0.5 Sand 0 0 0.00000
2 Vr1 Q2 2 0 NA NA 1.4 Sand&Searass 1 30 19.14286
3 Vr1 Q3 3 0 NA NA 1.7 Sand&Searass 1 15 16.00000
4 Vr1 Q4 4 1 17 0 2.0 Sand&Searass 1 95 35.00000
5 Vr1 Q5 5 0 NA NA 2.4 Sand 0 0 0.00000
6 Vr1 Q6 6 0 NA NA 2.9 Sand&Searass 1 50 24.85714
My sample size is really small (n=12) and I only have ~70 rows of data.
when I run the code it returns
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.457e+01 4.519e+00 5.437 0.00555 **
Width 1.810e-08 1.641e-01 0.000 1.00000
MBL -2.827e-08 9.906e-02 0.000 1.00000
DFT 2.905e-07 1.268e+00 0.000 1.00000
SGP 1.064e-06 2.691e+00 0.000 1.00000
SGC -2.703e-09 3.289e-02 0.000 1.00000
Depth 1.480e-07 9.619e-01 0.000 1.00000
SubstrateSand&Searass -8.516e-08 1.626e+00 0.000 1.00000
Does this mean my data set is just to small to asses detection probability or am I doing something wrong?
According to Hair (author of book Multivariate Data Analysis), you need at least 15 examples for each feature (column) of your data. If you have 12, you could only select one feature.
So, run a t-test comparing means of features related the each one of the two classes (0 and 1 at target - dependent variable) and choose the feature (independent variable) whose mean difference between classes is the biggest. This means that variable can properly create a boundary to split these two classes.

Caret: There were missing values in resampled performance measures

I am running caret's neural network on the Bike Sharing dataset and I get the following error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
I am not sure what the problem is. Can anyone help please?
The dataset is from:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Here is the coding:
library(caret)
library(bestNormalize)
data_hour = read.csv("hour.csv")
# Split dataset
set.seed(3)
split = createDataPartition(data_hour$casual, p=0.80, list=FALSE)
validation = data_hour[-split,]
dataset = data_hour[split,]
dataset = dataset[,c(-1,-2,-4)]
# View strucutre of data
str(dataset)
# 'data.frame': 13905 obs. of 14 variables:
# $ season : int 1 1 1 1 1 1 1 1 1 1 ...
# $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
# $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
# $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
# $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
# $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
# $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
# $ temp : num 0.22 0.22 0.24 0.24 0.24 0.38 0.36 0.42 0.46 0.44 ...
# $ atemp : num 0.273 0.273 0.288 0.258 0.288 ...
# $ hum : num 0.8 0.8 0.75 0.75 0.75 0.76 0.81 0.77 0.72 0.77 ...
# $ windspeed : num 0 0 0 0.0896 0 ...
# $ casual : int 8 5 3 0 1 12 26 29 35 40 ...
# $ registered: int 32 27 10 1 7 24 30 55 71 70 ...
# $ cnt : int 40 32 13 1 8 36 56 84 106 110 ...
## transform numeric data to Guassian
dataset_selected = dataset[,c(-13,-14)]
for (i in 8:12) { dataset_selected[,i] = predict(boxcox(dataset_selected[,i] +0.1))}
# View transformed dataset
str(dataset_selected)
#'data.frame': 13905 obs. of 12 variables:
#' $ season : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
#' $ hr : int 1 2 3 5 8 10 11 12 14 15 ...
#' $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
#' $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
#' $ weathersit: int 1 1 1 2 1 1 1 1 2 2 ...
#' $ temp : num -1.47 -1.47 -1.35 -1.35 -1.35 ...
#' $ atemp : num -1.18 -1.18 -1.09 -1.27 -1.09 ...
#' $ hum : num 0.899 0.899 0.637 0.637 0.637 ...
#' $ windspeed : num -1.8 -1.8 -1.8 -0.787 -1.8 ...
#' $ casual : num -0.361 -0.588 -0.81 -1.867 -1.208 ...
# Train data with Neural Network model from caret
control = trainControl(method = 'repeatedcv', number = 10, repeats =3)
metric = 'RMSE'
set.seed(3)
fit = train(casual ~., data = dataset_selected, method = 'nnet', metric = metric, trControl = control, trace = FALSE)
Thanks for your help!
phivers comment is spot on, however I would still like to provide a more verbose answer on this concrete example.
In order to investigate what is going on in more detail one should add the argument savePredictions = "all" to trainControl:
control = trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
returnResamp = "all",
savePredictions = "all")
metric = 'RMSE'
set.seed(3)
fit = train(casual ~.,
data = dataset_selected,
method = 'nnet',
metric = metric,
trControl = control,
trace = FALSE,
form = "traditional")
now when running:
fit$results
#output
size decay RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
2 1 1e-04 0.9479487 0.1850270 0.7657225 0.074211541 0.20380571 0.079640883
3 1 1e-01 0.8801701 0.3516646 0.6937938 0.074484860 0.20787440 0.077960642
4 3 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
5 3 1e-04 0.9272942 0.2482794 0.7434689 0.091409600 0.24363651 0.098854133
6 3 1e-01 0.7943899 0.6193242 0.5944279 0.011560524 0.03299137 0.013002708
7 5 0e+00 0.9999205 NaN 0.8213177 0.009655872 NA 0.007919575
8 5 1e-04 0.8811411 0.3621494 0.6941335 0.092169810 0.22980560 0.098987058
9 5 1e-01 0.7896507 0.6431808 0.5870894 0.009947324 0.01063359 0.009121535
we notice the problem occurs when decay = 0.
lets filter the observations and predictions for decay = 0
library(tidyverse)
fit$pred %>%
filter(decay == 0) -> for_r2
var(for_r2$pred)
#output
0
we can observe that all of the predictions when decay == 0 are the same (have zero variance). The model exclusively predicts 0:
unique(for_r2$pred)
#output
0
So when the summary function tries to predict R squared:
caret::R2(for_r2$obs, for_r2$pred)
#output
[1] NA
Warning message:
In cor(obs, pred, use = ifelse(na.rm, "complete.obs", "everything")) :
the standard deviation is zero
Answer by #topepo (Caret package main developer). See detailed Github thread here.
It looks like it happens when you have one hidden unit and almost no
regularization. What is happening is that the model is predicting a
value very close to a constant (so that the RMSE is a little worse
than the basic st deviation of the outcome):
> ANN_cooling_fit$resample %>% dplyr::filter(is.na(Rsquared))
RMSE Rsquared MAE size decay Resample
1 8.414010 NA 6.704311 1 0e+00 Fold04.Rep01
2 8.421244 NA 6.844363 1 0e+00 Fold01.Rep03
3 7.855925 NA 6.372947 1 1e-04 Fold10.Rep07
4 7.963816 NA 6.428947 1 0e+00 Fold07.Rep09
5 8.492898 NA 6.901842 1 0e+00 Fold09.Rep09
6 7.892527 NA 6.479474 1 0e+00 Fold10.Rep10
> sd(mydata$V7)
[1] 7.962888
So it's nothing to really worry about; just some parameters that do very poorly.
The answer by #missuse is already very insightful to understand why this error happens.
So I just want to add some straightforward ways how to get rid of this error.
If in some cross-validation folds the predictions get zero variance, the model didn't converge. In such cases, you can try the neuralnet package which offers two parameters you can tune:
threshold : default value = 0.01. Set it to 0.3 and then try lower values 0.2, 0.1, 0.05.
stepmax : default value = 1e+05. Set it to 1e+08 and then try lower values 1e+07, 1e+06.
In most cases, it is sufficient to change the threshold parameter like this:
model.nn <- caret::train(formula1,
method = "neuralnet",
data = training.set[,],
# apply preProcess within cross-validation folds
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
threshold = 0.3
)

R coxph() with interaction term, Warning: X matrix deemed to be singular

Please be patient with me. I'm new to this site.
I am modeling turtle nest survival using the coxph() function and have run into a confusing problem with an interaction term between species and nest cages. I have nests from 3 species of turtles (7, 10, and 111 nests per species).
There are nest cages on all nests for the species(1) with 7 nests.
There are no nest cages on all the nests for the species(2) with 10 nests.
There are nest cages on about half of the nests for the species(3) with 111 nests.
Here is my model with the summary output:
S<-Surv(time, event)
n8<-coxph(S~species:cage, data=nesta1)
Warning message:
In coxph(S ~ species:cage, data = nesta1) :
X matrix deemed to be singular; variable 1 5 6
summary(n8)
Call:
coxph(formula = S ~ species:cage, data = nesta1)
n= 128, number of events= 73
coef exp(coef) se(coef) z Pr(>|z|)
species1:cageN NA NA 0.0000 NA NA
species2:cageN 1.2399 3.4554 0.3965 3.128 0.00176 **
species3:cageN 0.5511 1.7351 0.2664 2.068 0.03860 *
species1:cageY -0.1054 0.8999 0.6145 -0.172 0.86379
species2:cageY NA NA 0.0000 NA NA
species3:cageY NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
species1:cageN NA NA NA NA
species2:cageN 3.4554 0.2894 1.5887 7.515
species3:cageN 1.7351 0.5763 1.0293 2.925
species1:cageY 0.8999 1.1112 0.2698 3.001
species2:cageY NA NA NA NA
species3:cageY NA NA NA NA
Concordance= 0.61 (se = 0.038 )
Rsquare= 0.079 (max possible= 0.993 )
Likelihood ratio test= 10.57 on 3 df, p=0.01426
Wald test = 11.36 on 3 df, p=0.009908
Score (logrank) test = 12.22 on 3 df, p=0.006672
I understand that I would have singularities for species 1 and 2, but not for species 3. Why would the "species3:cageY" line be singular when there are species 3 nests with nest cages on them?
Is it ok to include species 1 and 2 even though they have those singularities?
Edit: I cannot find any errors in my data. I have decimal numbers for the time variable for a few nests, but that doesn't seem to be a problem for species 3 nests without a nest cage. For species 3, I have the full range of time values for nests with and without a nest cage and I have both true and false events for nests with and without a nest cage.
Edit:
with( nesta1, table(event, species, cage))
, , cage = N
species
event 1 2 3
0 0 1 24
1 0 9 38
, , cage = Y
species
event 1 2 3
0 4 0 26
1 3 0 23
Edit 2: I understand that interaction-only models are not very useful, but the interaction term results behave the same way whether I have other main effects in the model or not. I've removed the other main effects to simplify this question.
Thank you!

Resources