Logistic regression detection probability - r

I'm attempting to access the key covariates in detection probability.
I'm currently using this code
model1 <- glm(P ~ Width +
MBL +
DFT +
SGP +
SGC +
Depth,
family = binomial("logit"),
data = dframe2, na.action = na.exclude)
summary.lm(model1)
my data is structured like this-
Site Transect Q ID P Width DFT Depth Substrate SGP SGC MBL
1 Vr1 Q1 1 0 NA NA 0.5 Sand 0 0 0.00000
2 Vr1 Q2 2 0 NA NA 1.4 Sand&Searass 1 30 19.14286
3 Vr1 Q3 3 0 NA NA 1.7 Sand&Searass 1 15 16.00000
4 Vr1 Q4 4 1 17 0 2.0 Sand&Searass 1 95 35.00000
5 Vr1 Q5 5 0 NA NA 2.4 Sand 0 0 0.00000
6 Vr1 Q6 6 0 NA NA 2.9 Sand&Searass 1 50 24.85714
My sample size is really small (n=12) and I only have ~70 rows of data.
when I run the code it returns
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.457e+01 4.519e+00 5.437 0.00555 **
Width 1.810e-08 1.641e-01 0.000 1.00000
MBL -2.827e-08 9.906e-02 0.000 1.00000
DFT 2.905e-07 1.268e+00 0.000 1.00000
SGP 1.064e-06 2.691e+00 0.000 1.00000
SGC -2.703e-09 3.289e-02 0.000 1.00000
Depth 1.480e-07 9.619e-01 0.000 1.00000
SubstrateSand&Searass -8.516e-08 1.626e+00 0.000 1.00000
Does this mean my data set is just to small to asses detection probability or am I doing something wrong?

According to Hair (author of book Multivariate Data Analysis), you need at least 15 examples for each feature (column) of your data. If you have 12, you could only select one feature.
So, run a t-test comparing means of features related the each one of the two classes (0 and 1 at target - dependent variable) and choose the feature (independent variable) whose mean difference between classes is the biggest. This means that variable can properly create a boundary to split these two classes.

Related

R hmftest multinomial logit model " system is computationally singular"

I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.

Why do I get NAs in my confusionMatrix doing KNN?

Working with R, I performed a KNN Algorithm knn <- train(x = x_train, y = y_train, method = "knn") with this dataframe:
1 0.35955056 0.62068966 0.34177215 0.27 0.7260274 0.22 MIT
2 0.59550562 0.56321839 0.35443038 0.15 0.7260274 0.22 MIT
3 0.52808989 0.35632184 0.45569620 0.13 0.7397260 0.22 NUC
4 0.34831461 0.35632184 0.34177215 0.54 0.6575342 0.22 MIT
5 0.44943820 0.31034483 0.44303797 0.17 0.6712329 0.22 CYT
6 0.43820225 0.47126437 0.34177215 0.65 0.7260274 0.22 MIT
7 0.41573034 0.36781609 0.48101266 0.20 0.7945205 0.34 NUC
8 0.49438202 0.42528736 0.56962025 0.36 0.6712329 0.22 MIT
9 0.32584270 0.29885057 0.49367089 0.15 0.7945205 0.30 CYT
10 0.35955056 0.29885057 0.41772152 0.21 0.7260274 0.27 NU
...
Obtaining this result:
k-Nearest Neighbors
945 samples
6 predictor
8 classes: 'CYT', 'ERL', 'EXC', 'ME', 'MIT', 'NUC', 'POX', 'VAC'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 945, 945, 945, 945, 945, 945, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5273630 0.3760233
7 0.5480598 0.4004283
9 0.5667651 0.4242597
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
After that, I wanted to do a confusion matrix with this code:
predictions <- predict(knn, x_test)
results <- data.frame(real = y_test, predicted = predictions)
attach(results)
confusionMatrix(real, predicted)
And I got this results:
Confusion Matrix and Statistics
Reference
Prediction CYT ERL EXC ME MIT NUC POX VAC
CYT 73 0 0 3 7 44 0 0
ERL 0 0 0 1 0 0 0 0
EXC 2 0 6 3 1 0 0 0
ME 5 0 1 68 2 11 0 0
MIT 19 0 0 8 44 14 0 0
NUC 57 0 0 6 8 74 0 0
POX 3 0 0 0 1 2 0 0
VAC 3 0 2 2 1 1 0 0
Overall Statistics
Accuracy : 0.5614
95% CI : (0.5153, 0.6068)
No Information Rate : 0.3432
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.417
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: CYT Class: ERL Class: EXC Class: ME Class: MIT Class: NUC Class: POX Class: VAC
Sensitivity 0.4506 NA 0.66667 0.7473 0.68750 0.5068 NA NA
Specificity 0.8258 0.997881 0.98704 0.9501 0.89951 0.7822 0.98729 0.98093
Pos Pred Value 0.5748 NA 0.50000 0.7816 0.51765 0.5103 NA NA
Neg Pred Value 0.7420 NA 0.99348 0.9403 0.94832 0.7798 NA NA
Prevalence 0.3432 0.000000 0.01907 0.1928 0.13559 0.3093 0.00000 0.00000
Detection Rate 0.1547 0.000000 0.01271 0.1441 0.09322 0.1568 0.00000 0.00000
Detection Prevalence 0.2691 0.002119 0.02542 0.1843 0.18008 0.3072 0.01271 0.01907
Balanced Accuracy 0.6382 NA 0.82685 0.8487 0.79350 0.6445 NA NA
I would like to know why I have got this NAs in my sensibility in the class ERL for example.
Did I do something wrong ?
What is the reason of these NAs. I can provided the completed dataframe if necessary.
Based on the confusion matrix, your prediction set is lacking data with the classification of ERL, POX, and VOC which is leading to the NAs in the Statistics by Class.
Take a look at the Sensitivity of Class ERL for example. Sensitivity, also called the True Positive Rate, is calculated as the number of correct positive predictions divided by the total number of positives.
Positive ERL Predictions = 0
Actual ERL Classifications = 0
Sensitivity ERL = 0/0 which leads to the NA.

Kaplan Meier survival plot

Good morning,
I am having trouble understanding some of my outputs for my Kaplan Meier analyses.
I have managed to produce the following plots and outputs using ggsurvplot and survfit.
I first made a plot of survival time of 55 nest with time and then did the same with the top predictors for nest failure, one being microtopography, as seen in this example.
Call: npsurv(formula = (S) ~ 1, data = nestdata, conf.type = "log-log")
26 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
55 45 0 13 29 2 NA
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata, conf.type = "log-log")
29 observations deleted due to missingness
records n.max n.start events median 0.95LCL 0.95UCL
Microtopography=0 14 13 0 1 NA NA NA
Microtopography=1 26 21 0 7 NA 29 NA
Microtopography=2 12 8 0 5 3 2 NA
So, I have two primary questions.
1. The survival curves are for a ground nesting bird with an egg incubation time of 21-23 days. Incubation time is the number of days the hen sits of the eggs before they hatch. Knowing that, how is it possible that the median survival time in plot #1 is 29 days? It seems to fit with the literature I have read on this same species, however, I assume it has something to do with the left censoring in my models, but am honestly at a loss. If anyone has any insight or even any litterature that could help me understand this concept, I would really appreciate it.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt, the median survival times returned are NA. I understand I can chose another interval, such as .75, but in this example that still wouldnt help me because microtopography 0 never drops below .9 or so. How would one go about reporting this data. Would the work around be to choose a survival interval, using:
summary(s,times=c(7,14,21,29))
Call: npsurv(formula = (S) ~ Microtopography, data = nestdata,
conf.type =
"log-log")
29 observations deleted due to missingness
Microtopography=0
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 3 0 0 1.000 0.0000 1.000 1.000
14 7 0 0 1.000 0.0000 1.000 1.000
21 13 0 0 1.000 0.0000 1.000 1.000
29 8 1 5 0.909 0.0867 0.508 0.987
Microtopography=1
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 9 0 0 1.000 0.0000 1.000 1.000
14 17 1 0 0.933 0.0644 0.613 0.990
21 21 3 0 0.798 0.0909 0.545 0.919
29 15 3 7 0.655 0.1060 0.409 0.819
Microtopography=2
time n.risk n.event censored survival std.err lower 95% CI upper 95% CI
7 1 2 0 0.333 0.272 0.00896 0.774
14 7 1 0 0.267 0.226 0.00968 0.686
21 8 1 0 0.233 0.200 0.00990 0.632
29 3 1 5 0.156 0.148 0.00636 0.504
Late to the party...
The median survival time of 29 days is the median incubation time that birds of this species are expected to be in the egg until they hatch - based on your data. Your median of 21-24 (based on ?) is probably based on many experiments/studies of eggs that have hatched, ignoring those that haven't hatched yet (those that failed?).
From your overall survival curve, it is clear that some eggs have not yet hatched, even after more than 35 days. These are taken into account when calculating the expected survival times. If you think that these eggs will fail, then omit them. Otherwise, the software cannot possibly know that they will eventually fail. But how can anyone know for sure if an egg is going to fail, even after 30 days? Is there a known maximum hatching time? The record-breaker of all hatched eggs?
There are not really R questions, so this question might be more appropriate for the statistics site. But the following might help.
how is it possible that the median survival time in plot #1 is 29 days?
The median survival is where the survival curve passes the 50% mark. Eyeballing it, 29 days looks right.
I am also wondering how I can compare median survival times for the 2nd plot. Because microtopography survival curves 1 and 2 never croos the .5 pt.
Given your data, you cannot compare the median. You can compare the 75% or 90%, if you must. You can compare the point survival at, say, 30 days. You can compare the truncated average survival in the first 30 days.
In order to compare the median, you would have to make an assumption. I reasonable assumption would be an exponential decay after some tenure point that includes at least one failure.

R coxph() with interaction term, Warning: X matrix deemed to be singular

Please be patient with me. I'm new to this site.
I am modeling turtle nest survival using the coxph() function and have run into a confusing problem with an interaction term between species and nest cages. I have nests from 3 species of turtles (7, 10, and 111 nests per species).
There are nest cages on all nests for the species(1) with 7 nests.
There are no nest cages on all the nests for the species(2) with 10 nests.
There are nest cages on about half of the nests for the species(3) with 111 nests.
Here is my model with the summary output:
S<-Surv(time, event)
n8<-coxph(S~species:cage, data=nesta1)
Warning message:
In coxph(S ~ species:cage, data = nesta1) :
X matrix deemed to be singular; variable 1 5 6
summary(n8)
Call:
coxph(formula = S ~ species:cage, data = nesta1)
n= 128, number of events= 73
coef exp(coef) se(coef) z Pr(>|z|)
species1:cageN NA NA 0.0000 NA NA
species2:cageN 1.2399 3.4554 0.3965 3.128 0.00176 **
species3:cageN 0.5511 1.7351 0.2664 2.068 0.03860 *
species1:cageY -0.1054 0.8999 0.6145 -0.172 0.86379
species2:cageY NA NA 0.0000 NA NA
species3:cageY NA NA 0.0000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
species1:cageN NA NA NA NA
species2:cageN 3.4554 0.2894 1.5887 7.515
species3:cageN 1.7351 0.5763 1.0293 2.925
species1:cageY 0.8999 1.1112 0.2698 3.001
species2:cageY NA NA NA NA
species3:cageY NA NA NA NA
Concordance= 0.61 (se = 0.038 )
Rsquare= 0.079 (max possible= 0.993 )
Likelihood ratio test= 10.57 on 3 df, p=0.01426
Wald test = 11.36 on 3 df, p=0.009908
Score (logrank) test = 12.22 on 3 df, p=0.006672
I understand that I would have singularities for species 1 and 2, but not for species 3. Why would the "species3:cageY" line be singular when there are species 3 nests with nest cages on them?
Is it ok to include species 1 and 2 even though they have those singularities?
Edit: I cannot find any errors in my data. I have decimal numbers for the time variable for a few nests, but that doesn't seem to be a problem for species 3 nests without a nest cage. For species 3, I have the full range of time values for nests with and without a nest cage and I have both true and false events for nests with and without a nest cage.
Edit:
with( nesta1, table(event, species, cage))
, , cage = N
species
event 1 2 3
0 0 1 24
1 0 9 38
, , cage = Y
species
event 1 2 3
0 4 0 26
1 3 0 23
Edit 2: I understand that interaction-only models are not very useful, but the interaction term results behave the same way whether I have other main effects in the model or not. I've removed the other main effects to simplify this question.
Thank you!

Why does R return a low p-value for ANOVA on a set of 1s?

I'm trying to use repeated rounds of ANOVA to sort a large dataset into different categories. For each element in the dataset I have twelve data points which represent three replicates each of four conditions, which arise as two combinitions of two variable1. The data is some relative expression compared to a control, which means that for the control itself all twelve values are 1:
>at
v1 v2 values
1. a X 1
2. b X 1
3. a X 1
4. b X 1
5. a X 1
6. b X 1
7. a Y 1
8. b Y 1
9. a Y 1
10. b Y 1
11. a Y 1
12. b Y 1
which I analyze this way (the Tukey wrapper gives me Information about whether it is up or down in addition to whether it is different, which is why I'm using it):
stats <- TukeyHSD(aov(values~v1+v2, data=at))
> stats
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = values ~ v1 + v2, data = at)
$v1
diff lwr upr p adj
a-b 4.440892e-16 -1.359166e-16 1.024095e-15 0.1173068
$v2
diff lwr upr p adj
X-Y -4.440892e-16 -1.024095e-15 1.359166e-16 0.1173068
I expected the p value to be very close or equal to 1 since clearly the null hypothesis that the two groups of both of these tests are the same is correct. Instead the p-value is very low with 0.117! Clearly the difference and the bounds are tiny (e-16) so I'm guessing the problem is to do with the internal storage of the numbers as slightly off 1, but I'm not sure how to solve the problem. Any suggestions?
Thanks a lot!
I'm adding some sample data:
aX1 bX1 aX2 bX2 aX3 bX3 aY1 bY1 aY2 bY2 aY3 bY3
element1 0.112 0 0.172 0.072 0.058 0.055 0 0 0.046 0 0.042 0
element2 0.859 0.294 0.565 0 0.669 0 0.11 0 1.707 0 1.324 0
element3 1.255 0.721 3.645 1.636 5.36 6.701 0 0.097 0.533 0.209 0.358 2.219

Resources