Kaplan Meier Survival curve results differ between R and SAS? - r

I'm re-running Kaplan-Meier Survival Curves from previously published data, using the exact data set used in the publication (Charpentier et al. 2008 - Inbreeding depression in ring-tailed lemurs (Lemur catta): genetic diversity predicts parasitism, immunocompetence, and survivorship). This publication ran the curves in SAS Version 9, using LIFETEST, to analyze the age at death structured by genetic heterozygosity and sex of the animal (n=64). She reports a Chi square value of 6.31 and a p value of 0.012; however, when I run the curves in R, I get a Chi square value of 0.9 and a p value of 0.821. Can anyone explain this??
R Code used: Age is the time to death, mort is the censorship code, sex is the stratum of gender, and ho2 is the factor delineating the two groups to be compared.
> survdiff(Surv(age, mort1)~ho2+sex,data=mariekmsurv1)
Call:
survdiff(formula = Surv(age, mort1) ~ ho2 + sex, data = mariekmsurv1)
N Observed Expected (O-E)^2/E (O-E)^2/V
ho2=1, sex=F 18 3 3.23 0.0166 0.0215
ho2=1, sex=M 12 3 2.35 0.1776 0.2140
ho2=2, sex=F 17 5 3.92 0.3004 0.4189
ho2=2, sex=M 17 4 5.50 0.4088 0.6621
Chisq= 0.9 on 3 degrees of freedom, p= 0.821
> str(mariekmsurv1)
'data.frame': 64 obs. of 6 variables:
$ id : Factor w/ 65 levels "","aeschylus",..: 14 31 33 30 47 57 51 39 36 3 ...
$ sex : Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ mort1: int 0 0 0 0 0 0 0 0 0 0 ...
$ age : num 0.12 0.192 0.2 0.23 1.024 ...
$ sex.1: Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ ho2 : int 1 1 1 2 1 1 1 1 1 2 ...
- attr(*, "na.action")=Class 'omit' Named int [1:141] 65 66 67 68 69 70 71 72 73 74 ...
.. ..- attr(*, "names")= chr [1:141] "65" "66" "67" "68" ...

Some ideas:
Try running it in SAS -- see if you get the same results as the author. Maybe they didn't send you the exact same dataset they used.
Look into the default values of the relevant SAS PROC and compare to the defaults of the R function you are using.

Given the HUGE difference between the Chi-squared (6.81 and 0.9) and P values (0.012 and 0.821) beteween SAS procedure and R procedure for survival analyses; I suspect that you have used wrong variables in the either one of the procedures.
The procedural difference / (data handling difference between SAS and R can cause some very small differences ) .
This is not a software error, this is highly likely to be a human error.

Related

summary_factorlist, Having Error due variables with less than two levels

*I have a large data set including 2000 variables, including factors and continuous variables.
For example:
library(finalfit)
library(dplyr)
data(colon_s)
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor"
I use the following function to compare the mean of each continuous variable among the level of the categorical dependent variable (ANOVA) or the percentage of each categorical variable among the level of the categorical dependent variable (CHI-SQUARE)
summary_factorlist(colon_s, dependent ="perfor.factor", explanatory =explanatory , add_dependent_label=T, p=T,p_cat="fisher", p_cont_para = "aov", fit_id
= T)
But as soon as running the above code, I got the following error:
Error in dplyr::summarise():
! Problem while computing ..1 = ...$p.value.
Caused by error in fisher.test():
! 'x' and 'y' must have at least 2 levels
*In the data set, there are some variables which do not include at least two levels or just one of their levels has a non-zero frequency. I was wondering if there is any loop function to remove the variable if one of these conditions satisfies.
If the variable includes just one level
If the variable includes more than one level but the frequency of just one level is no-zero.
if all values of the variable are missing*
Update (partial answer):
With this code we can remove factors with only one level and keep other non factor variables:
x <- colon_s[, (sapply(colon_s, nlevels)>1) | (sapply(colon_s, is.factor)==FALSE)]
The OP's code does work with the data provided
library(dplyr)
library(finalfit)
summary_factorlist(colon_s, dependent ="perfor.factor",
explanatory =explanatory ,
add_dependent_label=TRUE, p=TRUE,p_cat="fisher", p_cont_para = "aov", fit_id = TRUE)
Dependent: Perforation No Yes p fit_id index
Age (years) Mean (SD) 59.8 (11.9) 58.4 (13.3) 0.542 age 1
Age <40 years 68 (7.5) 2 (7.4) 1.000 age.factor<40 years 2
40-59 years 334 (37.0) 10 (37.0) age.factor40-59 years 3
60+ years 500 (55.4) 15 (55.6) age.factor60+ years 4
Sex Female 432 (47.9) 13 (48.1) 1.000 sex.factorFemale 5
Male 470 (52.1) 14 (51.9) sex.factorMale 6
Obstruction No 715 (81.2) 17 (63.0) 0.026 obstruct.factorNo 7
Yes 166 (18.8) 10 (37.0) obstruct.factorYes 8
The strcture of data shows the factor variables to have more than 1 level
> str(colon_s[c(explanatory, dependent)])
'data.frame': 929 obs. of 5 variables:
$ age : num 43 63 71 66 69 57 77 54 46 68 ...
..- attr(*, "label")= chr "Age (years)"
$ age.factor : Factor w/ 3 levels "<40 years","40-59 years",..: 2 3 3 3 3 2 3 2 2 3 ...
..- attr(*, "label")= chr "Age"
$ sex.factor : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 2 2 2 1 ...
..- attr(*, "label")= chr "Sex"
$ obstruct.factor: Factor w/ 2 levels "No","Yes": NA 1 1 2 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Obstruction"
$ perfor.factor : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "label")= chr "Perforation"
Regarding selection of factor variables with the condition mentioned, we could use
library(dplyr)
colon_s_sub <- colon_s %>%
select(where(~ is.factor(.x) && nlevels(.x) > 1 && all(table(.x) > 0) &
sum(complete.cases(.x)) > 0))

clmm model summary does not show p-values

I run the following model in R:
clmm_br<-clmm(Grado_amenaza~Life_Form + size_max_cm +
leaf_length_mean + petals_length_mean +
silicua_length_mean + bloom_length + categ_color+ (1|Genero) ,
data=brasic1)
I didn't get any warnings or errors but when I run summary(clmm_br) I can't get the p-values:
summary(clmm_br)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: Grado_amenaza ~ Life_Form + size_max_cm + leaf_length_mean +
petals_length_mean + silicua_length_mean + bloom_length +
categ_color + (1 | Genero)
data: brasic1
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 76 -64.18 160.36 1807(1458) 1.50e-03 NaN
Random effects:
Groups Name Variance Std.Dev.
Genero (Intercept) 0.000000008505 0.00009222
Number of groups: Genero 39
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Life_Form[T.G] 2.233338 NA NA NA
Life_Form[T.Hem] 0.577112 NA NA NA
Life_Form[T.Hyd] -22.632916 NA NA NA
Life_Form[T.Th] -1.227512 NA NA NA
size_max_cm 0.006442 NA NA NA
leaf_length_mean 0.008491 NA NA NA
petals_length_mean 0.091623 NA NA NA
silicua_length_mean -0.036001 NA NA NA
bloom_length -0.844697 NA NA NA
categ_color[T.2] -2.420793 NA NA NA
categ_color[T.3] 1.268585 NA NA NA
categ_color[T.4] 1.049953 NA NA NA
Threshold coefficients:
Estimate Std. Error z value
1|3 -1.171 NA NA
3|4 1.266 NA NA
4|5 1.800 NA NA
(4 observations deleted due to missingness)
I tried with no random effects and excluding the rows with NAs but it's the same.
The structure of my data:
str(brasic1)
tibble[,13] [80 x 13] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:80] 135 137 142 145 287 295 585 593 646 656 ...
$ Genero : chr [1:80] "Alyssum" "Alyssum" "Alyssum" "Alyssum" ...
$ Cons.stat : chr [1:80] "LC" "VU" "VU" "LC" ...
$ Amenazada : num [1:80] 0 1 1 0 1 0 0 1 0 0 ...
$ Grado_amenaza : Factor w/ 5 levels "1","3","4","5",..: 1 2 2 1 4 1 1 2 1 1 ...
$ Life_Form : chr [1:80] "Th" "Hem" "Hem" "Th" ...
$ size_max_cm : num [1:80] 12 6 7 15 20 27 60 62 50 60 ...
$ leaf_length_mean : num [1:80] 7.5 7 11 14.5 31.5 45 90 65 65 39 ...
$ petals_length_mean : num [1:80] 2.2 3.5 5.5 2.55 6 8 10.5 9.5 9.5 2.9 ...
$ silicua_length_mean: num [1:80] 3.5 4 3.5 4 25 47.5 37.5 41.5 17.5 2.9 ...
$ X2n : num [1:80] 32 NA 16 16 NA NA 20 20 18 14 ...
$ bloom_length : num [1:80] 2 1 2 2 2 2 2 2 11 2 ...
$ categ_color : chr [1:80] "1" "4" "4" "4" ...
For a full answer we really need a reproducible example, but I can point to a few things that raise suspicions.
The fact that you can get estimates, but not standard errors, implies that there is something wrong with the Hessian (the estimate of the curvature of the log-likelihood surface at the maximum likelihood estimate), but there are several distinct (possibly overlapping possibilities)
any time you have a "large" parameter estimate (say, absolute value > 10), as in your example (Life_Form[T.Hyd] = -22.632916), it suggests complete separation, i.e. the presence/absence of that parameter perfectly predicts the response. (You can search for that term, e.g. on CrossValidated.) However, complete separation usually leads to absurdly large standard errors (along with the large parameter estimates) rather than to NAs.
you may have perfect multicollinearity, i.e. combinations of your predictor variables that are perfectly (jointly) correlated with other such combinations. Some R estimation procedures can detect and deal with this case (typically by dropping one or more predictors), but clmm might not be able to. (You should be able to construct your model matrix (X <- model.matrix( your_formula, your_data), excluding the random effect from the formula) and then use caret::findLinearCombos(X) to explore this issue.)
More generally, if you want to do reliable inference you may need to cut down the size of your model (not by stepwise or other forms of model selection); a rule of thumb is that you need 10-20 observations per parameter estimated. You're trying to estimate 12 fixed effect parameters plus a few more (ordinal-threshold parameters and random effect variance) from 80 observations ...
In addition to dropping random effects, it may be useful to a diagnosis to fit a regular linear model with lm() (which should tell you something about collinearity, by dropping parameters) or a binomial model based on some threshold grade values (which might help with identifying complete separation).

C5.0 algorithm not working due to logical factor, solutions?

This question has been asked before, however it was not answered in a way that solved my problem. The question was also slightly different.
I am trying to build a decision tree model using the c5 package. I am trying to predict if MMA fighters have championship potential (this is a logical factor with 2 levels yes/no).
Originally this column was a boolean but i converted it to a factor using
fighters_clean$championship_potential <- as.factor(fighters_clean$championship_potential)
table(fighters_clean$championship_potential)
#Rename binary outcome
fighters_clean$championship_potential <- factor(fighters_clean$championship_potential,
levels = c("TRUE", "FALSE"), labels = c("YES", "NO"))
on my data frame it says "Factor with 2 levels" which should work as the classifier for a c5 decision tree, however I keep getting this error message.
Error in UseMethod("QuinlanAttributes") :
no applicable method for 'QuinlanAttributes' applied to an object of class "logical"
The code for my model is below.
#Lets use a decision tree to see what fighters have that championship potential
table(fighters_clean$championship_potential)
#FALSE TRUE
#2578 602
#create test and training data
#set seed alters the random number generator so that it is random but repeatable, the number is arbitrary.
set.seed(123)
Tree_training <- sample(3187, 2868)
str(Tree_training)
#So what this does is it creates a vector of 2868 random integers.
#We use this vector to split our data into training and test data
#it should be a representative 90/10 split.
Tree_Train <- fighters_clean[Tree_training, ]
Tree_Test <- fighters_clean[-Tree_training, ]
#That worked, sweet.
#Now lets see if they are representative.
#Should be even number of champ potential in both data sets,
prop.table(table(Tree_Train$championship_potential))
prop.table(table(Tree_Test$championship_potential))
#awesome so thats a perfect split, with each data set having 18% champions.
#C5 is a commercial software for decision tree models that is built into R
#We will use this to build a decision tree.
str(Tree_Train)
'data.frame': 2868 obs. of 12 variables:
$ name : chr "Jesse Juarez" "Milton Vieira" "Joey Gomez" "Gilbert Smith" ...
$ SLpM : num 1.71 1.13 2.93 1.09 5.92 0 0 1.2 0 2.11 ...
$ Str_Acc : num 48 35 35 41 51 0 0 33 0 50 ...
$ SApM : num 2.87 2.36 4.03 2.73 3.6 0 0 1.73 0 1.89 ...
$ Str_Def : num 52 48 53 35 55 0 0 73 0 63 ...
$ TD_Avg : num 2.69 2.67 1.15 3.51 0.44 0 0 0 0 0.19 ...
$ TD_Acc : num 33 53 37 60 33 0 0 0 0 40 ...
$ TD_Def : num 50 12 50 0 70 0 0 50 0 78 ...
$ Sub_Avg : num 0 0.7 0 1.2 0.4 0 0 0 0 0.3 ...
$ Win_percentage : num 0.667 0.565 0.875 0.714 0.8 ...
$ championship_potential: Factor w/ 2 levels "YES","NO": 2 2 1 2 2 2 1 2 2 2 ...
$ contender : logi FALSE FALSE TRUE TRUE TRUE TRUE ...
library(C50)
DTModel <- C5.0(Tree_Train [-11], Tree_Train$championship_potential, trials = 1, costs = NULL)

Predicting species presence with a random forest model based on unbalanced training data

I want to build a species distribution model using random-forest:
My training data consists of 971 records of species presence (71)/absence (900) and three environmental variables at systematically sampled points (4*4m, random starting point).
Training data:
str(train)
'data.frame': 971 obs. of 4 variables:
$ presence: num 0 0 0 0 0 0 0 0 0 0 ...
$ v1 : num 0.18 0.18 0.24 0.24 0.75 0.7 0.27 0 0.29 0.77 ...
$ v2 : num 10 110 19 99 97 71 64 45 54 74 ...
$ v3 : Factor w/ 3 levels "cat1","cat2",..: 1 1 1 1 2 2 2 3 1 2 ...
model:
model <- randomForest(as.factor(presence) ~ v1 + v2 + v3, data = train)
My test data (test) consists of 1019 records of the same variables including their coordinates at location B. Additionally I have mapped the species occurrence at location B. So I applied the model on that data:
prediction <- predict(model, newdata = test, type="prob")
I set type="prob"because I wanted to predict the occurrence probability of the species.
The data I generated and want to test against the observed occurrence looks like this:
str(prediction_data)
'data.frame': 1019 obs. of 16 variables:
$ x : num 180574 180575 180576 180576 180576 ...
$ y : num 226954 226953 226951 226953 226955 ...
$ v1 : num 0.1131 0.5996 0.7187 0.5885 0.0611 ...
$ v2 : num 10 110 19 99 97 71 64 45 54 74 ...
$ v3 : int 1 1 1 1 1 1 1 1 2 1 ...
$ occurrence_prob : num 0.3252 0.1826 0.0909 0.1014 0.4195 ...
Now my doubt is whether it makes sense to consider the unbalanced training data and try to improve the sensitivity of the prediction by using the parameters e.g. sampsize=(c(71,71)) or classwt = c(0.5, 0.5)in the model building function when finally I also want to set a probability threshold for classifying the species presence by analyzing the receiver operator curve?!
Would this improve the model sensitivity, be redundant or make things worse?
I would really appreciate any thoughts, advice, opinion, hint. Unfortunately I do not know anyone with whom I could discuss my doubts in person. Thank you!

Visualising logistic regression using the effects package in R

I am using the effects package in R to plot the effects of categorical and numerical predictors in a binomial logistic regression estimated using the lme4 package. My dependent variable is the presence or absence of a virus in an individual animal and my predictive factors are various individual traits (eg. sex, age, month/year captured, presence of parasites, scaled mass index (SMI), with site as a random variable).
When I use the allEffects function on my regression, I get the plots below. When compared to the model summary output below, you can see that the slope of each line appears to be zero, regardless of the estimated coefficients, and there is something strange going on with the scale of the y-axes where the ticks and tick labels appear to be overwritten on the same point.
Here is my code for the model and the summary output:
library(lme4)
library(effects)
virus1.mod<-glmer(virus1~ age + sex + month.yr + parasites + SMI + (1|site) , data=virus1data, family=binomial)
virus1.effects<-allEffects(virus1.mod)
plot(virus1.effects, ylab="Probability(infected)", rug=FALSE)
> summary(virus1.mod)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: binomial ( logit )
Formula: virus1 ~ age + sex + month.yr + parasite + SMI + (1 | site)
Data: virus1data
AIC BIC logLik deviance
189.5721 248.1130 -76.7860 153.5721
Random effects:
Groups Name Variance Std.Dev.
site (Intercept) 4.729e-10 2.175e-05
Number of obs: 191, groups: site, 6
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.340e+00 2.572e+00 2.076 0.03789 *
ageJ 1.126e+00 8.316e-01 1.354 0.17583
sexM -3.943e-02 4.562e-01 -0.086 0.93113
month.yrFeb-08 -2.259e+01 6.405e+04 0.000 0.99972
month.yrFeb-09 -2.201e+01 2.741e+04 -0.001 0.99936
month.yrJan-08.516e+00 8.175e-01 -3.078 0.00208 **
month.yrJan-09 -2.607e+00 8.066e-01 -3.232 0.00123 **
month.yrJul-08 -1.428e+00 8.571e-01 -1.666 0.09563 .
month.yrJul-09 -2.795e+00 1.170e+00 -2.389 0.01691 *
month.yrJun-08 -2.259e+01 3.300e+04 -0.001 0.99945
month.yrMar-09 -5.451e-01 6.705e-01 -0.813 0.41622
month.yrMar-08 -1.863e+00 7.921e-01 -2.352 0.01869 *
month.yrMay-09 -6.319e-01 8.956e-01 -0.706 0.48047
month.yrMay-08 3.818e-01 1.015e+00 0.376 0.70691
month.yrSep-08 2.563e+01 5.806e+05 0.000 0.99996
parasiteTRUE -6.329e-03 4.834e-01 -0.013 0.98955
SMI -3.438e-01 1.616e-01 -2.127 0.03342 *
And str of my data frame:
> str(virus1data)
'data.frame': 191 obs. of 8 variables:
$ virus1 : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
$ age : Factor w/ 2 levels "A","J": 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 2 1 2 2 ...
$ site : Factor w/ 6 levels “site1”,"site2”,"site3",..: 1 1 1 1 2 2 2 3 2 3 ...
$ rep : Factor w/ 7 levels "NRF","L","NR",..: 3 7 3 7 1 1 3 1 7 7 ...
$ month.yr : Factor w/ 17 levels "Feb-08","Feb-09",..: 4 5 5 5 13 7 14 9 9 9 ...
$ parasite : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 2 2 1 2 1 ...
$ SMI : num 14.1 14.8 14.5 13.1 15.3 ...
- attr(*, "na.action")=Class 'omit' Named int [1:73] 6 12 13 21 22 23 24 25 26 27 ...
.. ..- attr(*, "names")= chr [1:73] "1048" "1657" "1866" "2961" ...
Without making my actual data available, does anyone have an idea of what might be causing this? I have used this function with a different dataset (same independent variables but a different virus as the response variable, and different records) without problems.
This is the first time I have posted on CV, so I hope that the question is appropriate and that I have provided enough (and the right) information.

Resources