No starting estimate was successful error with coxme upon data subsetting - r

I have a large dataset that I subsetted and created a new dataset.
I used the following code that works perfectly
require(sjPlot);require(coxme)
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), data))
But when I used the subsetted datas set using the following code,
www<- subset(data, (data$TRIAL != 5 & data$Sex.standerd.BMI.gpM1F2 >=1))
tab_model(coxme(Surv(comp2_years, comp2)~FEMALE+(1|TRIAL), www))
it gave me the following error:
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
This is my new data structure
str(www)
Classes ‘data.table’ and 'data.frame': 7576 obs. of 79 variables:
$ TRIAL : num 1 1 1 1 1 1 1 1 1 1 ...
$ FEMALE : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ type_comp2 : chr "0" "0" "Revasc" "0" ...
$ comp2 : num 0 0 1 0 0 0 0 0 0 1 ...
$ comp2_years : num 10 10 9.77 10 10 ...
$ Sex.standerd.BMI.gpM1F2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ Trial1_4.MiddleBMI : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
I saw this post but I could not solve my current problem.
Any advice will be greatly appreciated.

Add the droplevels() command to your subset.

This happened to me too, and I found that using droplevels() to forget about the levels you did not include in the subset solved it:
library(survival)
library(coxme)
Change ph.ecog from number to categorical to make this point:
lung$ph.ecog <- as.factor(lung$ph.ecog)
(fit <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lung))
Works well for the full data set. Subset out some levels of ph.ecog, and it gives this error:
lunga <- subset(lung, !ph.ecog %in% c(2, 3))
(fita <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lunga))
Error in coxme.fit(X, Y, strats, offset, init, control, weights = weights, :
No starting estimate was successful
Using droplevels() to forget about empty levels allows coxme to fit again:
lungb <- droplevels(subset(lung, !ph.ecog %in% c(2, 3)))
(fitb <- coxme(Surv(time, status) ~ ph.ecog + age + (1|inst), lungb))

Related

C-index for each treatment arm with variable-treatment interaction

I have difficulty calculating the C-index (UnoC with survAUC R package) for each treatment arm to assess the variable-treatment interaction.
I have a database with 4 explanatory variables X1, X2, X3, X4, as follows:
> str(data)
'data.frame': 1000 obs. of 7 variables:
$ X1 : num -0.578 0.351 0.759 -0.858 -1.022 ...
$ X2 : num -0.7897 0.0339 -1.608 -1.1642 -0.0787 ...
$ X3 : num -0.1561 -0.7147 -0.8229 -0.1519 -0.0318 ...
$ X4 : num 1.4161 -0.0688 -0.155 -0.1571 -0.649 ...
$ TRT : num 0 0 0 0 0 0 0 1 0 1 ...
$ time: num 6.52 2.15 3 1.31 1.56 ...
$ stat: num 1 1 1 1 1 1 1 1 1 1 ...
The variable X4 interacts with the treatment variable and I don't have censored data.
I would like to calculate the C-index (UnoC) for each treatment arm. I expect the C-index to be equal to 0.5 in the control arm and much higher in the experimental arm.
But, I get almost the same value for both arms!
Can anyone confirm that: if I have a strong interaction between a variable and the treatment, the C-index in the experimental arm is high and in the control arm = 0.5?
Here is my attempt:
TR <- data[1:500,]
TE <- data[501:1000,]
s <- Surv(TR$time, TR$stat)
sNew <- Surv(TE$time, TE$stat)
train.fit <- coxph(Surv(time, stat) ~ X4, data=TR)
lpnew <- predict(train.fit, newdata=TE)
# The C-index for each treatment arm
UnoC(Surv.rsp = s[TR$TRT == 1], Surv.rsp.new = sNew[TE$TRT == 1], lpnew = lpnew[TE$TRT == 1])
[1] 0.7577109
UnoC(Surv.rsp = s[TR$TRT == 0], Surv.rsp.new = sNew[TE$TRT == 0], lpnew = -lpnew[TE$TRT == 0])
[1] 0.7295202
Thank you for your Help

"contrasts can be applied only to factors with 2 or more levels" Despite having multiple levels in each factor

I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels

Error in cross validation with factor value

I have this code:
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(is_nocnv ~., data = mydata, method = "lm", trControl = train.control)
# Summarize the results
print(model)
When I execute this code I obtain this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
The field: is_nocnv is factor the value of this field is 'YES' , 'NO'
str(mydata)
'data.frame': 8334 obs. of 7 variables:
$ chr : Factor w/ 1 level "chr1": 1 1 1 1 1 1 1 1 1 1 ...
$ start : int 3218610 154080441 154089408 61735 2069681 2074104 3135175 3137913 3214732 5901288 ...
$ stop : int 154074261 154081058 247813706 2061969 2071738 3130590 3136858 3212946 5900106 5902086 ...
$ strand : Factor w/ 1 level "*": 1 1 1 1 1 1 1 1 1 1 ...
$ num_probes : int 69643 3 59364 379 2 333 2 33 1943 3 ...
$ segment_mean: num -0.122 -13.462 -0.1 -0.326 -25.242 ...
$ is_nocnv : Factor w/ 2 levels "NO","YES": 2 2 2 1 1 1 1 1 1 1 ...
Here a small part of my dataset (csv)
"chr","start","stop","strand","num_probes","segment_mean","is_nocnv"
chr1,3218610,154074261,*,69643,-0.122,YES
chr1,154080441,154081058,*,3,-13.462,YES
chr1,154089408,247813706,*,59364,-0.1003,YES
chr1,61735,2061969,*,379,-0.326,NO
chr1,2069681,2071738,*,2,-25.242,NO
chr1,2074104,3130590,*,333,-0.3957,NO

R: Error in if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family")

I tried to use glm for estimate soccer teams strengths.
# data is dataframe (structure on bottom).
model <- glm(Goals ~ Home + Team + Opponent, family=poisson(link=log), data=data)
but get the error:
Error in if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(y, 0) : ‘<’ not meaningful for factors
data:
> data
Team Opponent Goals Home
1 5a51f2589d39c31899cce9d9 5a51f2579d39c31899cce9ce 3 1
2 5a51f2579d39c31899cce9ce 5a51f2589d39c31899cce9d9 0 0
3 5a51f2589d39c31899cce9da 5a51f2579d39c31899cce9cd 3 1
4 5a51f2579d39c31899cce9cd 5a51f2589d39c31899cce9da 0 0
> is.factor(data$Goals)
[1] TRUE
From the "details" section of documentation for glm() function:
A typical predictor has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
So you want to make sure your Goals column is numeric:
df <- data.frame( Team= c("5a51f2589d39c31899cce9d9", "5a51f2579d39c31899cce9ce", "5a51f2589d39c31899cce9da", "5a51f2579d39c31899cce9cd"),
Opponent=c("5a51f2579d39c31899cce9ce", "5a51f2589d39c31899cce9d9", "5a51f2579d39c31899cce9cd", "5a51f2589d39c31899cce9da "),
Goals=c(3,0,3,0),
Home=c(1,0,1,0))
str(df)
#'data.frame': 4 obs. of 4 variables:
# $ Team : Factor w/ 4 levels "5a51f2579d39c31899cce9cd",..: 3 2 4 1
# $ Opponent: Factor w/ 4 levels "5a51f2579d39c31899cce9cd",..: 2 3 1 4
# $ Goals : num 3 0 3 0
# $ Home : num 1 0 1 0
model <- glm(Goals ~ Home + Team + Opponent, family=poisson(link=log), data=df)
Then here is the output:
> model
Call: glm(formula = Goals ~ Home + Team + Opponent, family = poisson(link = log),
data = df)
Coefficients:
(Intercept) Home Team5a51f2579d39c31899cce9ce
-2.330e+01 2.440e+01 -3.089e-14
Team5a51f2589d39c31899cce9d9 Team5a51f2589d39c31899cce9da Opponent5a51f2579d39c31899cce9ce
-6.725e-15 NA NA
Opponent5a51f2589d39c31899cce9d9 Opponent5a51f2589d39c31899cce9da
NA NA
Degrees of Freedom: 3 Total (i.e. Null); 0 Residual
Null Deviance: 8.318
Residual Deviance: 3.033e-10 AIC: 13.98

Between/within standard deviations

When working on a hierarchical/multilevel/panel dataset, it may be very useful to adopt a package which returns the within- and between-group standard deviations of the available variables.
This is something that with the following data in Stata can be easily done through the command
xtsum, i(momid)
I made a research, but I cannot find any R package which can do that..
edit:
Just to fix ideas, an example of hierarchical dataset could be this:
son_id mom_id hispanic mom_smoke son_birthweigth
1 1 1 1 3950
2 1 1 0 3890
3 1 1 0 3990
1 2 0 1 4200
2 2 0 1 4120
1 3 0 0 2975
2 3 0 1 2980
The "multilevel" structure is given by the fact that each mother (higher level) has two or more sons (lower level). Hence, each mother defines a group of observations.
Accordingly, each dataset variable can vary either between and within mothers or only between mothers. birtweigth varies among mothers, but also within the same mother. Instead, hispanic is fixed for the same mother.
For example, the within-mother variance of son_birthweigth is:
# mom1 means
bwt_mean1 <- (3950+3890+3990)/3
bwt_mean2 <- (4200+4120)/2
bwt_mean3 <- (2975+2980)/2
# Within-mother variance for birthweigth
((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 +
(4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 +
(2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)
While the between-mother variance is:
# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
mean <- (3950+3890+3990+4200+4120+2975+2980)/7
# within variance:
((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)
I don't know what your stata command should reproduce, but to answer the second part of question about
hierarchical structure , it is easy to do this with list.
For example, you define a structure like this:
tree = list(
"var1" = list(
"panel" = list(type ='p',mean = 1,sd=0)
,"cluster" = list(type = 'c',value = c(5,8,10)))
,"var2" = list(
"panel" = list(type ='p',mean = 2,sd=0.5)
,"cluster" = list(type="c",value =c(1,2)))
)
To create this lapply is convinent to work with list
tree <- lapply(list('var1','var2'),function(x){
ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
cluster= list(type = 'c',value = rnorm(3))) ## R prefer symbols
})
names(tree) <-c('var1','var2')
You can view he structure with str
str(tree)
List of 2
$ var1:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num 0.284
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
$ var2:List of 2
..$ panel :List of 3
.. ..$ type: chr "p"
.. ..$ mean: num -0.144
.. ..$ sd : num 0
..$ cluster:List of 2
.. ..$ type : chr "c"
.. ..$ value: num [1:3] -0.595 -1.795 -0.439
Edit after OP clarification
I think that package reshape2 is what you want. I will demonstrate this here.
The idea here is in order to do the multilevel analysis we need to reshape the data.
First to divide the variables into two groups :identifier and measured variables.
library(reshape2)
dat.m <- melt(dat,id.vars=c('son_id','mom_id')) ## other columns are measured
str(dat.m)
'data.frame': 21 obs. of 4 variables:
$ son_id : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
$ mom_id : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
$ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
$ value : num 1 1 1 0 0 0 0 1 0 0 ..
Once your have data in "moten" form , you can "cast" to rearrange it in the shape that you want:
# mom1 means for all variable
acast(dat.m,variable~mom_id,mean)
1 2 3
hispanic 1.0000000 0 0.0
mom_smoke 0.3333333 1 0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth
acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
1 2 3
hispanic 0.0000000 0 0.0
mom_smoke 0.6666667 0 0.5
son_birthweigth 5066.6666667 3200 12.5
## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic 0.4285714
mom_smoke 0.5714286
son_birthweigth 3729.2857143
I know this question is four years old, but recently I wanted to do the same in R and came up with the following function. It depends on dplyr and tibble. Where: df is the dataframe, columns is a numerical vector to subset the dataframe and individuals is the column with the individuals.
xtsumR<-function(df,columns,individuals){
df<-dplyr::arrange_(df,individuals)
panel<-tibble::tibble()
for (i in columns){
v<-df %>% dplyr::group_by_() %>%
dplyr::summarize_(
mean=mean(df[[i]]),
sd=sd(df[[i]]),
min=min(df[[i]]),
max=max(df[[i]])
)
v<-tibble::add_column(v,variacao="overal",.before=-1)
v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
sdB<-sd(v2)
varW<-df[[i]]-rep(v2,each=12) #
varW<-varW+mean(df[[i]])
sdW<-sd(varW)
minB<-min(v2)
maxB<-max(v2)
minW<-min(varW)
maxW<-max(varW)
v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
panel<-rbind(panel,v)
}
var<-rep(names(df)[columns])
n1<-rep(NA,length(columns))
n2<-rep(NA,length(columns))
var<-c(rbind(var,n1,n1))
panel$var<-var
panel<-panel[c(6,1:5)]
names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
panel[3:6]<-as.numeric(unlist(panel[3:6]))
panel[3:6]<-round(unlist(panel[3:6]),2)
return(panel)
}

Resources