Adjusted survival curve based on weigthed cox regression - r

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?

Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Related

inputs of 2 separate predict() return the same set of fitted values

Confession: I attempted to ask this question yesterday, but used a sample, congruent dataset which resembles the my "real" data in hopes this would be more convenient for readers here. One issue was resolved, but another remains that appears immutable.
My objective is creating a linear model of two predicted vectors: "yC.hat", and "yT.hat" which are meant to project average effects for unique observed values of pri2000v as a function of the average poverty level "I(avgpoverty^2) under control (treatment = 0) and treatment (treatment = 1) conditions.
While I appear to have no issues running the regression itself, the inputs of my data argument have no effect on predict(), and only the object itself affects the output. As a result, treatment = 0 and treatment = 1 in the data argument result in the same fitted values. In fact, I can plug in ANY value into the data argument and it makes do difference. So I suspect my failure to understand issue starts here.
Here is my code:
q6rega <- lm(pri2000v ~ treatment + I(log(pobtot1994)) + I(avgpoverty^2)
#interactions
+ treatment:avgpoverty + treatment:I(avgpoverty^2), data = pga)
## predicted PRI support under the Treatment condition
q6.yT.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 1))
## predicted PRI support rate under the Control condition
q6.yC.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 0))
q6.yC.hat == q6.yT.hat
TRUE[417]
dput(pga has been posted on my github, if needed
EDIT: There were a few things wrong with my code above, but not specifying pobtot1994 somehow resulted in R treating it as newdata being omitted. Since I'm fairly new to statistics, I confused fitted values with the prediction output that I was actually trying to achieve. I would have expected that an unexpected input is to produce an error instead.
I'm surprised you are able to run a prediction when it is lacking the required variable (pobtot1994) for your model in the new data frame for prediction.
Anyway, you would need to create a new data frame with the three variables in untransformed form used in the model. Since you are interested to compare the fitted values of avgpoverty 3 to 5 for treatment 1 and 0, you need to force the third variable pobtot1994 as a constant. I use the mean of pobtot9994 here for simplicity.
newdat <- expand.grid(avgpoverty=3:5, treatment=factor(c(0,1)), pobtot1994=mean(pga$pobtot1994))
avgpoverty treatment pobtot1994
1 3 0 2037.384
2 4 0 2037.384
3 5 0 2037.384
4 3 1 2037.384
5 4 1 2037.384
6 5 1 2037.384
The prediction will show you the different values for the two conditions.
newdat$fitted <- predict(q6rega, newdata=newdat)
avgpoverty treatment pobtot1994 fitted
1 3 0 2037.384 38.86817
2 4 0 2037.384 50.77476
3 5 0 2037.384 55.67832
4 3 1 2037.384 51.55077
5 4 1 2037.384 49.03148
6 5 1 2037.384 59.73910

Errors using powerSim and powerCurve for a clmm in R

I'm new to clmm and run into the following problem:
I want to obtain the optimal sample size for my study with R using powerSim and powerCurve. Because my data is ordinal, I'm using a clmm. Study participants (VPN) should evaluate three sentence types (SH1,SM1, and SP1) on a 5 point likert scale (evaluation.likert). I need to account for my participants as a random factor while the sentence types and the evaluation are my fixed factors.
Here's a glimpse of my data (count of VPN goes up to 40 for each of the parameters, I just shortened it here):
VPN parameter evaluation.likert
1 1 SH1 2
2 2 SH1 4
3 3 SH1 5
4 4 SH1 3
...
5 1 SM1 4
6 2 SM1 2
7 3 SM1 2
8 4 SM1 5
...
9 1 SP1 1
10 2 SP1 1
11 3 SP1 3
12 4 SP1 5
...
Now, with some help I created the following model:
clmm(likert~parameter+(1|VPN), data=dfdata)
With this model, I'm doing the simulation:
ps1 <- powerSim(power, test=fixed("likert:parameter", "anova"), nsim=40)
Warning:
In observedPowerWarning(sim) :
This appears to be an "observed power" calculation
print(ps1)
Power for predictor 'likert:parameter', (95% confidence interval):
0.00% ( 0.00, 8.81)
Test: Type-I F-test
Based on 40 simulations, (0 warnings, 40 errors)
alpha = 0.05, nrow = NA
Time elapsed: 0 h 0 m 0 s
nb: result might be an observed power calculation
In the above example, I tried it with 40 participants but I already also ran a simulation with 2000000 participants to check if I just need a huge amount of people. The results were the same though: 0.0%.
lastResult()$errors tells me that I'm using a method which is not applicable for clmm:
not applicable method for'simulate' on object of class "clmm"
But besides the anova I'm doing here, I've also already tried z, t, f, chisq, lr, sa, kr, pb. (And instead of test=fixed, I've also already tried test=compare, test=fcompare, test=rcompare, and even test=random())
So I guess there must be something wrong with my model? Or are really none of these methods applicaple for clmms?
Many thanks in advance, your help is already very much appreciated!

Stata twoway graph of means with confidence intervals

Using
clear
score group test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
I want to scatter plot mean score by group for each test (same graph) with confidence intervals (the real data has thousands of observations). The resulting graph would have two sets of two dots. One set of dots for test==a (group==0 vs group==1) and one set of dots for test==b (group==0 vs group==1).
My current approach works but it is laborious. I compute all of the needed statistics using egen: the mean, number of observations, standard deviations...for each group by test. I then collapse the data and plot.
There has to be another way, no?
I assumed that Stata would be able to take as its input the score group and test variables and then compute and present this pretty standard graph.
After spending a lot of time on Google, I had to ask.
Although there are user-written programs, I lean towards statsby as a basic approach here. Discussion is accessible in this paper.
This example takes your data example (almost executable code). Some choices depend on the large confidence intervals implied. Note that if your version of Stata is not up-to-date, the syntax of ci will be different. (Just omit means.)
clear
input score group str1 test
2 0 A
3 0 B
6 0 B
8 0 A
2 0 A
2 0 A
10 1 B
7 1 B
8 1 A
5 1 A
10 1 A
11 1 B
end
save cj12 , replace
* test A
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "A"
gen test = "A"
save cj12results, replace
* test B
use cj12
statsby mean=r(mean) ub=r(ub) lb=r(lb) N=r(N), by(group) clear : ///
ci means score if test == "B"
gen test = "B"
append using cj12results
* graph; show sample sizes too, but where to show them is empirical
set scheme s1color
gen where = -20
scatter mean group, ms(O) mcolor(blue) || ///
rcap ub lb group, lcolor(blue) ///
by(test, note("95% confidence intervals") legend(off)) ///
subtitle(, fcolor(ltblue*0.2)) ///
ytitle(score) xla(0 1) xsc(r(-0.25 1.25)) yla(-10(10)10, ang(h)) || ///
scatter where group, ms(none) mla(N) mlabpos(12) mlabsize(*1.5)
We can't compare your complete code or your graph, because you show neither.

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

How to organize data for a multivariate probit model?

I've conducted a psychometric test on some subjects, and I'm trying to create a multivariate probit model.
The test was conducted as follows:
To subject 1 was given a certain stimulous under 11 different conditions, 10 times for each condition. Answers (correct=1, uncorrect=0) were registered.
So for subject 1, I have the following results' table:
# Subj 1
correct
cnt 1 0
1 0 10
2 0 10
3 1 9
4 5 5
5 7 3
6 10 0
7 10 0
8 10 0
9 9 1
10 10 0
11 10 0
This means that Subj1 answered uncorrectly 10 times under condition 1 and 2, and answered 10 times correctly under condition 10 and 11. For the other conditions, the response was increasing from condition 3 to condition 9.
I hope I was clear.
I usually analyze the data using the following code:
prob.glm <- glm(resp.mat1 ~ cnt, family = binomial(link = "probit"))
Here resp.mat1 is the responses' table, while cnt is the contrast c(1,11). So I'm able to draw the sigmoid curve using the predict() function. The graph, for the subject-1 is the following.
Now suppose I've conducted the same test on 20 subjects. I have now 20 tables, organized like the first one.
What I want to do is to compare subgroups, for example: male vs. female; young vs. older and so on. But I want to keep the inter-individual variability, so simply "adding" the 20 tables will be wrong.
How can I organize the data in order to use the glm() function?
I want to be able to write a command like:
prob.glm <- glm(resp.matTOT ~ cnt + sex, family = binomial(link = "probit"))
And then graphing the curve for sex=M, and sex=F.
I tried using the rbind() function, to create a unique table, then adding columns for Subj (1 to 20), Sex, Age. But it looks me a bad solution, so any alternative solutions will be really appreciated.
Looks like you are using the wrong function for the job. Check the first example of glmer in package lme4; it comes quite close to what you want. herd should be replaced by the subject number, but make sure that you do something like
mydata$subject = as.factor(mydata$subject)
when you have numerical subject numbers.
# Stolen from lme4
library(lattice)
library(
xyplot(incidence/size ~ period|herd, cbpp, type=c('g','p','l'),
layout=c(3,5), index.cond = function(x,y)max(y))
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))
There's a multivariate probit command in the mlogit library of all things. You can see an example of the data structure required here:
https://stats.stackexchange.com/questions/28776/multinomial-probit-for-varying-choice-set

Resources