Number at risk table using survplot with cph() object - r

This is pretty specific to the rms package. When using survplot with a cph object, the number at risk table is not stratified according to the covariate. using npsurv() does this correctly.
library(survival);library(rms)
data(lung)
fit <- cph(Surv(time, status==2) ~ sex,data = lung,surv = TRUE,x=TRUE, y=TRUE)
survplot(fit, n.risk=TRUE)
# compared to:
survplot(npsurv(fit$sformula, data = lung), conf= "none",n.risk=TRUE)
For usual use cases I guess re-running the model with npsurv will do, but for others, like when wanting to use fit.mult.impute (my specific use case), you want to use the "original" fit created by cph().
Is there a work-around/way to fix this?

So I should have used
strat(sex)
as the term.

Related

Back transformation of emmeans in lmer

I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.

Why is predict in R taking Train data instead of Test data? [duplicate]

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Getting "Variable Importance" from rpart

I'm performing a tree analysis using rpart, and I need to access the values of "Variable importance" as shown when the rpart object is printed.
Is there a way to do that?
Thanks!
#rawr indicated it in the comments, I'll just make it an answer:
You can extract the variable importance from a rpart object using:
fit$variable.importance
Just adding details on #user7779's answer, you can also access the information you need in the following way:
library(rpart)
my.tree = rpart(y ~ X, data = dta, method = "anova") # I am assuming regression tree.
summary(my.tree)
In the output, among the first lines, you find variable importance. Notice though that here everything is rescaled, thus you will get the relative importance (i.e., numbers are going to sum up to one hundred).

R random forest using cforest, how to plot tree

I have created random forest model using cforest
library("party")
crs$rf <- cforest(as.factor(Censor) ~ .,
data=crs$dataset[crs$sample,c(crs$input, crs$target)],
controls=cforest_unbiased(ntree=500, mtry=4))
cf <- crs$rf
tr <- party:::prettytree(cf#ensemble[[1]], names(cf#data#get("input")))
#tr
plot(new("BinaryTree", tree=tr, data=cf#data, responses=cf#responses))
I get error when plotting tree
Error: no string supplied for 'strwidth/height' unit
Any help how to overcome this error?
Looking at your code, I assume that crs is referring to a data frame. The dollar sign may be a problem (specifically crs$rf). If crs is indeed a data frame, then the $ would tell R to extract items from inside the dataframe by using the usual list indexing device. This may be conflicting with the call to generate the random forest, and be causing the error. You could fix this by starting with:
crs_rf <- cforest(as.factor(Censor) ~ ., .....
Which would create the random forest object. This would replace:
crs$rf <- cforest(as.factor(Censor) ~ ., ......
As a reference, in case this doesn't fix it, I wanted to refer you to a great guide from Stanford that covers random forests. They do have examples showing how the party package is used. In order to make trouble shooting easier, I might recommend pulling apart the call like they do in the guide. For example (taking from the guide that is provided), first, set the controls:
data.controls <- cforest_unbiased(ntree=1000, mtry=3)
Then make the call:
data.cforest <- cforest(Resp ~ x + y + z…, data = mydata, controls=data.controls)
Then generate the plot once the call works. If you need help plotting trees using cforest(), you might think about looking at this excellent resource (as I believe the root of the problem is with your plotting function call): http://www.r-bloggers.com/a-brief-tour-of-the-trees-and-forests/

lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Resources