Posterior_survfit() is not using the nd - r

I am having trouble generating posterior predictions using posterior_survfit(). I am trying to use a new data frame, but it is not using the new data frame and instead is using values from the dataset I used to fit the model. The fitted variables in the model are New.Treatment (6 treatments = categorical), Openness (a continuous light index min= 2.22, mean= 6.903221 and max=10.54), subplot_by_site(categorical-720 sites), New.Species.name(categorical- 165 species). My new data frame has 94 rows and the posterior_survfit() is giving me 3017800 rows. Help, please!
head(nd)
New.Treatment Openness
1 BE 5
2 BE 6
3 BE 7
4 BE 8
5 BE 9
6 BE 10
fit= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment*Openness + (1 |subplot_by_site)+(1|New.Species.name),
data = dataset,
basehaz = "weibull",
chains=4,
iter = 2000,
cores =4 )
Post=posterior_survfit(fit, type="surv",
newdata=nd5)
head(Post)
id cond_time time median ci_lb ci_ub
1 1 NA 62.0000 0.9626 0.9623 1.0000
2 1 NA 69.1313 0.9603 0.9600 0.9997
3 1 NA 76.2626 0.9581 0.9579 0.9696
4 1 NA 83.3939 0.9561 0.9557 0.9665
5 1 NA 90.5253 0.9541 0.9537 0.9545
6 1 NA 97.6566 0.9522 0.9517 0.9526
##Here some reproducible code to explain my problem:
library(rstanarm)
data_NHN<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=0.15)))
data_NHN$subplot_by_site=c(rep("P1",63),rep("P2",60),rep("P3",60))
data_NHN$Status_surv=sample(0:1,183, replace=TRUE)
data_NHN$New.Species.name=c(rep("sp1",10),rep("sp2",40),rep("sp1",80),rep("sp2",20),rep("sp1",33))
data_NHN$days=sample(10, size = nrow(data_NHN), replace = TRUE)
nd_t<- expand.grid(New.Treatment = c("A","B","C"), Openness = c(seq(2, 11, by=1)))
mod= stan_surv(formula = Surv(days, Status_surv) ~ New.Treatment+Openness + (1 |subplot_by_site)+(1|New.Species.name),
data =data_NHN,
basehaz = "weibull",
chains=4,
iter = 30,
cores =4)
summary(mod)
pos=posterior_survfit(mod, type="surv",
newdataEvent=nd_t,
times = 0)
head(pos)
#I am interested in predicting values for specific Openess values
#(nd_t=20 rows)but I am getting instead values for each point in time
#(pos=18300rows)
Operating System: Mac OS Catalina 10.15.6
R version: 4.0
rstan version: 2.21.2
rstanarm Version: rstanarm_2.21.2
Any suggestions on why is it not working. it’s not clear how to give some sort of plot of the effects of one variable in the interaction as the other changes and the associated uncertainty (i.e. a marginal effects plot). In my example, I am interested in getting the values at specific "Openness" values and not at each specific time as appears in the posterior results. TIA.

I can only give you a partial answer. You should be aware that this is pretty bleeding-edge; although the ArXiv paper (dated Feb 2020) says
Hopefully by the time you are reading
this, the functionality will be available in the stable release on the Comprehensive R Archive Network (CRAN)
but so far that's not even true; it's not even in the master GitHub branch, so I used remotes::install_github("stan-dev/rstanarm#feature/survival") to install it from source.
The proximal problem is that you should specify the new data frame as newdata, not newdataEvent. There seem to be a lot of mismatches between the master and this branch, and between the docs and the code ... newdataEvent is used in the older method for stanjm models, but not for stansurv models. You can look at the code here, or use formals(rstanarm:::posterior_survfit.stansurv). Unfortunately, because this method has an (unused, unchecked) ... argument, that means that any misnamed arguments will be silently ignored.
The next problem is that if you specify the new data in your example as newdata you'll get
Error: The following variables are missing from the data: subplot_by_site, New.Species.name
That is, there doesn't seem to be an obvious way to generate a population-level posterior prediction. (Setting the random effects grouping variables to NA isn't allowed.) If you want to do this, you could either:
expand your newdata to include all combinations of the grouping variables in your data set, and average the results across levels yourself;
post an issue on GitHub or contact the maintainers ...

Related

different output for PR AUC for different R packages

I find different numeric values for the computation of the Area Under the Precision Recall Curve (PRAUC) with the dataset I am working on when computed via 2 different R packages: yardstick and caret.
I am afraid I was not able to reproduce this mismatch with synthetic data, but only with my dataset (this is strange as well)
In order to make this reproducible, I am sharing the prediction output of my model, you can download it here https://drive.google.com/open?id=1LuCcEw-RNRcdz6cg0X5bIEblatxH4Rdz (don't worry, it's a small csv).
The csv contains a dataframe with 4 columns:
yes probability estimate of being in class yes
no = 1 - yes
obs actual class label
pred predicted class label (with .5 threshold)
here follows the code to produce the 2 values of PRAUC
require(data.table)
require(yardstick)
require(caret)
pr <- fread('pred_sample.csv')
# transform to factors
# put the positive class in the first level
pr[, obs := factor(obs, levels = c('yes', 'no'))]
pr[, pred := factor(pred, levels = c('yes', 'no'))] # this is actually not needed
# compute yardstick PRAUC
pr_auc(pr, obs, yes) # 0.315
# compute caret PRAUC
prSummary(pr, lev = c('yes', 'no')) # 0.2373
I could understand a little difference, due to the approximation when computing the area (interpolating the curve), but this seems way too high.
I even tried a third package, PRROC, and the result is still different, namely around .26.

Lambda Issue, or cross validation

I am doing double cross validation with LASSO of glmnet package, however when I plot the results I am getting lambda of 0 - 150000 which is unrealistic in my case, not sure what is wrong I am doing, can someone point me in the right direction. Thanks in advance!
calcium = read.csv("calciumgood.csv", header=TRUE)
dim(calcium)
n = dim(calcium)[1]
calcium = na.omit(calcium)
names(calcium)
library(glmnet) # use LASSO model from package glmnet
lambdalist = exp((-1200:1200)/100) # defines models to consider
fulldata.in = calcium
x.in = model.matrix(CAMMOL~. - CAMLEVEL - AGE,data=fulldata.in)
y.in = fulldata.in[,2]
k.in = 10
n.in = dim(fulldata.in)[1]
groups.in = c(rep(1:k.in,floor(n.in/k.in)),1:(n.in%%k.in))
set.seed(8)
cvgroups.in = sample(groups.in,n.in) #orders randomly, with seed (8)
#LASSO cross-validation
cvLASSOglm.in = cv.glmnet(x.in, y.in, lambda=lambdalist, alpha = 1, nfolds=k.in, foldid=cvgroups.in)
plot(cvLASSOglm.in$lambda,cvLASSOglm.in$cvm,type="l",lwd=2,col="red",xlab="lambda",ylab="CV(10)")
whichlowestcvLASSO.in = order(cvLASSOglm.in$cvm)[1]; min(cvLASSOglm.in$cvm)
bestlambdaLASSO = (cvLASSOglm.in$lambda)[whichlowestcvLASSO.in]; bestlambdaLASSO
abline(v=bestlambdaLASSO)
bestlambdaLASSO # this is the lambda for the best LASSO model
LASSOfit.in = glmnet(x.in, y.in, alpha = 1,lambda=lambdalist) # fit the model across possible lambda
LASSObestcoef = coef(LASSOfit.in, s = bestlambdaLASSO); LASSObestcoef # coefficients for the best model fit
I found the dataset you referring at
Calcium, inorganic phosphorus and alkaline phosphatase levels in elderly patients.
Basically the data are "dirty", and it is a possible reason why the algorithm does not converge properly. E.g. there are 771 year old patients, bisides 1 and 2 for male and female, there is 22 for sex encodeing etc.
As for your case you removed only NAs.
You need to check data.frame imported types as well. E.g. instead of factors it could be imported as integers (SEX, Lab and Age group) which will affect the model.
I think you need:
1) cleanse the data;
2) if doesnot work submit *.csv file

GAM model error

My data frame looks like:
head(bush_status)
distance status count
0 endemic 844
1 exotic 8
5 native 3
10 endemic 5
15 endemic 4
20 endemic 3
The count data is non-normally distributed. I'm trying to fit a generalized additive model to my data in two ways so i can use anova to see if the p-value supports m2.
m1 <- gam(count ~ s(distance) + status, data=bush_status, family="nb")
m2 <- gam(count ~ s(distance, by=status) + status, data=bush_status, family="nb")
m1 works fine, but m2 sends the error message:
"Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons,
scale.penalty = scale.penalty, :
Can't find by variable"
This is pretty beyond me so if anyone could offer any advice that would be much appreciated!
From your comments it became clear that you passed a character variable to by in the smoother. You must pass a factor variable there. This has been a frequent gotcha for me too and I consider it a design flaw (because base R regression functions deal with character variables just fine).

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)

R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’)

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?
The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

Resources