Smooth.Pspline yields the same results with different spar values - r

I am trying to determine the best value of spar to implement across a dataset by reducing the root mean square error between test and training replicates on the same raw data. My test and training replicates look like this:
Traindataset
t = -0.008
-0.006
-0.004
-0.002 0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
0.024
dist = NA 0 0 0 0
0.000165038
0.000686934
0.001168098
0.001928885
0.003147262
0.004054971
0.005605361
0.007192645
0.009504648
0.011498809
0.013013655
0.01342625
Testdataset
t = -0.008
-0.006
-0.004
-0.002 0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
0.024
dist = NA 0 0 0 0 0
0.000481184
0.001306409
0.002590156
0.003328259
0.004429246
0.005012768
0.005829698
0.006567801
0.008030102
0.009617453
0.011202827
I need the spline to be 5th order so I can accurately predict the 3rd derivative, so I am using smooth.Pspline (from the pspline package) instead of the more common smooth.spline. I attempted using a variant of the solution outlined here (using root mean squared error of predicting testdataset from traindataset instead of cross validation sum of squares within one dataset). My code looks like this:
RMSE <- function(m, o){
sqrt(mean((m - o)^2))
}
Psplinermse <- function(spar){
trainmod <- smooth.Pspline(traindataset$t, traindataset$dist, norder = 5,
spar = spar)
testpreddist <- predict(trainmod,testdataset$t)[,1]
RMSE(testpreddist, testdataset$dist)
}
spars <- seq(0, 1, by = 0.001)
rmsevals <- rep(0, length(spars))
for (i in 1:length(spars)){
rmsevals[i] <- Psplinermse(spars[i])
}
plot(spars, rmsevals, 'l', xlab = 'spar', ylab = 'RMSE' )
The issue I am having is that for pspline, the values of RMSE are the same for any spar above 0 graph of spar vs RMSE. When I dug into just the predictions line of code, I realized I am getting the exact same predicted values of dist for any spar above 0. Any ideas on why this might be are greatly appreciated.

Related

R-INLA not computing fitted marginal values

I've run into an issue where R INLA isn't computing the fitted marginal values. I first had it with my own dataset, and have been able to reproduce it following an example from this book. I suspect there must be some configuration I need to change, or maybe INLA isn't working well with something under the hood? Anyways here is the code:
library("rgdal")
boston.tr <- readOGR(system.file("shapes/boston_tracts.shp",
package="spData")[1])
#create adjacency matrices
boston.adj <- poly2nb(boston.tr)
W.boston <- nb2mat(boston.adj, style = "B")
W.boston.rs <- nb2mat(boston.adj, style = "W")
boston.tr$CMEDV2 <- boston.tr$CMEDV
boston.tr$CMEDV2 [boston.tr$CMEDV2 == 50.0] <- NA
#define formula
boston.form <- log(CMEDV2) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2) +
AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT)
boston.tr$ID <- 1:length(boston.tr)
#run model
boston.iid <- inla(update(boston.form, . ~. + f(ID, model = "iid")),
data = as.data.frame(boston.tr),
control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE),
control.predictor = list(compute = TRUE)
)
When I look at the output of this model, it species that the fitted values were computed:
summary(boston.iid)
Call:
c("inla(formula = update(boston.form, . ~ . + f(ID, model = \"iid\")), ", " data = as.data.frame(boston.tr),
control.compute = list(dic = TRUE, ", " waic = TRUE, cpo = TRUE), control.predictor = list(compute = TRUE))"
)
Time used:
Pre = 0.981, Running = 0.481, Post = 0.0337, Total = 1.5
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 4.376 0.151 4.080 4.376 4.672 4.376 0
CRIM -0.011 0.001 -0.013 -0.011 -0.009 -0.011 0
ZN 0.000 0.000 -0.001 0.000 0.001 0.000 0
INDUS 0.001 0.002 -0.003 0.001 0.006 0.001 0
CHAS1 0.056 0.034 -0.010 0.056 0.123 0.056 0
I(NOX^2) -0.540 0.107 -0.751 -0.540 -0.329 -0.540 0
I(RM^2) 0.007 0.001 0.005 0.007 0.010 0.007 0
AGE 0.000 0.001 -0.001 0.000 0.001 0.000 0
log(DIS) -0.143 0.032 -0.206 -0.143 -0.080 -0.143 0
log(RAD) 0.082 0.018 0.047 0.082 0.118 0.082 0
TAX 0.000 0.000 -0.001 0.000 0.000 0.000 0
PTRATIO -0.031 0.005 -0.040 -0.031 -0.021 -0.031 0
B 0.000 0.000 0.000 0.000 0.001 0.000 0
log(LSTAT) -0.329 0.027 -0.382 -0.329 -0.277 -0.329 0
Random effects:
Name Model
ID IID model
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for the Gaussian observations 169.24 46.04 99.07 160.46 299.72 141.30
Precision for ID 42.84 3.40 35.40 43.02 49.58 43.80
Deviance Information Criterion (DIC) ...............: -996.85
Deviance Information Criterion (DIC, saturated) ....: 1948.94
Effective number of parameters .....................: 202.49
Watanabe-Akaike information criterion (WAIC) ...: -759.57
Effective number of parameters .................: 337.73
Marginal log-Likelihood: 39.74
CPO and PIT are computed
Posterior marginals for the linear predictor and
the fitted values are computed
However, when I try to inspect those fitted marginal values, there is nothing there:
> boston.iid$marginals.fitted.values
NULL
Interestingly enough, I do get a summary of the posteriors, so they must be getting computed somehow?
> boston.iid$summary.fitted.values
mean sd 0.025quant 0.5quant 0.975quant mode
fitted.Predictor.001 2.834677 0.07604927 2.655321 2.844934 2.959994 2.858717
fitted.Predictor.002 3.020424 0.08220780 2.824525 3.034319 3.149766 3.052558
fitted.Predictor.003 3.053759 0.08883760 2.841738 3.071530 3.188051 3.094010
fitted.Predictor.004 3.032981 0.09846662 2.801099 3.056692 3.175215 3.084842
Any ideas on what I'm mis-specifying in the call. I have set compute = T which is what I had seen causing issues on the R-INLA forums.
The developers intentionally disabled computing the marginals to make the model faster.
To enable it, you can add these to the inla arguments:
control.predictor=list(compute=TRUE)
control.compute=list(return.marginals.predictor=TRUE)
So it looks something like this:
boston.form <- log(CMEDV2) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2) +
AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT)
boston.tr$ID <- 1:length(boston.tr)
#run model
boston.iid <- inla(update(boston.form, . ~. + f(ID, model = "iid")),
data = as.data.frame(boston.tr),
control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor=TRUE),
control.predictor = list(compute = TRUE)
)
boston.iid$summary.fitted.values
boston.iid$marginals.fitted.values

rstan MCMC: Different squence of data resulting in different results, why?

I am new to Stan and rstan.
I recently may find a weird issue when I worked on Markov chain Monte Carlo (MCMC). In short, for example, the data has 10 observations, say ID 1 to 10. Now, I permutate it by shifting the 10th row between the original first and second rows, say ID 1, 10, and 2 to 9. Two different scenarios of data will give different estimates, even I fix the same random seed.
To illustrate the issue in a simpler way, I write the following R scripts.
##TEST 01
# generate data
N <- 100
set.seed(123)
Y <- rnorm(N, 1.6, 0.2)
stan_code1 <- "
data {
int <lower=0> N; //number of data
real Y[N]; //data in an (C++) array
}
parameters {
real mu; //mean parameter of a normal distribution
real <lower=0> sigma; //standard deviation parameter of a normal distribution
}
model {
//prior distributions for parameters
mu ~ normal(1.7, 0.3);
sigma ~ cauchy(0, 1);
//likelihood of Y given parameters
for (i in 1:N) {
Y[i] ~ normal(mu, sigma);
}
}
"
# compile model
library(rstan)
model1 <- stan_model(model_code = stan_code1) #usually, take half a minute to run
# pass data to stan and run model
set.seed(123)
fit <- sampling(model1, list(N=N, Y=Y), iter=200, chains=4)
print(fit)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 473 1.00
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 141 1.02
# lp__ 117.84 0.07 0.85 115.77 117.37 118.07 118.51 118.78 169 1.01
Yp <- Y[c(1,100,2:99)]
set.seed(123)
fit2 <- sampling(model1, list(N=N, Y=Yp), iter=200, chains=4)
print(fit2)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.59 1.61 1.62 1.63 1.66 480 0.99
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 139 1.02
# lp__ 117.79 0.09 0.95 115.72 117.35 118.05 118.49 118.77 124 1.01
As we can see from the above simple case, two results fit and fit2 are different.
And, even stranger, if I write the likelihood before the priors (previousy, the priors are written ahead of the likelihood) in code file, the same random seed and the same data will still give different estimates.
##TEST 01'
# generate data
#N <- 100
set.seed(123)
Y <- rnorm(N, 1.6, 0.2)
stan_code11 <- "
data {
int <lower=0> N; //number of data
real Y[N]; //data in an (C++) array
}
parameters {
real mu; //mean parameter of a normal distribution
real <lower=0> sigma; //standard deviation parameter of a normal distribution
}
model {
//likelihood of Y given parameters
for (i in 1:N) {
Y[i] ~ normal(mu, sigma);
}
//prior distributions for parameters
mu ~ normal(1.7, 0.3);
sigma ~ cauchy(0, 1);
}
"
# compile model
#library(rstan)
model11 <- stan_model(model_code = stan_code11) #usually, take half a minute to run
# pass data to stan and run model
set.seed(123)
fit11 <- sampling(model11, list(N=N, Y=Y), iter=200, chains=4)
print(fit11)
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 455 0.99
# sigma 0.19 0.00 0.01 0.16 0.18 0.18 0.20 0.21 94 1.04
# lp__ 117.68 0.08 0.93 115.24 117.18 117.90 118.45 118.77 149 1.01
##TEST01 was
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# mu 1.62 0.00 0.02 1.58 1.61 1.62 1.63 1.66 473 1.00
# sigma 0.18 0.00 0.01 0.16 0.18 0.18 0.19 0.21 141 1.02
# lp__ 117.84 0.07 0.85 115.77 117.37 118.07 118.51 118.78 169 1.01
Stan does not utilize the same pseudo random-number generator as R. Thus, calling set.seed(123) only makes Y repeatable and does not make the MCMC sampling repeatable. In order to accomplish the later, you need to pass an integer as the seed argument to the stan (or sampling) function in the rstan package like
sampling(model11, list(N = N, Y = Y), seed = 1234).
Even then, I could imagine that permuting the observations could result in different realizations of the draws from the posterior distribution due to floating-point reasons. But none of this really matters (unless you conduct too few iterations or get warning messages) because the posterior distribution is the same even if a finite set of realizations from the posterior distribution are randomly different numbers.

ZIP - Hidden Markov model r Stan

I'm trying to adjust a Zero Inflated Poisson Hidden Markov Model with Stan. For the Poisson-HMM in a past forum this setting was shown. see link.
While to adjust the ZIP with the classical theory is well documented the code and model.
ziphsmm
library(ziphsmm)
set.seed(123)
prior_init <- c(0.5,0.5)
emit_init <- c(20,6)
zero_init <- c(0.5,0)
tpm <- matrix(c(0.9, 0.1, 0.2, 0.8),2,2,byrow=TRUE)
result <- hmmsim(n=100,M=2,prior=prior_init, tpm_parm=tpm,emit_parm=emit_init,zeroprop=zero_init)
y <- result$series
serie <- data.frame(y = result$series, m = result$state)
fit1 <- fasthmmfit(y,x=NULL,ntimes=NULL,M=2,prior_init,tpm,
emit_init,0.5, hessian=FALSE,method="BFGS",
control=list(trace=1))
fit1
$prior
[,1]
[1,] 0.997497445
[2,] 0.002502555
$tpm
[,1] [,2]
[1,] 0.9264945 0.07350553
[2,] 0.3303533 0.66964673
$zeroprop
[1] 0.6342182
$emit
[,1]
[1,] 20.384688
[2,] 7.365498
$working_parm
[1] -5.9879373 -2.5340475 0.7065877 0.5503559 3.0147840 1.9968067
$negloglik
[1] 208.823
Stan
library(rstan)
ZIPHMM <- 'data {
int<lower=0> N;
int<lower=0> y[N];
int<lower=1> m;
}
parameters {
real<lower=0, upper=1> theta; //
positive_ordered[m] lambda; //
simplex[m] Gamma[m]; // tpm
}
model {
vector[m] log_Gamma_tr[m];
vector[m] lp;
vector[m] lp_p1;
// priors
lambda ~ gamma(0.1,0.01);
theta ~ beta(0.05, 0.05);
// transposing tpm and taking the log of each entry
for(i in 1:m)
for(j in 1:m)
log_Gamma_tr[j, i] = log(Gamma[i, j]);
lp = rep_vector(-log(m), m); //
for(n in 1:N) {
for(j in 1:m){
if (y[n] == 0)
lp_p1[j] = log_sum_exp(log_Gamma_tr[j] + lp) +
log_sum_exp(bernoulli_lpmf(1 | theta),
bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda[j]));
else
lp_p1[j] = log_sum_exp(log_Gamma_tr[j] + lp) +
bernoulli_lpmf(0 | theta) +
poisson_lpmf(y[n] | lambda[j]);
}
lp = lp_p1;
}
target += log_sum_exp(lp);
}'
mod_ZIP <- stan(model_code = ZIPHMM, data=list(N=length(y), y=y, m=2), iter=1000, chains=1)
print(mod_ZIP,digits_summary = 3)
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta 0.518 0.002 0.052 0.417 0.484 0.518 0.554 0.621 568 0.998
lambda[1] 7.620 0.039 0.787 6.190 7.038 7.619 8.194 9.132 404 1.005
lambda[2] 20.544 0.039 0.957 18.861 19.891 20.500 21.189 22.611 614 1.005
Gamma[1,1] 0.664 0.004 0.094 0.473 0.604 0.669 0.730 0.841 541 0.998
Gamma[1,2] 0.336 0.004 0.094 0.159 0.270 0.331 0.396 0.527 541 0.998
Gamma[2,1] 0.163 0.003 0.066 0.057 0.114 0.159 0.201 0.312 522 0.999
Gamma[2,2] 0.837 0.003 0.066 0.688 0.799 0.841 0.886 0.943 522 0.999
lp__ -222.870 0.133 1.683 -227.154 -223.760 -222.469 -221.691 -220.689 161 0.999
True values
real = list(tpm = tpm,
zeroprop = nrow(serie[serie$m == 1 & serie$y == 0, ]) / nrow(serie[serie$m == 1,]),
emit = t(t(tapply(serie$y[serie$y != 0],serie$m[serie$y != 0], mean))))
real
$tpm
[,1] [,2]
[1,] 0.9 0.1
[2,] 0.2 0.8
$zeroprop
[1] 0.6341463
$emit
[,1]
1 20.433333
2 7.277778
Estimates give quite oddly to someone could help me to know that I am doing wrong. As we see the estimates of stan zeroprop = 0.518 while the real value is 0.634, on the other hand the values of the t.p.m. in stan they are quite distant and the means lambda1 = 7.62 and lambda2 = 20.54 although they approximate enough gave in different order to the real 20.43 and 7.27. I think I'm making some mistake in defining the model in Stan but I do not know which.
Although I don't know the inner workings of the ZIP-HMM fitting algorithm, there are some obvious differences in what you have implemented in the Stan model and how the ZIP-HMM optimization algorithm describes itself. Addressing these appears to be sufficient to generate similar results.
Differences Between the Models
Initial State Probability
The values that the ZIP-HMM estimates, specifically fit1$prior, indicate that it includes an ability to learn a probability for initial state. However, in the Stan model, this is fixed to 1:1
lp = rep_vector(-log(m), m);
This should be changed to allow the model to estimate an initial state.
Priors on Parameters (optional)
The Stan model has non-flat priors on lambda and theta, but presumably the ZIP-HMM is not weighting the specific values it arrives. If one wanted to more realistically mimic the ZIP-HMM, then flat priors would be better. However, the ability to have non-flat priors in Stan is really an opportunity to develop a more well-tuned model than is achievable with standard HMM inference algorithms.
Zero-Inflation on State 1
From the documentation of the fasthmmfit method
Fast gradient descent / stochastic gradient descent algorithm to learn the parameters in a specialized zero-inflated hidden Markov model, where zero-inflation only happens in State 1. [emphasis added]
The Stan model assumes zero-inflation on all states. This is likely why the estimated theta value is deflated relative to the ZIP-HMM MAP estimate.
State Ordering
When estimating discrete latent states or clusters in Stan, one can use an ordered vector as a trick to mitigate against label switching issues. This is effectively achieved here with
positive_ordered[m] lambda;
However, since the ZIP-HMM only has zero-inflation on the first state, correctly implementing this behavior in Stan requires prior knowledge of what the rank of the lambda is for the "first" state. This seems very problematic for generalizing this code. For now, let's just move forward under the assumption that we can always recover this information somehow. In this specific case, we will assume that state 1 in the HMM has the higher lambda value, and therefore will be state 2 in the Stan model.
Updated Stan Model
Incorporating the above changes in the model should be something like
Stan Model
data {
int<lower=0> N; // length of chain
int<lower=0> y[N]; // emissions
int<lower=1> m; // num states
}
parameters {
simplex[m] start_pos; // initial pos probs
real<lower=0, upper=1> theta; // zero-inflation parameter
positive_ordered[m] lambda; // emission poisson params
simplex[m] Gamma[m]; // transition prob matrix
}
model {
vector[m] log_Gamma_tr[m];
vector[m] lp;
vector[m] lp_p1;
// transposing tpm and taking the log of each entry
for (i in 1:m) {
for (j in 1:m) {
log_Gamma_tr[j, i] = log(Gamma[i, j]);
}
}
// initial position log-lik
lp = log(start_pos);
for (n in 1:N) {
for (j in 1:m) {
// log-lik for state
lp_p1[j] = log_sum_exp(log_Gamma_tr[j] + lp);
// log-lik for emission
if (j == 2) { // assuming only state 2 has zero-inflation
if (y[n] == 0) {
lp_p1[j] += log_mix(theta, 0, poisson_lpmf(0 | lambda[j]));
} else {
lp_p1[j] += log1m(theta) + poisson_lpmf(y[n] | lambda[j]);
}
} else {
lp_p1[j] += poisson_lpmf(y[n] | lambda[j]);
}
}
lp = lp_p1; // log-lik for next position
}
target += log_sum_exp(lp);
}
MAP Estimate
Loading the above as a string variable code.ZIPHMM, we first compile it and run a MAP estimate (since MAP estimation is going to behave most like the HMM fitting algorithm):
model.ZIPHMM <- stan_model(model_code=code.ZIPHMM)
// note the use of some initialization on the params,
// otherwise it can occasionally converge to strange extrema
map.ZIPHMM <- optimizing(model.ZIPHMM, algorithm="BFGS",
data=list(N=length(y), y=y, m=2),
init=list(theta=0.5, lambda=c(5,10)))
Examining the estimated parameters
> map.ZIPHMM$par
start_pos[1] start_pos[2]
9.872279e-07 9.999990e-01
theta
6.342449e-01
lambda[1] lambda[2]
7.370525e+00 2.038363e+01
Gamma[1,1] Gamma[2,1] Gamma[1,2] Gamma[2,2]
6.700871e-01 7.253215e-02 3.299129e-01 9.274678e-01
shows they closely reflect the values that fasthmmfit inferred, excepting that the state orders are switched.
Sampling the Posterior
This model can also be run with MCMC to infer a full posterior,
samples.ZIPHMM <- stan(model_code = code.ZIPHMM,
data=list(N=length(y), y=y, m=2),
iter=2000, chains=4)
which samples well and yields similar results (and without any parameter initializations)
> samples.ZIPHMM
Inference for Stan model: b29a2b7e93b53c78767aa4b0c11b62a0.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
start_pos[1] 0.45 0.00 0.29 0.02 0.20 0.43 0.69 0.97 6072 1
start_pos[2] 0.55 0.00 0.29 0.03 0.31 0.57 0.80 0.98 6072 1
theta 0.63 0.00 0.05 0.53 0.60 0.63 0.67 0.73 5710 1
lambda[1] 7.53 0.01 0.72 6.23 7.02 7.49 8.00 9.08 4036 1
lambda[2] 20.47 0.01 0.87 18.83 19.87 20.45 21.03 22.24 5964 1
Gamma[1,1] 0.65 0.00 0.11 0.43 0.57 0.65 0.72 0.84 5664 1
Gamma[1,2] 0.35 0.00 0.11 0.16 0.28 0.35 0.43 0.57 5664 1
Gamma[2,1] 0.08 0.00 0.03 0.03 0.06 0.08 0.10 0.16 5605 1
Gamma[2,2] 0.92 0.00 0.03 0.84 0.90 0.92 0.94 0.97 5605 1
lp__ -214.76 0.04 1.83 -219.21 -215.70 -214.43 -213.43 -212.25 1863 1

Access / save information from metafor forest plot in meta-analysis

I'm wondering if it's possible to access (in some form) the information that is presented in the -forest- command in the -metafor- package.
I am checking / verifying results, and I'd like to have the output of values produced. Thus far, the calculations all check, but I'd like to have them available for printing, saving, etc. instead of having to type them out by hand.
Sample code is below :
es <- read.table(header=TRUE, text = "
b se_b
0.083 0.011
0.114 0.011
0.081 0.013
0.527 0.017
" )
library(metafor)
es.est <- rma(yi=b, sei=se_b, dat=es, method="DL")
studies <- as.vector( c("Larry (2011)" , "Curly (2011)", "Moe (2015)" , "Shemp (2010)" ) )
forest(es.est , transf=exp , slab = studies , refline = 1 , xlim=c(0,3), at = c(1, 1.5, 2, 2.5, 3, 3.5, 4) , showweights=TRUE)
I'd like to access the values (effect size and c.i. for each study, as well as the overall estimate, and c.i.) that are presented on the right of the graphic.
Thanks so much,
-Jon
How about:
> summary(escalc(measure="GEN", yi=b, sei=se_b, data=es), transf=exp)
b se_b yi vi sei zi ci.lb ci.ub
1 0.083 0.011 1.0865 0.0001 0.0110 7.5455 1.0634 1.1102
2 0.114 0.011 1.1208 0.0001 0.0110 10.3636 1.0968 1.1452
3 0.081 0.013 1.0844 0.0002 0.0130 6.2308 1.0571 1.1124
4 0.527 0.017 1.6938 0.0003 0.0170 31.0000 1.6383 1.7512
Then yi, ci.lb, and ci.ub provides the same info.

Problems with using plotCalibration() from the predictABEL package in R

I’ve been having some trouble with the plotCalibration() function, I have managed to get it to work before, but recently whilst working with another dataset (here is a link to the .Rda data file), I have been unable to shake off an error message which keeps cropping up:
> plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort)
Error in plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort) : The specified outcome is not a binary variable.`
When I’ve tried to set the cOutcome column to factors or to logical, it still doesn’t work.
I’ve looked at the source of the function and the only time the error message comes up is in the first if()else{} statement:
if (length(unique(y))!=2) {stop(" The specified outcome is not a binary variable.\n")}
else{
But I have checked that the length(unique(y)) is indeed ==2, and so don’t understand why the error message still crops up!
Be sure you're passing a dataframe to PlotCalibration. Passing a dplyr tibble can cause this error. Converting with the normal as.data.frame() worked for me.
Using the data you sent earlier, I do not see any error though:
Following output were produced along with a calibration plot:
> library(PredictABEL)
> plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort)
$Table_HLtest
total meanpred meanobs predicted observed
[0.000632,0.00129) 340 0.001 0.000 0.31 0
0.001287 198 0.001 0.000 0.25 0
[0.001374,0.00201) 283 0.002 0.004 0.53 1
0.002009 310 0.002 0.000 0.62 0
[0.002505,0.00409) 154 0.003 0.000 0.52 0
[0.004086,0.00793) 251 0.006 0.000 1.42 0
[0.007931,0.00998) 116 0.008 0.009 0.96 1
[0.009981,0.19545] 181 0.024 0.011 4.40 2
$Chi_square
[1] 4.906
$df
[1] 8
$p_value
[1] 0.7676
Please try using table(data[,2],useNA = "ifany") to see the number of levels of the outcome variable of your dataset.
The function plotCalibration will execute when the outcome is a binary variable (two levels).

Resources