posterior predictive check in simmr - r

I am using simmr (https://andrewcparnell.github.io/simmr/articles/simmr.html#how-to-run-simmr-1) and am trying to determine what threshold is acceptable for posterior predictive checks.
In the linked vingette, the text says
"You can check the fit of the model with a posterior predictive check. This is similar to a fitted values plot in a linear regression. If the data points (denoted by the plot as y) broadly lie in the fitted value intervals (denoted yrep; the default is a 50% interval) then the model is fitting well. Observations lie outside the posterior predictive and the proportion doing so, should approximately match the proportion specified in the posterior_predictive function (default 50%)."
So, does that mean as long as the proportion of y outside yrep is <=0.5, that the model has "good fit"? I do not understand at what threshold would be considered "bad fit"

Related

Accuracy of predicted model in R

I am trying to calculate the accuracy of a predicted model with respect to the real case. In this case, I am using linear regression to predict my desired value. Plots show good accuracy between predicted and real models but not as good as my calculation for accuracy suggests. I am using the following code to calculate the accuracy of my model in Rstudio:
predicted <- import.list$y1
actual <- import.list$T_stp_cool
comparison <- data.frame(actual,predicted)
difference <- ((actual-predicted)/actual)
accuracy=1-abs(mean(difference))
In this case, accuracy is 99% most of the time but comparing the plots of the real case and predicted one does not show 99% accuracy. What's the most accurate way to calculate the accuracy of my model?

Calculating marginal effects from predicted probabilities of zeroinfl() model object

This plot, which I previously created, shows predicted probabilities of claim onset based on two variables, PIB (scaled across the x-axis) and W, presented as its 75th and 25th percentiles. Confidence intervals for the predictions are presented alongside the two lines.
Probability of Claim Onset
As I theorize that W and PIB have an interactive effect on claim onset, I'd like to see if there is any significance in the marginal effect of W on PIB. Confidence intervals of the predicted probabilities alone cannot confirm that this effect is insignificant, per my reading here (https://www.sociologicalscience.com/download/vol-6/february/SocSci_v6_81to117.pdf).
I know that you can calculate marginal effect easily from predicted probabilities by subtracting one from the other. Yet, I don't understand how I can get the confidence intervals for the marginal effect -- obviously needed to determine when and where my two sets of probabilities are indeed significantly different from one another.
The function that I used for calculating predicted probabilities of the zeroinfl() model object and the confidence intervals of those predicted probabilities is derived from an online posting (https://stat.ethz.ch/pipermail/r-help/2008-December/182806.html). I'm happy to provide more code if needed, but as this is not a question about an error, I am not sure it is needed.
So, I'm not entirely sure this is the correct answer, but to anyone who might come across the same problem I did:
Assuming that the two prediction lines maintain the same variance, you can pool SE before then calculating. See the wikipedia for Pooled Variance to confirm.
SEpooled <- ((pred_1_OR_pred_2$SE * sqrt(simulation_n))^2) * (sqrt((1/simulation_n)+(1/simulation_n)))
low_conf <- (pred_1$PP - pred_2$PP) - (1.96*SEpooled)
high_conf <- (pred_1$PP - pred_2$PP) + (1.96*SEpooled)
##Add this to the plot
lines(pred_1$x_val, low_conf, lty=2)
lines(pred_1$x_val, high_conf, lty=2)

confidence interval of estimates in a fitted hybrid model by spatstat

hybrid Gibbs models are flexible for fitting spatial pattern data, however, I am confused on how to get the confidence interval for the fitted model's estimate. for instance, I fitted a hybrid geyer model including a hardcore and a geyer saturation components, got the estimates:
Mo.hybrid<-Hybrid(H=Hardcore(), G=Geyer(81,1))
my.hybrid<-ppm(my.X~1,Mo.hybrid, correction="bord")
#beta = 1.629279e-06
#Hard core distance: 31.85573
#Fitted G interaction parameter gamma: 10.241487
what I interested is the gamma, which present the aggregation of points. obviously, the data X is a sample, i.e., of cells in a anatomical image. in order to report statistical result, a confidence interval for gamma is needed. however, i do not have replicates for the image data.
can i simlate 10 time of the fitted hybrid model, then refitted them to get confidence interval of the estimate? something like:
mo.Y<-rmhmodel(cif=c("hardcore","geyer"),
par=list(list(beta=1.629279e-06,hc=31.85573),
list(beta=1, gamma=10.241487,r=81,sat=1)), w=my.X)
Y1<-rmh(model=mo.Y, control = list(nrep=1e6,p=1, fixall=TRUE),
start=list(n.start=c(npoint(my.X))))
Y1.fit<-ppm(Y1~1, Mo.hybrid,rbord=0.1)
# simulate and fit Y2,Y3,...Y10 in same way
or:
Y10<-simulate(my.hybrid,nsim=10)
Y1.fit<-ppm(Y10[1]~1, Mo.hybrid,rbord=0.1)
# fit Y2,Y3,...Y10 in same way
certainly, the algorithms is different, the rmh() can control simulated intensity while the simulate() does not.
now the questions are:
is it right to use simualtion to get confidence interval of estimate?
or the fitted model can provide estimate interval that could be extracted?
if simulation is ok, which algorithm is better in my case?
The function confint calculates confidence intervals for the canonical parameters of a statistical model. It is defined in the standard stats package. You can apply it to fitted point process models in spatstat: in your example just type confint(my.hybrid).
You wanted a confidence interval for the non-canonical parameter gamma. The canonical parameter is theta = log(gamma) so if you do exp(confint(my.hybrid) you can read off the confidence interval for gamma.
Confidence intervals and other forms of inference for fitted point process models are discussed in detail in the spatstat book chapters 9, 10 and 13.
The confidence intervals described above are the asymptotic ones (based on the asymptotic variance matrix using the central limit theorem).
If you really wanted to estimate the variance-covariance matrix by simulation, it would be safer and easier to fit the model using method='ho' (which performs the simulation) and then apply confint as before (which would then use the variance of the simulations rather than the asymptotic variance).
rmh.ppm and simulate.ppm are essentially the same algorithm, apart from some book-keeping. The differences observed in your example occur because you passed different arguments. You could have passed the same arguments to either of these functions.

regTermTest in R survey package does not appear to produce correct p-values

I am trying to use regTermTest to test for estimated with the survey package in R. This is a simulation study. I simulated 1,000 stratified random samples from a finite population. The correct model is given by y ~ x + Stratum, for some covariate x. The intercept varies by stratum. I fit y ~ x * Stratum using svyglm() and then ran regTermTest(myOutput, ~x:Stratum), to test for the presence of an interaction effect. I also the usual ANOVA using a linear model. The p-values from regTermTest ought to follow a uniform distribution, with 5% of them having values below 5%. Same for the ANOVA. The p-values for the regular linear model worked out, but regTermTest produced p-values that were way too small. The median p-value was 0.02, when it should have been 0.5.
Does anyone know what's going wrong here? And should I have posted this question to Cross-Validated?

How to set a weighted least-squares in r for heteroscedastic data?

I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.
My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:
Model=lm(…,weights=(1/population))
Is that correct? Or should it be:
Model=lm(…,weights=(1/sqrt(population)))
(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)
From ?lm: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.
So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population)). If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population.
As to which of those is most appropriate... that's a question for CrossValidated!
To answer your question, Lucas, I think you want weights=(1/population). R parameterizes the weights as inversely proportional to the variances, so specifying the weights this way amounts to assuming that the variance of the error term is proportional to the population of the city, which is a common assumption in this setting.
But check the assumption! If the variance of the error term is indeed proportional to the population size, then if you divide each residual by the square root of its corresponding sample size, the residuals should have constant variance. Remember, dividing a random variable by a constant results in the variance being divided by the square of that constant.
Here's how you can check this: Obtain residuals from the regression by
residuals = lm(..., weights = 1/population)$residuals
Then divide the residuals by the square roots of the population variances:
standardized_residuals = residuals/sqrt(population)
Then compare the sample variance among the residuals corresponding to the bottom half of population sizes:
variance1 = var(standardized_residuals[population < median(population)])
to the sample variance among the residuals corresponding to the upper half of population sizes:
variance2 = var(standardized_residuals[population > median(population)])
If these two numbers, variance1 and variance2 are similar, then you're doing something right. If they are drastically different, then maybe your assumption is violated.

Resources