{Methcomp} – Deming / orthogonal regression – goodness of fit + confidence intervals - r
A question following this post. I have the following data:
x1, disease symptom
y1, another disease symptom
I fitted the x1/y1 data with a Deming regression with vr (or sdr) option set to 1. In other words, the regression is a Total Least Squares regression, i.e. orthogonal regression. See previous post for the graph.
x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
34.2,34.0,32.6,30.8)
y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
84.0,93.5,83.5,96.5,96.0,97.5)
x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))
library(MethComp)
dem_reg <- Deming(x1, y1)
abline(dem_reg[1:2], col = "green")
I would like to know how much x1 helps to predict y1:
normally, I’d go for a R-squared, but it does not seem to be relevant; although another mathematician told me he thinks a R-squared may be appropriate. And this page suggests to calculate a Pearson product-moment correlation coefficient, which is R I believe?
partially related, there is possibly a tolerance interval. I could calculated it with R ({tolerance} package or code shown in the post), but it is not exactly what I am searching for.
Does someone know how to calculate a goodness of fit for Deming regression, using R? I looked at MetchComp pdf but could not find it (perhaps missed it though).
EDIT: following Gaurav's answers about confidence interval: R code
Firstly: confidence intervals for parameters
library(mcr)
MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
getCoefficients(MCR_reg)
Secondly: confidence intervals for predicted values
# plot of data
x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))
# Deming regression using functions from {mcr}
library(mcr) MCR_reg=mcreg(x1,y1,method.reg="Deming",error.ratio=1,method.ci="analytical")
MCR_intercept=getCoefficients(MCR_reg)[1,1]
MCR_slope=getCoefficients(MCR_reg)[2,1]
# CI for predicted values
x_to_predict=seq(0,35)
predicted_values=MCResultAnalytical.calcResponse(MCR_reg,x_to_predict,alpha=0.05)
CI_low=predicted_values[,4]
CI_up=predicted_values[,5]
# plot regression line and CI for predicted values
abline(MCR_intercept,MCR_slope, col="red")
lines(x_to_predict,CI_low,col="royalblue",lty="dashed")
lines(x_to_predict,CI_up,col="royalblue",lty="dashed")
# comments
text(7.5,60, "Deming regression", col="red")
text(7.5,40, "Confidence Interval for", col="royalblue")
text(7.5,35, "Predicted values - 95%", col="royalblue")
EDIT 2
Topic moved to Cross Validated:
https://stats.stackexchange.com/questions/167907/deming-orthogonal-regression-measuring-goodness-of-fit
There are many proposed methods to calculate goodness of fit and tolerance intervals for Deming Regression but none of them widely accepted. The conventional methods we use for OLS regression may not make sense. This is an area of active research. I don't think there many R-packages which will help you compute that since not many mathematicians agree on any particular method. Most methods for calculating intervals are based on Resampling techniques.
However you can check out the 'mcr' package for intervals...
https://cran.r-project.org/web/packages/mcr/
Related
Test of second differences for average marginal effects in logistic regression
I have a question similar to the one here: Testing the difference between marginal effects calculated across factors. I used the same code to generate average marginal effects for two groups. The difference is that I am running a logistic rather than linear regression model. My average marginal effects are on the probability scale, so emmeans will not provide the correct contrast. Does anyone have any suggestions for how to test whether there is a significant difference in the average marginal effects between group 1 and group 2? Thank you so much, Ilana
It is a bit unclear what the issue really is, but I'll try. I'm supposing your logistic regression model was fitted using, say, glm: mod <- glm(cbind(heads, tails) ~ treat, data = mydata, family = binomial()) If you then do emm <- emmeans(mod, "treat") emm ### marginal means pairs(emm) ### differences Your results will be presented on the logit scale. If you want them on the probability scale, you can do summary(emm, type = "response") summary(pairs(emm), type = "response") However, the latter will back-transform the differences of logits, thereby producing odds ratios. If you actually want differences of probabilities rather than ratios of odds, use regrid(), which will construct a new grid of values after back-transforming (and hence it will forget the log transformation): pairs(regrid(emm)) It seems possible that two or more factors are present and you want contrasts of contrasts on the probability scale. In that case, extend this idea by calling regrid() on the table of EMMs to put everything on the probability scale, then follow the analogous procedure used in the linked article.
Calculating marginal effects from predicted probabilities of zeroinfl() model object
This plot, which I previously created, shows predicted probabilities of claim onset based on two variables, PIB (scaled across the x-axis) and W, presented as its 75th and 25th percentiles. Confidence intervals for the predictions are presented alongside the two lines. Probability of Claim Onset As I theorize that W and PIB have an interactive effect on claim onset, I'd like to see if there is any significance in the marginal effect of W on PIB. Confidence intervals of the predicted probabilities alone cannot confirm that this effect is insignificant, per my reading here (https://www.sociologicalscience.com/download/vol-6/february/SocSci_v6_81to117.pdf). I know that you can calculate marginal effect easily from predicted probabilities by subtracting one from the other. Yet, I don't understand how I can get the confidence intervals for the marginal effect -- obviously needed to determine when and where my two sets of probabilities are indeed significantly different from one another. The function that I used for calculating predicted probabilities of the zeroinfl() model object and the confidence intervals of those predicted probabilities is derived from an online posting (https://stat.ethz.ch/pipermail/r-help/2008-December/182806.html). I'm happy to provide more code if needed, but as this is not a question about an error, I am not sure it is needed.
So, I'm not entirely sure this is the correct answer, but to anyone who might come across the same problem I did: Assuming that the two prediction lines maintain the same variance, you can pool SE before then calculating. See the wikipedia for Pooled Variance to confirm. SEpooled <- ((pred_1_OR_pred_2$SE * sqrt(simulation_n))^2) * (sqrt((1/simulation_n)+(1/simulation_n))) low_conf <- (pred_1$PP - pred_2$PP) - (1.96*SEpooled) high_conf <- (pred_1$PP - pred_2$PP) + (1.96*SEpooled) ##Add this to the plot lines(pred_1$x_val, low_conf, lty=2) lines(pred_1$x_val, high_conf, lty=2)
Plotting standard errors for effects
I have a lme4 model I have run for a hierarchical logistic regression, and I'm plotting the effects using the effects package. I would like to create an effects graph with the standard error of the mean as the error bars. I can get the point estimates, 95% confidence intervals, and standard errors into a dataframe. The standard errors, however, seem at odds with the confidence limit parameters, see below for an example in a regular glm. library(effects) library(dplyr) mtcars <- mtcars %>% mutate(vs = factor(vs)) glm1 <- glm(am ~ vs, mtcars, family = "binomial") (glm1_eff <- Effect("vs", glm1) %>% as.data.frame()) vs fit se lower upper 1 0 0.3333333 0.4999999 0.1580074 0.5712210 2 1 0.5000000 0.5345225 0.2596776 0.7403224 My understanding is that the fit column displays the point estimate for the probability of am is equal to 1 and that lower and upper correspond to the 95% confidence intervals for the probability that am equals 1. Note that the standard error does not seem to correspond to the confidence interval (e.g., .33+.49 > .57). Here's what I am shooting for. As opposed to a 95% confidence interval, I would like to have an effects plot with +- the standard error of the mean. Are the standard errors in log-odds instead of probability? Is there a simply way to convert them to probabilities and plot them so that I can make the graph?
John Fox shared this helpful response: From ?Effect: "se: (for "eff" objects) a vector of standard errors for the effect, on the scale of the linear predictor." So the standard errors are on the log-odds scale." You could use the delta method to get standard errors on the probability scale but that would be very ill-advised, since the approach to asymptotic normality of estimated probabilities will be much slower than of log-odds. Effect() computes confidence limits on the scale of the linear predictor (log-odds for a logit model) and then inverse-transforms them to the scale of the response (probabilities). All of the information you need to create a custom plot is in the "eff" object returned by Effect(); the contents of the object are documented in ?Effect. I agree, by the way, that the as.data.frame.eff() method could be improved, and I'll do that when I have a chance. In particular, it invites misunderstanding to report the effects and confidence limits on the scale of the response but to show standard errors for the linear-predictor scale.
I'm answering the mystery first, then addressing the "show SE on the plot" question Explanation of the SE mystery: All math in a GLM needs to be done on the link scale because this is the additive scale (where stuff can be added up). So... The values in the column "fit" are the predicted probability of success (or the "predictions on the response scale"). Their values are expit(b0) and expit(b0 + b1). expit() is the inverse logit function. The SEs are on the link scale. An SE on the response scale doesn't make much sense because the response scale is non-linear (although its kinda weird to have stats on the response and link scale in the same table). "lower" and "upper" are on the response scale, so these are the CIs of the predicted probabilities of success. They are computed as expit(b0 ± 1.96SE) and expit(b0 + b1 ± 1.96SE). To recover these values with what is given library(boot) # inv.logit and logit functions expit.pred_0 <- 1/3 # fit 0 expit.pred_1 <- 1/2 # fit 1 se1 <- 1/2 se2 <- .5345225 inv.logit(logit(expit.pred_0) - qnorm(.975)*se1) inv.logit(logit(expit.pred_0) + qnorm(.975)*se1) inv.logit(logit(expit.pred_1) - qnorm(.975)*se2) inv.logit(logit(expit.pred_1) + qnorm(.975)*se2) > inv.logit(logit(expit.pred_0) - qnorm(.975)*se1) [1] 0.1580074 > inv.logit(logit(expit.pred_0) + qnorm(.975)*se1) [1] 0.5712211 > inv.logit(logit(expit.pred_1) - qnorm(.975)*se2) [1] 0.2596776 > inv.logit(logit(expit.pred_1) + qnorm(.975)*se2) [1] 0.7403224 Showing an SE computed from a glm on the response (non additive) scale doesn't make any sense because the SE is only additive on the link scale. In other words Multiplying SE by some quantile on the response scale (the scale of the plot you envision, with probability on the y axis) is meaningless. A CI is a point estimate back transformed from the link scale and so makes sense for plotting. I frequently see researchers plotting SE bars computed from a linear model, like you envision, even though the statistics presented are from a GLM. These SE's are meaningful in a sense I guess but they often imply absurd consequences (like probabilities that could be less than zero or greater than one) so...don't do that either.
Confidence intervals for predicted probabilities from predict.lrm
I am trying to determine confidence intervals for predicted probabilities from a binomial logistic regression in R. The model is estimated using lrm (from the package rms) to allow for clustering standard errors on survey respondents (each respondent appears up to 3 times in the data): library(rms) model1<-lrm(outcome~var1+var2+var3,data=mydata,x=T,y=T,se.fit=T) model.rob<-robcov(model1,cluster=respondent.id) I am able to estimate a predicted probability for the outcome using predict.lrm: predicted.prob<-predict(model.rob,newdata=data.frame(var1=1,var2=.33,var3=.5), type="fitted") What I want to determine is a 95% confidence interval for this predicted probability. I have tried specifying se.fit=T, but this not permissible in predict.lrm when type=fitted. I have spent the last few hours scouring the Internet for how to do this with lrm to no avail (obviously). Can anyone point me toward a method for determining this confidence interval? Alternatively, if it is impossible or difficult with lrm models, is there another way to estimate a logit with clustered standard errors for which confidence intervals would be more easily obtainable?
The help file for predict.lrm has a clear example. Here is a slight modification of it: L <- predict(fit, newdata=data.frame(...), se.fit=TRUE) plogis(with(L, linear.predictors + 1.96*cbind(- se.fit, se.fit))) For some problems you may want to use the gendata or Predict functions, e.g. L <- predict(fit, gendata(fit, var1=1), se.fit=TRUE) # leave other vars at median/mode Predict(fit, var1=1:2, var2=3) # leave other vars at median/mode; gives CLs
how can I predict probability of an event using the weibull distribution
I have a data set of connection forces based on axial force in N (http://pastebin.com/Huwg4vxv) Some previous analyses has been undertaken (by another party) and has fitted a Weibull distribution to it, and then predicted that the chances of recording a force of 60N or higher is around 1.2%. I have to say that eyeballing the data, that doesn't seem likely to me, but I know nothing about this particular distribution. So far I am able to fit the curve: force<-read.csv(file="forcestats.csv",header = T) library(MASS) fitdistr(force$F, 'weibull') hist(force$F) I am trying to understand is a weibull distro really the best fit for this data ? how I can make that same prediction using R (how to calculate the probability of values above 60N); is it possible to calculate the 95% confidence interval for that value (i.e., 1.2% +/- x%) Thanks for reading Pete
To address your first item, is a weibull distro really the best fit for this data ? conceptually, this is more of a question about statistical inference rather than programming, so you most likely want to tackle that on CrossValidated rather than SO. However, you can certainly inquire about the means of investigating this programmatically, such as comparing the estimated density of the observed data to the theoretical density function or to the density function of random samples from a weibull distribution with your parameter estimates: library(MASS) ## Weibull <- read.csv( "F:/Studio/MiscData/force_in_newtons.txt", header=TRUE) ## params <- fitdistr(Weibull$F, 'weibull') ## Shape <- params[[1]][1] Scale <- params[[1]][2] ## set.seed(123) plot( density( rweibull( 500,shape=Shape,scale=Scale)), col="red", lwd=2,lty=3, main="") ## lines( density( Weibull$F), col="blue", lty=3,lwd=2) ## legend( "topright", legend=c( "rweibull(n=500,...)", "observed data"), lty=c(3,3), col=c("red","blue"), lwd=c(3,3), bty="n") Of course, there are many other ways of assessing the fit of your model, this is just a quick sanity check. As for your second question, you can use the pweibull function with lower.tail=FALSE to get probabilities from the theoretical survival function (S(x) = 1 - F(x)): ## Pr(X >= 60) > pweibull( 60,shape=Shape,scale=Scale, lower.tail=FALSE) [1] 0.01268268 As for your final item, I believe that calculating confidence intervals on probabilities (as well as certain other statistical quantities) for an estimated distribution requires using the Delta method; I could be recalling incorrectly though, so you may want to double check on this. If this is the case and you aren't familiar with the Delta method, then unfortunately you will probably have to do a fair amount of reading on the subject because the calculation involved is generally non-trivial - here's another link; the Wikipedia article doesn't give a very in-depth treatment of the subject. Or, you could inquire about this on Cross Validated as well.