GAM smooths interaction differences - calculate p value using mgcv and gratia 0.6 - r

I am using the useful gratia package by Gavin Simpson to extract the difference in two smooths for two different levels of a factor variable. The smooths are generated by the wonderful mgcv package. For example
library(mgcv)
library(gratia)
m1 <- gam(outcome ~ s(dep_var, by = fact_var) + fact_var, data = my.data)
diff1 <- difference_smooths(m1, smooth = "s(dep_var)")
draw(diff1)
This give me a graph of the difference between the two smooths for each level of the "by" variable in the gam() call. The graph has a shaded 95% credible interval (CI) for the difference.
Statistical significance, or areas of statistical significance at the 0.05 level, is assessed by whether or where the y = 0 line crosses the CI, where the y axis represents the difference between the smooths.
Here is an example from Gavin's site where the "by" factor variable had 3 levels.
The differences are clearly statistically significant (at 0.05) over nearly all of the graphs.
Here is another example I have generated using a "by" variable with 2 levels.
The difference in my example is clearly not statistically significant anywhere.
In the mgcv package, an approximate p value is outputted for a smooth fit that tests the null hypothesis that the coefficients are all = 0, based on a chi square test.
My question is, can anyone suggest a way of calculating a p value that similarly assesses the difference between the two smooths instead of solely relying on graphical evidence?
The output from difference_smooths() is a data frame with differences between the smooth functions at 100 points in the range of the smoothed variable, the standard error for the difference and the upper and lower limits of the CI.
Here is a link to the release of gratia 0.4 that explains the difference_smooths() function
enter link description here
but gratia is now at version 0.6
enter link description here
Thanks in advance for taking the time to consider this.
Don

One way of getting a p value for the interaction between the by factor variables is to manipulate the difference_smooths() function by activating the ci_level option. Default is 0.95. The ci_level can be manipulated to find a level where the y = 0 is no longer within the CI bands. If for example this occurred when ci_level = my_level, the p value for testing the hypothesis that the difference is zero everywhere would be 1 - my_level.
This is not totally satisfactory. For example, it would take a little manual experimentation and it may be difficult to discern accurately when zero drops out of the CI. Although, a function could be written to search the accompanying data frame that is outputted with difference_smooths() as the ci_level is varied. This is not totally satisfactory either because the detection of a non-zero CI would be dependent on the 100 points chosen by difference_smooths() to assess the difference between the two curves. Then again, the standard errors are approximate for a GAM using mgcv, so that shouldn't be too much of a problem.
Here is a graph where the zero first drops out of the CI.
Zero dropped out at ci_level = 0.88 and was still in the interval at ci_level = 0.89. So an approxiamte p value would be 1 - 0.88 = 0.12.
Can anyone think of a better way?
Reply to Gavin Simpson's comments Feb 19
Thanks very much Gavin for taking the time to make your comments.
I am not sure if using the criterion, >= 0 (for negative diffs), is a good way to go. Because of the draws from the posterior, there is likely to be many diffs that meet this criterion. I am interpreting your criterion as sample the posterior distribution and count how many differences meet the criterion, calculate the percentage and that is the p value. Correct me if I have misunderstood. Using this approach, I consistently got p values at around 0.45 - 0.5 for different gam models, even when it was clear the difference in the smooths should be statistically significant, at least at p = 0.05, because the confidence band around the smooth did not contain zero at a number of points.
Instead, I was thinking perhaps it would be better to compare the means of the posterior distribution of each of the diffs. For example
# get coefficients for the by smooths
coeff.level1 <- coef(gam.model1)[31:38]
coeff.level0 <- coef(gam.model1)[23:30]
# these indices are specific to my multi-variable gam.model1
# in my case 8 coefficients per smooth
# get posterior coefficients variances for the by smooths' coefficients
vp_level1 <- gam.model1$Vp[31:38, 31:38]
vp_level0 <- gam.model1$Vp[23:30, 23:30]
#run the simulation to get the distribution of each
#difference coefficient using the joint variance
library(MASS)
no.draws = 1000
sim <- mvrnorm(n = no.draws, (coeff.level1 - coeff.level0),
(vp_level1 + vp_level0))
# sim is a no.draws X no. of coefficients (8 in my case) matrix
# put the results into a data.frame.
y.group <- data.frame(y = as.vector(sim),
group = c(rep(1,no.draws), rep(2,no.draws),
rep(3,no.draws), rep(4,no.draws),
rep(5,no.draws), rep(6,no.draws),
rep(7,no.draws), rep(8,no.draws)) )
# y has the differences sampled from their posterior distributions.
# group is just a grouping name for the 8 sets of differences,
# (one set for each difference in coefficients)
# compare means with a linear regression
lm.test <- lm(y ~ as.factor(group), data = y.group)
summary(lm.test)
# The p value for the F statistic tells you how
# compatible the data are with the null hypothesis that
# all the group means are equal to each other.
# Same F statistic and p value from
anova(lm.test)
One could argue that if all coefficients are not equal to each other then they all can't be equal to zero but that isn't what we want here.
The basis of the smooth tests of fit given by summary(mgcv::gam.model1)
is a joint test of all coefficients == 0. This would be from a type of likelihood ratio test where model fit with and without a term are compared.
I would appreciate some ideas how to do this with the difference between two smooths.
Now that I got this far, I had a rethink of your original suggestion of using the criterion, >= 0 (for negative diffs). I reinterpreted this as meaning for each simulated coefficient difference distribution (in my case 8), count when this occurs and make a table where each row (my case, 8) is for one of these distributions with two columns holding this count and (number of simulation draws minus count), Then on this table run a chi square test. When I did this, I got a very low p value when I believe I shouldn't have as 0 was well within the smooth difference CI across almost all the levels of the exposure. Maybe I am still misunderstanding your suggestion.
Follow up thought Feb 24
In a follow up thought, we could create a variable that represents the interaction between the by factor and continuous variable
library(dplyr)
my.dat <- my.dat %>% mutate(interact.var =
ifelse(factor.2levels == "yes", 1, 0)*cont.var)
Here I am assuming that factor.2levels has the levels ("no", "yes"), and "no" is the reference level. The ifelse function creates a dummy variable which is multiplied by the continuous variable to generate the interactive variable.
Then we place this interactive variable in the GAM and get the usual statistical test for fit, that is, testing all the coefficients == 0.

#GavinSimpson actually posted a method of how to get the difference between two smooths and assess its statistical significance here in 2017. Thanks to Matteo Fasiolo for pointing me in that direction.
In that approach, the by variable is converted to an ordered categorical variable which causes mgcv::gam to produce difference smooths in comparison to the reference level. Statistical significance for the difference smooths is then tested in the usual way with the summary command for the gam model.
However, and correct me if I have misunderstood, the ordered factor approach causes the smooth for the main effect to now be the smooth for the reference level of the ordered factor.
The approach I suggested, see the main post under the heading, Follow up thought Feb 24, where the interaction variable is created, gives an almost identical result for the p value for the difference smooth but does not change the smooth for the main effect. It also does not change the intercept and the linear term for the by categorical variable which also both changed with the ordered variable approach.

Related

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

ROCR cutoff value and accuracy plots

I have a continuous independent variable (let's say 'height') and a binary independent variable (let's say 'gets a job'). I want to see at what cutoff value for height best predicts one's ability to get a job. I also want to see how accurate this model is. I assumed a multinomial logistic model. I wanted a ROC curve so I used the ROCR package in R. This was my code:
mymodel <- multinom(job~height, data = dataset)
pred <- predict(mymodel,dataset,type = 'prob')
roc_pred <- prediction(pred,dataset$job)
roc <- performance(roc_pred,"tpr","fpr")
plot(roc,colorize=T)
Now, this is my question. When I colorize the plot, it gives me the range of cut-off values used to make the plot. I'm a little confused as to what the cutoff values actually are though. Are the cutoff values the heights? Or the probability that a certain data point (person) with a certain height is able to get a job? I have a feeling it's the latter, but I am interested in the former. If it is the latter, how do I obtain the cutoff value for the height??
I found a video that explains the cutoffs you see: https://www.youtube.com/watch?v=YdNhNfJ4Vl8
There are many different ways to estimate optimal cutoffs: Youden Index, Sensitivity + Specificity,Distance to Corner and many others (see this article)
I suggest you use a pROC library to do so
library(pROC)
roc <- roc(fit, obs, percent = TRUE)
roc.out <- coords(roc, "best", ret = c("threshold", "sens", "spec"), transpose = TRUE)
method "best" uses the Younden index (J- index) The maximum value of the Youden index is 1 (perfect test) and the minimum is 0 when the test has no diagnostic value. The minimum occurs when sensitivity=1−specificity, i.e., represented by the equal line (the diagonal) in the ROC diagram. The vertical distance between the equal line and the ROC curve is the J-index for that particular cutoff. The J-index is represented by the ROC-curve itself.

How to plot the difference between two density distributions

I've trained a model to predict a certain variable. When I now use this model to predict said value and compare this predictions to the actual values, I get the two following distributions.
The corresponding R Data Frame looks as follows:
x_var | kind
3.532 | actual
4.676 | actual
...
3.12 | predicted
6.78 | predicted
These two distributions obviously have slightly different means, quantiles, etc. What I would now like to do is combine these two distributions into one (especially as they are fairly similar), but not like in the following thread.
Instead, I would like to plot one density function that shows the difference between the actual and predicted values and enables me to say e.g. 50% of the predictions are within -X% and +Y% of the actual values.
I've tried just plotting the difference between predicted-actual and also the difference compared to the mean in the respective group. However, neither approach has produced my desired result. With the plotted distribution, it is especially important to be able to make above statement, i.e. 50% of the predictions are within -X% and +Y% of the actual values. How can this be achieved?
Let's consider the two distributions as df_actual, df_predicted, then calculate
# dataframe with difference between two distributions
df_diff <- data.frame(x = df_predicted$x - df_actual$x, y = df_predicted$y - df_actual$y)
Then find the relative % difference by :
x_diff = mean(( df_diff$x - df_actual$x) / df_actual $x) * 100
y_diff = mean(( df_diff$y - df_actual$y) / df_actual $y) * 100
This will give you % prediction whether +/- in x as well as y. This is my opinion and also follow this thread for displaying and measuring area between two distribution curves.
I hope this helps.
ParthChaudhary is right - rather than subtracting the distributions, you want to analyze the distribution of differences. But take care to subtract the values within corresponding pairs, or otherwise the actual - predicted differences will be overshadowed by the variance of actual (and predicted) alone. I.e., if you have something like:
x y type
0 10.9 actual
1 15.7 actual
2 25.3 actual
...
0 10 predicted
1 17 predicted
2 23 predicted
...
you would merge(df[df$type=="actual",], df[df$type=="predicted",], by="x"), then calculate and plot y.x-y.y.
To better quantify whether the differences between your predicted and actual distributions are significant, you could consider using the Kolmogorov-Smirnov test in R, available via the function ks.test

Estimating p-value thresholds from a distribution plot

My data is in the following format and includes a particular statistic
site LRStat
1 3.580728
2 2.978038
3 5.058644
4 3.699278
5 4.349046
This is just a sample of the data.
I then obtained the null LR distribution as well by permuting random pairs of data. I used this to plot a histogram with frequency in the y-axes and LR statistic in the x-axes. How is it possible to determine the critical p-value cut-off points based on the null distribution (as shown in the below figure)?
You now have a sampling distribution of LR values. The quantile function in R will give you an estimate of whatever "critical value" you prefer. If, for instance, you decided you wanted the conventional 0.05 "p-value" you could take your dataframe, named LR_df for illustration, and issue this command:
quantile( LR_df[ , 'LRStat'] , 0.95)
If you wanted all of those "probabilities" on the figure, you would use a vector of values complementary to unity. The following code gives you the LSstat values at which a given proportion of the sample are higher than that value.
quantile( LR_df[ , 'LRStat'] , c(0.9, 0.95, 0.99, 0.999, 0.9999) )
The p-values are just a sampling distribution of a test statistic under a null hypothesis. Your null hypothesis in this case is that the LRstats are uniformly distributed. (I know it sounds strange to put it that way, but if you want to argue with the statisticians then get a copy of http://amstat.tandfonline.com/doi/pdf/10.1198/000313008X332421 .) The choice of p-value for cutoff will depend on scientific or business setting. If you were assessing an investment opportunity the cutoff might be 0.15 but if you are trying to find new scientific knowledge, I think it should be smaller (more stringent test). The field of molecular genetics has a lot of junk (i.e. fails to reproduce results) in their literature because they were not strict enough in the statistical methods.

Standardisation in MuMIn package in R

I am using the 'MuMIn' package in R to select models and calculate effect sizes of the input variables (rain, brk, onset, wid). To make my effect size comparable between variables, I standardised them using standardize function in arm package. Here is the code that I am following:
For reference, please refer to the appendix of this paper: http://onlinelibrary.wiley.com/doi/10.1111/j.1420-9101.2010.02210.x/full
Grueber et al. 2011: Multimodel inference in ecology and evolution: challenges and solutions
data1<-read.csv("data.csv",header=TRUE) #reads the data
global.model<-lmer(yld.res ~ rain + brk + onset + wid + (1|state),data=data1,REML="FALSE") # prepares a global model
stdz.model <- standardize(global.model,standardize.y = FALSE) # standardise the input varaibles
model.set <- dredge(stdz.model) ### generates the full submodel set
top.models <- get.models(model.set, subset= delta<2) # selects models with delta AIC <2
model.avg(top.models) # calculates the average effect size of input variables
Here is the result of model.avg(top.models) which gives the average effect size of each input variable
Coefficients:
(Intercept) brk rain wid onset
subset -4.281975e-14 -106.0919 51.54688 39.82837 35.68766
I read around how the standardize function works- subtracts mean and divides by 2SD.
My question is this: Since I have standardised the input variables, should not the effect sizes be between -1 to 1? or the effect size which the output shows is correct?
Please advise
Thanks a lot
This is more of a statistical question than a programming question, but: you've only standardized the predictor variables, not the response variable (you specified standardize.y=FALSE); therefore, each of your coefficients represents the expected change of the response (in the response's units!) per 2 SD change in the predictor. If the range of the response is large (as it must be in your example), then there could be a very large change. For example, if I were analyzing the change in elephant weight measured in milligrams, I could expect very large changes in the response for reasonably small changes in the predictors (e.g. sex, age, food availability). You should probably use standardize.y=TRUE if you want truly nondimensional/unitless effect sizes. Even nondimensional effects aren't necessarily constrained to be between -1 and +1, but it would be surprising for them to be so large.
By the way, I think your standardize function comes from the arm package, not from MuMIn (library("sos"); findFn("standardize",sortby="Function)).

Resources