Using savage dickey null point hypotheses for arms models - stan

Lets say I have a brms model of y ~ a*b + (1|group) and I'd really like to compare it against an alternative model y ~ a + b + (1|group) (in my example a is time and b is experimental condition).
Using hypothesis I could write:
hypothesis(m1, c("a:b = 0"))
However the Evidence ratio value returned for this point null hypothesis is often weird in that it is very dependent on the priors used. When relatively uninformative priors are used then -- even when a fairly precise and >zero estimate for a:b is found -- the ER can still be > 1 or > 3. This is because the prior placed so much weight far away from zero.
I realise I could try using the bridgesampling package to calculate a Bayes Factor but have always found this quite slow and cumbersome. I wonder instead whether it would be reasonable to compare the ERs for these two hypotheses:
a:b = 0 vs a:b > 0
That is, if I calculated ER(a:b > 0)/ER(a:b = 0) then does this give the a Bayes factor in favour of a non-zero and positive effect, as compared with a zero effect?

Related

Linear Regression Model with a variable that zeroes the result

For my class we have to create a model to predict the credit balance of each individuals. Based on observations, many results are zero where the lm tries to calculate them.
To overcome this I created a new variable that results in zero if X and Y are true.
CB$Balzero = ifelse(CB$Rating<=230 & CB$Income<90,0,1)
This resulted in getting 90% of the zero results right. The problem is:
How can I place this variable in the lm so it correctly results in zeros when the proposition is true and the calculation when it is false?
Something like: lm=Balzero*(Balance~.)
I think that
y ~ -1 + Balzero:Balance
might work (you haven't given us a reproducible example to try).
-1 tells R to omit the intercept
: specifies an interaction. If both variables are numeric, then A:B includes the product of A and B as a term in the model.
The second term could also be specified as I(Balzero*Balance) (I means "as is", i.e. interpret * in the usual numerical sense, not in its formula-construction context.)
These specifications should fit the model
Y = beta1*Balzero*Balance + eps
where eps is an error term.
If Balzero == 0, the predicted value will be zero. If Balzero==1 the predicted value will be beta1*Balance.
You might want to look into random forest models, which naturally incorporate the kind of qualitative splitting that you're doing by hand in your example.

A strange case of singular fit in lme4 glmer - simple random structure with variation among groups

Background
I am trying to test for differences in wind speed data among different groups. For the purpose of my question, I am looking only on side wind (wind direction that is 90 deg from the individual), and I only care about the strength of the wind. Thus, I use absolute values. The range of wind speeds is 0.0004-6.8 m/sec and because I use absolute values, Gamma distribution describes it much better than normal distribution.
My data contains 734 samples from 68 individuals, with each individual having between 1-30 repeats. However, even if I reduce my samples to only include individuals with at least 10 repeats (which leaves me with 26 individuals and a total of 466 samples), I still get the problematic error.
The model
The full model is Wind ~ a*b + c*d + (1|individual), but for the purpose of this question, the simple model of Wind ~ 1 + (1|individual) gives the same singularity error, so I do not think that the explanatory variables are the problem.
The complete code line is glmer(Wind ~ 1 + (1|individual), data = X, family = Gamma(log))
The problem and the strange part
When running the model, I get the boundary (singular) fit: see ?isSingular error, although, as you can see, I use a very simple model and random structure. The strange part is that I can solve this by adding 0.1 to the Wind variable (i.e. glmer(Wind+0.1 ~ 1 + (1|Tag), data = X, family = Gamma(log)) does not give any error). I honestly do not remember why I added 0.1 the first time I did it, but I was surprised to see that it solved the error.
The question
Is this a problem with lme4? Am I missing something? Any ideas what might cause this and why does me adding 0.1 to the variable solve this problem?
Edit following questions
I am not sure what's the best way to add data, so here is a link to a csv file in Google drive
using glmmTMB does not produce any warnings with the basic formula glmmTMB(Wind ~ 1 + (1|Tag), data = X, family = Gamma(log)), but gives convergence problems warnings ('non-positive-definite Hessian matrix') when using the full model (i.e., Wind ~ a*b + c*d + (1|individual)), which are then solved if I scale the continuous variables

Understanding the output of my Ramsey RESET test

I am new to R, and doing a replication study where I need to check if their regression holds for the classical assumption for OLS. For the specification assumption, I am doing the Ramsey RESET test, here is my code:
simple_model <- lm(deploy ~ loggdppc + natoyears + milspend + lagdeploy + logland + logcoast + lag3terror + logmindist)
resettest(simple_model, power=2, type="regressor", data = natopanel)
Here is my output:
RESET = 2.0719, df1 = 6, df2 = 355, p-value = 0.05586
Since the P-value is (albeit slightly) above 0.05, does this mean that it 'passes' the RAMSEY test? Or is there an issue of missing variables? I still have not gotten quite the hang of these interpretations. This model does not include all their variables, as they are testing for a specific hypothesis.
Thank you for your help!
According to Wikipedia:
"[The intuition of Ramsey RESET] test is that if non-linear combinations of the explanatory variables have any power in explaining the response
variable, the model is misspecified in the sense that the data
generating process might be better approximated by a polynomial or
another non-linear functional form"
It tests whether including higher degree polynomials of your explanatory variables -- in your example 2nd degree due to power=2 -- have any additional explanatory power. In essence, you test whether the 2nd-degree terms of your regressors are jointly significantly different from zero.
Suppose you use 5% as your cut-off for significance. In that case, you (barely) fail to reject the null hypothesis that including the 2nd-degree terms improves the fit over a linear model.

Finding model predictor values that maximize the outcome

How do you find the set of values for model predictors (a mixture of linear and non-linear) that yield the highest value for the response.
Example Model:
library(lme4); library(splines)
summary(lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month), data = airquality, REML = F))
Here I am interested in what conditions (predictors) produce the highest solar radation (outcome).
This question seems simple, but I've failed to find a good answer using Google.
If the model was simple, I could take the derivatives to find the maximum or minimum. Someone has suggested that if the model function can be extracted, the stats::optim() function might be used. As a last resort, I could simulate all the reasonable variations of input values and plug it into the predict() function and look for the maximum value.
The last approach mentioned doesn't seem very efficient and I imagine that this is a common enough task (e.g., finding optimal customers for advertising) that someone has built some tools for handling it. Any help is appreciated.
There are some conceptual issues here.
for the simple terms (Wind and Temp), the response is a linear (and hence both monotonic and unbounded) function of the predictors. Thus if these terms have positive parameter estimates, increasing their values to infinity (Inf) will give you an infinite response (Solar.R); values should be as small as possible (negative infinite) if the coefficients are negative. Practically speaking, then, you want to set these predictors to the minimum or maximum reasonable value if the parameter estimates are respectively negative or positive.
for the bs term, I'm not sure what the properties of the B-spline are beyond the boundary knots, but I'm pretty sure that the curves go off to positive or negative infinity, so you've got the same issue. However, for the case of bs, it's also possible that there are one or more interior maxima. For this case I would probably try to extract the basis terms and evaluate the spline over the range of the data ...
Alternatively, your mentioning optim makes me think that this is a possibility:
data(airquality)
library(lme4)
library(splines)
m1 <- lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month),
data = airquality, REML = FALSE)
predval <- function(x) {
newdata <- data.frame(Ozone=x[1],Wind=x[2],Temp=x[3])
## return population-averaged prediction (no Month effect)
return(predict(m1, newdata=newdata, re.form=~0))
}
aq <- na.omit(airquality)
sval <- with(aq,c(mean(Ozone),mean(Wind),mean(Temp)))
predval(sval)
opt1 <- optim(fn=predval,
par=sval,
lower=with(aq,c(min(Ozone),min(Wind),min(Temp))),
upper=with(aq,c(max(Ozone),max(Wind),max(Temp))),
method="L-BFGS-B", ## for constrained opt.
control=list(fnscale=-1)) ## for maximization
## opt1
## $par
## [1] 70.33851 20.70000 97.00000
##
## $value
## [1] 282.9784
As expected, this is intermediate in the range of Ozone(1-168), and min/max for Wind (2.3-20.7) and Temp (57-97).
This brute force solution could be made much more efficient by automatically selecting the min/max values for the simple terms and optimizing only over the complex (polynomial/spline/etc.) terms.

Screening (multi)collinearity in a regression model

I hope that this one is not going to be "ask-and-answer" question... here goes:
(multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure" collinearity, since it doesn't affect regression model itself, but interpretation of an effect of individual predictors.
One way to spot collinearity is to put each predictor as a dependent variable, and other predictors as independent variables, determine R2, and if it's larger than .9 (or .95), we can consider predictor redundant. This is one "method"... what about other approaches? Some of them are time consuming, like excluding predictors from model and watching for b-coefficient changes - they should be noticeably different.
Of course, we must always bear in mind the specific context/goal of the analysis... Sometimes, only remedy is to repeat a research, but right now, I'm interested in various ways of screening redundant predictors when (multi)collinearity occurs in a regression model.
The kappa() function can help. Here is a simulated example:
> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001 # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2) # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3) # bad model with near collinearity
> kappa(mm12) # a 'low' kappa is good
[1] 1.166029
> kappa(mm123) # a 'high' kappa indicates trouble
[1] 121530.7
and we go further by making the third regressor more and more collinear:
> x4 <- x1 + 2*x2 + rnorm(100)*0.000001 # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2 # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>
This used approximations, see help(kappa) for details.
Just to add to what Dirk said about the Condition Number method, a rule of thumb is that values of CN > 30 indicate severe collinearity. Other methods, apart from condition number, include:
1) the determinant of the covariance
matrix which ranges from 0 (Perfect
Collinearity) to 1 (No Collinearity)
# using Dirk's example
> det(cov(mm12[,-1]))
[1] 0.8856818
> det(cov(mm123[,-1]))
[1] 8.916092e-09
2) Using the fact that the determinant of a diagonal matrix is the product of the eigenvalues => The presence of one or more small eigenvalues indicates collinearity
> eigen(cov(mm12[,-1]))$values
[1] 1.0876357 0.8143184
> eigen(cov(mm123[,-1]))$values
[1] 5.388022e+00 9.862794e-01 1.677819e-09
3) The value of the Variance Inflation Factor (VIF). The VIF for predictor i is 1/(1-R_i^2), where R_i^2 is the R^2 from a regression of predictor i against the remaining predictors. Collinearity is present when VIF for at least one independent variable is large. Rule of Thumb: VIF > 10 is of concern. For an implementation in R see here. I would also like to comment that the use of R^2 for determining collinearity should go hand in hand with visual examination of the scatterplots because a single outlier can "cause" collinearity where it doesn't exist, or can HIDE collinearity where it exists.
You might like Vito Ricci's Reference Card "R Functions For Regression Analysis"
http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf
It succinctly lists many useful regression related functions in R including diagnostic functions.
In particular, it lists the vif function from the car package which can assess multicollinearity.
http://en.wikipedia.org/wiki/Variance_inflation_factor
Consideration of multicollinearity often goes hand in hand with issues of assessing variable importance. If this applies to you, perhaps check out the relaimpo package: http://prof.beuth-hochschule.de/groemping/relaimpo/
See also Section 9.4 in this Book: Practical Regression and Anova using R [Faraway 2002].
Collinearity can be detected in several ways:
Examination of the correlation matrix of the predictors will reveal large pairwise collinearities.
A regression of x_i on all other predictors gives R^2_i. Repeat for all predictors. R^2_i close to one indicates a problem — the offending linear combination may be found.
Examine the eigenvalues of t(X) %*% X, where X denotes the model matrix; Small eigenvalues indicate a problem. The 2-norm condition number can be shown to be the ratio of the largest to the smallest non-zero singular value of the matrix ($\kappa = \sqrt{\lambda_1/\lambda_p}$; see ?kappa); \kappa >= 30 is considered large.
Since there is no mention of VIF so far, I will add my answer. Variance Inflation Factor>10 usually indicates serious redundancy between predictor variables. VIF indicates the factor by which variance of the co-efficient of a variable would increase if it was not highly correlated with other variables.
vif() is available in package cars and applied to an object of class(lm). It returns the vif of x1, x2 . . . xn in object lm(). It is a good idea to exclude variables with vif >10 or introduce transformations to the variables with vif>10.

Resources