Multiple meta-regression with metafor - r

I have performed a multiple meta-regression with the package metafor, but struggle with the interpretation of the Test of Moderators (i.e., QM). My model includes two variables: (1) sample type (dummy: community vs. forensic) and (2) proportion of females in sample (continuous).
This is the output I get:
The results indicate that proportion_females is significantly predicting the effect size while controlling for sample type. However, QM shows a non-significant result (p < 0.05).
How is that possible? It was my understanding that QM tests the hypothesis H0: β_sample = β_females = 0. If Proportion_females is clearly != 0, why does QM not yield a significant result?

This can happen just like in regular regression (where the overall/omnibus F-test can fail to be significant, but an individual coefficient is found to be significant). This is more likely to happen when the model includes multiple non-relevant predictors/moderators, since this will decrease the power of the omnibus test. It can also go the other way around where none of the individual coefficients are found to be significant, but the omnibus test is.

Related

How to assess the model and prediction of random forest when doing regression analysis?

I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.

R - Interpreting Random Forest Importance

I'm working with random forest models in R as a part of an independent research project. I have fit my random forest model and generated the overall importance of each predictor to the models accuracy. However, in order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
Is there a way to produce this information from a random forest model? I.e. I expect age to have a positive impact on the likelihood a surgical complication occurs, but existence of osteoarthritis not so much.
Code:
surgery.bagComp = randomForest(complication~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.90,0.10)) #The cutoff is the probability for each group selection, probs of 10% or higher are classified as 'Complication' occurring
surgery.bagComp #Get stats for random forest model
imp=as.data.frame(importance(surgery.bagComp)) #Analyze the importance of each variable in the model
imp = cbind(vars=rownames(imp), imp)
imp = imp[order(imp$MeanDecreaseAccuracy),]
imp$vars = factor(imp$vars, levels=imp$vars)
dotchart(imp$MeanDecreaseAccuracy, imp$vars,
xlim=c(0,max(imp$MeanDecreaseAccuracy)), pch=16,xlab = "Mean Decrease Accuracy",main = "Complications - Variable Importance Plot",color="black")
Importance Plot:
Any suggestions/areas of research anyone can suggest would be greatly appreciated.
In order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
You need to be perform "feature impact" analysis, not "feature importance" analysis.
Algorithmically, it's about traversing decision tree data structures and observing what was the impact of each split on the prediction outcome. For example, consider the split "age <= 40". Does the left branch (condition evaluates to true) carry lower likelihood than the right branch (condition evaluates to false)?
Feature importances may give you a hint which features to look for, but it cannot be "transformed" to feature impacts.
You might find the following articles helpful: WHY did your model predict THAT? (Part 1 of 2) and WHY did your model predict THAT? (Part 2 of 2).

interactions in logistical regression R

I am struggling to interpret the results of a binomial logistic regression I did. The experiment has 4 conditions, in each condition all participants receive different version of treatment.
DVs (1 per condition)=DE01,DE02,DE03,DE04, all binary (1 - participants take a spec. decision, 0 - don't)
Predictors: FTFinal (continuous, a freedom threat scale)
SRFinal (continuous, situational reactance scale)
TRFinal (continuous, trait reactance scale)
SVO_Type(binary, egoists=1, altruists=0)
After running the binomial (logit) models, I ended up with the following:see output. Initially I tested 2 models per condition, when condition 2 (DE02 as a DV) got my attention. In model(3)There are two variables, which are significant predictors of DE02 (taking a decision or not) - FTFinal and SVO Type. In context, the values for model (3) would mean that all else equal, being an Egoist (SVO_Type 1) decreases the (log)likelihood of taking a decision in comparison to being an altruist. Also, higher scores on FTFinal(freedom threat) increase the likelihood of taking the decision. So far so good. Removing SVO_Type from the regression (model 4) made the FTFinal coefficient non-significant. Removing FTFinal from the model does not change the significance of SVO_Type.
So I figured:ok, mediaiton, perhaps, or moderation.
I tried all models both in R and SPSS, and entering an interaction term SVO_Type*FTFinal makes !all variables in model(2) non-significant. I followed this "http: //www.nrhpsych.com/mediation/logmed.html" mediation procedure for logistic regression, but found no mediation. (sorry for link, not allowd to post more than 2, remove space after http:) To sum up: Predicting DE02 from SVO_Type only is not significant.Predicting DE02 from FTFinal is not significantPutitng those two in the regression makes them significant predictors
code and summaries here
Including an interaction between these both in a model makes all coefficients insignificant.
So I am at a total loss: As far as I know, to test moderation, you need an interaction term. This term is between a categorical var (SVO_Type) and the continuous one(FTFinal), perhaps that goes wrong?
And to test mediation outside SPSS, I tried the "mediate" package (sorry I am a noob, so I am allowed max 3 links per post), only to discover that there is a "treatment" argument in the funciton, which is to be the treatment variable (exp Vs cntrl). I don't have such, all ppns are subjected to different versions of the same treatment.
Any help will be appreciated.

Can't generate correlated random numbers from binomial distributions in R using rmvbin

I am trying to get a sample of correlated random numbers from binomial distributions in R. I tried to use rmvbin and it worked fine with some probabilities:
> rmvbin(100, margprob = c(0.1,0.1), bincorr=0.5*diag(2)+0.5)
while the next call which is quite similar one raises an error:
> rmvbin(100, margprob = c(0.01,0.01), bincorr=0.5*diag(2)+0.5)
Error in commonprob2sigma(commonprob, simulvals) :
Extrapolation occurred ... margprob and commonprob not compatible?
I can't find any justification for this.
This is a math/stats "problem" and not an R problem (In the sense that it's not a problem but a consequence of the model)
Short version: For bivariate binary data there is a link between the marginal probabilities and the correlation that can be observed. You can see it if you do a bit of boring juggling with the marginal probabilities $p_A$ and $p_B$ and the simultaneous probability $p_{AB}$. In other words: the marginal probabilities put restrictions on range of allowed correlations (and vice versa), and you are violating this in your call.
For bivariate Gaussian random variables the marginals and the correlations are separate and can be specified independently of each other.
The question should probably be moved to stats exchange.

same regression, different statistics (R v. SAS)?

I ran the same probit regression in SAS and R and while my coefficient estimates are (essentially) equivalent, the reported test statistics are different. Specifically, SAS reports test statistics as t-statistics whereas R reports test statistics as z-statistics.
I checked my econometrics text and found (with little elaboration) that it reports probit results in terms of t statistics.
Which statistic is appropriate? And why does R differ from SAS?
Here's my SAS code:
proc qlim data=DavesData;
model y = x1 x2 x3/ discrete(d=probit);
run;
quit;
And here's my R code:
> model.1 <- glm(y ~ x1 + x2 + x3, family=binomial(link="probit"))
> summary(model.1)
Just to answer a little bit - it's seriously off topic, question should be closed in fact - but neither the t-statistic nor the z-statistic are meaningful. They're both related though, as Z is just the standard normal distribution and T is an adapted "close-to-normal" distribution that takes into account the fact that your sample is limited to n cases.
Now, both the z and the t statistic provide the significance for the null hypothesis that the respective coefficient is equal to zero. The standard error on the coefficients, used for that test, is based on the residual error. Using the link function, you practically transform your response in such a way that the residuals become normal again, whereas in fact the residuals represent the difference between the observed and the estimated proportion. Due to this transformation, calculation of the degrees of freedom for the T-statistic isn't useful anymore and hence R assumes the standard normal distribution for the test statistic.
Both results are completely equivalent, R will just give slightly sharper p-values. It's a matter of debate, but if you look at proportion difference tests, they're also always done using the standard normal approximation (Z-test).
Which brings me back to the point that neither of these values actually has any meaning. If you want to know whether or not a variable has a significant contribution with a p-value that actually says something, you use a Chi-squared test like the Likelihood Ratio test (LR), Score test or Wald test. R just gives you the standard likelihood ratio, SAS also gives you the other two. But all three tests are essentially equivalent, if they differ seriously it's time to look again at your data.
eg in R :
anova(model.1,test="Chisq")
For SAS : see the examples here for use of contrasts, getting the LR, Score or Wald test

Resources