I am using the glmmTMB package in R to run a logistic regression glm model with fixed and random effects (random intercepts and slopes). For some background, I have 5 fixed covariates, one of which includes a quadratic (so really 6 fixed effects) and I am including random slopes for each of those 6 covariates. Prior to running my model, I have scaled and centered each covariate (using the scale function) and checked for correlation between covariates (other than the quadratic, correlation < 0.6). I would like to convert the estimates from the model (which are standardized) to unstandardized estimates because I need to create a predictive map in ArcGIS, which have unstandardized rasters. For obtaining unstandardized estimates for use in ArcGIS, I have tried running my model with the raw data (i.e. skipping the scale and center code) but I believe I am running into convergence issues because even though it runs without warnings, the estimates have large standard errors (10-100x larger than the estimate) and the relationship of the estimate (+ or -) flips between the standardized and unstandardized runs. I have found similar posts such as this, this, and this but I don't think they are exactly my issue, or I am not understanding the math in the solutions. Advice would be very much appreciated.
Related
I am running a quantile regression (as my residuals in linear regression were not normally distributed) for a study on the association of mediterranean diet and inflammatory markers. As I was building the model I got outputs for beta coefficients and standard error plus p-values, and confidence intervals. However, once I stratified for low and high levels of exercise, there was no longer an output for standard error. Any ideas?enter image description here
By default rq does not returns standard errors, but rather the confidence interval by inverting a rank test. If you want standard errors you have to specify another method for se, such as se="boot".
Note: the whole point of quantile regression is to move away from means and SD, so this may not be the most adequate estimate for your problem.
I'm looking for a function for interaction effects visualization which has a correspondence with ivreg or plm. My model is 2sls with fixed effects but it seems there are no packages available for calculating interaction effects in R.
I'd be pleased if someone could solve my concern.
You might want to have a look at interplot(). You can use this function to visualize e.g. the estimated coefficient of regressor X on outcome Y conditional on values of instrument Z by simply plugging in the fitted values from ivreg(). (The confidence intervals are trickier, but you are probably less interested in those in the first instance.)
https://cran.r-project.org/web/packages/interplot/vignettes/interplot-vignette.html
Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?
I am using the Orange canvas with its regression methods to make some estimations about my data set. The regression coefficients r^2 must be inside of the interval [-1,1] for being meaningful according to statistics field. But sometimes, I've got the regression coefficients -50,.. or 26,.. etc. So, I am confused about that. How can I interprete such the coefficients ? Thank you all already.
From Wikipedia:
Important cases where the computational definition of R2 can yield negative values, depending on the definition used, arise where the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data, and where linear regression is conducted without including an intercept. Additionally, negative values of R2 may occur when fitting non-linear functions to data.
There is nothing in the definition of R² that theoretically prevents it from having arbitrarily negative values. I guess you can interpret -50 as even worse than -1. But with regard to R² = 26, I'm clueless.
I have received AUCs and prediction from a collaborated generated in Weka. The statistical model behin that was cross validated, so my dataset with the predictions includes columns for fold, predicted probability and true class. Using this data I was unable to replicate the AUCs given the predicted probabilities in R. The values always differ slightly.
Additional details:
Weka was used via GUI, not command line
I checked the AUC in R with packages pROC and ROCR
I first tried calculating the AUC over the collected predictions (without regard to fold) and I got different AUCs
Then I tried calculating the AUCs per fold and averaging. This did also not match.
The model was ridge logistic regression and there is a single tie in the predictions
The first fold has one sample more than the others. I have tried taking a weighted average, but this did not work out either
I have even tested averaging the AUCs after logit-transformation (for normality)
Taking the median instead of the mean did not help either
I am familiar with how the AUC is calculated in R, but I don't see what Weka could do differently.