I am busy with a simulation study for my PhD where I am comparing output generated from ordinary least squares, weighted least squares and survey-weighted least squares regression applied to a stratified two-stage cluster sample. I've used influence.measures(), vif() and colldiag() on the ordinary least squares and weighted least squares regressions, but cannot find a clear answer as to how the same diagnostics can be obtained for svyglm(). Can these measures be obtained in R and if so, how?
Related
I am using the glmmTMB package in R to run a logistic regression glm model with fixed and random effects (random intercepts and slopes). For some background, I have 5 fixed covariates, one of which includes a quadratic (so really 6 fixed effects) and I am including random slopes for each of those 6 covariates. Prior to running my model, I have scaled and centered each covariate (using the scale function) and checked for correlation between covariates (other than the quadratic, correlation < 0.6). I would like to convert the estimates from the model (which are standardized) to unstandardized estimates because I need to create a predictive map in ArcGIS, which have unstandardized rasters. For obtaining unstandardized estimates for use in ArcGIS, I have tried running my model with the raw data (i.e. skipping the scale and center code) but I believe I am running into convergence issues because even though it runs without warnings, the estimates have large standard errors (10-100x larger than the estimate) and the relationship of the estimate (+ or -) flips between the standardized and unstandardized runs. I have found similar posts such as this, this, and this but I don't think they are exactly my issue, or I am not understanding the math in the solutions. Advice would be very much appreciated.
I want to build a survival model then calculate the X-year (e.g. 10-year) risk of survival.
Is there a way to do this using coxph or survreg? Is this possible using random survival forest (e.g. ranger)?
P.S. not sure if important but data is wide (~100 features - mostly continuous) and 17k samples.
For anyone else trying to do the same. If you build a cox-model with survival::coxph or rms::cph you can use the function pec::predictSurvProb.
I would like to cluster standard errors at subject level in a nonlinear least square regression in R. Is it possible to do this with the nls command?
Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?
I have received AUCs and prediction from a collaborated generated in Weka. The statistical model behin that was cross validated, so my dataset with the predictions includes columns for fold, predicted probability and true class. Using this data I was unable to replicate the AUCs given the predicted probabilities in R. The values always differ slightly.
Additional details:
Weka was used via GUI, not command line
I checked the AUC in R with packages pROC and ROCR
I first tried calculating the AUC over the collected predictions (without regard to fold) and I got different AUCs
Then I tried calculating the AUCs per fold and averaging. This did also not match.
The model was ridge logistic regression and there is a single tie in the predictions
The first fold has one sample more than the others. I have tried taking a weighted average, but this did not work out either
I have even tested averaging the AUCs after logit-transformation (for normality)
Taking the median instead of the mean did not help either
I am familiar with how the AUC is calculated in R, but I don't see what Weka could do differently.