I am conducting a benchmark analysis comparing different learners (logistic regression, gradient boosting, random forest, extreme gradient boosting) with the mlr package.
I understand that there are two different types of preprocessing (data and learner dependent and independent). Now I would like to conduct the data dependent preprocessing using the mlr's wrapper functionality makePreprocWrapperCaret().
However, I am unsure about the settings. As far as I understand correctly, I should impute missings with median (or mean) for logistic regression, however for tree-based models for example with very great values.
Question 1) How would I impute NAs with very great values in the code below (for the tree-based models)?
Next, as far as I understand correctly, I should cut off outliers for the logistic regression (e.g. at 99%, and 1%). However, for tree-based models that is not necessary.
Question 2) How can I cut off outlier (e.g. at 99%, and 1%) in the code below?
Lastly, (again, if I understood correctly) I should standardize the data for the logistic regression. However, I can only find the "center" option within the makePreprocWrapperCaret() which is not exactly the same.
Question 3) How can I standardize in the code below?
Many thanks in advance!!
lrn_logreg = makePreprocWrapperCaret("classif.logreg", method = c("medianImpute")) #logistic regression --> include standardization + cutoff outliers
lrn_gbm = makePreprocWrapperCaret("classif.gbm") #gradient boosting --> include imputation with great values
lrn_rf = makePreprocWrapperCaret("classif.randomForest") #Random Forest --> include imputation with great values
lrn_xgboost = makePreprocWrapperCaret("classif.xgboost") #eXtreme Gradient Boosting --> include imputation with great values
You can have a look at the mlr tutorial for imputation: https://mlr.mlr-org.com/articles/tutorial/impute.html
1)
You can use the makeImputeWrapper of mlr. For the maximum you can use imputeMax in makeImputeWrapper.
2)
For cutting of the highest and lowest values you can write your own preprocWrapper: https://mlr.mlr-org.com/articles/tutorial/preproc.html
3)
For normalization there is already a preprocWrapper function: normalizeFeatures.
See also here: https://mlr.mlr-org.com/reference/normalizeFeatures.html
Related
I was wondering if someone would know an R package that would allow me to fit an Ordinal Logistic regression with a LASSO regularization or, alternatively, a Beta regression still with the LASSO? And if you also know of a nice tutorial to help me code that in R (with appropriate cross-validation), that would be even better!
Some context: My response variable is a satisfaction score between 0 and 10 (actually, values lie between 2 and 10) so I can model it with a Beta regression or I can convert its values into ranked categories. My interest is to identify important variables explaining this score but as I have too many potential explanatory variables (p = 12) compared to my sample size (n = 105), I need to use a penalized regression method for model selection, hence my interest in the LASSO.
The ordinalNet package does this. There's a paper with example here:
https://www.jstatsoft.org/article/download/v099i06/1440
Also the glmnetcr package: https://cran.r-project.org/web/packages/glmnetcr/vignettes/glmnetcr.pdf
modelo <- lm( P3J_IOP~ PräOP_IOP +OPTyp + P3J_Med, data = na.omit(df))
summary(modelo)
Error:
Fehler in step(modelo, direction = "backward") :
Number of lines used has changed: remove missing values?
I have a lot of missing values in my dependent variable P3J_IOP.
Has anyone any idea how to create the model?
tl;dr unfortunately, this is going to be hard.
It is fairly difficult to make linear regression work smoothly with missing values in the predictors/dependent variables (this is true of most statistical modeling approaches, with the exception of random forests). In case it's not clear, the problem with stepwise approaches with missing data in the predictor is:
incomplete cases (i.e., observations with missing data for any of the current set of predictors) must be dropped in order to fit a linear model;
models with different predictor sets will typically have different sets of incomplete cases, leading to the models being fitted on different subsets of the data;
models fitted to different data sets aren't easily comparable.
You basically have the following choices:
drop any predictors with large numbers of missing values, then drop all cases that have missing values in any of the remaining predictors;
use some form of imputation, e.g. with the mice package, to fill in your missing data (in order to do proper statistical inference, you need to do multiple imputation, which may be hard to combine with stepwise regression).
There are some advanced statistical techniques that will allow you to simultaneously do the imputation and the modeling, such as the brms package (here is some documentation on imputation with brms), but it's a pretty big hammer/jump in statistical sophistication if all you want to do is fit a linear model to your data ...
My Goal: I have an ordinal factor variable (5 levels) to which I would like to apply contrasts to test for a linear trend. However, the factor groups have heterogeneity of variance.
What I've done: Upon recommendation, I used lmRob() from robust pckg to create a robust linear model, then applied the contrasts.
# assign the codes for a linear contrast of 5 groups, save as object
contrast5 <- contr.poly(5)
# set contrast property of sf1 to contain the weights
contrasts(SCI$sf1) <- contrast5
# fit and save a robust model (exhaustive instead of subsampling)
robmod.sf1 <- lmRob(ICECAP_A ~ sf1, data = SCI, nrep = Exhaustive)
summary.lmRob(robmod.sf1)
My problem: I have since been reading that robust regression is more suited to address outliers, and not heterogeneity of variance. (bottom of https://stats.idre.ucla.edu/r/dae/robust-regression/_ ) This UCLA page (among others) suggests the sandwich package to get heteroskedastic-consistent (HC) standard errors (such as in https://thestatsgeek.com/2014/02/14/the-robust-sandwich-variance-estimator-for-linear-regression-using-r/ ).
But these examples use a series of functions/calls to generate output that gives you the HC that could be used to calculate confidence intervals, t-values, p-values etc.
My thinking is that if I use vcovHC(), I could get the HC std errors, but the HC std errors would not have been 'applied'/a property of the model, so I couldn't pass the model (with the HC errors) through a function to apply the contrasts that I ultimately want. I hope I am not conflating two separate concepts, but surely if a function addresses/down-weights outliers, that should at least somewhat address unequal variances as well?
Can anyone confirm if my reasoning is sound (and thus remain with lmRob()? Or suggest how I could just correct my standard errors and still apply the contrasts?
vcovHC is the right function to deal with heteroscedasticity. HC stands for heteroscedasticity-consistent estimator. This will not downweight outliers in estimates of model effects, but it will calculated the CIs and p-values differently to accommodate the impact of such outlying observations. lmRob does downweight outlying values and does not handle heteroscedasticity
See more here:
https://stats.stackexchange.com/questions/50778/sandwich-estimator-intuition/50788#50788
Sorry for a quite stupid question. I am doing multiple comparisons of morphologic traits through correlations of bootstraped data. I'm curious if such multiple comparisons are impacting my level of inference, as well as the effect of the potential multicollinearity in my data. Perhaps, a reasonable option would be to use my bootstraps to generate maximum likelihood and then generate AICc-s to do comparisons with all of my parameters, to see what comes out as most important... the problem is that although I have (more or less clear) the way, I don't know how to implement this in R. Can anybody be so kind as to throw some light on this for me?
So far, here an example (using R language, but not my data):
library(boot)
data(iris)
head(iris)
# The function
pearson <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,2], method='p'),
median(dt[,1]),
median(dt[,2])
)
}
# One example: iris$Sepal.Length ~ iris$Sepal.Width
# I calculate the r-squared with 1000 replications
set.seed(12345)
dat <- iris[,c(1,2)]
dat <- na.omit(dat)
results <- boot(dat, statistic=pearson, R=1000)
# 95% CIs
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% (-0.2490, 0.0423 )
Calculations and Intervals on Original Scale
plot(results)
I have several more pairs of comparisons.
More of a Cross Validated question.
Multicollinearity shouldn't be a problem if you're just assessing the relationship between two variables (in your case correlation). Multicollinearity only becomes an issue when you fit a model, e.g. multiple regression, with several highly correlated predictors.
Multiple comparisons is always a problem though because it increases your type-I error. The way to address that is to do a multiple comparison correction, e.g. Bonferroni-Holm or the less conservative FDR. That can have its downsides though, especially if you have a lot of predictors and few observations - it may lower your power so much that you won't be able to find any effect, no matter how big it is.
In high-dimensional setting like this, your best bet may be with some sort of regularized regression method. With regularization, you put all predictors into your model at once, similarly to doing multiple regression, however, the trick is that you constrain the model so that all of the regression slopes are pulled towards zero, so that only the ones with the big effects "survive". The machine learning versions of regularized regression are called ridge, LASSO, and elastic net, and they can be fitted using the glmnet package. There is also Bayesian equivalents in so-called shrinkage priors, such as horseshoe (see e.g. https://avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf). You can fit Bayesian regularized regression using the brms package.
Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?