I have identified genes of interest in disease cases and controls within a microarray gene expression set and have applied PCA. I want to use elastic net regression to build a model that can determine which principal components are predictive of the source (case versus control) but I'm unsure of how to do this i.e. what to input as the x and y variables. Any help at all would be much appreciated!
Some form of subset selection (i.e. the elastic net regression you refer to), where you fit a 'penalized' model and determine the most effective predictors isn't applicable to PCA or PCR (principal component regression). PCR reduces the data set to 'n' components, and the different principal components refer to different 'directions' of variance within the data. The first principal component is the direction within the data which has the most variance, the second principal component is the direction within the data which has the second most variance, etc
If you were to type:
summary(pcr.model)
It will return a table containing the amount of variance explained in the response (i.e. your y) by each principal component. You will notice there is a cumulative total of variance explained by the principal components.
The idea of PCR is that you can select a subset of these (if your data is applicable -- i.e. most of the variance is captured in the first few principal components), allowing you to greatly reduce the dimensionality of your data (allowing you to, say, plot a graph of PC1 vs PC2). Note that PCR is generally used in the categorisation of ordinal or categorical data types, so if your data isn't like this, probably use something else.
If, however, you want to know which predictors are useful and apply an elastic-net type regression, I would recommend using the Lasso. I would also recommend the ISLR book, which contains excellent R walkthroughs of all of the essential frequentist modelling techniques.
Related
Briefly, I am working with data sets from two different countries. My aim is to ensemble the models for both countries to see how generalizable the ensemble becomes
My set-up is: I have trained one worfklow_set for each country (10 model specifications with resampling and a grid search of size 20).
This is the error I get when trying to add them as candidates:
predictions <- stacks() %>%
add_candidates(wf_set_1) %>%
add_candidates(wf_set_2)
Error:
It seems like the new candidate member 'Logistic Regression' doesn't make use of the same resampling object as the existing candidates.
Thanks for the question!
Unfortunately, we don't support ensembling models trained on different data sets in stacks. There are a few operations that are no longer well-defined when this is the case.
Given your description of the problem, though, this sounds like a setting where, rather than fitting a model for each country, the country would be included as a feature in one model that fits across countries. For any covariates x_i whose effect you feel may be dependent on country, you can create an interaction term with step_interact(x_i, country).
Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.
I am building in Python a credit scorecard using this public dataset: https://www.kaggle.com/sivakrishna3311/delinquency-telecom-dataset
It's a binary classification problem:
Target = 1 -> Good applicant
Target = 0 -> Bad applicant
I only have numeric continuous predictive characteristics.
In the credit industry it is a legal requirement to explain why an applicant got rejected (or didn't even get the maximum score): to meet that requirement, Adverse Codes are produced.
In a classic logistic regression approach, one would do this:
calculate the Weight-of-Evidence (WoE) for each predictive
characteristic (forcing a monotonic relationship between the feature
values and the WoE or log(odds)). In the following example, the
higher the network Age the higher the Weight-of-Evidence (WoE):
replace the data values with the correspondent Weight-of-Evidence.
For example, a value of 250 for Network Age would be replaced by
0.04 (which is the correspondent WoE).
Train a logistic regression
After some linear transformations you'd get something like this:
And therefore it'd be straightforward to assign the Adverse Codes, so that the bin with the maximum score doesn't return an Adverse Code. For example:
Now, I want to train an XGBoost (which typically outperforms a logistic regression on a imbalanced, low noise data). XGBoost are very predictive but need to be explained (typically via SHAP).
What I have read is that in order to make the model decision explainable you must ensure that the monotonic constraints are applied.
Question 1. Does it mean that I need to train the XGBoost on the Weight-of-Evidence transformed data like it's done with the Logistic Regression (see point 2 above)?
Question 2. In Python, the XGBoost package offers the option to set monotonic constraints (via the monotone_constraints option). If I don't transform the data by replacing the Weight-of-Evidence (therefore removing all monotonic constraints) does it still make sense to use "monotone_constraints" in XGboost for a binary problem? I mean, does it make sense to use monotone_constraints with a XGBClassifier at all?
Thanks.
I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R. Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.
Calling the variable importance with the function varImp() shows nine variables. Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.
How can it be? Why does the decision tree plot not shows the same nine variables from the variable importance table?
Thank you.
Similar to rpart(), Caret has a cool property: it deals with surrogate variables, i.e. variables that are not chosen for splits, but that were close to win the competition.
Let me be more clear. Say at a given split, the algorithm decided to split on x1. Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage. We call x2 surrogate, and we assign it its variable importance as we do for x1.
This is way you can get in the importance ranking variables that are actually not used for splitting. You can also find that such variables are more important than others actuall used!
The rationale for this is explained in the documentation for rpart(): suppose we have two identical covariates, say x3 and x4. Then rpart() is likely to split on one of them only, e.g., x3. How can we say that x4 is not important?
To conclude, variable importance considers the increase in fit for both primary variables (i.e., the ones actually chosen for splitting) and surrogate variables. So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.
Hope this clarifies your doubts. For more details, see here. Just a quick quotation:
The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:
[...]
- Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control.
I am not used to caret, but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.
My question has to do with using the RSGHB package for predicting choice probabilities per alternative by applying mixed logit models (variation across respondents) with correlated coefficients.
I understand that the choice probabilities are simulated on an individual level and in order to get preference share an average of the individual shares would do. All the sources I have found treat each prediction as a separate simulation which makes the whole process cumbersome if many predictions are needed.
Since one can save the respondent specific coefficient draws wouldn't it be faster to simply apply the logit transform to each each (vector of) coefficient draw? Once this is done new or existing alternatives could be calculated faster than rerunning a whole simulation process for each required alternative. For the time being using a fitted() approach will not help me understand how prediction actually works.