Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say that a data has both numeric & catagoricial feature, and I've created a xgboost model by using gblinear. I've analyzed the xgboost model with xgb.importance, then how can I express categorical variable weights?
While XGBoost is considered to be a black box model, you can understand the feature importance (for both categorical and numeric) by averaging the gain of each feature for all split and all trees.
This is represented in the graph below.
# Get the feature real names
names <- dimnames(trainMatrix)[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = bst)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
In the feature importance above, we can see the first 10 most important features.
This function gives a color to each bar. Basically a K-means clustering is applied to group each feature by importance.
Alternately, this could be represented in a tree diagram (see the link above).
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have set of customers with different attributes continuous, categorical, binary and ordinal.
How can I cluster them knowing that we cannot apply the same distance metrics on the these different types of attributes?
Thank you in advance
As mentioned already daisy package is an option which does an automatic selection of best distance metric based on data type.But I would suggest the following approach and request expert to please chime in.
Rather than automatic selection identify and remove some correlated variables like(some examples)
Pearson Correlation: for continuous variable
Chi Square Test: for categorical variables
Categorical vs Numerical: One way Anova test etc.
Taking the subset of useful variables consider doing One-Hot Encoding of categorical variables and maybe convert ordinal to continuous (or categorical and one-hot encode). Test using different distance metric like Euclidean, Manhattan etc to evaluate the result. You will get a better clarity of the overall clustering process in this way.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
One task of Machine Learning / Data Science is making predictions. But, I want to get more insights in the variables of my model.
To get more insights, I tried different methods:
Logistic Regression (The output provides some 'insights' in the influence of the different variables, see: Checking interpretation of GLM summary in R)
The xgb.plot.importance function applied on a Boosting Tree, see picture below (applied on the Titanic Data Set).
And I saw a great article (but unfortunately, it is not working) how to explain a boosting tree (see: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211).
My question: are there other methods to give yourself (or even better: the business) more insights about which variables have a influence on the target variable? And of course: is the influence positive/negative and how big is it?
You could also try to use lasso regression (https://stats.stackexchange.com/questions/17251/what-is-the-lasso-in-regression-analysis), which basically selects the variables that influence the response variable mostly.
The glmnet package provides support for this type of regression.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm doing credit risk modelling and the data have large number of features.I am using boruta package for feature selection. The package is too computationally expensive, I cannot run it on the complete training dataset. What i'm trying to do is take a subset of the training data(let's say about 20-30%) and run the boruta package on that subsetted data and get the important features. But when i use random forest to train the data I have too use the full dataset. My question is, Is it right to select features only on a part of train data but then build the model on whole of training data?
Since the question is logical in nature, I will give my two cents.
A single random sample of 20% of the population is good enough i believe
A step further would be taking 3-4 such random sets and the intersection of the significant variables from all of them is an improvement to the above
Using feature selection from multiple methods (xgboost, some caret feature selection methods) -> use a different random sample for each of them, and then take the common significant features
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Fitting repeated measures in R, convergence issues. I have the following fit which is one of many datasets and it doesn't converge. I do other sets that do. This dataset and model work in SAS... Could I get some direction in what to do to have this work in R? Things to look at (matrices,option settings,a reference on this topic for r/splus ...).
fit.gls <- gls(resp~grpnum+grpnum/day,data=long, corr=cormat,na.action=na.omit)
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 62
I have read the following and still trying to work thru it...
Converting Repeated Measures mixed model formula from SAS to R
The problem is the data. gls needs to invert a matrix to work (see Wikipedia for the formula to estimate the covariates). For you particular data set, that matrix is not invertible.
You can allow for singular values to be allowed with the control argument:
fit.gls <- gls(resp~grpnum+grpnum/day,data=long, corr=cormat,na.action=na.omit,
control = list(singular.ok = TRUE))
Be careful with this as you may get bad results! Always check the model fit afterwards.
Look at the help for gls and glsConrol for more details about options.