I have a model using medical data in R that I have created using XGBoost. I can tell feature importance and the top feature is BMI. However, I do not know how to tell how this impacts the model--it a binary outcome of hemorrhage, so does a lower or higher BMI cause the patient to be classified as being at risk for hemorrhage. Does anyone know how to be able to find this out? Thank you!
Related
I'm currently working on my master's thesis. For my research, I have to measure the impact of two independent variables (amount of infographics and amount of other images) on a dependent variable (amount raised). Also other control variables are included. The data consists of a database containing 2000+ observations, with some missing (NA in R) independent variable values. Someone suggested using a Heckman model for this. I watched a youtube video to make it more clear for myself, but I doubt if a Heckman model is the right way to analyze this. Do you recommend to use the Heckman model for this?
How would you analyze this?
Best regards
I’m currently working with a dataset that has lots zeros in the predictor variables as well as the response variable too. The response variable is continuous and it is very skewed to the right.
I’m trying to apply a discrete-continous model where in the first level i perform a binomial logit model to model the zero o and in the second level i perform a regression model for nonzero observations.
Stata program allows you to do this type of analysis very easily but i am using RStudio and did not find any clear packages that implement such apprach. I’d greatly appreciate it if someone can point me to which package i should be using and showing an example would be greatly appreciated too.
I am trying to understand and use PISA data in the most "accurate" way possible.
In the end, all I need are country means (BUT I want to control for individual level variables, so I have the country net effect, since I want to analyse education system characteristics). I am currently getting them with ranef out of a lmer multilevel model (with (1|CNT/SCHOOLID) and (1 |CNT) nesting).
I first thought of using the survey package, so I can include the replicate weights (svyrepdesign works), but then I cannot control for individual level variables (no multilevel possible).I tried svynlm - did not work, it does take way too long (48 hours and then does not converge).
I now figured out that for my case I should be using senate weights instead of the student final weights (not a problem so far, those I can include in lmer).
If I use mitools and lmer I have a multilevel model (with senate weights) and the Plausible Values, but not repweights. Since I am not including the se's in my further analysis, I could go with just senate weights, but I need a good explanation other than "technical consideration".
Does someone know a way out of this dilemma?
Look at BIFIEsurvey. For two-level regressions, BIFIE.twolevelreg allows to usePV and replicate weights.
Consider also how to scale your weights: See here: Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society. Series A (General), 169, 805–827.
I am trying to predict the Spotify popularity score using a range of machine learning algorithms in the R Caret package including logistic regression. The aim is to predict track popularity based on audio features e.g. danceability, energy etc. The problem I have is that Spotify are not transparent about how the popularity score is calculated but I know it is based on a number of things including play counts and how recent the track is. That means that the number of days released will have an impact on the popularity score so I have included days_released as an independent variable in my modelling to try and control for it.
So, I have 50 variables (days_released being one of them). I am using the rfe function in Caret to perform feature selection but for every algorithm, days_released is the only variable selected. Does anyone have any advice or recommended reading on how to overcome this problem? I want to predict popularity and explore which track features have a significant relationship with popularity, controlling for days_released.
Do I take the days_released variable out altogether?
Do I leave it in but force rfe to select more than one feature?
Any help would be much appreciated! Thanks in advance!
I've been using BTYD package for predicting customer churn and number of orders in the future, but I find the included models (Pareto/NBD, BG/NBD and BG/BB) limited in the sense that they only take recency, frequency, age and monetary value into account. Using these values I receive accuracy of up to 80% but I'm sure this could be improved by incorporating more meaningful predictors into the model. Is it possible in this package? I couldn't find any information about it in the vignette. Any help is welcome, thanks!
Kasia