I'm working on a churn model, the main idea it's to predict if a customer whether be a churner or no within 30 days.
I've been struggling with my dataset, I have 100k rows and my target variable is unbalanced, 95% no churn and 5% churn.
I'm trying with GLM and RF, if I train both models with raw data, I don't get any churn prediction, so, It doesn't work for me. I have tried balancing, taking all churners and same amount of no churners (50% churn, 50% no churn), training with that and then testing with my data and I get a lot of churn predictions when they not. I tried oversampling, undersampling, ROSE, SMOTE, and it seems that nothing's working for me.
With luck both models predict a maximum 20% of all my churners, then my gain and lift are not that good. I think I've tried everything, but I don't get more than 20% prediction of what I need.
I have customer behavior variables, personal information and more. I also made an exploratory analysis, calculating percentage of churn per age, per sex, per behavior and I saw that every group have the same churn percentage, so, I'm thinking that maybe I lack of more variables that separates groups in a better form (this last idea it's just personal).
Thank you everyone, greetings.
Related
My job is to make sure that an online retailer achieves a certain service level (in stock rate) for their products, while avoiding aging and excess stock. I have a robust cost and leadtime simulation model. One of the inputs into that model is a vector of prediction intervals for cumulative demand over the next leadtime weeks.
I've been reading about quantile regression, conforming models, gradient boosting, and quantile random forest... frankly all of these are far above my head, and they seem focused on multivariate regression of non-time-series data. I know that I can't just regress against time, so I'm not even sure how to set up a complex regression method correctly. Moreover, since I'm forecasting many thousands of items every week, the parameter setting and tuning needs to be completely automated.
To date, I've been using a handful of traditional forecast methods (TSB [variation of Croston], ETS, ARIMA, etc) including hybrids, using R packages like hybridForecast. My prediction intervals are almost universally much narrower than our actual results (e.g. in a sample of 500 relatively steady-selling items, 20% were below my ARIMA 1% prediction interval, and 12% were above the 99% prediction interval).
I switched to using simulation + bootstrapping the residuals to build my intervals, but the results are directionally the same as above.
I'm looking for the simplest way to arrive at a univariate time series model with more accurate prediction intervals for cumulative demand over leadtime weeks, particularly at the upper / lower 10% and beyond. All my current models are training on MSE, so one step is probably to use something more like pinball loss scoring against cumulative demand (rather than the per-period error). Unfortunately I'm totally unfamiliar with how to write a custom loss function for the legacy forecasting libraries (much less the new sexy ones above).
I'd deeply appreciate any advice!
A side note: we already have an AWS setup that can compute each item from an R job in parallel, so computing time is not a major factor.
Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?
I would like to detect patterns within a weather dataset of around 10'000 data points. I have around 40 possible predictors (temperature, humidity etc.) which may explain good or bad weather the next day (dependent variable). Normally, I would apply classical machine learning methods like Random Forest to build and test models for classifying the whole dataset and find reliable predictors to forecast the next day's weather.
My task though is different. I want to find predictors and their parameters which "guarantee" me good or bad weather in a subset of the overall data. I am not interested in describing the whole dataset but finding the pattern of predictors (and their parameters) that give me good or bad weather indications. So I am trying to find, for example, 100 datapoints with 100% good weather if certain predictors are set to certain levels. I am not interested in the other 9'900 datapoints.
It is kind of the task of trying all combinations and calibrations of the predictors to find a subset of the overall data points which can be predicted with very high accuracy.
How would you do this systematically?
I am working on Decision Tree model .The dataset is related to cars.I have 80% data in training set and 20% test set. The summary of the model ( based on training data) shows misclassification rate around 0.02605 where as when I run the model on training set came as 0.0289 , the difference between them is around 0.003. Is the difference acceptable , what is causing this difference? I am new to R/statistics.Please share your feedback.
Acceptable misclassification rate is more art than science. If your data is generated from a single population then there is without a doubt to be some unavoidable overlap between the groups, which will make linear classification error-prone. This doesn't mean its a problem. For instance, if you are classifying credit card charges as possibly fraudulent or not, and your recourse isn't too harsh in the case when you classify an observation to the former, then you it may be advantageous to be on the safer side and end up with more false-positives rather than a low misclassification rate. You could 1. visualize your data to identify overlap, or 2. compute N*.03 to discern the number of misclassified cases; if you have an understanding of what you are classifying, you could assess the seriousness of misclassification that way.
I have thousands of factors (categorical variables) that I am applying a classification on using Naive Bayes.
My problem is that I have many factors that appear very few times in my dataset so it seems they decrease the performance of my prediction.
Indeed, I noticed that if I removed the categorical variables that were happening very few times, I had a signicant improvement of my accuracy. But ideally I would like to keep all my factors, do you know what is the best practice to do so ?
Big Thanks.
This is too long for a comment.
The lowest frequency terms may be adversely affecting the accuracy simply because there is not enough data to make an accurate prediction. Hence, the observations in the training set may say nothing about the validation set.
You could combine all the lowest frequency observations into a single value. Off-hand, I don't know what the right threshold is. You can start by taking everything that occurs 5 or fewer times and lumping them together.