I am attempting to produce a classification model based on the work of qualitative survey data. About 10K of our customers were researched and as a result a segmentation model was built and subsequently each customer categorised into 1 of 8 customer segments. The challenge is to now classify the TOTAL customer base into those segments. As only certain customers responded the researcher used overall demographics to apply post-stratification weights (or frequency weights).
My task is to now use our customer data as explanatory variables on this 10K in order to build a classification model for the whole base.
In order to handle the customer weights I simply duplicated each customer record by each respective frequency weight and the data set exploded to about 72K. I then split this data into train and test and used the R caret package to train a GBM and using the final chosen model classified my hold-out test set.
I was getting 82% accuracy and thought the results were too good to be true. After thinking about it I think the issue is that the model is inadvertently seeing records in train that are exactly the same in test (some records might be exactly duplicated up to 10 times).
I know that the GLM model function allows you to use the weight parameter to refer to a vector of weights but my question is how to utilise other machine learning algorithms, such as GBM or Random Forests, in R?
Thanks
You can use case weights with gbm and train. In general, the list of models in caret that can use case weights is here.
Related
I'm using caret's timeslice method to do step-ahead cross validation on time series data. I was surprised to find that:
the 'best' hyperparameters chosen by caret are those with the best average performance across all train/test splits, and similarly the reported performance is the average across all train/test splits based on these hyperparam values, and
caret trains a final model using all of the data available - which makes sense when fixedWindow = TRUE but perhaps not otherwise.
My data are non-stationary, so I'd like hyperparameter tuning, performance reporting and final model training to be based on recent data so that:
optimal hyperparameter values can change as underlying relationships change
reported performance can account for the fact that the best hyperparam values may change across splits, and
recent changes in underlying relationships are best captured when training the final model.
I'm wondering if the best approach for my non-stationary data would follow an approach something like:
split each training fold into a training subset and validation subset - using the validation subset to pick hyperparam values within each training fold.
train on the entire training fold using the hyperparam values selected in (1)
report performance based on whatever hyperparameter values were selected in (1), even though these may change from fold to fold
The final model is trained, and hyperparameter values selected, based on steps (1) and (2) using the most recent data only.
Having typed this up I've realised that I'm just describing nested CV for time series. My questions:
is this best practice for training time series models when data are non-stationary?
can caret, or another R package, do this?
Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.
Currently i am working on a project whose objective is to find the customer who has more probability to purchase your project.Its a classification model (0 & 1 ).
I have created model with RF and XGB both & calculated gain score ( Data is imbalanced ).Not my more than 80 % customers covering in top 3 decile for training data but when i run the model on validation dataset, it fall back to 56-59 % in both model.
Say i have 20 customers & for better accuracy , i have clustered them, Now model is giving perfect result on cluster 1 customers but perform poor on cluster 2 customers.
Any suggestion to tune the same.
Firstly, if there is a high accuracy difference between your training and validation set your model may suffer from bias. You may need to use a more complex model for this training.
Secondly, because of the imbalance of your dataset, you maybe want to resample the training set. You can use under-sampling or over-sampling techniques(SMOTE).
Thirdly, you may need to use the right evaluation metrics like precision, recall, F1.
Finally, in train/val/test split you need to be careful about the distribution of your dataset. So you can use the stratified keyword to handle this problem.
I am currently in the midst of fitting curves to sales volumes of 2 successive product generations over an observation period. I am looking at two different markets. Through some research, I have identified a nonlinear model that fit the data well in both markets. To estimate the parameters of the model, I am using R and the package, nls2.
I have obtained regression output for the markets two individually.
Now, I would like to test whether the estimated parameters are different from each other on the two markets. That is, I would like to test each model parameter from one market against the corresponding parameter estimate from the other market.
Is there any function in R or in the nls2 package, that will allow me to do that? Or is there a smarter method?
Thank you in advance!
You can compare the parameters using MANOVA.
I have been working on a survey of 10K customers who have been segmented into several customer segments. Now due to the nature of the respondents who actually completed the survey the researcher who did the qualitative work applied case weights (also known as probability weights) and supplied the data to me with all customers with one of 8 class labels. So we have a multi-class problem which of course is highly imbalanced.
One approach I have taken is to decompose these classes into a pairwise model which all contribute to a final vote. Now my question is two fold:
I am using the wonderful package SMOTE to balance each model to address the class imbalance problem. However, as each customer record has a related case weight SMOTE is treating each customer equally. After applying SMOTE the classes now appear to be equal but if you consider the respective case weights it actually isn't.
My second question is relating to my strategy. Should I need not worry about my case weights and just build my classification model on the raw unweighted data even though it doesn't represent the total customer base that I want to classify into each segment.
I have been using the R caret package to build these multiple binary classifiers.
Regards