Balance imbalanced data-set containing case weights with SMOTE - r

I have been working on a survey of 10K customers who have been segmented into several customer segments. Now due to the nature of the respondents who actually completed the survey the researcher who did the qualitative work applied case weights (also known as probability weights) and supplied the data to me with all customers with one of 8 class labels. So we have a multi-class problem which of course is highly imbalanced.
One approach I have taken is to decompose these classes into a pairwise model which all contribute to a final vote. Now my question is two fold:
I am using the wonderful package SMOTE to balance each model to address the class imbalance problem. However, as each customer record has a related case weight SMOTE is treating each customer equally. After applying SMOTE the classes now appear to be equal but if you consider the respective case weights it actually isn't.
My second question is relating to my strategy. Should I need not worry about my case weights and just build my classification model on the raw unweighted data even though it doesn't represent the total customer base that I want to classify into each segment.
I have been using the R caret package to build these multiple binary classifiers.
Regards

Related

Singularity in Linear Mixed Effects Models

Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?

how to implement cost sensitive learning for logistic regression in R

I have a data set which is highly imbalanced. majority to minority class ratio is 99:1. I would like to build a model which should predict the minority class accurately. In simple terms i want to perform cost sensitive learning in which cost of false negative should be higher than cost of false positive.
But i didn't find any package in R for logistic regression which will do the same.
Can any body recommend some document of site having example of R code to do the same. Thanks in advance.
For any algorithm that does not offer a cost option, you can just oversample the minority class. For example, if you want to weight them 5x then just oversample them by a factor of 5.
There is a lot of literature out there for how to deal with imbalanced data. General approaches include oversampling the minority class or undersampling the majority class. Additionally, you can get into more advanced techniques such as SMOTE, which will create synthetic observations based on your minority class.
In cases with high imbalances such as yours, I've found that a combination of oversampling the majority and undersampling the minority many times so that you get multiple models that you can average together produces good results. (Basically, this is modified bagging)

Build GBM classification model with customer post-stratification weights

I am attempting to produce a classification model based on the work of qualitative survey data. About 10K of our customers were researched and as a result a segmentation model was built and subsequently each customer categorised into 1 of 8 customer segments. The challenge is to now classify the TOTAL customer base into those segments. As only certain customers responded the researcher used overall demographics to apply post-stratification weights (or frequency weights).
My task is to now use our customer data as explanatory variables on this 10K in order to build a classification model for the whole base.
In order to handle the customer weights I simply duplicated each customer record by each respective frequency weight and the data set exploded to about 72K. I then split this data into train and test and used the R caret package to train a GBM and using the final chosen model classified my hold-out test set.
I was getting 82% accuracy and thought the results were too good to be true. After thinking about it I think the issue is that the model is inadvertently seeing records in train that are exactly the same in test (some records might be exactly duplicated up to 10 times).
I know that the GLM model function allows you to use the weight parameter to refer to a vector of weights but my question is how to utilise other machine learning algorithms, such as GBM or Random Forests, in R?
Thanks
You can use case weights with gbm and train. In general, the list of models in caret that can use case weights is here.

Smote fails making oversampling

I have just done oversampling in my dataset using Smote, included in DMwR package.
My dataset is formed by two classes. The original distribution is 12 vs 62. So, I have coded this oversampling:
newData <- SMOTE(Score ~ ., data, k=3, perc.over = 400,perc.under=150)
Now, the distribution is 60 vs 72.
However, when I display the 'newData' dataset I discover how SMOTE has made an oversampling and there are some samples repeated.
For example, the sample number 24 appears as 24.1, 24.2 and 24.3.
Is this correct? This affects directly in classification because the classifier will learn a model with data that it will be present in test, so this is not legal in classification.
Edit:
I think I didn't explain correctly my issue:
As you know, SMOTE is a technique to oversample. It creates new samples from the original ones, modifying the values of the features for it. However, when I display my new data generated by SMOTE, I obtain this:
(these values are the values of the features) Sample50: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645 0.008043167
Sample 50.1: 1.8787547 0.19847987 -0.0105946940 4.420207 4.660536 1.0936388 0.5312777 0.07171645
Sample 50 belongs to the original dataset. Sample 50.1 is the 'artificial' sample generated by SMOTE. However (and this is my issue), SMOTE has created a repeated sample, instead of creating a artificial one modifying 'a bit' the values of the features.
I hope you can understand me.
Thanks!
Smote is an algorithm that generates synthetic examples of a given class (the minority class) to handle imbalanced distributions. This strategy for generating new data is then combined with random under-sampling of the majority class. When you use SMOTE in package DMwR you need to specify an over-sampling percentage and also an under-sampling percentage. These value must be set carefully because the obtained distribution of the data may remain imbalanced.
In your case, and given the parameters set, namely the percentage of under- and over-sampling smote will introduce replicas of the examples of your minority class.
Your initial class distribution is 12 to 62 and after applying smote you end with 60 to 72. This means that the minority class was oversampled with smote and new synthetic examples of this class where produced.
However, your majority class which had 62 examples, now contains 72! The under sampling percentage was applied to this class but it actually increased the number of examples. Since the number of examples to select from the majority class is determined based on the examples of the minority class, the number of examples sampled from this class was larger then the already existing.
Therefore, you had 62 examples and the algorithm tried to randomly select 72! This means that some replicas of the examples of the majority class where introduced.
So, to explain the over-sampling and under-sampling you selected:
12 examples from the minority class with 400% of oversampling gives: 12*400/100=48. So, 48 new synthetic examples where added to the minority class (12+48=60 the final number of examples for the minority class).
The number of examples to select from the majority class are: 48*150/100=72.
But the majority class only has 62, so replicas are necessarily introduced.
I'm not sure about the implementation of SMOTE in DMwR but it should be safe for you to round the new data to the nearest integer value. One guess is that this is left for you to do in the off chance that you want to do regression instead of classification. Otherwise if you wanted regression and SMOTE returned integers, you'll have unintentionally lost information by going in the opposite direction (SMOTE -> integers -> reals).
If you are not familiar with what SMOTE does is, it creates 'new data' by looking at nearest neighbors to establish a neighborhood and then sampling from within that neighborhood. It is usually done when there is insufficient data in a classification problem for a given class. It operates on the assumptions that data near your data is similar because of proximity.
Alternately you can use Weka's implementation of SMOTE which does not make you do this additional work.
SMOTE is a very simple algorithm to generate synthetic samples. However, before you go ahead and start to use it, you have to understand your features. For example, should each of your features vary the same, etc.
Simply put, before you do, try to understand your data...!!!

regressions with many nested categorical covariates

I have a few hundred thousand measurements where the dependent
variable is a probability, and would like to use logistic regression.
However, the covariates I have are all categorical, and worse, are all
nested. By this I mean that if a certain measurement has "city -
Phoenix" then obviously it is certain to have "state - Arizona" and
"country - U.S." I have four such factors - the most granular has
some 20k levels, but if need be I could do without that one, I think.
I also have a few non-nested categorical covariates (only four or so,
with maybe three different levels each).
What I am most interested in
is prediction - given a new observation in some city, I would like to
know the relevant probability/dependent variable. I am not interested
as much in the related inferential machinery - standard deviations,
etc - at least as of now. I am hoping I can afford to be sloppy.
However, I would love to have that information unless it requires
methods that are more computationally expensive.
Does anyone have any advice on how to attack this? I have looked into
mixed effects, but am not sure it is what I am looking for.
I think this is more of model design question than on R specifically; as such, I'd like to address the context of the question first then the appropriate R packages.
If your dependent variable is a probability, e.g., $y\in[0,1]$, a logistic regression is not data appropriate---particularly given that you are interested in predicting probabilities out of sample. The logistic is going to be modeling the contribution of the independent variables to the probability that your dependent variable flips from a zero to a one, and since your variable is continuous and truncated you need a different specification.
I think your latter intuition about mixed effects is a good one. Since your observations are nested, i.e., US <-> AZ <-> Phoenix, a multi-level model, or in this case a hierarchical linear model, may be the best specification for your data. The best R packages for this type of modeling are multilevel and nlme, and there is an excellent introduction to both multi-level models in R and nlme available here. You may be particularly interested in the discussion of data manipulation for multi-level modeling, which begins on page 26.
I would suggest looking into penalised regressions like the elastic net. The elastic net is used in text mining where each column represents the present or absence of a single word, and there maybe hundreds of thousands of variables, an analogous problem to yours. A good place to start with R would be the glmnet package and its accompanying JSS paper: http://www.jstatsoft.org/v33/i01/.

Resources