Partition data while preserving groups with caret - r

Apologies for the cross-stack post, I wasn't sure if this is more appropriate for stackoverflow or for crossvalidated. I initiatlly posted on the latter, but realized this might be the more appropriate place.
So, I have a dataset with many rows of individuals, each with a unique indvidual ID.
For each individual, there is also a column indicating whether or not that person belongs to the same household, which is a unique householdID.
Finally, there is a Target variable, for each row, which is what I will be trying to make predictions on. Of course, there are columns with different features.
My question is--as the membership to different households is important--is there a way to partition the data into train and test sets where all the people belonging to the same household are kept together and not randomly split over both sets? (i.e., any given householdID number should not appear in both sets). But also, it is possible to split the households over both train and test sets and keep a balanced Target variable?
So, using the createDataPartition function in caret, I've managed to have a blanced Target value in both train and test when I set y = Target, and I've managed to separate the households cleanly over both train and test when I set y = unique(householdID), but I can't figure out if there's a way to get both of these results at the same time.
I'm pretty flat out of ideas, so any suggestions would be most welcome!
Thanks!

groupKFold is the way to go. But instead of using data$Target you need to split on data$householdID (or whatever your household ID column is named). This will make sure that all members of a group will be in the same fold.
After this you can use the folds in trainControl to model on data$Target.

Related

How can I split two groups as equally as possible in R?

I’m trying to do a basic A/B test, where I have to split my observations into two as equal groups as possible. The observations are based on session data from Google Analytics for different cities.
Therefore I need to create a random 50\50 split between the cities, but the split has to make the two groups as equal regarding sessions as possible. What would be my best option?
I’ve tried to perform a genetic algorithm approach to the problem, but when there’s only one variable, there must be an easier and faster way, right?

How should I approach this multi-classification problem?

I am trying to think of a potential way to predict an ID value given text data. The data is broken by:
Group: A 4 digit number in which a group of IDs exists in
ID: 13 Digit Number that is the Group Number + a Unique Value
Text: Words coming from documents.
Goal: is to predict an ID number given only the text from a document.
The data that I have has about 1200 different IDs while there are only 140 different groups. The document term matrix is about 186 columns wide with about 20,000 rows. I have a lot more data I could include. I had created a simple neural net to predict the Group number with 70% accuracy. My idea is to use this model first to predict the group number and then build separate models for each group to narrow the amount of IDs in the prediction. A final model would be trained and would be used to predict the ID. Below is a drawing of what I had in mind. Is this similar to stacking in ensemble learning? I am relatively new to machine learning and I am trying to think of different ways to approach this problem.
Am I on the right path or is there a better way of doing this? Any advice is greatly appreciated.
A lot depends on how well you think you can infer the group_number and unique_value from the text. Does the unique_value depend at all on the group_number? If so, then you will likely want to predict the group_number first and use that in the prediction of the unique_value - as you have suggested doing for each unique group number. You will also have to consider the amount of data you have for each given group and if it's enough to train respective models. Give it a shot, and if it doesn't work, try a single neural network where you enter the text and the group number you've already predicted!
Good Luck!

how do duplicated rows effect a decision tree?

I am using Rpart{} to build a decision tree for a categorical variable and I am wondering whether I should use the full data set of just the set of unique rows.
I am answering this as a general question on decision trees, rather than on the R implementation.
The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting a weight on the values in those rows.
This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.
In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.
In some ways this would depend on the data itself. Are the duplicated rows valid data? Or are they only partly duplicated but still important?
If the data were temperature measurements in a town at a given hour, maybe duplicated temperatures are important as they would weight this variable to be a more correct temperature than another lone measurement that was different.
If the data were temperature measurements that three people recorded off the same thermometer at the same time, then you would want to remove the noise from the data by reducing to just unique values.
The answer could very well be a combination of the above. If you had multiple readings that conflicted at the same time period, you would choose the most heavily weighted one, and then decide how to break ties, if all the measurements were the same you removed duplicates. In this way you are cleaning the data before you put it through an algorithm.
It all comes down to what is relevant in the data model and whether duplicated rows are of relevance to the result.

Cluster Analysis using R for large data sample

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise.
I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R.
Here is the brief description of my problem.
I have approximately 480K records of customers who have bought so far. The data contains following columns:
email id
gender
city
total transactions so far
average basket value
average basket size ( no of item purchased during one transaction)
average discount claimed per transaction
No of days since the user first purchased
Average duration between two purchases
No of days since last transaction
The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?
Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:
cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)
For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.
Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.
h <- hclust(d, "ave")
Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.
I am not sure if clustering is the way to go here.
Here are some ideas:
First split your data into a training set (say 70%) and a test set.
Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.
fit <-lm(averagebasketvalue ~., data = custdata)
Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.
Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like
fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²
Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".
This can go on for a while... play with the model to try to get better R-squared and SSEs.
I think a tree-based model (rpart) might also work well here.
Then you might change to cluster analysis at a later time.

recombining data frames in R without using row.names

I start with a data.frame (or a data_frame) containing my dependent Y variable for analysis, my independent X variables, and some "Z" variables -- extra columns that I don't need for my modeling exercise.
What I would like to do is:
Create an analysis data set without the Z variables;
Break this data set into random training and test sets;
Find my best model;
Predict on both the training and test sets using this model;
Recombine the training and test sets by rows; and finally
Recombine these data with the Z variables, by column.
It's the last step, of course, that presents the problem -- how do I make sure that the rows in the recombined training and test sets match the rows in the original data set? We might try to use the row.names variable from the original set, but I agree with Hadley that this is an error-prone kludge (my words, not his) -- why have a special column that's treated differently from all other data columns?
One alternative is to create an ID column that uniquely identifies each row, and then keep this column around when dividing into the train and test sets (but excluding it from all modeling formulas, of course). This seems clumsy as well, and would make all my formulas harder to read.
This must be a solved problem -- could people tell me how they deal with this? Especially using the plyr/dplyr/tidyr package framework?

Resources