Cluster Analysis using R for large data sample - r

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise.
I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R.
Here is the brief description of my problem.
I have approximately 480K records of customers who have bought so far. The data contains following columns:
email id
gender
city
total transactions so far
average basket value
average basket size ( no of item purchased during one transaction)
average discount claimed per transaction
No of days since the user first purchased
Average duration between two purchases
No of days since last transaction
The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?

Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:
cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)
For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.
Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.
h <- hclust(d, "ave")

Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.
I am not sure if clustering is the way to go here.
Here are some ideas:
First split your data into a training set (say 70%) and a test set.
Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.
fit <-lm(averagebasketvalue ~., data = custdata)
Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.
Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like
fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²
Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".
This can go on for a while... play with the model to try to get better R-squared and SSEs.
I think a tree-based model (rpart) might also work well here.
Then you might change to cluster analysis at a later time.

Related

R caret training - but each sample has three separate measurements and I want to use majority vote to predict

I have a very specific datasets with 50 people. Each person has a response (sex) and ~2000 measurements of some biological stuff.
We have three independent replicates from each person, so 3 rows pr. person.
I can easily use caret and groupKFold() to keep each person in either training or test sets - that works fine.
Then I simply predict each replicate separately (so 3 prediction pr person).
I want to use these three prediction together and make a combined prediction pr. person either using majority vote and/or some other scheme.
I.e. - so for each person I get the 3 predictions and predict the response to be the one with most votes. That's pretty easy to do for the final model, but it should also be used in the tuning step (i.e. in the cross validation picking parameter values).
I think I can do that in the summaryFunction=... when calling caret::trainControl() but I would simply like to ask:
Is there a simpler way of doing this?
I have googled around - but I keep failing in finding people with similar problems. And I really hope someone can point me in the right direction.

Partition data while preserving groups with caret

Apologies for the cross-stack post, I wasn't sure if this is more appropriate for stackoverflow or for crossvalidated. I initiatlly posted on the latter, but realized this might be the more appropriate place.
So, I have a dataset with many rows of individuals, each with a unique indvidual ID.
For each individual, there is also a column indicating whether or not that person belongs to the same household, which is a unique householdID.
Finally, there is a Target variable, for each row, which is what I will be trying to make predictions on. Of course, there are columns with different features.
My question is--as the membership to different households is important--is there a way to partition the data into train and test sets where all the people belonging to the same household are kept together and not randomly split over both sets? (i.e., any given householdID number should not appear in both sets). But also, it is possible to split the households over both train and test sets and keep a balanced Target variable?
So, using the createDataPartition function in caret, I've managed to have a blanced Target value in both train and test when I set y = Target, and I've managed to separate the households cleanly over both train and test when I set y = unique(householdID), but I can't figure out if there's a way to get both of these results at the same time.
I'm pretty flat out of ideas, so any suggestions would be most welcome!
Thanks!
groupKFold is the way to go. But instead of using data$Target you need to split on data$householdID (or whatever your household ID column is named). This will make sure that all members of a group will be in the same fold.
After this you can use the folds in trainControl to model on data$Target.

Predicting WHEN an event is going to occur

I am very new to Machine learning and r, so my question might seem unclear or would need more information. I have tried to explain as much as possible. Please correct me if I have used wrong terminologies or phrases. Any help on this will be greatly appreciated.
Context - I am trying to build a model to predict "when" an event is going to happen.
I have a dataset which has the below structure. This is not the actual data. It is a dummy data created to explain the scenario. Actual data cannot be shared due to confidentiality.
About data -
A customer buys a subscription under which he is allowed to use x$
amount of the service provided.
A customer can have multiple subscriptions. Subscriptions could be overlapping in time or could be serialized in time
Each subscription has a limit on the usage which is x$
Each subscription has a startdate and end date.
Subscription will no longer be used after enddate.
Customer has his own behavior/pattern in which he uses the service. This is described by other derived variables Monthly utilization, avg monthly utilization etc.
Customer can use the service above $x. This is indicated by column
"ExceedanceMonth" in the table above. Value of 1 says customer
went above $x in the first month of the subscription, value 5 says
customer went above $x in 5th month of the subscription. Value of
NULL indicates that the limit $x is not reached yet. This could be
either because
subscription ended and customer didn't overuse
or
subscription is yet to end and customer might overuse in future
The 2nd scenario after or condition described above is what I want to
predict. Among the subscriptions which are yet to end and customer
hasn't overused, WHEN will the limit be reached. i.e. predict the
ExceedanceMonth column in the above table.
Before reaching this model - I have a classification model built using decision tree which predicts if customer is going to cross the limitamount or not i.e. predict if LimitReached = 1 or 0 in next 2 months. I am not sure if I should train the model discussed here (predict time to event) with all the data and test/use the model on customer/subscriptions with Limitreached = 1 or train the model with only the customers/subscription which will have Limitreached = 1
I have researched on survival models. I understand that a survival model like Cox can be used to understand the hazard function and understand how each variable can affect the time to event. I tried to use predict function with cox but I did not understand if any of the values passed to "type" parameter can be used to predict the actual time. i.e. I did not understand how I can predict the actual value for "WHEN" the limit will be crossed
May be survival model isn't the right approach for this scenario. So, please advise me of what could be the best way to approach this problem.
#define survival object
recsurv <- Surv(time=df$ExceedanceMonth, event=df$LimitReached)
#only for testing the code
train = subset(df,df$SubStartDate>="20150301" & df$SubEndDate<="20180401")
test = subset(df,df$SubStartDate>"20180401") #only for testing the code
fit <- coxph(Surv(df$ExceedanceMonth, df$LimitReached) ~ df$SubDurationInMonths+df$`#subs`+df$LimitAmount+df$Monthlyutitlization+df$AvgMonthlyUtilization, train, model = TRUE)
predicted <- predict(fit, newdata = test)
head(predicted)
1 2 3 4 5 6
0.75347328 0.23516619 -0.05535162 -0.03759123 -0.65658488 -0.54233043
Thank you in advance!
Survival models are fine for what you're trying to do. (I'm assuming you've estimated the model correctly from this point on.)
The key is understanding what comes out of the model. For a Cox, the default quantity out of predict() is the linear combination (b0 + b1x1 + b2x2..., though the Cox doesn't estimate a b0). That alone won't tell you anything about when.
Specifying type="expected" for predict() will give you when via the expected duration--how long, on average, until the customer reaches his/her data limit, with the follow-up time (how long you watch the customer) set equal to the customer's actual duration (retrieved from the coxph model object).
The coxed package will also give you expected durations, calculated using a different method, without the need to worry about follow-up time. It's also a little more forgiving when it comes to inputting a newdata argument, particularly if you have a specific covariate profile in mind. See the package vignette here.
See also this thread for more on coxph.predict().

Z score normalizing r dataframe consecutively

I would like to normalize an R data.frame by computing the z-score using the function scale().
However, I am not sure whether this approach is subject to "look-ahead bias", which is a finance term for making up features that would not have been known or available during the period being analyzed.
These are stock returns, and I want to use this data for a "backtest" (a finance term for validation). I want to make sure that each period's z-score is only using data available up to that point and not the entire series mean and std when computing the z-score.
Does anyone know how to perform the calculation for this? Or is there a different approach?
You can normalize data or create new features using normalization without worrying about "look-ahead" bias. It's very common.
You just don't use any data to do so that would not be available in the period being analyzed.
Much like with target encoding or other feature engineering techniques you simply create those features on a training subset of your historical data, then validate it on a validation split. You may also consider KFold cross-validation.
If you'd like to augment your question with a reproducible example I can show you.

How to stratify sample a data set, conduct statistical analysis with Caret and repeat in r?

I have a data set that I would like to stratify sample, create statistical models on using the caret package and then generate predictions.
The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).
What I want to be able to do is:
Generate the stratified data sample
Create the machine learning model
Repeat 1000 times & take the average model output
I hope that by repeating the steps on the variations of the stratified data set, I am able to avoid the subtle changes in the predictions generated due to a smaller data sample.
For example, it may look something like this in r;
Original.Dataset = data.frame(A)
Stratified.Dataset = stratified(Original.Dataset, group = x)
Model = train(Stratified.Dataset.....other model inputs)
Repeat process with new stratified data set based on the original data and average out.
Thank you in advance for any help, or package suggestions that might be useful. Is it possible to stratify the sample in caret or simulate in caret?
First of all, welcome to SO.
It is hard to understand what you exactly are wondering, your question is very broad.
If you need input on statistics I would suggest you to ask more clearly defined questions in Cross Validated.
Q&A for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
The problem I am finding is that in different iterations of the
stratified data set I get significantly different results (this may be
in part due to the relatively small data sample M=1000).
I assume you are referring to different iterations of your model. This depends on how large your different groups are. E.g. if you are trying to divide your data set consisting of 1000 samples in to groups of 10 samples, your model could very likely be unstable and hence give different results in each iteration. This could also be due to that your model depends on some randomness, and the smaller your data is (and the more groups) your will have larger variation. See here or here for more information on cross validation, stability and bootstrap aggregating.
Generate the stratified data sample
How to generate it: the dplyr package is excellent in grouping data depending on different variables. You might also want to use the split function found in the base package. See here for more information. You could also use the in-built methods found in the caret package, found here.
How to know how to split it: it very much depends on your question you would like to answer, most likely you would like to even out some variables, e.g. gender and age for creating a model for predicting disease. See here for more info.
In the case of having e.g. duplicated observations and you want to create unique subsets with different combinations of replicates with it's unique measurements you would have to use other methods. If the replicates have a common identifier, here sample_names. You could do something like this to select all samples but with different combinations of the replicates:
tg <- data.frame(sample_names = rep(1:5,each=2))
set.seed(10)
tg$values<-rnorm(10)
partition <- lapply(1:100, function(z) {
set.seed(z)
sapply(unique(tg$sample_names), function(x) {
which(x == tg$sample_names)[sample(1:2, 1)]
})
})
#the first partition of your data to train a model.
tg[partition[[1]],]
Create the machine learning model
If you want to use caret, you could go to the caret webpage. And see all the available models. Depending on your research question and/or data you would like to use different types of models. Therefore, I would recommend you to take some online machine learning courses, for instance the Stanford University course given by Andrew Ng (I have taken it myself), to get more familiar with the different major algorithms.If you are familiar with the algorithms, just search for the available models.
Repeat 1000 times & take the average model output
You can either repeat your model 1000 times with different seeds (see set.seed) and different training methods e.g. cross validations or bootstrap aggregation. There are a lot of different training parameters in the caret package:
The function trainControl generates parameters that further control
how models are created, with possible values:
method: The resampling method: "boot", "cv", "LOOCV", "LGOCV",
"repeatedcv", "timeslice", "none" and "oob"
For more information on the methods, see here.

Resources