How should I approach this multi-classification problem? - r

I am trying to think of a potential way to predict an ID value given text data. The data is broken by:
Group: A 4 digit number in which a group of IDs exists in
ID: 13 Digit Number that is the Group Number + a Unique Value
Text: Words coming from documents.
Goal: is to predict an ID number given only the text from a document.
The data that I have has about 1200 different IDs while there are only 140 different groups. The document term matrix is about 186 columns wide with about 20,000 rows. I have a lot more data I could include. I had created a simple neural net to predict the Group number with 70% accuracy. My idea is to use this model first to predict the group number and then build separate models for each group to narrow the amount of IDs in the prediction. A final model would be trained and would be used to predict the ID. Below is a drawing of what I had in mind. Is this similar to stacking in ensemble learning? I am relatively new to machine learning and I am trying to think of different ways to approach this problem.
Am I on the right path or is there a better way of doing this? Any advice is greatly appreciated.

A lot depends on how well you think you can infer the group_number and unique_value from the text. Does the unique_value depend at all on the group_number? If so, then you will likely want to predict the group_number first and use that in the prediction of the unique_value - as you have suggested doing for each unique group number. You will also have to consider the amount of data you have for each given group and if it's enough to train respective models. Give it a shot, and if it doesn't work, try a single neural network where you enter the text and the group number you've already predicted!
Good Luck!

Related

R caret training - but each sample has three separate measurements and I want to use majority vote to predict

I have a very specific datasets with 50 people. Each person has a response (sex) and ~2000 measurements of some biological stuff.
We have three independent replicates from each person, so 3 rows pr. person.
I can easily use caret and groupKFold() to keep each person in either training or test sets - that works fine.
Then I simply predict each replicate separately (so 3 prediction pr person).
I want to use these three prediction together and make a combined prediction pr. person either using majority vote and/or some other scheme.
I.e. - so for each person I get the 3 predictions and predict the response to be the one with most votes. That's pretty easy to do for the final model, but it should also be used in the tuning step (i.e. in the cross validation picking parameter values).
I think I can do that in the summaryFunction=... when calling caret::trainControl() but I would simply like to ask:
Is there a simpler way of doing this?
I have googled around - but I keep failing in finding people with similar problems. And I really hope someone can point me in the right direction.

Partition data while preserving groups with caret

Apologies for the cross-stack post, I wasn't sure if this is more appropriate for stackoverflow or for crossvalidated. I initiatlly posted on the latter, but realized this might be the more appropriate place.
So, I have a dataset with many rows of individuals, each with a unique indvidual ID.
For each individual, there is also a column indicating whether or not that person belongs to the same household, which is a unique householdID.
Finally, there is a Target variable, for each row, which is what I will be trying to make predictions on. Of course, there are columns with different features.
My question is--as the membership to different households is important--is there a way to partition the data into train and test sets where all the people belonging to the same household are kept together and not randomly split over both sets? (i.e., any given householdID number should not appear in both sets). But also, it is possible to split the households over both train and test sets and keep a balanced Target variable?
So, using the createDataPartition function in caret, I've managed to have a blanced Target value in both train and test when I set y = Target, and I've managed to separate the households cleanly over both train and test when I set y = unique(householdID), but I can't figure out if there's a way to get both of these results at the same time.
I'm pretty flat out of ideas, so any suggestions would be most welcome!
Thanks!
groupKFold is the way to go. But instead of using data$Target you need to split on data$householdID (or whatever your household ID column is named). This will make sure that all members of a group will be in the same fold.
After this you can use the folds in trainControl to model on data$Target.

How do I transform a list of items into groups to predict group ratings in azure machine learning?

I'm newbie to azure machine learning and I'm trying to build a model that rates groups of items.
My data is a file with a list of items with features (small list - less than 80 items) and I need to make groups (of diferent sizes - groups of 2, 3, 4,... 10 items, for all the possible combinations) so that the model rate those groups (rates from 1 to 10). I also have some group rates to train the model.
I don't know how to transform the items into groups.
Another thing is, I'm not sure which model is the best. From all I gather, I think that a multiclass classification is the most suitable for this problem. Is it?
Thank you in advance and sorry for any grammar error in my text.
You need to convert various groups as columns. One of the such example is where you have sales for specific day and you need have sales for past days as additional features. Here is the code that does convert rows to columns for having sales for previous days - https://gallery.cortanaintelligence.com/CustomModule/Generate-Lag-Features-1 - source code for this - https://gist.github.com/nk773/a2ed7cd0ce8020647f5e7711f749b3b5

how do duplicated rows effect a decision tree?

I am using Rpart{} to build a decision tree for a categorical variable and I am wondering whether I should use the full data set of just the set of unique rows.
I am answering this as a general question on decision trees, rather than on the R implementation.
The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting a weight on the values in those rows.
This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.
In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.
In some ways this would depend on the data itself. Are the duplicated rows valid data? Or are they only partly duplicated but still important?
If the data were temperature measurements in a town at a given hour, maybe duplicated temperatures are important as they would weight this variable to be a more correct temperature than another lone measurement that was different.
If the data were temperature measurements that three people recorded off the same thermometer at the same time, then you would want to remove the noise from the data by reducing to just unique values.
The answer could very well be a combination of the above. If you had multiple readings that conflicted at the same time period, you would choose the most heavily weighted one, and then decide how to break ties, if all the measurements were the same you removed duplicates. In this way you are cleaning the data before you put it through an algorithm.
It all comes down to what is relevant in the data model and whether duplicated rows are of relevance to the result.

Cluster Analysis using R for large data sample

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise.
I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R.
Here is the brief description of my problem.
I have approximately 480K records of customers who have bought so far. The data contains following columns:
email id
gender
city
total transactions so far
average basket value
average basket size ( no of item purchased during one transaction)
average discount claimed per transaction
No of days since the user first purchased
Average duration between two purchases
No of days since last transaction
The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?
Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:
cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)
For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.
Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.
h <- hclust(d, "ave")
Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.
I am not sure if clustering is the way to go here.
Here are some ideas:
First split your data into a training set (say 70%) and a test set.
Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.
fit <-lm(averagebasketvalue ~., data = custdata)
Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.
Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like
fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²
Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".
This can go on for a while... play with the model to try to get better R-squared and SSEs.
I think a tree-based model (rpart) might also work well here.
Then you might change to cluster analysis at a later time.

Resources