How do I transform a list of items into groups to predict group ratings in azure machine learning? - azure-machine-learning-studio

I'm newbie to azure machine learning and I'm trying to build a model that rates groups of items.
My data is a file with a list of items with features (small list - less than 80 items) and I need to make groups (of diferent sizes - groups of 2, 3, 4,... 10 items, for all the possible combinations) so that the model rate those groups (rates from 1 to 10). I also have some group rates to train the model.
I don't know how to transform the items into groups.
Another thing is, I'm not sure which model is the best. From all I gather, I think that a multiclass classification is the most suitable for this problem. Is it?
Thank you in advance and sorry for any grammar error in my text.

You need to convert various groups as columns. One of the such example is where you have sales for specific day and you need have sales for past days as additional features. Here is the code that does convert rows to columns for having sales for previous days - https://gallery.cortanaintelligence.com/CustomModule/Generate-Lag-Features-1 - source code for this - https://gist.github.com/nk773/a2ed7cd0ce8020647f5e7711f749b3b5

Related

How can Keras predict sequences of sales (individually) of 11106 distinct customers, each a series of varying length (anyway from 1 to 15 periods)

I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's

How should I approach this multi-classification problem?

I am trying to think of a potential way to predict an ID value given text data. The data is broken by:
Group: A 4 digit number in which a group of IDs exists in
ID: 13 Digit Number that is the Group Number + a Unique Value
Text: Words coming from documents.
Goal: is to predict an ID number given only the text from a document.
The data that I have has about 1200 different IDs while there are only 140 different groups. The document term matrix is about 186 columns wide with about 20,000 rows. I have a lot more data I could include. I had created a simple neural net to predict the Group number with 70% accuracy. My idea is to use this model first to predict the group number and then build separate models for each group to narrow the amount of IDs in the prediction. A final model would be trained and would be used to predict the ID. Below is a drawing of what I had in mind. Is this similar to stacking in ensemble learning? I am relatively new to machine learning and I am trying to think of different ways to approach this problem.
Am I on the right path or is there a better way of doing this? Any advice is greatly appreciated.
A lot depends on how well you think you can infer the group_number and unique_value from the text. Does the unique_value depend at all on the group_number? If so, then you will likely want to predict the group_number first and use that in the prediction of the unique_value - as you have suggested doing for each unique group number. You will also have to consider the amount of data you have for each given group and if it's enough to train respective models. Give it a shot, and if it doesn't work, try a single neural network where you enter the text and the group number you've already predicted!
Good Luck!

Is it possible to use the evtree package in R for panel data / over multiple years?

I would like to know, if it is possible to use evtree over multiple years?
I have an unbalanced panel data set (8 years), with two groups based on a (binary) dependent variable (dv). For every year, the dv-value for each observation can be either 0 or 1, and thus constitutes group membership. Also, I have multiple predictor variables (pv), where their influence on dv might change over time.
Evtree generally seems the correct approach for me (at least for a single year). My goal is to train the evtree model over multiple periods (to capture possible temporal effects) in order to classify the two groups as good as possible.
Any help is highly appreciated.
Thanks in advance!

How can I tweak xgboost to assign more weight to a variable?

I have historical purchase data for some 10k customers for 3 months, I want to use that data for making predictions about their purchase in next 3 months. I am using Customer ID as input variable, as I want xgboost to learn for individual spendings among different categories. Is there a way to tweak, so that emphasis is to learn more based on the each Individual purchase? Or better way of addressing this problem?
You can use weight vector which you can pass in weight argument in xgboost; a vector of size equal to nrow(trainingData). However This is generally used to penalize mistake in classification mistake (think of sparse data with items which just sale say once in month or so; you want to learn the sales then you need to give more weight to sales instance or else all prediction will be zero). Apparently you are trying to tweak weight of independent variable which I am not able to understand well.
Learning the behavior of dependent variable (sales in your case) is what machine learning model do, you should let it do its job. You should not tweak it to force learn from some feature only. For learning purchase behavior clustering type of unsupervised techniques will be more useful.
To include user specific behavior first take will be to do clustering and identify under-indexed and over-indexed categories for each user. Then you can create some categorical feature using these flags.
PS: Some data to explain your problem can help others to help you better.
It's arrived with XGBoost 1.3.0 as of the date of 10 December 2020, with the name of feature_weights : https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit , I'll edit here when I can work/see a tutorial with it.

Cluster Analysis using R for large data sample

I am just starting out with segmenting a customer database using R I have for an ecommerce retail business. I seek some guidance about the best approach to proceed with for this exercise.
I have searched the topics already posted here and tried them out myself like dist() and hclust(). However I am running into one issue or another and not able to overcome it since I am new to using R.
Here is the brief description of my problem.
I have approximately 480K records of customers who have bought so far. The data contains following columns:
email id
gender
city
total transactions so far
average basket value
average basket size ( no of item purchased during one transaction)
average discount claimed per transaction
No of days since the user first purchased
Average duration between two purchases
No of days since last transaction
The business goal of this exercise is to identify the most profitable segments and encourage repeat purchases in those segments using campaigns. Can I please get some guidance as to how to do this successfully without running into problems like the size of the sample or the data type of columns?
Read this to learn how to subset data frames. When you try to define d, it looks like you're providing way to much data, which might be fixed by subsetting your table first. If not, you might want to take a random sample of your data instead of all of it. Suppose you know that columns 4 through 10 of your data frame called cust_data contain numerical data, then you might try this:
cust_data2 <- cust_data[, 4:10]
d <- dist(cust_data2)
For large values, you may want to log transform them--just experiment and see what makes sense. I really am not sure about this, and that's just a suggestion. Maybe choosing a more appropriate clustering or distance metric would be better.
Finally, when you run hclust, you need to pass in the d matrix, and not the original data set.
h <- hclust(d, "ave")
Sadly your data does not contain any attributes that indicate what types of items/transactions did NOT result in a sale.
I am not sure if clustering is the way to go here.
Here are some ideas:
First split your data into a training set (say 70%) and a test set.
Set up a simple linear regression model with,say, "average basket value" as a response variable, and all other variables as independent variables.
fit <-lm(averagebasketvalue ~., data = custdata)
Run the model on the training set, determine significant attributes (those with at least one star in the summary(fit) output), then focus on those variables.
Check your regression coefficients on the test set, by calculating R-squared and Sum of squared errors (SSE) on the test set. You can use the predict() function , the calls will look like
fitpred <- predict(fit, newdata=testset)
summary(fitpred) # will give you R²
Maybe "city" contains too many unique values to be meaningful. Try to generalize them by introducing a new attribute CityClass (e.g. BigCity-MediumCity-SmallCity ... or whatever classification scheme is useful for your cities). You might also condition the model on "gender". Drop "email id".
This can go on for a while... play with the model to try to get better R-squared and SSEs.
I think a tree-based model (rpart) might also work well here.
Then you might change to cluster analysis at a later time.

Resources