R: Clustering customers based on similar product interests for an event - r

I have a dataset with a list of customers and their product preferences. Basically, it is a simple CSV with a column called "CUSTOMER" and 5 other columns called "PRODUCT_WANTED_A", "PRODUCT_WANTED_B" and so on.
I asked these customers if they were interested to know more about a particular product, and answers could be simply YES or NO (1 or 0 in the dataset). The dataset can be downloaded here. Obviously, there will be customers with many different interests, based on the mix of their YES or NO in these 5 columns.
My goal is to understand which customers are similar to others in such interests. This will help me manage an agenda of product presentations and, in each meeting, I would like to understand the best grouping for it. I started with a hierarchical plot like this:
customer_list <- read.csv("customers_products_wanted.csv", sep=",", header = TRUE)
customer.hclust <- hclust(dist(customers_list))
plot(customer.hclust, customer_list$CUSTOMER)
library(rect.hclust)
rect.clust(customer.hplot,5)
This is the plot I got, asking for 5 clusters:
Tried the same, but with 10 clusters:
Question 1: I know it's always hard to tell, but looking at the charts and dataset, what would be your 'cut' to group customers? 5? 10?
I was reviewing the results, and in the same group, I had CUSTOMER112 with 1,0,1,0,1 as their preferences together with CUSTOMER 110 (1,1,1,1,1), CUSTOMER106 (1,1,1,1,0) and so on. The "distance" can be right, but in a given group I have customers with some relevant differences in their preferences.
Question 2: I don't know if it's a case of total ignorance about clustering, the code I used or even the dataset. Based on your experience, what would be your approach for the best clustering in this case?
Any comments will be highly appreciated. As you see, I did some efforts, but still in doubt.
Thanks a lot!
Ricardo

All answers were important, but #Ben video recommendation and #Samuel Tan advice on breaking the customers into grids, I found a good way to handle it.
The video gave me a lot of insights about "noisy" variables in hierarchical clustering, and the grid recommendation helped me think on what the data is really trying to tell me.
That said, a basic data cleaning process eliminated all customers with no interests in any products (this is obvious, but I didn't pay attention to it at first). Then, I ignored customers with a specific interest (single product). It was done because these customers wouldn't need to attend the workshop series I'm planning (they just want to listen about one product).
Evaluating all the others, interested in more than one product, I realized the product mix could point me to a better classification. From there, I grouped customers into 3 clusters: integration opportunities (2 or 3 products), convergence opportunities (4 products) and transformation opportunities (all products).
Now it's clear to me which customers I should focus on for my workshops, and plan my post-workshop sales campaigns leveraging materials that target each customer group (integration, convergence, transformation).
Thanks for all the advices!
Ricardo

Related

Check if any combination of binary variables is correlated/has impact on an ordinal dependent variable

I am working on a case to finish my (not so advanced) data scientist course and I have already been helped a lot by topics here, thanks!
Unfortunately now I am stuck again and cannot find an existing answer.
My data comes from a bike shop and I want to see if products bought during customers' first registered purchase are related to/have impact on how important they will become to the shop in the future. I have grouped customers into 5 clusters (from those who registered and made never any registered purchase again, through these who made 2-3 purchases for little money, those who made a few purchases for a lot of money to those who purchase stuff regularly and really bring a lot of money to this bike shop), I have ordered them into an ordinal dependent variable.
As the independent variables I have prepared 20+ binary variables that identify products/services bought during the first purchase from this shop (first purchase as a registered customer). One row per customer. So I want to check the idea if there are combinations of products (probably "extras" to the bike purchase) that can increase the chance that a customer would register and hopefully stay as a loyal customer for the future.
The dream would be be able to say, for example, if you buy a cheap or middle-cheap bike during this first purchase you probably don't contribute so much to the bike shop in a long term so you have low grade on the dependent variable. But those who bought a middle-cheap bike AND a helmet AND a lock (probably to special price) are more likely to become one of the loyal registered customers bringing money for a longer time.
There might be no relation like that but I want to test that anyways. Implementation of the result could be being able to recommend an extra product during a purchase (with a good price on it).
I am learning R during this course. We went through some techniques and first I was imagining it would be possible to work with the neural networks (just cause it sounded most fun to try), having all these products as input in the sparse matrix and the customers clusters as the output (I hoped it was similar to the examples I read about with sparse matrix with pixels from a picture as the input and numbers 1-9 as the output) but then I was told that this actually is based on pictures and real patterns and in my case I don't even know if there is any.
Then I was thinking I could try with the ordinal forest. But it doesn't predict my clusters well, not at all (2 out of 5 clusters get no predictions). But that is OK, I don't expect the first purchase to be able to predict all the customers future. But I would really want to see if there are combinations of products that might increase the chance that a customer ends up in one of the "higher" clusters on the loyalty scale.
I am not sure if this was clear enough. :) Do you think that there is any way of testing my idea? What could I try to do? Let me know if you need more information.

KNIME ANALYTIC PLATFORM: What does pattern mean in knime?

I am working on KNIME ANALYTIC PLATFORM as part of my project. I am new to this analytics platform.
Prediction Analysis is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trends. ... Knime is based on the Eclipse platform and provides a visual programming language based on data-flows to create an easy-to-understand analysis process quickly
My Approach
With an existing data I was trying to form a pattern. Say like ..
There are several customers with pending amount to be paid and few of them paid. My case was they might exist 1 or more number of orders from customers,
Say customer 1,2 and 3 are there. Cust_1 has 3 orders and Cust_2 has 2 orders and Cust_3 had 1 order, with there some orders amount paid and some not paid.
My Question
My question is can we generate a pattern, based on customers.
To know the customers order more than 2 with coloured and arrange them into pattern? What nodes in knime make my pattern?
can anyone please solve this question.
The patterns in this case what customers buy together, which are expressed as association rules. These rules can applied to new data and can be help predicting new buys by suggesting those products when one of them is in the basket.
In case more information is available on the customers, that can be used to cluster them together based on those properties (which case the patterns are the similarity of the customers) and if a new customer fits in one of those clusters, the most common product(s) can be suggested to her/him/it. The nice thing is that KNIME makes this very easy once you have your data and you get familiar with KNIME (which is itself user friendly, there are many free sources available: https://www.knime.com/resources).
Obviously other patterns might be also useful. If you have more data, you might see trends (patterns) in buys of individual customer orders (or the amount of the orders, where the ARIMA nodes might be useful) or in the popularity of different products. These can also be called patterns.
For complex models, you might need to use other tools too, like R or Python or something else. I should emphasize that KNIME has very good PMML support, so you are not tied to a single tool, you can create/train your model in KNIME and use some other tool to make predictions based on that model or the other way around.

Recommendation systems - converting transaction counts to star ratings

I'm doing some exploratory work on recommendation systems and have been reading about collaborative filtering techniques involving user-based, item-based, and SVD algorithms. I am also trying out R's recommenderlab package.
One apparent assumption in the literature is that the user data has labelled items based on a rating scale, e.g. between 1 and 5 stars. I'm looking at problems where the user data does not have ratings but rather just transactions. For example, if I want to recommend restaurants to a user, the only data I have is how often he has visited other restaurants.
How can I convert these "transaction" counts into ratings that can be used by recommendation algorithms that expect a fixed-scale rating? One approach I thought of is simple binning:
0 stars = 0-1 visits
1 star = 2-3 visits
...
5 stars = 10+ visits
However, that doesn't seem like it would work well. For example, if someone visited a restaurant only once, he may still really love it.
Any help would be appreciated.
I would try different approaches. As you said, only visited once may indicate that the user still loves the restaurant but you don't know for sure. Your goal is not to optimize for one single user rather for all users. So for this, you can split your data into training and test data. Train on the training data with different scales and test on the test data.
The different scales may be
a binary scale (0:never visited, 1: visited). This is mostly used in online shops (bought or not). Would support your assuption with the one time visit.
your presented scale or other ranges for the 5 stars. You can also use more than 5 stars. I would potentially not group 0-1 visits.
The approach with the best accuracy should be chosen.
Here's an idea: restaurants the user has visited zero or one times tell you nothing about what they like. Restaurants they have visited many times tell you lots. Why not just look for restaurants similar to those the customer most regularly frequents? In this way, you're using positive information (what they like) but none of the negative since you don't have access to it anyway.
If you absolutely had to infer some continuous measure, I think it would only be sensible to look at the propensity for another visit given past behaviour. This would start with the prior probability of choosing that restaurant (background frequency, or just uniform over restaurants) with a likelihood term related to the number of visits to that restaurant. In this way the more a user visits a restaurant the more likely they are to visit again.

Get Annual Financial Data for a Stock for many years in R

Suppose I want to regress in R Gross Profit on Total Revenue. I need data for this, and the more, the better.
There is a library on CRAN that I find very useful: quantmod , that does what I need.
library(quantmod)
getFinancials(Symbol="AMD", src="google")
#to get the names of the matrix: rownames(AMD.f$IS$A)
Total.Revenue<-AMD.f$IS$A["Revenue",]
Gross.Profit<-AMD.f$IS$A["Gross Profit",]
#finally:
reg1<-lm(Gross.Profit~Total.Revenue)
The biggest issue that I have is that this library gets me data only for 4 years (4 observations, and who runs a regression with only 4 observations???). Is there any other way (maybe other libraries) that would get data for MORE than 4 years?
I agree that this is not an R programming question, but I'm going to make a few comments anyway before this question is (likely) closed.
It boils down to this: getting reliable fundamental data across sectors and markets is difficult enough even if you have money to spend. If you are looking at the US then there are a number of options, but all the major (read 'relatively reliable') providers require thousands of dollars per month - FactSet, Bloomberg, Datastream and so on. For what it's worth, for working with fundamental data I prefer and use FactSet.
Generally speaking, because the Excel tools offered by each provider are more mature, I have found it easier to populate spreadsheets with the data and then read the data into R. Then again, I typically deal with the fundamentals of a few dozen companies at most, because once you move out of the domain of your "known" companies the time it takes to check anomalies increases exponentially.
There are numerous potential "gotchas". The most obvious is that definitions vary from sector to sector. "Sales" for an industrial company is very different from "sales" for a bank, for example. Another problem is changes in definitions. Pretty much every year some accounting regulation or other changes and breaks your data series. Last year minorities were reported here, but this year this item is moved to another position in the P&L and so on.
Another problem is companies themselves changing. How does one deal with mergers, acquisitions and spin-offs, for example? This sort of thing can make measuring organic sales growth next to impossible. Yet another point to bear in mind is that if you're dealing with operating or net profit, you have to consider exceptionals and whether to adjust for them.
Dealing with companies outside the US adds a whole bunch of further problems. Of course, the major data providers try to standardise globally (FactSet Fundamentals for example). This just adds another layer of abstraction and typically it is hard to check to see how the data has been manipulated.
In short, getting the data is onerous and I know of no reliable free sources. Unless you're dealing with the simplest items for a very homogenous group of companies, this is a can of worms even if you do have the data.

Website Layout Statistics

I have a client who has suggested laying out a long list of categories in a custom order. The order is to be decided by them based on product items they sell the most etc.
I tend to disagree and feel that people browsing the internet prefer to search lists of categories that are in alphabetical order or sorted by something they can take reference of such as a date.
I would like to know others thoughts on this and it would be appreciated if anyone could point me in the direction of any open source surveys that have been taken in this area.
Thanks
Ben
What a silly stance to take regarding a simple customer request. Allow for both orderings, and other ones too. There is no survey that will demonstrate that the client is wrong as they are - by definition - correct.
Code that allows for different orderings has greater utility anyway, and real user data will be able to show them which - if either - should be the default.

Resources