Association rule mining basics - How to read association rules - associations

A very basic question here:
Example rule (suppose its generated from WEKA) :
bread=t 10 ==> milk=t 10 conf:(1)
Which means that "from 10 instances, everytime people buy bread, they also buy milk". (ignore the support)
Does this rule can be read both ways? Like, "every time people buy milk, they also buy bread?"
Another example
Physics101=A ==> Superphysics401=A
Can it be read both ways like this:
"If people got A on Physics101, they also got A on Superphysics401"
"If people got A on Superphysics401, they also got A on Physics101" ?
If so, what makes WEKA generate the rule in that order (Physics ==> Superphysics), why not the other way? Or does the order not relevant?

Does this rule can be read both ways? Like, "everytime people buy milk, they also buy bread?"
No, it can only be read one way.
This follows from the rules of implication. A -> B and B -> A are different things. Read former as "A is a subset of B", thus, whenever you are in A, you are in B. B -> A, also called converse of A -> B, can be interpreted in similar way. When both of these hold, we say that A <-> B which means that A and B are essentially the same.
If the above looks like too much jargon, keep the following in mind:
Rain -> Clouds is true. Whenever there is rain, there will be clouds, But Clouds -> Rain is not always true. There may be clouds but no rain.
If so, what makes WEKA generate the rule in that order (Physics ==>
Superphysics), why not the other way? Or does the order not relevant?
The dataset leads to the rules. Here is an example :
Milk, Bread, Waffers
Milk, Toasts, Butter
Milk, Bread, Cookies
Milk, Cashewnuts
Convince yourself that Bread -> Milk, but Milk ! -> Bread.
Note that we may not be always interested in rules that either hold or do not hold. Thus, we try to add a notion of confidence to the rules. A natural way of defining confidence for A->B is P(B|A) i.e. how often do we see B when we see A.
This can be calculated by dividing the count of B and A appearing together and dividing by the count of A appearing alone.
In our example,
P(Milk | Bread) = 2 / 2 = 1 and
P(Bread | Milk) = 2 / 4 = 0.5
You can now sort list of rules on the basis of confidence and decide which ones do you want to use.

Related

SKU comparison for ADX VM

This link shows comparison among various VM SKUs available for ADX cluster. My question is about the following two SKUs:-
D14 v2 (Category: compute-optimized) , SSD:614 GB , Cores:16, RAM:112GB
DS14 v2 + 4 TB PS (Category: storage-optimized) , SSD:4TB , Cores:16 RAM:112GB
Purely looking at the numbers (SSD,RAM,Cores) it looks like #2 has everything #1 has but on top of that #2 also has 4TB of SSD -- whereas #1 has only 614GB of SSD. So based on that I will always choose #2 over #1. So what is the meaning of category here then? #1 falls in the category "compute-optimized" whereas #2 belongs to "storage-optimized". My question is that if a category is decided on the basis of configuration mentioned here then we should be able to call #2 as both storage as well as compute optimized because #2 has the same compute as #1 and then it has something extra over #1. Then why #2 is only listed as storage optimized. I am trying to understand if there an additional edge of using #1 over #2 for compute intensive jobs -- because if I just look at the numbers here I don't see any reason (apart from cost , which too is not much different though) why I shouldn't use #2 over #1. Probably #1 has something unique which is missing in #2 which is not specified in that link.
Based on your question, it appears you're largely disregarding the consideration of cost - the following table (in the same doc you've linked-to) summarizes the main considerations for choosing a SKU - you can see one of them is Cost per GB cache per core.
Another example - let's assume you can reach the same total cache (SSD) size with either SKUs you mentioned - with one, your cluster will have X nodes, and with the other Y nodes. If Y > X, data in the other cluster will be distributed across more nodes, allowing more parallelism during ingestion and queries. Of course, the cost for both options could be different.
Last - I would strongly recommend, given that cost isn't meaningless in your case, that you consult with the cost estimator, and see how a different choice of SKU affects the total estimated cost of your cluster (given you know the volumes of data you're dealing with).

PageRank Theory -- Unassisted Goal Scoring in R with igraph

I'm trying to analyze goal-scoring networks in hockey. I have data for the player who scored the goal and the player who assisted on that goal. My issue is that some goals do not have an assist, so I'm not sure what I should do in those situations.
So, an example for my data looks like this:
scorer <- c("Lidstrom", "Yzerman", "Fedorov", "Yzerman", "Shanahan")
assister <- c("", "Lidstrom", "Yzerman", "Shanahan", "Lidstrom")
mydata <- data.frame(scorer, assister)
And the output is:
scorer assister
1 Lidstrom
2 Yzerman Lidstrom
3 Fedorov Yzerman
4 Yzerman Shanahan
5 Shanahan Lidstrom
When I'm dealing with unassisted goals, does it make sense to act as if the assist goes to the scorer?
EX:
scorer assister
1 Lidstrom Lidstrom
2 Yzerman Lidstrom
3 Fedorov Yzerman
4 Yzerman Shanahan
5 Shanahan Lidstrom
Or does it make sense to create a new name "unassisted" for unassisted goals?
EX:
scorer assister
1 Lidstrom UNASSISTED
2 Yzerman Lidstrom
3 Fedorov Yzerman
4 Yzerman Shanahan
5 Shanahan Lidstrom
Here's the rest of my code for the PageRank, assuming that something is filled in for the blank assister space:
library(igraph)
library(dplyr)
my_network <- mydata %>%
as.matrix() %>%
graph.edgelist(directed = TRUE)
page_rank(my_network, directed = TRUE)$vector
I can't just remove goals that are unassisted, so I'm trying to come up with some solution that doesn't defy any major graph theory principles (of which I'm not knowledgeable). Any ideas?
I agree with the suggestion of #emilliman5 outlined in the comments: for unassisted goals, just make an edge from the scorer to itself. Then use PageRank for finding the most influential players. Actually, PageRank can be a particularly good choice here because the principles underlying the PageRank score bear some similarity to what is going on in a "real" hockey match.
Let me elaborate on this a bit. PageRank was originally invented for modeling the behaviour of a randomly chosen Internet user browsing the pages on the web. In each time step, the user can choose to follow a link on the web page currently being viewed, or surf to another, unrelated page, chosen uniformly from the set of all pages on the Internet. There is a fixed probability value that decides whether the user is going to follow a link (typically 0.85) or the user is going to "teleport" to a randomly chosen page (typically 0.15). The idea behind PageRank is that the most important pages are where the user is likely to spend a lot of time when following the rules above. The behaviour of the user is essentially a random walk over the set of webpages.
Now, in a hockey game, the "user" is the hockey puck that is being passed from player to player. At each pass, the puck is either passed from one player to another, or a goal is scored, or the puck is accidentally passed to the opposing team. In the latter two cases, the puck ends up at the opposing team, and eventually it is returned to the first team at a randomly chosen player. (This is a first approximation; if you want to go deeper, you could keep on "tracking" the puck for the opposing team as well). I think you can start seeing the similarities here. The assister-to-scorer network that you have captures a fragment of this, namely the last pass before each goal. From this point of view, I think it totally makes sense to think about unassisted goals as events where the player passed to himself before scoring.
Of course you would have a much better understanding of the team dynamics if your dataset contained all the passes, not only the ones that resulted in a goal. In fact, in that case, you could add an additional node called "GOAL" to your network, draw edges from scorers to the "GOAL" node, and then calculate the so-called personalized PageRank vector for the "GOAL" node, which would give you the most influential nodes from which the "GOAL" node is the easiest to reach. But this is more like a research question from this point onwards, and it is probably not a good fit for further discussion on Stack Overflow.

Google code jam 2016: Round 1A, BFF

Question :
You are a teacher at the brand new Little Coders kindergarten. You have N kids in your class, and each one has a different student ID number from 1 through N. Every kid in your class has a single best friend forever (BFF), and you know who that BFF is for each kid. BFFs are not necessarily reciprocal -- that is, B being A's BFF does not imply that A is B's BFF.
Your lesson plan for tomorrow includes an activity in which the participants must sit in a circle. You want to make the activity as successful as possible by building the largest possible circle of kids such that each kid in the circle is sitting directly next to their BFF, either to the left or to the right. Any kids not in the circle will watch the activity without participating.
What is the greatest number of kids that can be in the circle?
Input
The first line of the input gives the number of test cases, T. T test cases follow. Each test case consists of two lines. The first line of a test case contains a single integer N, the total number of kids in the class. The second line of a test case contains N integers F1, F2, ..., FN, where Fi is the student ID number of the BFF of the kid with student ID i.
Output
For each test case, output one line containing "Case #x: y", where x is the test case number (starting from 1) and y is the maximum number of kids in the group that can be arranged in a circle such that each kid in the circle is sitting next to his or her BFF.
My problem : There is the contest analysis on the code jam site, but I don't understand it. Where is the optimization happening? If someone can explain this problem and its solution in a detailed manner, it will be very helpful.
Edit : I am not adding any pseudo-code, because I want to better my understanding of the problem, and it's not a coding issue.

customer segmentation in retail [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have a large sales database of a 'home and construction' retail.
And I need to know who are the electricians, plumbers, painters, etc. in the store.
My first approach was to select the articles related to a specialty (wires [article] is related to an electrician [specialty], for example) And then, based on customer sales, know who the customers are.
But this is a lot of work.
My second approach is to make a cluster segmentation first, and then discover which cluster belong to a specialty. (this is a lot better because I would be able to discover new segments)
But, how can I do that? What type of clustering should I occupy? Kmeans, fuzzy? What variables should I take to that model? Should I use PCA to know how many cluster to search?
The header of my data (simplified):
customer_id | transaction_id | transaction_date | item_article_id | item_group_id | item_category_id | item_qty | sales_amt
Any help would be appreciated
(sorry my english)
You want to identify classes of customers based on what they buy (I presume this is for marketing reasons). This calls for a clustering approach. I will talk you through the entire setup.
The clustering space
Let us first consider what exactly you are clustering: either orders or customers. In either case, the way you characterize the items and the distances between them is the same. I will discuss the basic case for orders first, and then explain the considerations that apply to clustering by customers instead.
For your purpose, an order is characterized by what articles were purchased, and possibly also how many of them. In terms of a space, this means that you have a dimension for each type of article (item_article_id), for example the "wire" dimension. If all you care about is whether an article is bought or not, each item has a coordinate of either 0 or 1 in each dimension. If some order includes wire but not pipe, then it has a value of 1 on the "wire" dimension and 0 on the "pipe" dimension.
However, there is something to say for caring about the quantities. Perhaps plumbers buy lots of glue while electricians buy only small amounts. In that case, you can set the coordinate in each dimension to the quantity of the corresponding article (presumably item_qty). So suppose you have three articles, wire, pipe and glue, then an order described by the vector (2, 3, 0) includes 2 wire, 3 pipe and 0 glue, while an order described by the vector (0, 1, 4) includes 0 wire, 1 pipe and 4 glue.
If there is a large spread in the quantities for a given article, i.e. if some orders include order of magnitude more of some article than other orders, then it may be helpful to work with a log scale. Suppose you have these four orders:
2 wire, 2 pipe, 1 glue
3 wire, 2 pipe, 0 glue
0 wire, 100 pipe, 1 glue
0 wire, 300 pipe, 3 glue
The former two orders look like they may belong to electricians while the latter two look like they belong to plumbers. However, if you work with a linear scale, order 3 will turn out to be closer to orders 1 and 2 than to order 4. We fix that by using a log scale for the vectors that encode these orders (I use the base 10 logarithm here, but it does not matter which base you take because they differ only by a constant factor):
(0.30, 0.30, 0)
(0.48, 0.30, -2)
(-2, 2, 0)
(-2, 2.48, 0.48)
Now order 3 is closest to order 4, as we would expect. Note that I have used -2 as a special value to indicate the absence of an article, because the logarithm of 0 is not defined (log(x) tends to negative infinity as x tends to 0). -2 means that we pretend that the order included 1/100th of the article; you could make the special value more or less extreme, depending on how much weight you want to give to the fact that an article was not included.
The input to your clustering algorithm (regardless of which algorithm you take, see below) will be a position matrix with one row for each item (order or customer), one column for each dimension (article), and either the presence (0/1), amount, or logarithm of the amount in each cell, depending on which you choose based on the discussion above. If you cluster by customers, you can simply sum the amounts from all orders that belong to that customer before you calculate what goes into each cell of your position matrix (if you use the log scale, sum the amounts before taking the logarithm).
Clustering by orders rather than by customers gives you more detail, but also more noise. Customers may be consistent within an order but not between them; perhaps a customer sometimes behaves like a plumber and sometimes like an electrician. This is a pattern that you will only find if you cluster by orders. You will then find how often each customer belongs to each cluster; perhaps 70% of somebody's orders belong to the electrician type and 30% belong to the plumber type. On the other hand, a plumber may only buy pipe in one order and then only buy glue in the next order. Only if you cluster by customers and sum the amounts of their orders, you get a balanced view of what each customer needs on average.
From here on I will refer to your position matrix by the name my.matrix.
The clustering algorithm
If you want to be able to discover new customer types, you probably want to let the data speak for themselves as much as possible. A good old fashioned
hierarchical clustering with complete linkage (CLINK) may be an appropriate choice in this case. In R, you simply do hclust(dist(my.matrix)) (this will use the Euclidean distance measure, which is probably good enough in your case). It will join closely neighbouring items or clusters together until all items are categorized in a hierarchical tree. You can treat any branch of the tree as a cluster, observe typical article amounts for that branch and decide whether that branch represents a customer segment by itself, should be split in sub-branches, or joined with a sibling branch instead. The advantage is that you find the "full story" of which items and clusters of items are most similar to each other and how much. The disadvantage is that the outcome of the algorithm does not tell you where to draw the borders between your customer segments; you can cut up the clustering tree in many ways, so it's up to your interpretation how you want to identify your customer types.
On the other hand, if you are comfortable fixing the number of clusters (k) beforehand, k-means is a very robust way to get just any segmentation of your customers in k distinct types. In R, you would do kmeans(my.matrix, k). For marketing purposes, it may be sufficient to have (say) 5 different profiles of customers that you make custom advertisement for, rather than treating all customers the same. With k-means you don't explore all of the diversity that is present in your data, but you might not need to do so anyway.
If you don't want to fix the number of clusters beforehand, but you also don't want to manually decide where to draw the borders between the segments afterwards, there is a third possibility. You start with the k-means algorithm, where you let it generate an amount of cluster centers that is much larger than the number of clusters that you hope to end up with (for example, if you hope to end up with somewhere about 10 clusters, let the k-means algorithm look for 200 clusters). Then, use the mean shift algorithm to further cluster the resulting centers. You will end up with a smaller number of compact clusters. The approach is explained in more detail by James Li over here. You can use the mean shift algorithm in R with the ms function from the LPCM package, see this documentation.
About using PCA
PCA will not tell you how many clusters you need. PCA answers a different question: which variables seem to represent a common underlying (hidden) factor. In a sense, it is a way to cluster variables, i.e. properties of entities, not to cluster the entities themselves. The number of principal components (common underlying factors) is not indicative of the number of clusters needed. PCA can still be interesting if you want to learn something about the predictive value of each article about a customer's interests.
Sources
Michael J. Crawley, 2005. Statistics. An Introduction using R.
Gerry P. Quinn and Michael J. Keough, 2002. Experimental Design and Data Analysis for Biologists.
Wikipedia: hierarchical clustering, k-means, mean shift, PCA

Looking for a good reference on calculating permutations

As a programmer, I frequently need to be able to know the
how to calculate the number of permutations of a set, usually
for estimation purposes.
There are a lot of different ways specify the allowable
combinations, depending on the problem at hand. For example,
given the set of letters A,B,C,D
Assuming a 4 digit result, how many ways can those letters
be arranged?
What if you can have 1,2,3 or 4 digits, then how many ways?
What if you are only allowed to use each letter at most once?
twice?
What if you must avoid the same letter appearing twice in
a row, but if they are not in a row, then twice is ok?
Etc. I'm sure there are many more.
Does anyone know of a web reference or book that talks about
this subject in terms that a non-mathematician can understand?
Thanks!
Assuming a 4 digit result, how many
ways can those letters be arranged?
when picking the 1st digital , you have 4 choices ,which is one of A, B , C and D ; it is the same when picking the 2nd, 3rd ,4th since repetition is allowed:
so you have total : 4*4*4*4 = 256 choices.
What if you can have 1,2,3 or 4
digits, then how many ways?
It is easy to deduce from question 1.
What if you are only allowed to use
each letter at most once?
When pick the 1st digital , you have 4 choices ,which is one of A , B , c and D ; when picking the 2nd , you have 3 choice except the one you have picked for the 1st ; and 2 choices for 3rd , 1 choices for the 4th.
So you have total : 4 * 3 * 2 * 1 = 24 choice.
The knowledge involving here include combination , permutation and probability. Here is a good tutorial to understand their difference.
First of all the topics you are speaking of are
Permutations (where the order matters)
Combinations (order doesn't matter)
I would recommend Math Tutor DVD for teaching yourself math topics. The "probability and statistics" disk set will give you the formulas and skill you need to solve the problems. It's great because it's the closest thing you can get to going back to school, because a teacher solves problems on a white board for you.
I've found a clip on the Combinations chapter of the video for you to check out.
If you need to do more than just count the number of combinations and permutations, if you actually need to generate the sequences, then Donald Knuth's books Generating all combinations and partitions and Generating all tuples and permutations. He goes into great detail regarding algorithms subject to various restrictions, looking at the advantages and disadvantages of different solutions for each problem.
It all depends on how simply do you need the explanation to be.
The topic you are looking for is called "Permutations and Combinations".
Here's a fairly simply introduction. There are dozens like this on the first few pages from google.

Resources