which one? permutation or combination?

which one? permutation or combination? - math

Let say we have a bookshelf that can fit 6 books, we want 4 computer science books and 2 physics books but computer books should be together and physics books also should be together, we have 8 computer and 6 physics book in total in how many ways we can do that?
I believe that the answer is like this:
c(8,4)*c(6,2) + c(6,2)*c(8,4)
but my instructor sloved it in this way(by p I mean permutation):
p(8,4)*p(6,2) + p(6,2)*p(8,4)
could u please tell me which one is right?

Your answer is correct if you don't care what order the books go on the shelf. Your instructor's answer is correct if you count different orderings of the books within each of the two groups as different ways of filling the bookshelf.

Related

Possibilities of dividing a class in groups with several criteria

I have to divide a class of 50 students writing a dissertation in 10 different discussion groups of 5 members each. In theory, there are 1.35363x10^37 possible ways of doing this, which is just the result of {50!}/{(5!^10)*10!)}, if it is already decided that the groups will consist of 5.
However, each group is to be led by a facilitator. This reduces the number of possible combinations considerably, because each facilitaror has one field of expertise among 5 possible ones, which should be matched to the topics the students are writing about as much as possible. If there are three facilitators with competence A, three with competence B, two with competence C, one with competence D and one with competence E, and 15 students are assigned to A, 15 to B, 10 to C, 5 to D and 5 to E, the number of possible combinations comes down to 252 505.
But both students and facilitators keep advocating for the use of more criteria, instead of just focusing on field of expertise. For example, wanting to be in a group of students that know each other, or being in a group with a facilitator that has particular knowledge of a specific research method.
I am trying to illustrate my intuitive reasoning, which tells me that each new criteria increases the complexity/impossibility of the task, if the objective is a completely efficient solution. But I can't get my head around expressing this analytically in a satisfactory manner.
Is my reasoning correct, that adding criteria would reduce the amount of possibilities that can be discarded following the inclusion-exclusion principle, thus making the task more complex, adding possible combinations? I also think that if the criteria are not compatible (for example if students that know each other are writing about different topics, and there aren't enough competent facilitators), certain constraints become inviable.

You need to distinguish between computational complexity and human complexity. Adding constraints almost automatically increases the human complexity of the problem in the sense that it means that there is more to wrap your mind around. But -- it isn't true that the computational complexity increases. At least sometimes it decreases.
For example, say you have a set of 200 items and you want to determine if there is a subset of them which satisfy some constraint. Depending on the constraint, There might be no feasible way to do it. After all, 2^200 is much too large to brute-force. Now add the constraint that the subset needs to have exactly 3 elements. Now all of a sudden it is possible to brute force (just run through all 1,313,400 3-element subsets until you either find a solution or determine that none exist). This is enough to show that it isn't true that adding a constraint always makes a problem intrinsically more difficult. In the discrete case a new constraint can cut down on the size of the search space in a way that can be exploited. In the continuous cases it can reduce degrees of freedom and thus lower the dimension of the problem. This isn't to say that it always makes it easier. Probably as a rule of thumb, additional constraints tend to make a problem more difficult.
Your actual problem isn't spelled out enough to give concrete advice. One possibility (and one way to handle a proliferation of somewhat extraneous constraints) is to divide the constraints into hard constraints which need to be satisfied and soft constraints which are merely desired but not strictly needed. Turn it into an optimization problem: find the solution which maximizes the number of soft-constraints that are satisfied, subject to the condition that it satisfies the hard constraints. Perhaps you can formulate it as an integer programming problem and hopefully find an exact solution. Or, if it is easy to generate solutions that satisfy the hard constraints and it is easy to mutate one such solution to obtain another (e.g. swap two students who are in different groups), then an evolutionary algorithm would be a reasonable heuristic.

customer segmentation in retail [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have a large sales database of a 'home and construction' retail.
And I need to know who are the electricians, plumbers, painters, etc. in the store.
My first approach was to select the articles related to a specialty (wires [article] is related to an electrician [specialty], for example) And then, based on customer sales, know who the customers are.
But this is a lot of work.
My second approach is to make a cluster segmentation first, and then discover which cluster belong to a specialty. (this is a lot better because I would be able to discover new segments)
But, how can I do that? What type of clustering should I occupy? Kmeans, fuzzy? What variables should I take to that model? Should I use PCA to know how many cluster to search?
The header of my data (simplified):
customer_id | transaction_id | transaction_date | item_article_id | item_group_id | item_category_id | item_qty | sales_amt
Any help would be appreciated
(sorry my english)

You want to identify classes of customers based on what they buy (I presume this is for marketing reasons). This calls for a clustering approach. I will talk you through the entire setup.
The clustering space
Let us first consider what exactly you are clustering: either orders or customers. In either case, the way you characterize the items and the distances between them is the same. I will discuss the basic case for orders first, and then explain the considerations that apply to clustering by customers instead.
For your purpose, an order is characterized by what articles were purchased, and possibly also how many of them. In terms of a space, this means that you have a dimension for each type of article (item_article_id), for example the "wire" dimension. If all you care about is whether an article is bought or not, each item has a coordinate of either 0 or 1 in each dimension. If some order includes wire but not pipe, then it has a value of 1 on the "wire" dimension and 0 on the "pipe" dimension.
However, there is something to say for caring about the quantities. Perhaps plumbers buy lots of glue while electricians buy only small amounts. In that case, you can set the coordinate in each dimension to the quantity of the corresponding article (presumably item_qty). So suppose you have three articles, wire, pipe and glue, then an order described by the vector (2, 3, 0) includes 2 wire, 3 pipe and 0 glue, while an order described by the vector (0, 1, 4) includes 0 wire, 1 pipe and 4 glue.
If there is a large spread in the quantities for a given article, i.e. if some orders include order of magnitude more of some article than other orders, then it may be helpful to work with a log scale. Suppose you have these four orders:
2 wire, 2 pipe, 1 glue
3 wire, 2 pipe, 0 glue
0 wire, 100 pipe, 1 glue
0 wire, 300 pipe, 3 glue
The former two orders look like they may belong to electricians while the latter two look like they belong to plumbers. However, if you work with a linear scale, order 3 will turn out to be closer to orders 1 and 2 than to order 4. We fix that by using a log scale for the vectors that encode these orders (I use the base 10 logarithm here, but it does not matter which base you take because they differ only by a constant factor):
(0.30, 0.30, 0)
(0.48, 0.30, -2)
(-2, 2, 0)
(-2, 2.48, 0.48)
Now order 3 is closest to order 4, as we would expect. Note that I have used -2 as a special value to indicate the absence of an article, because the logarithm of 0 is not defined (log(x) tends to negative infinity as x tends to 0). -2 means that we pretend that the order included 1/100th of the article; you could make the special value more or less extreme, depending on how much weight you want to give to the fact that an article was not included.
The input to your clustering algorithm (regardless of which algorithm you take, see below) will be a position matrix with one row for each item (order or customer), one column for each dimension (article), and either the presence (0/1), amount, or logarithm of the amount in each cell, depending on which you choose based on the discussion above. If you cluster by customers, you can simply sum the amounts from all orders that belong to that customer before you calculate what goes into each cell of your position matrix (if you use the log scale, sum the amounts before taking the logarithm).
Clustering by orders rather than by customers gives you more detail, but also more noise. Customers may be consistent within an order but not between them; perhaps a customer sometimes behaves like a plumber and sometimes like an electrician. This is a pattern that you will only find if you cluster by orders. You will then find how often each customer belongs to each cluster; perhaps 70% of somebody's orders belong to the electrician type and 30% belong to the plumber type. On the other hand, a plumber may only buy pipe in one order and then only buy glue in the next order. Only if you cluster by customers and sum the amounts of their orders, you get a balanced view of what each customer needs on average.
From here on I will refer to your position matrix by the name my.matrix.
The clustering algorithm
If you want to be able to discover new customer types, you probably want to let the data speak for themselves as much as possible. A good old fashioned
hierarchical clustering with complete linkage (CLINK) may be an appropriate choice in this case. In R, you simply do hclust(dist(my.matrix)) (this will use the Euclidean distance measure, which is probably good enough in your case). It will join closely neighbouring items or clusters together until all items are categorized in a hierarchical tree. You can treat any branch of the tree as a cluster, observe typical article amounts for that branch and decide whether that branch represents a customer segment by itself, should be split in sub-branches, or joined with a sibling branch instead. The advantage is that you find the "full story" of which items and clusters of items are most similar to each other and how much. The disadvantage is that the outcome of the algorithm does not tell you where to draw the borders between your customer segments; you can cut up the clustering tree in many ways, so it's up to your interpretation how you want to identify your customer types.
On the other hand, if you are comfortable fixing the number of clusters (k) beforehand, k-means is a very robust way to get just any segmentation of your customers in k distinct types. In R, you would do kmeans(my.matrix, k). For marketing purposes, it may be sufficient to have (say) 5 different profiles of customers that you make custom advertisement for, rather than treating all customers the same. With k-means you don't explore all of the diversity that is present in your data, but you might not need to do so anyway.
If you don't want to fix the number of clusters beforehand, but you also don't want to manually decide where to draw the borders between the segments afterwards, there is a third possibility. You start with the k-means algorithm, where you let it generate an amount of cluster centers that is much larger than the number of clusters that you hope to end up with (for example, if you hope to end up with somewhere about 10 clusters, let the k-means algorithm look for 200 clusters). Then, use the mean shift algorithm to further cluster the resulting centers. You will end up with a smaller number of compact clusters. The approach is explained in more detail by James Li over here. You can use the mean shift algorithm in R with the ms function from the LPCM package, see this documentation.
About using PCA
PCA will not tell you how many clusters you need. PCA answers a different question: which variables seem to represent a common underlying (hidden) factor. In a sense, it is a way to cluster variables, i.e. properties of entities, not to cluster the entities themselves. The number of principal components (common underlying factors) is not indicative of the number of clusters needed. PCA can still be interesting if you want to learn something about the predictive value of each article about a customer's interests.
Sources
Michael J. Crawley, 2005. Statistics. An Introduction using R.
Gerry P. Quinn and Michael J. Keough, 2002. Experimental Design and Data Analysis for Biologists.
Wikipedia: hierarchical clustering, k-means, mean shift, PCA

Text Line Count in R [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I'm doing some basic text analysis in R and want to count the number of lines for a transcript from a .txt file that I load into R. With the example below to yield a count in which each speaker gets a new line attached to the linecount such that Mr. Smith = 4, Mr. Gordon = 6, Mr. Catalano = 3.
[71] "\"511\"\t\"MR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\""
[72] "\"513\"\t\"MR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\""
[73] "\"515\"\t\"MR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""
The function countLine() doesn't work since it requires a connection - these are just .txt imported into R. I realize that the line count is subject to the formatting of what the text is opened up in, but any general help on if this is feasible would help. Thanks.

I didn't think your example was reproducible, so I edited it to get it to contain what you posted but I do not know if the names will match:
txtvec <- structure(list(`'511' ` = "MR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\"",
`'513' ` = "MR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\"",
`'515' ` = "MR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""), .Names = c("'511'\t",
"'513'\t", "'515'\t"))
So it's only a matter or running a regex expression across it and tabling the results:
> table( sapply(txtvec, function(x) sub("(^MR.+)\\:.+", "\\1", x) ) )
#MR Catalano MR Gordon MR Smith
1 1 1
There was concern expressed that the names were not in the original structure. This is another version with unnamed vector and a slightly modified regex:
txtvec <- c("\"511\"\t\"\nMR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\"",
"\"513\"\t\"\nMR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\"",
"\"515\"\t\"\nMR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""
)
table( sapply(txtvec, function(x) sub(".+\\n(MR.+)\\:.+", "\\1", x) ) )
#MR Catalano MR Gordon MR Smith
# 1 1 1
To count the number of "lines" these would occupy on a wrapping device with 80 characters per line you could use this code (which could easily be converted to a function):
sapply(txtvec, function(tt) 1+nchar(tt) %/% 80)
#[1] 5 8 4

This is raised in the comments, but it really bares being it's own answer:
You cannot "count lines" without defining what a "line" is. A line is a very vague concept and can vary by the program being used.
Unless of course the data contains some indicator of a line break, such as \n. But even then, you would not be counting lines, you would be counting linebreaks. You would then have to ask yourself if the hardcoded line break is in accord with what you are hoping to analyze.
--
If your data does not contain linebreaks, but you still want to count the number of lines, then we're back to the question of "how do you define a line"? The most basic way, is as #flodel suggests, which is to use character length. For example, you can define a line as 76 characters long, and then take
ceiling(nchar(X) / 76))
This of course assumes that you can cut words. (If you need words to remain whole, then you have to get craftier)

n coins. Which is fake? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Algorithm to find minimum number of weighings required to find defective ball from a set of n balls
We have n coins. One of them is fake, which is heavier or lighter (we don't know). We have scales with 2 plates. How can we get the fake coin in p moves?
Can you give me a hand for writing such a program? No need a whole program, just ideas.
Thank you.

This is known as Balance puzzle. See Marcel Kołodziejczyk’s Two-pan balance and generalized counterfeit coin problem for a generalization of this problem.

I remember solving this for n=12 and 13, partly by hand and then with a program at the end. I don't know how I would solve it for a general n... but I know how I'd start - by considering small values of n and doing it by hand.
I suspect there are essentially patterns that can be used recursively for this... but you'll find them much easier to discover with pen and paper for small values (n=4 to 7, for example) than by coding.

Put coins on each side, the real ones will balance each other out, the fake will make the scale go either way. When the scales aren't balanced, one of the 2 you just put on is fake, try each against a real coin.
If the coins are objects you're handed, then you should be able to do that in a program quite easily.

Looking for a good reference on calculating permutations

As a programmer, I frequently need to be able to know the
how to calculate the number of permutations of a set, usually
for estimation purposes.
There are a lot of different ways specify the allowable
combinations, depending on the problem at hand. For example,
given the set of letters A,B,C,D
Assuming a 4 digit result, how many ways can those letters
be arranged?
What if you can have 1,2,3 or 4 digits, then how many ways?
What if you are only allowed to use each letter at most once?
twice?
What if you must avoid the same letter appearing twice in
a row, but if they are not in a row, then twice is ok?
Etc. I'm sure there are many more.
Does anyone know of a web reference or book that talks about
this subject in terms that a non-mathematician can understand?
Thanks!

Assuming a 4 digit result, how many
ways can those letters be arranged?
when picking the 1st digital , you have 4 choices ,which is one of A, B , C and D ; it is the same when picking the 2nd, 3rd ,4th since repetition is allowed:
so you have total : 4*4*4*4 = 256 choices.
What if you can have 1,2,3 or 4
digits, then how many ways?
It is easy to deduce from question 1.
What if you are only allowed to use
each letter at most once?
When pick the 1st digital , you have 4 choices ,which is one of A , B , c and D ; when picking the 2nd , you have 3 choice except the one you have picked for the 1st ; and 2 choices for 3rd , 1 choices for the 4th.
So you have total : 4 * 3 * 2 * 1 = 24 choice.
The knowledge involving here include combination , permutation and probability. Here is a good tutorial to understand their difference.

First of all the topics you are speaking of are
Permutations (where the order matters)
Combinations (order doesn't matter)
I would recommend Math Tutor DVD for teaching yourself math topics. The "probability and statistics" disk set will give you the formulas and skill you need to solve the problems. It's great because it's the closest thing you can get to going back to school, because a teacher solves problems on a white board for you.
I've found a clip on the Combinations chapter of the video for you to check out.

If you need to do more than just count the number of combinations and permutations, if you actually need to generate the sequences, then Donald Knuth's books Generating all combinations and partitions and Generating all tuples and permutations. He goes into great detail regarding algorithms subject to various restrictions, looking at the advantages and disadvantages of different solutions for each problem.

It all depends on how simply do you need the explanation to be.
The topic you are looking for is called "Permutations and Combinations".
Here's a fairly simply introduction. There are dozens like this on the first few pages from google.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex