I need to explain to the client why dupes are showing up between 2 supposedly different exams. It's been 20 years since Prob and Stats.
I have a generated Multiple choice exam.
There are 192 questions in the database,
100 are chosen at random (no dupes).
Obviously, there is a 100% chance of there being at least 8 dupes between any two exams so generated. (Pigeonhole principle)
How do I calculate the probability of there being
25 dupes?
50 dupes?
75 dupes?
-- Edit after the fact --
I ran this through excel, taking sums of the probabilities from n-100,
For this particular problem, the probabilities were
n P(n+ dupes)
40 97.5%
52 ~50%
61 ~0
Erm, this is really really hazy for me. But there are (192 choose 100) possible exams, right?
And there are (100 choose N) ways of picking N dupes, each with (92 choose 100-N) ways of picking the rest of the questions, no?
So isn't the probability of picking N dupes just:
(100 choose N) * (92 choose 100-N) / (192 choose 100)
EDIT: So if you want the chances of N or more dupes instead of exactly N, you have to sum the top half of that fraction for all values of N from the minimum number of dupes up to 100.
Errrr, maybe...
Its probably higher than you think. I won't attempt to duplicate this article: http://en.wikipedia.org/wiki/Birthday_paradox
Once you've created the first exam, there are 92 questions that have never been used, and 100 that have. If you now generate another exam, with 100 questions in in it, you are chosing from a set of 92 questions that have never been used, and 100 that have. Clearly you are going to get quite a few duplicates.
You would expect to get (100/192) * 100 duplicates, i.e. in any two randomly chosen exams, there will (on average) be 52 duplicate questions.
If you want the probability that there are 25, or 75, or whatever, then you have two choices.
a) Work out the maths
b) Simulate a few runs on a computer
Related
I'm looking for a way to generate different data frames where a variable is distributed randomly among a set number of observations, but where the sum of those values adds up to a predetermined total. More specifically I'm looking for a way to distribute 20.000.000 votes among 15 political parties randomly. I've looked around the forums a bit but can't seem to find an answer, and while trying to generate the data on my own I've gotten nowhere; I don't even know where to begin. The distribution itself does not matter, though I'd love to be able to influence the way it distributes the votes.
Thank you :)
You could make a vector of 20,000,000 samples of the numbers 1 through 15 then make a table from them, but this seems rather computationally expensive, and will result in an unrealistically even split of votes. Instead, you could normalise the cumulative sum of 15 numbers drawn from a uniform distribution and multiply by 20 million. This will give a more realistic spread of votes, with some parties having significantly more votes than others.
my_sample <- cumsum(runif(15))
my_sample <- c(0, my_sample/max(my_sample))
votes <- round(diff(my_sample) * 20000000)
votes
#> [1] 725623 2052337 1753844 61946 1173750 1984897
#> [7] 554969 1280220 1381259 1311762 766969 2055094
#> [13] 1779572 2293662 824096
These will add up to 20,000,000:
sum(votes)
#> [1] 2e+07
And we can see quite a "natural looking" spread of votes.
barplot(setNames(votes, letters[1:15]), xlab = "party")
I'm guessing if you substitute rexp for runif in the above solution this would more closely match actual voting numbers in real life, with a small number of high-vote parties and a large number of low-vote parties.
I got a number and I need to divide it into 2 factors so that I can put them into eulers phi function and calculate my n for RSA encryption, how do I find the two integers?
For example number: 1387: 19 • 73 so I get phi(1387)=(19-1)*(73-1)= 1296 = n
how do I get the 19 and 73 the fastest way? I dont want to use any internet calculators for it because I wont be able to use them in my exam.
In RSA,n=pq,in your example,p=19,q=73.Because p and q are all Prime number,so In your exam I think you can guess their value.I think because RSA's savety rely on value n can not be easily divided into two Prime number,so your teacher won't give a too large value n. For example,now teacher give me a value n=1387,you can think now last number is 7,so two number's value may ?3*?9 or ?1*?7.If is ?3*?9,so ?3 may equal 3,13,23,43,53,73,83... ?9may equal 19,29,59,79,89...,now you can use 1387 divide ?3,and search for whether there is an ?9 match(see whether the result of 1387 divided ?3 is Prime number).It's
just my thought,good luck to you.
I've this following optimization problem. A company produces a product, say Big A. To produce this product, it requires 5 processes. (Please find the detail table below). For each process, there are number of supplier that supply raw material for that particular process. E.g. For process 1, there are 3 supplier 1,2 & 3.
The constrain for the CEO of this company,say C, is that for each process the CEO has to purchase supplies from Supplier 1 first, then for additional supplies from 2nd Supplier and so on.
The optimization problem is C wants 700 units for total material to produce for 1 unit of Big A then how will he do it at minimum cost. How the optimization will change if the amount of units require increases to 1500 units.
I'll be grateful if I get the solution of this answer. But if somebody can suggest me some reference regarding this problem it will be a great help too. I'm mainly using R software here.
Process Supplier Cost Units Cumm_Cost Cumm_Unit
1 1 10 100 10 100
1 2 20 110 30 210
1 3 10 200 40 410
2 1 20 100 20 100
2 2 30 150 50 250
2 3 10 150 60 400
3 1 40 130 40 130
3 2 30 140 70 270
3 3 50 120 120 390
4 1 20 120 20 120
4 2 40 120 60 240
4 3 20 180 80 420
5 1 30 180 30 180
5 2 10 160 40 320
5 3 30 140 70 460
Regards,
I will start by solving the specific problem that you have posted and then will demonsrate how to formulate the problem more abstractively. For simplicity, I will use Excel's Solver add-in to solve the problem, but any configuration of a modeling language (such as AIMMS, AMPL, LINGO, OPL, MOSEL and numerous others) with a solver (CPLEX, GUROBI, GLPK, CBC and numerous others) can be used. If you would like to use R, there exists an lpSolve package that calls the lpSolve solver (which is not the best one in the word to be honest, but it is free of charge).
Note that for "real" (large scale) integer problems, the commercial solvers CPLEX, GUROBI and XPRESS perform a lot better than others. The first completely free solver that performs decently in most tests (including Hans Mittelman's page) is CBC. CBC can be hooked up in excel and solve the built-in solver model without restrictions in the number of constraints or variables, by using this add-in. Therefore, assuming that most CPU is going to be spent by the optimization algorithm, using CBC/OpenSolver seems like an efficient choice.
SPREADSHEET SETUP
I follow some conventions for convenience:
Decision variable cells are marked Green.
Constraints are marked red.
Data are marked grey.
Objective function is marked blue.
First, lets augment the table you presented as follows:
The added columns explained briefly:
Selected?: equals 1 if the (Process, Supplier) combo is allowed to produced a positive quantity, zero otherwise.
Quantity: the quantity produced, defined for each (Process, Supplier) combo.
Max Quantity?: Equals 1 if the Suppliers produces the maximum amount of units for that particular Process.
Quantity UB: equals Units * Selected?. This makes the upper bound either equal to Units, when the Supplier is allowed to produce this Process, or zero otherwise.
Quantity LB: equals Units * Max Quantity?. This is to ensure that whenever the Max Quantity? column is 1, the produced quantity will be equal to Units.
Selection: For the 1st supplier, it equals 0. For the 2nd and 3rd suppliers, it equals the Max Quantity? of the previous supplier (row) minus the Selected? of the current supplier (row).
A screenshot with formulas:
There exist two more constraints:
There must be at least one item produced from each process and
The total number of items should be 700 (or later 1,500).
Here is their setup:
and here are the formulas:
In brief, we use SUMIF to sum the quantities that are specific to each supplier, which we are going to constrain to be more than 1 item for each process.
To finish the spreadsheet setup, we need to calculate the objective function, namely the cost of the allocation. This is easily done by taking the SUMPRODUCT of columns Quantity and Cost. Note that the cumulative quantities are derived data and not very useful in the current context.
After the above steps, the spreadsheet looks like below:
SOLVER MODEL
For the solver model we need to declare
The Objective
The Decisions
The Constraints
The Solver (and tweak some parameters if necessary).
For ease of exposition, I have given each range the name of its header. The solver model looks as follows:
It should all be explanatory, except possibly the Selected >= 0 part. The column selected equals the difference between the binary max Quantity of the previous supplier minus the Selected of the current supplier. Selected >= 0 => max Quantity of previous supplier >= Selected of current supplier. Therefore, if the previous supplier does not produce at max quantity (binary = 0), the current supplier cannot produce.
Then we need to make sure that the solver setting are OK:
and solve the problem.
Solution for req = 700 :
As we see, the model tries to avoid procedures 3 and 5 as much as possible, and satisfies the constraint "at least 1 item per process" by picking up exactly 1 item for processes 3 and 5. The objective function value is 11,710.
Solution for req = 1,500 :
Here we need more capacity, but yet process 3 seems expensive and the model tries to avoid it by allocating whatever is necessary (just 1 unit to supplier 1).
I hope this helps. The spreadsheet can be downloaded here. I include the definition of the mathematical model below, in case you would like to transfer it to another language.
MATHEMATICAL FORMULATION
A formal definition of your problem is as follows.
SETS:
PARAMETERS:
Decisions:
Objective:
Constraints:
Constraint explanation:
C1: A supplier cannot produce anything from a process if he has not been allocated to that process.
C2: If a supplier's maximum indicator is set to 1, then the production variable should be the maximum possible.
C3: We cannot select supplier s for process p if we have not produced the max quantity available from the previous supplier s_[-1].
C4: We need to produce at least 1 item from each process.
C5: the total production from all processes and suppliers should equal the required amount.
Looks like you should look at the simplex algorithm (or some existing implementation of it).
Wikipedia has a fairly nice description of the algorithm, http://en.wikipedia.org/wiki/Simplex_algorithm
I would like to unit test the time writing software used at my company. In order to do this I would like to create sets of random numbers that add up to a defined value.
I want to be able to control the parameters:
Min and max value of the generated number
The n of the generated numbers
The sum of the generated numbers
For example, in 250 days a person worked 2000 hours. The 2000 hours have to randomly distributed over the 250 days. The maximum time time spend per day is 9 hours and the minimum amount is .25
I worked my way trough this SO question and found the method
diff(c(0, sort(runif(249)), 2000))
This results in 1 big number a 249 small numbers. That's why I would to be able to set min and max for the generated number. But I don't know where to start.
You will have no problem meeting any two out of your three constraints, but all three might be a problem. As you note, the standard way to generate N random numbers that add to a sum is to generate N-1 random numbers in the range of 0..sum, sort them, and take the differences. This is basically treating your sum as a number line, choosing N-1 random points, and your numbers are the segments between the points.
But this might not be compatible with constraints on the numbers themselves. For example, what if you want 10 numbers that add to 1000, but each has to be less than 100? That won't work. Even if you have ranges that are mathematically possible, forcing compliance with all the constraints might mean sacrificing uniformity or other desirable properties.
I suspect the only way to do this is to keep the sum constraint, the N constraint, do the standard N-1, sort, and diff thing, but restrict the resolution of the individual randoms to your desired minimum (in other words, instead of 0..100, maybe generate 0..10 times 10).
Or, instead of generating N-1 uniformly random points along the line, generate a random sample of points along the line within a similar low-resolution constraint.
This is more of a maths question than programming but I figure a lot of people here are pretty good at maths! :)
My question is: Given a 9 x 9 grid (81 cells) that must contain the numbers 1 to 9 each exactly 9 times, how many different grids can be produced. The order of the numbers doesn't matter, for example the first row could contain nine 1's etc. This is related to Sudoku and we know the number of valid Sudoku grids is 6.67×10^21, so since my problem isn't constrained like Sudoku by having to have each of the 9 numbers in each row, column and box then the answer should be greater than 6.67×10^21.
My first thought was that the answer is 81! however on further reflection this assumes that the 81 numbers possible for each cell are different, distinct number. They are not, there are 81 possible numbers for each cell but only 9 possible different numbers.
My next thought was then that each of the cells in the first row can be any number between 1 and 9. If by chance the first row happened to be all the same number, say all 1s, then each cell in the second row could only have 8 possibilites, 2-9. If this continued down until the last row then number of different permutations could be calculated by 9^2 * 8^2 * 7^2 ..... * 1^2. However this doesn't work if each row doesn't contain 9 of the same number.
It's been quite a while since I studied this stuff and I can't think of a way to work it out, I'd appreciate any help anyone can offer.
Imagine taking 81 blank slips of paper and writing a number from 1 to 9 on each slip (nine of each number). Shuffle the deck, and start placing the slips on the 9x9 grid.
You'd be able to create 81! different patterns if you considered each slip to be unique.
But instead you want to consider all the 1's to be equivalent.
For any particular configuration, how many times will that configuration be repeated
due to the 1's all being equivalent? The answer is 9!, the number of ways you can permute the nine slips with 1 written on them.
So that cuts the total number of permutations down to 81!/9!. (You divide by the number of indistinguishable permutations. Instead of 9! indistinguishable permutations, imagine there were just 2 indistinguishable permutations. You would divide the count by 2, right? So the rule is, you divide by the number of indistinguishable permutations.)
Ah, but you also want the 2's to be equivalent, and the 3's, and so forth.
By the same reasoning, that cuts down the number of permutations to
81!/(9!)^9
By Stirling's approximation, that is roughly 5.8 * 10^70.
First, let's start with 81 numbers, 1 through 81. The number of permutations for that is 81P81, or 81!. Simple enough.
However, we have nine 1s, which can be arranged in 9! indistinguishable permutations. Same with 2, 3, etc.
So what we have is the total number of board permutations divided by all the indistinguishable permutations of all numbers, or 81! / (9! ** 9).
>>> reduce(operator.mul, range(1,82))/(reduce(operator.mul, range(1, 10))**9)
53130688706387569792052442448845648519471103327391407016237760000000000L