Randomly pairing elements of a vector in R to count unique arrangements - r

Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?

> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.

Related

Making a for loop in r

I am just getting started with R so I am sorry if I say things that dont make sense.
I am trying to make a for loop which does the following,
l_dtest[[1]]<-vector()
l_dtest[[2]]<-vector()
l_dtest[[3]]<-vector()
l_dtest[[4]]<-vector()
l_dtest[[5]]<-vector()
all the way up till any number which will be assigned as n. for example, if n was chosen to be 100 then it would repeat this all the way to > l_dtest[[100]]<-vector().
I have tried multiple different attempts at doing this and here is one of them.
n<-4
p<-(1:n)
l_dtest<-list()
for(i in p){
print((l_dtest[i]<-vector())<-i)
}
Again I am VERY new to R so I don't know what I am doing or what is wrong with this loop.
The detailed background for why I need to do this is that I need to write an R function that receives as input the size of the population "n", runs a simulation of the model below with that population size, and returns the number of generations it took to reach a MRCA (most recent common ancestor).
Here is the model,
We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each
generation, each individual chooses two parents at random from the previous generation. The choices are made randomly and equally likely over the n possibilities and each individual chooses twice. All choices are made independently. Thus, for example, it is possible that, when an individual chooses his two parents, he chooses the same individual twice, so that in
fact he ends up with just one parent; this happens with probability 1/n.
I don't understand the specific step at the begining of this post or why I need to do it but my teacher said I do. I don't know if this helps but the next step is choosing parents for the first person and then combining the lists from the step I posted with a previous step. It looks like this,
sample(1:5, 2, replace=T)
#[1] 1 2
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[1]]) #To my understanding, l_dtem[[1]] is now receiving the listdescandants from l_d[[1]] bcs the ladder chose l_dtemp[[1]] as first parent
l_dtemp[[2]]<-union(l_dtemp[[2]], l_d[[1]]) #Same as ^^ but for l_d[[1]]'s 2nd choice which is l_dtemp[[2]]
sample(1:5, 2, replace=T)
#[1] 1 3
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[2]])
l_dtemp[[3]]<-union(l_dtemp[[3]], l_d[[2]])

How to extract top features by CATScore in r?

I am running a machine learning algorithm that uses CAT score for feature selection as
library(sda)
train1<- data.matrix(train, rownames.force = NA)
ranking.LDA = sda.ranking(train1[,1:lengthvar], train1[,lengthtrain], diagonal=FALSE)
topfs<-which(ranking.LDA[,"score"] >2)
My question is how to ask the CAT score to give me for example top 20 features? The only way I could extract features was setting a threshold, but this way, it gives me various number of features for different data set. What I want is always having eg. top 20 (or any other number) features.
Thanks in advance for your valuable contribution.
ranking.LDA gives a list of numbers.Hence we use a list function.
#As ranking.LDA gives a ranking of predictors we directly extract column names using this ranking.
colnames(train1[,ranking.LDA[1:20]])

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Number of combinations

Given the following letters in a license plate, how many combinations of them can you create
AAAA1234
Please note that this is not a homework question (I am too old for college :)
I am only trying to understand permutations and combinations. I always get lost when I see questions like this. Do I use n! or nPr or nCr.
Any book on this subject in addition to the logic used to arrive at the answer will also be greatly appreciated.
I have faith in exactly one method to remember such formulas: Rethink through the reasoning to justify it as needed. Then, each time you need the formula, remembering it becomes a mental exercise that makes it easier to remember it the next time. It also allows you to know the math on your own authority, instead of someone else's authority.
If the letters are all different, then there are n choices for the first letter, n-1 choices for the second letter, and so on. That makes n! However, in your problem the letters are not all different. One trick is to tag them to make them different so that you are overcounting, then divide by the amount that you are overcounting. If a of the symbols are A, then you can tag them in a! ways. They are then all different, so that the answer to the modified question is n!. So the answer to the original question is n!/a! (This is assuming that the symbols other than the A are fixed, distinct numbers.)
Another argument is to count the positions for the numbers. There are n positions for the 1, n-1 positions for the 2, etc., so you get n(n-1)...(n-r+1) = n!/a!, where r = n-a.
In fact the answer is the same as the permutation formula nPr. And your arrangements are much the same as partial permutations, which is what the formula is for. But you'll learn it better if you reason through it before looking at the formula.
As for books, I might suggest Brualdi, Introductory Combinatorics.
One strategy that you can use (there will be many) is to get all the permutations possible, then divide out the repeats.
Permutations of 8 elements = 8!
But for each unique arrangement of these, there are a bunch more with the same positions of the A's. So, how many ways can you arrange four A's in one particular set of positions?
Permutations of 4 A's = 4!
So the total unique arrangements should be 8! / 4!
If I'm totally wrong just someone say so and I'll delete this answer...
If you mean 3 letters A-Z and 4 digits 0...9 in that order, then you have
26 letters x
26 letters x
26 letters x
26 letters x
10 digits x
10 digits x
10 digits x
10 digits
= 26^4 * 10^4
= 4569760000
If no leading "0" is allowed, you get a few less.
Edit1: Miscounted the "A"
Edit2: I reread the question - originally I thought it was just four letters at the beginning followed by 4 numbers. If it's just a permutation thing, then the answer is obviously different: 8! permutations at all, but 4! permutations for the A are the same, so 8! / 4! = 1680.
Answer is 8!/4!
Let's try to explain with a simpler question: Combinations of 112 ?
There are 112, 121 and 211. If all digits would be unique, we could just find the answer by 3! But there is a repeating digit. So we should extract repeating digits by 3!/2! = 3
Another example is 1122. We have two repeating digit here. So we should extract twice. 4!/2!.2! = 6
I think this is a good explanation of permutations and combinations:
Easy Permutations and Combinations Better explained.
It goes step by step until you discover how to make the calculations.
No need for permutations, because all letters can be repeated, even the number
since the given example is [AAAA1234],then we have 4-Letters and 4-Digits.
for each letter we have 26 {A-Z} possible combinations
Thats why for 4 letters we will have 26^4
For each Number we have 10 {0-9} possible combinations, except the last digit we 9 possible combinations {case 1}, if it not allowed to be 0 otherwise it is 10 {case 2}
Thats why for 4 letters we will have 9*10^3 {case 1} or 10^4 {case 2}
The total number of combinations is {case 1} 9*(26^4)***(10^3) or {case 2} (26^4)*(10^4)
But if your question about permutations for the set{A,A,A,A,1,2,3,4}, then consider the the equivalent set {1,2,3,4,5,6,7,8} and try avoid the repeated sequence by divide over the permutations of {5,6,7,8} and the answer is 8!/4!=5*6*7*8=1680. the{5,6,7,8} represent {A,A,A,A} See #Tesserex & #erkangur
How many distinct sets of positions can the A's occupy? Given this value, multiply by the number of distinct arrangements of 1234 and you have your answer. You'll need to choose the positions for the A's and then ! will help with the arrangements of 1234.
Consider a simpler example. Let's say you had asked the question:
How many arrangements are there of the symbols: ABCD1234?
Now, since every symbol is distinct, there are 8! ways to arrange them.
Now let's build up to your problem. If we change the letter B to an A, we have AACD1234.
This destroys the uniqueness of exactly half the possible combinations, since any combination where we could have previously switched the A and the B is now non-unique. Therefore, we now have 8!/2 combinations.
Similarly, replacing the C with another A would result in half of the remaining combinations losing their uniqueness, and so on.
So, if only one symbol is duplicated, the generalized formula is (number of symbols total)!/2^(number of duplications)
In your case, the number of possible arrangements is 8!/2^4

Resources