Permuting elements of a vector 10,000 times - efficiently? (R) - r

This question is quite straightforward. However, the solutions that I have found to it are extremely memory and time inefficient. I am wondering if this can be done in R without grinding one's machine into dust.
Take a vector:
x<-c("A", "B", "B", "E", "C", "C", "D", "E", "A', "C")
This one has 10 elements. There are five unique elements. Therefore, importantly, some elements are repeated and any permutation should contain the same total number of each type of element. I wish to permute this sequence/vector 10,000 times with each one being a randomly generated and unique one. With my real data, I could be doing these permutations for up to 1000 elements. This can be very hard to do efficiently.
To get one permutation, you can just do:
sample(x)
or, from the gtools package:
permute(x)
I could write some code to do that 10,000 times, but am likely to have duplicates. Is there way of doing this and dropping duplicates until 10,000 is reached?
Other similar questions on stackoverflow and statsoverflow have asked question about generating all the unique permutations of a sequence. These questions are here:
Shuffling a vector - all possible outcomes of sample()?
Generating all distinct permutations of a list in R
https://stats.stackexchange.com/questions/24300/how-to-resample-in-r-without-repeating-permutations
These are good and the suggestions for generating all the unique permutations are great and it would certainly be quite easy to run them and sample 10,000 random samples from each to get our 10,000. However, if you go beyond about 10 elements in a vector then it gets very memory intensive.
Any comments about how to do this efficiently for up to 1000 elements appreciated. This has me getting very dizzy.

I don't think that the computations should be as expensive as you are making them to be. For small "x" vectors, you might want to overshoot a little bit (here, I've sort of overdone it), then check for duplicates using duplicated. If the difference between the number required and the number of duplicated rows is too much for you to get your desired 10,000, repeat the process to fill the difference, using rbind to add the ones you want to keep to the matrix you get from replicate. This could be implemented in a while loop.
x <- c("A", "B", "B", "E", "C", "C", "D", "E", "A", "C")
set.seed(1)
N <- t(replicate(15000, sample(x)))
sum(duplicated(N))
# [1] 1389
out <- N[!(duplicated(N)), ][1:10000, ]
head(out)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "B" "E" "C" "D" "B" "E" "A" "C" "C" "A"
# [2,] "B" "B" "C" "C" "C" "D" "E" "E" "A" "A"
# [3,] "C" "B" "C" "A" "A" "E" "D" "C" "B" "E"
# [4,] "C" "C" "E" "B" "C" "E" "A" "A" "D" "B"
# [5,] "A" "C" "D" "E" "E" "C" "A" "B" "B" "C"
# [6,] "C" "E" "E" "B" "A" "C" "D" "A" "B" "C"
The duplicated step is actually the most expensive, as far as I can see:
y <- sample(500, 1000, TRUE)
system.time(N <- t(replicate(12000, sample(y))))
# user system elapsed
# 2.35 0.08 2.43
system.time(D <- sum(duplicated(N)))
# user system elapsed
# 14.82 0.01 14.84
D
# [1] 0
^^ There, we have no duplicates in our 12,000 samples.

In case you are only interested in the first 10000 permutations (in dictionary order), you can make use of the iterpc library.
library(iterpc)
x <- c("A", "B", "B", "E", "C", "C", "D", "E", "A", "C")
I <- iterpc(table(x), ordered=TRUE)
# first 10000 permutations
result <- getnext(I, d=10000)
And it is very fast to get them.
> system.time(getnext(I, d=10000))
user system elapsed
0.004 0.000 0.005

Here's an idea. This is not necessarily an answer but it's too big for a comment.
Get the permutations in an orderly way, and add them to a collection. For example, if elements are A, B, C, and D:
A B C D
A B D C
A D B C
... so on
And once you have got required number of permutations (10000 in your case), permute that collection once.
If the cost of randomization is the bottleneck, this approach should solve it.

Related

R: From panel structure to adjacency matrix or edge list?

Im trying to convert a data set in a long format panel structure to an adjacency matrix or edge list to make network graphs. The data set contains articles each identified by an ID-number. Each article can appear several times under a number of categories. Hence I have a long format structure at the moment:
ID <- c(1,1,1,2,2,2,3,3)
Category <- c("A","B","C","B","E","H","C","E")
dat <- data.frame(ID,Category)
I want to convert this into an adjacency matrix or edge list. Where the edge list such look something like this
A B
A C
B C
B E
B H
E H
C E
Edit: I have tried dat <- merge(ID, Category, by="Category") but it returns the error message Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
Thanks in advance
Update: I ended up using the crossprod(table(dat)) from the comments, but the solution suggested by Navy Cheng below works just as well
This code will work
do.call(rbind,lapply(split(dat, dat$ID), function(x){
t(combn(as.vector(x$Category), 2))
}))
Update
As #Parfait 's suggestion, you can have by instead of split+lapply.
1) Use by to group nodes ("A", "B", "C" ...) by Category;
2) Use combn to create edge between nodes in each group, and t to transform the matrix for further rbind
> edge.list <- by(dat, dat$ID, function(x) t(combn(as.vector(x$Category), 2)))
dat$ID: 1
[,1] [,2]
[1,] "A" "B"
[2,] "A" "C"
[3,] "B" "C"
------------------------------------------------------------
dat$ID: 2
[,1] [,2]
[1,] "B" "E"
[2,] "B" "H"
[3,] "E" "H"
------------------------------------------------------------
dat$ID: 3
[,1] [,2]
[1,] "C" "E"
3) Then merge the list
> do.call(rbind, edge.list)
[,1] [,2]
[1,] "A" "B"
[2,] "A" "C"
[3,] "B" "C"
[4,] "B" "E"
[5,] "B" "H"
[6,] "E" "H"
[7,] "C" "E"
So if you are willing to convert your data.frame to a data.table this problem can be solved pretty efficiently and cleanly and if you have many rows will be much faster.
library(data.table)
dat<-data.table(dat)
Basically you can apply functions to columns of the data.table in the j cell and group in the k cell. So you want all the combinations of categories taken two at a time for each ID which looks like this:
dat[,combn(Categories,2),by=ID]
However stopping at this point will keep the ID column and by default create a column called V1 that basically concatenates the array returned by combn into a vector of the categories and not the two-column adjacency matrix that you need. But by chaining another call to this you can create the matrix easily as you would with any single vector. In one line of code this will look like:
dat[,combn(Category,2),by=ID][,matrix(V1,ncol=2,byrow = T)]
Remember that the vector column we wish to convert to a matrix is called V1 by default and also we want the 2-column matrix to be created by row instead of the default which is by column. Hope that helps and let me know if I need to add anything to my explanation. Good luck!

Function to generate ordered combinations [duplicate]

This question already has answers here:
How to generate all possible combinations of vectors without caring for order?
(2 answers)
Closed 5 years ago.
What R command generates all possible ordered combinations of length k?
For example from this vector:
a,b,c,d
It want to generate all combinations of length 3 but only those ones where the order is conserved:
a,b,c
a,b,d
a,c,d
b,c,d
Or If I have this vector
a,b,7,d,e
I want to do the same for length 2:
a,b
a,7
a,d
a,e
b,7
b,d
b,e
7,d
7,e
d,e
combn doesn't work here because it gives you all possible combinations including reversed ones such as
c,b
In simple cases I could try to do it with expand.grid but both methods would need further processing.
Maybe there is a base function (or package) able to do what I want or even accepting more complex conditions.
PD: When I say "ordered" I'm speaking about the order of appearance in the starting vector. I don't mean the typographic order, though in my example they are the same.
You can use combn in base R:
vec <- c("a", "b", "c", "d")
len <- 2
combn(length(vec), len, function(x) vec[x])
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] "a" "a" "a" "b" "b" "c"
#[2,] "b" "c" "d" "c" "d" "d"
Of length 3:
combn(length(vec), 3, function(x) vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "b"
#[2,] "b" "b" "c" "c"
#[3,] "c" "d" "d" "d"
OR as #Sotos pointed out in the comments:
combn(vec, len)

Getting paired permutations in R

Context: I have a list of sports teams called teamNames, and I would like to generate their match-ups for each week. I'm not sure if permutations are even the right approach, but I feel like they would be. What I would ideally like is to pass a vector of team names to a function, and then have it give me a matrix where each row has that vector of team names in a different order, such that if I go through them in pairs, I'll get a unique set of match-ups for each row.
For example if my input is teamNames <- c("a", "b", "c", "d"), I want the output to be a matrix that says:
a b c d
a c b d
a d c b
Edit: Further clarification: in this case, the matrix has given me three "weeks" of matchups. First week: "a" vs. "b" and "c" vs. "d"
Second week: "a" vs. "c" and "b" vs. "d"
Third week: "a" vs. "d" and "b" vs. "c"
The closest I've gotten from reading other questions is to use the permutations function in the gtools package as follows:
permutations(length(teamNames), 2, teamNames)
This generates all the possible match-ups, but what it doesn't do is to divide them into sets/weeks. combinations(length(teamNames), 4, teamNames only gives me one set of matchups.
If I understand correctly, if 2 teams are chosen from the 4 teams, the rest two have to be matched. Then it is selecting 2 out of 4. Permutation may not be applied as 'a vs b' == 'b vs a'. No extra package is necessary as the utils package has combn().
> combn(teamNames, 2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
Above shows selecting 2 teams from 4 and there are some duplication - selecting a and b equals to selecting c and d. If one of those duplicating cases are cancelled out, it'd be alright to set up a schedule.
Update
# Buckminster - I keep updating the code. In this update, the rest two are updated although there are still duplication. Also, among 4, if 2 are determined, the rest two have to be able to be determined (it is a similar idea how to solve a system of equations in linear algebra). In other words, I'm not sure why -1 was given probably by you.
# Update
teamNames <- c("a", "b", "c", "d")
first <- combn(teamNames, 2, simplify = FALSE)
second <- lapply(first, function(x) teamNames[!teamNames %in% x])
bind <- rbind(do.call(cbind, first), do.call(cbind, second))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
[3,] "c" "b" "b" "a" "a" "a"
[4,] "d" "d" "c" "d" "c" "b"
Let me check if duplication can be removed easily.

Subsetting Identical Observations in R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 8 years ago.
I am trying to look at protein sequence homology using R, and I'd like to go through a data frame looking for identical pairs of Position and Letter. The data look similar to the frame below:
Letter <- c("A", "B", "C", "D", "D", "E", "G", "L")
Position <- c(1, 2, 3, 4, 4, 5, 6, 7)
data.set <- cbind(Position, Letter)
Which yields:
Position Letter
[1,] "1" "A"
[2,] "2" "B"
[3,] "3" "C"
[4,] "4" "D"
[5,] "4" "D"
[6,] "5" "E"
[7,] "6" "G"
[8,] "7" "L"
I'd like to loop through and find all identical observations (in this case, observations 4 and 5), but I'm having difficulty in discovering the best way to do it.
I'd like the resultant data frame to look like:
Position Letter
[1,] "4" "D"
[2,] "4" "D"
The ways I've tried to do this ended up yielding this code, but unfortunately it returns one value of TRUE because I realized that I am comparing two identical data frames:
> identical(data.set[1:nrow(data.set),1:2], data.set[1:nrow(data.set),1:2])
[1] TRUE
I'm not sure if looping through using the identical() function would be the best way? I'm sure there's a more elegant solution that I am missing.
Thanks for any help!
Try the unique function:
unique(data.set)
...
You can use duplicated using fromLast to go in two directions:
data.set[(duplicated(data.set)==T | duplicated(data.set, fromLast = TRUE) == T),]
# Position Letter
#[1,] "4" "D"
#[2,] "4" "D"

R Question Number of Unique Combinations of A,A,A,A,B,B,B,B,B

I am trying to find a way to get a list in R of all the possible unique permutations of A,A,A,A,B,B,B,B,B.
Combinations was what was originally thought to be the method for obtaining a solution, hence the combinations answers.
I think this is what you're after. #bill was on the ball with the recommendation of combining unique and combn. We'll also use the apply family to generate ALL of the combinations. Since unique removes duplicate rows, we need to transpose the results from combn before uniqueing them. We then transpose them back before returning to the screen so that each column represents a unique answer.
#Daters
x <- c(rep("A", 4), rep("B",5))
#Generates a list with ALL of the combinations
zz <- sapply(seq_along(x), function(y) combn(x,y))
#Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z))))
Which returns:
[[1]]
[,1] [,2]
[1,] "A" "B"
[[2]]
[,1] [,2] [,3]
[1,] "A" "A" "B"
[2,] "A" "B" "B"
[[3]]
[,1] [,2] [,3] [,4]
[1,] "A" "A" "A" "B"
[2,] "A" "A" "B" "B"
[3,] "A" "B" "B" "B"
...
EDIT Since the question is about permuations and not combinations, the answer above is not that useful. This post outlines a function to generate the unique permutations given a set of parameters. I have no idea if it could be improved upon, but here's one approach using that function:
fn_perm_list <-
function (n, r, v = 1:n)
{
if (r == 1)
matrix(v, n, 1)
else if (n == 1)
matrix(v, 1, r)
else {
X <- NULL
for (i in 1:n) X <- rbind(X, cbind(v[i], fn_perm_list(n -
1, r - 1, v[-i])))
X
}
}
zz <- fn_perm_list(9, 9)
#Turn into character matrix. This currently does not generalize well, but gets the job done
zz <- ifelse(zz <= 4, "A", "B")
#Returns 126 rows as indicated in comments
unique(zz)
There's no need to generate permutations and then pick out the unique ones.
Here's a much simpler way (and much, much faster as well): To generate all permutations of 4 A's and 5 B's, we just need to enumerate all possible ways of placing 4 A's among 9 possible locations. This is simply a combinations problem. Here's how we can do this:
x <- rep('B',9) # vector of 9 B's
a_pos <- combn(9,4) # all possible ways to place 4 A's among 9 positions
perms <- apply(a_pos, 2, function(p) replace(x,p,'A')) # all desired permutations
Each column of the 9x126 matrix perms is a unique permutation 4 A's and 5 B's:
> dim(perms)
[1] 9 126
> perms[,1:4] ## look at first few columns
[,1] [,2] [,3] [,4]
[1,] "A" "A" "A" "A"
[2,] "A" "A" "A" "A"
[3,] "A" "A" "A" "A"
[4,] "A" "B" "B" "B"
[5,] "B" "A" "B" "B"
[6,] "B" "B" "A" "B"
[7,] "B" "B" "B" "A"
[8,] "B" "B" "B" "B"
[9,] "B" "B" "B" "B"

Resources