I have a vector:
seq1<-c('a','b','c','b','a','b','c','b','a','b','c')
I wish to permute the elements of this vector to create multiple (ideally up to 5000) vectors with the condition that the permuted vectors cannot have repeated elements within the vector in consecutive elements. e.g. "abbca...." is not allowed as 'b-b' is a repeat.
I realize that for this small example there probably are not 5000 solutions. I am typically dealing with much larger vectors. I am also willing to consider sampling with replacement, though currently I'm working on solutions without replacement.
I am looking for better solutions than my current thinking.
Option 1. - brute force.
Here, I just repeatedly sample and check if any successive elements are duplicates.
set.seed(18)
seq1b <- sample(seq1a)
seq1b
#[1] "b" "b" "a" "a" "c" "b" "b" "c" "a" "c" "b"
sum(seq1b[-length(seq1b)]==seq1b[-1]) #3
This is not a solution as there are 3 duplicated consecutive elements. I also realize that lag is probably a better way to check for duplicated elements but for some reason it is being finicky (I think it is being masked by another package I have loaded).
set.seed(1000)
res<-NULL
for (i in 1:10000){res[[i]]<-sample(seq1a)}
res1 <- lapply(res, function(x) sum(x[-length(x)]==x[-1]))
sum(unlist(res1)==0) #228
This produces 228 options out of 10000 iterations. But let's see how many unique ones:
res2 <- res[which(unlist(res1)==0)]
unique(unlist(lapply(res2, paste0, collapse=""))) #134
Out of 10000 attempts we only get 134 unique ones from this short example vector.
Here are 3 of the 134 example sequences produced:
# "bcbabcbabca" "cbabababcbc" "bcbcababacb"
In fact, if I try over 500,000 samples, I can only get 212 unique sequences that match my non-repeating criteria. This is probably close to the upper limit of possible ones.
Option 2. - iteratively
A second idea I had is to be more iterative about the approach.
seq1a
table(seq1a)
#a b c
#3 5 3
We could sample one of these letters as our starting point. Then sample another from the remaining ones, check if it is the same as the previously chosen one and if not, add it to the end. And so on and so forth...
set.seed(10)
newseq <- sample(seq1a,1) #b
newseq #[1] "b"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)]
table(remaining)
#a b c
#3 4 3
set.seed(10)
newone <- sample(remaining,1) #c
#check if newone is same as previous one.
newone==newseq[length(newseq)] #FALSE
newseq <- c(newseq, newone) #update newseq
newseq #[1] "b" "c"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)] #update remaining
remaining
table(remaining)
#a b c
#3 4 2
This might work, but I can also see it running into lots of issues - e.g. we could go:
# "a" "c" "a" "c" "a" "b" ...
and then be left with 3 more 'b's that cannot go at the end as they'd be duplicates.
Of course, this would be a lot easier if I allowed sampling with replacement, but for now I'm trying to do this without replacement.
You can use the iterpc package to work with combinations and iterations. I hadn't heard of it until trying to answer this question so there might also be more effective ways to use the same package.
Here I've used iterpc to set up an iterator, and getall to find all combinations of the vector based on that iterator. This seems to just report unique combinations, making it a bit nicer than finding all combinations with expand.grid.
#install.packages("iterpc")
require("iterpc")
seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')
I <- iterpc(n = table(seq1), ordered=TRUE)
all_seqs <- getall(I)
# result is a matrix with permutations as rows:
head(all_seqs)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#[1,] "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c"
#[2,] "a" "a" "a" "b" "b" "b" "b" "c" "b" "c" "c"
#[3,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "b" "c"
#[4,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "b"
#[5,] "a" "a" "a" "b" "b" "b" "c" "b" "b" "c" "c"
#[6,] "a" "a" "a" "b" "b" "b" "c" "b" "c" "b" "c"
The rle function tells us about consecutive values equal to each other in a vector. The lengths component of the output tells us how many times each element of values is repeated:
rle(c("a", "a", "b", "b", "b", "c", "b"))
# Run Length Encoding
# lengths: int [1:3] 2 3 1 1
# values : chr [1:3] "a" "b" "c" "b"
The length of values or lengths will be equal to the length of the original vector only for combinations which have no consecutive repeats.
You can therefore apply rle to each row, calculate the length of values or lengths and keep rows from all_seqs where the calculated value is the same as the length of seqs1.
#apply the rle function
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
# keep rows which have an rle with a length equal to length(seq1)
all_seqs_good <- all_seqs[which(all_seqs_rle == length(seq1)), ]
all_seqs_good has an nrow of 212, suggesting that you did indeed find all possible combinations for your example vector.
nrow(all_seqs_good)
# 212
Technically this is still brute forcing (except that it doesn't calculate every possible combination - only unique ones), but is fairly quick for your example. I'm not sure how well it will cope with larger vectors yet...
Edit: this does seem to fail for larger vectors. One solution would be to break larger vectors into smaller chunks, then process those chunks as above and combine them - keeping only the combinations which meet your criteria.
For example, breaking a vector of length 24 into two vectors of length 12, then combining the results can give you 200,000+ combinations which meet your critera and is pretty quick (around 1 minute for me):
# function based on the above solution
seq_check <- function(mySeq){
I = iterpc(n = table(mySeq), ordered=TRUE)
all_seqs <- getall(I)
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
all_seqs_good <- all_seqs[which(all_seqs_rle == length(mySeq)), ]
return(all_seqs_good)
}
set.seed(1)
seq1<-sample(c(rep("a", 8), rep("b", 8), rep("c", 8)),24)
seq1a <- seq1[1:12]
seq1b <- seq1[13:24]
#get all permutations with no consecutive repeats
seq1a = apply(seq_check(seq1a), 1, paste0, collapse="")
seq1b = apply(seq_check(seq1b), 1, paste0, collapse="")
#combine seq1a and seq1b:
combined_seqs <- expand.grid(seq1a, seq1b)
combined_seqs <- apply(combined_seqs, 1, paste0, collapse="")
#function to calculate rle lengths
rle_calc <- function(x) length(rle(unlist(strsplit(x, "")))$values)
#keep combined sequences which have rle lengths of 24
combined_seqs_rle <- sapply(combined_seqs, rle_calc)
passed_combinations <- combined_seqs[which(combined_seqs_rle == 24)]
#find number of solutions
length(passed_combinations)
#[1] 245832
length(unique(passed_combinations))
#[1] 245832
You might need to re-order the starting vector for best results. For example, if seq1 in the above example had started with "a" eight times in a row, there would be no passing solutions. For example, try the splitting up solution with seq1 <- c(rep("a", 8), rep("b", 8), rep("c", 8)) and you get no solutions back, even though there are really the same number of solutions for the random sequence.
It doesn't look like you need to find every possible passing combination, but if you do then for larger vectors you'll probably need to iterate through I using the getnext function from iterpc, and check each one in a loop which would be very slow.
Here another solution. Please see the comments in the code for an explanation of the algorithm.
In a way, it's similar to your second (iterative) approach, but it includes
a while loop that ensures that the next element is valid
and a stopping criterion for the case when the remaining elements would necessarily form an invalid combination
The algorithm is also quite efficient with longer seq1 vectors as given in one of your comments. But I guess it's performance will degrade if you have more unique elements in seq1.
Here the code:
First a few definitions
set.seed(1234)
seq1=c('a','b','c','b','a','b','c','b','a','b','c')
#number of attempts to generate a valid combination
Nres=10000
#this list will hold the results
#we do not have to care about memory allocation
res_list=list()
Now generate the combinations
#the outer loop creates the user-defined number of combination attempts
for (i in 1:Nres) {
#create a "population" from seq1
popul=seq1
#pre-allocate an NA vector of the same length as seq1
res_vec=rep(NA_character_,length(seq1))
#take FIRST draw from the population
new_draw=sample(popul,1)
#remove draw from population
popul=popul[-match(new_draw,popul)]
#save new draw
res_vec[1]=new_draw
#now take remaining draws
for (j in 2:length(seq1)) {
#take new draws as long as
#1) new_draw is equal to the last draw and
#2) as long as there are any valid elements left in popul
while((new_draw==res_vec[j-1])&any(res_vec[j-1]!=popul)) {
#take new draw
new_draw=sample(popul,1)
}
#if we did not find a valid draw break inner loop
if (new_draw==res_vec[j-1]) {
break
}
#otherwise save new_draw ...
res_vec[j]=new_draw
#... and delete new draw from population
popul=popul[-match(new_draw,popul)]
}
#this is to check whether we had to break the inner loop
#if not, save results vector
if (sum(is.na(res_vec[j]))==0) res_list[[length(res_list)+1]]=res_vec
}
Now let's check the results
#for each result vector in res_list:
#1) check whether all subsequent elements are different ---> sum(x[-1]==x[-length(x)])==0
#2) and whether we have the same number of elements as in seq1 ---> all.equal(table(x),table(seq1),check.attributes=FALSE)
sum(sapply(res_list,function(x) (sum(x[-1]==x[-length(x)])==0)&all.equal(table(x),table(seq1),check.attributes=FALSE)))
#6085
#the previous number should be the same as the length of res_list
length(res_list)
#6085
#check the number of unique solutions
length(unique(res_list))
#212
The speed of your actual job will depend on a lot of factors (e.g. how many possible passing combinations exist), but I think you can accomplish this relatively quickly by using 2 loops (similarly to how you outlined, but possibly quicker):
Permutate your set of variables and check that there are no
sequential values.
Assess whether the passing permutation is unique to those that have already been chosen
In the following example, you set two values to control the searching process: nsuccess - Desired number of many unique permutations; nmax - Maximum number of permutations (sets upper limit on computation time)
Example
seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')
seq1
set.seed(1)
nsuccess <- 200
nmax <- 30000
res <- matrix(NA, nrow=length(seq1), ncol=nsuccess)
i <- 1
j <- 1
while(i <= nsuccess & j <= nmax){
s1 <- sample(seq1)
s1str <- paste(s1, collapse=",")
test <- rle(s1)$lengths
if(sum(test) == length(test)) { # check that no values are consecutive
U <- unique(apply(res, 2, function(x){paste(x, collapse=",")}))
if(!s1str %in% U){ # check if new permutation is unique
res[,i] <- s1
i <- i+1
}
}
j <-j+1
}
print(paste("i =", i, "; j =", j))
res # view the unique permutations
Related
Im trying to convert a data set in a long format panel structure to an adjacency matrix or edge list to make network graphs. The data set contains articles each identified by an ID-number. Each article can appear several times under a number of categories. Hence I have a long format structure at the moment:
ID <- c(1,1,1,2,2,2,3,3)
Category <- c("A","B","C","B","E","H","C","E")
dat <- data.frame(ID,Category)
I want to convert this into an adjacency matrix or edge list. Where the edge list such look something like this
A B
A C
B C
B E
B H
E H
C E
Edit: I have tried dat <- merge(ID, Category, by="Category") but it returns the error message Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
Thanks in advance
Update: I ended up using the crossprod(table(dat)) from the comments, but the solution suggested by Navy Cheng below works just as well
This code will work
do.call(rbind,lapply(split(dat, dat$ID), function(x){
t(combn(as.vector(x$Category), 2))
}))
Update
As #Parfait 's suggestion, you can have by instead of split+lapply.
1) Use by to group nodes ("A", "B", "C" ...) by Category;
2) Use combn to create edge between nodes in each group, and t to transform the matrix for further rbind
> edge.list <- by(dat, dat$ID, function(x) t(combn(as.vector(x$Category), 2)))
dat$ID: 1
[,1] [,2]
[1,] "A" "B"
[2,] "A" "C"
[3,] "B" "C"
------------------------------------------------------------
dat$ID: 2
[,1] [,2]
[1,] "B" "E"
[2,] "B" "H"
[3,] "E" "H"
------------------------------------------------------------
dat$ID: 3
[,1] [,2]
[1,] "C" "E"
3) Then merge the list
> do.call(rbind, edge.list)
[,1] [,2]
[1,] "A" "B"
[2,] "A" "C"
[3,] "B" "C"
[4,] "B" "E"
[5,] "B" "H"
[6,] "E" "H"
[7,] "C" "E"
So if you are willing to convert your data.frame to a data.table this problem can be solved pretty efficiently and cleanly and if you have many rows will be much faster.
library(data.table)
dat<-data.table(dat)
Basically you can apply functions to columns of the data.table in the j cell and group in the k cell. So you want all the combinations of categories taken two at a time for each ID which looks like this:
dat[,combn(Categories,2),by=ID]
However stopping at this point will keep the ID column and by default create a column called V1 that basically concatenates the array returned by combn into a vector of the categories and not the two-column adjacency matrix that you need. But by chaining another call to this you can create the matrix easily as you would with any single vector. In one line of code this will look like:
dat[,combn(Category,2),by=ID][,matrix(V1,ncol=2,byrow = T)]
Remember that the vector column we wish to convert to a matrix is called V1 by default and also we want the 2-column matrix to be created by row instead of the default which is by column. Hope that helps and let me know if I need to add anything to my explanation. Good luck!
When I run the code:
library(vecsets)
p <- c("a","b")
q <- c( "a")
vunion(p,q, multiple = TRUE)
I get the result:
[1] "a" "b"
But I expect the result to be
vunion(p,q, multiple = TRUE)
[1] "a" "b" "a"
I also do not understand the result provided in the example of the vesect package. The example shows:
x <- c(1:5,3,3,3,2,NA,NA)
y <- c(2:5,4,3,NA)
vunion(x,y,multiple=TRUE)
[1] 2 3 3 4 5 NA 1 3 3 2 NA 4
But if we check
length(x)+length(y); length(vunion(x,y))
[1] 18
[1] 12
we get different lengths, but I think they should be the same. Note, for example, 5 appears only once.
What's going on here? Can someone explain?
I think the vecset package documentation (link) describes this behavior quite well:
The base::union function removes duplicates per algebraic set theory. vunion does not, and so returns as many duplicate elements as are in either input vector (not the sum of their inputs.) In short, vunion is the same as vintersect(x,y) + vsetdiff(x,y) + vsetdiff(y,x).
It's true that you have to read carefully, though. I've emphasized the important part. The issue is not with character versus numeric vectors, but rather whether elements are repeated within the same vector or not. Consider p1 versus p2 in the following example. The result from vunion will have as many a's as either p or q, so we expect 1 "a" in the first part and two a's in the second part; both times we expect only 1 "b":
library(vecsets)
q <- c("a", "b")
p1 <- c("a", "b")
vunion(p1, q, multiple = TRUE)
[1] "a" "b"
p2 <- c("a", "a", "b")
vunion(p2, q, multiple = TRUE)
[1] "a" "b" "a"
To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id. Suppose:
N is the length of the data
k is the number of distinct values of id
M is the number of desired parts
The idea is that M << k << N, so splitting by id is no good.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
t(dt$id)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# [1,] "a" "b" "b" "b" "b" "c" "c" "c" "d" "d" "d" "e" "e" "f" "f" "f"
in this example, the desired split for M=3 is {{a,b}, {c,d}, {e,f}}
and for M=4 is {{a,b}, {c}, {d,e}, {f}}
More generally, if id were numeric, the cutoff points should be
quantile(id, probs=seq(0, 1, length.out = M+1), type=1) or some similar split to roughly-equal parts.
What is an efficient way to do this?
Preliminary comment
I recommend reading what the main author of data.table has to say about parallelization with it.
I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting #eddi's comment from below...
Instead of literally splitting up the data - create a new "parallel.id" column, and then call
dt[, parallel_operation(.SD), by = parallel.id]
Answer, assuming you don't want to use by
Sort the IDs by size:
ids <- names(sort(table(dt$id)))
n <- length(ids)
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
gs <- split(alt_ids, ceiling(seq(n) / (n/M)))
res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])]
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],]
Check that the sizes aren't too bad:
# using the OP's example data...
sapply(res, nrow)
# [1] 7 9 for M = 2
# [1] 5 5 6 for M = 3
# [1] 1 6 3 6 for M = 4
# [1] 1 4 2 3 6 for M = 5
Although I emphasized data.table at the top, this should work fine with a data.frame, too.
If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:
split(dt, as.numeric(as.factor(dt$id)) %% M)
It assigns id to the the bucket using factor-value mod number-of buckets.
For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.
If k is big enough, you can use this idea to split data into groups:
First, lets find size for each of ids
group_sizes <- dt[, .N, by = id]
Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain
grps_vals <- list()
grps_vals[1 : M] <- c(0)
grps_nms <- list()
grps_nms[1 : M] <- c(0)
(Here I specially added zero values to be able to create list of size M)
Then using loop on every iteration add values to the smallest group. It will make groups roughly equal
for ( i in 1:nrow(group_sizes)){
sums <- sapply(groups, sum)
idx <- which(sums == min(sums))[1]
groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
}
Finally, delete first zero element from list of names :)
grps_nms <- lapply(grps_nms, function(x){x[-1]})
> grps_nms
[[1]]
[1] "a" "d" "f"
[[2]]
[1] "b"
[[3]]
[1] "c" "e"
Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
dt %>%
select(id) %>%
distinct() %>% # select distinct id values
mutate(group = ntile(id,3)) %>% # create grouping
inner_join(dt, by="id") # join back initial information
PS: I've learnt lots of useful stuff based on previous answers.
I am teaching myself the basics of R and have been encountering trouble using the function tapply when passing the sort function while trying to use non-default optional arguments for sort. Here is an example of the trouble I am facing:
Given the vectors
x <- c(1.1, 1.0, 2.1, NA_real_)
y <- c("a", "b", "c","d")
I find that
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
results in the same output regardless of the logical assignments I endow decreasing and na.last with. In fact, the output always defaults to the sort default values
decreasing = FALSE, na.last = NA
For the record, when inputing the above example, the output is
> tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
1 1.1 2.1
"b" "a" "c"
Let me also mention that if I define the alternate function
sort2 <- function(v) sort(v, decreasing=TRUE, na.last=TRUE);
and pass sort2 to tapply instead, I still encounter the same trouble.
I am using running this code on a Mac OS X 10.10.4, using R 3.2.0. Using sort standalone results in the desired behavior (calling sort on its own without passing through tapply, that is), since it acts appropriately when altering the decreasing and na.last arguments.
Thank you in advance for any help.
I don't think you're using tapply() correctly.
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
The above line of code basically says "sort vector y grouping by categorical vector x". Your vector x is not really a categorical vector at all, it's a numeric vector with only distinct values, plus an NA. tapply() ignores the NA index, and then treats each of the remaining three distinct numeric values in x as separate groups, so it passes each of the three corresponding character strings from y to three different calls of sort(), which obviously has no effect on anything (which explains why your customization arguments have no effect) and returns the result ordered by the x groups.
Here's an example of how to do what I think you're trying to do:
x <- c(NA,1,2,3,NA,2,1,3);
g <- rep(letters[1:2],each=4);
x;
## [1] NA 1 2 3 NA 2 1 3
g;
## [1] "a" "a" "a" "a" "b" "b" "b" "b"
tapply(x,g,sort,decreasing=T,na.last=T);
## $a
## [1] 3 2 1 NA
##
## $b
## [1] 3 2 1 NA
##
Edit: When you want to sort a vector by another vector, you can use order():
y[order(x,decreasing=T,na.last=T)];
## [1] "c" "a" "b" "d"
y[order(x,decreasing=F,na.last=T)];
## [1] "b" "a" "c" "d"
I am trying to generate a random sequence from a fixed number of characters that contains at least one of each character.
For example having the ensemble
m = letters[1:3]
I would like to create a sequence of N = 10 elements that contain at least one of each m characters, like
a
a
a
a
b
c
c
c
c
a
I tried with sample(n,N,replace=T) but in this way also a sequence like
a
a
a
a
a
c
c
c
c
a
can be generated that does not contain b.
f <- function(x, n){
sample(c(x, sample(m, n-length(x), replace=TRUE)))
}
f(letters[1:3], 5)
# [1] "a" "c" "a" "b" "a"
f(letters[1:3], 5)
# [1] "a" "a" "b" "b" "c"
f(letters[1:3], 5)
# [1] "a" "a" "b" "c" "a"
f(letters[1:3], 5)
# [1] "b" "c" "b" "c" "a"
Josh O'Briens answer is a good way to do it but doesn't provide much input checking. Since I already wrote it might as well present my answer. It's pretty much the same thing but takes care of checking things like only considering unique items and making sure there are enough unique items to guarantee you get at least one of each.
at_least_one_samp <- function(n, input){
# Only consider unique items.
items <- unique(input)
unique_items_count <- length(items)
if(unique_items_count > n){
stop("Not enough unique items in input to give at least one of each")
}
# Get values for vector - force each item in at least once
# then randomly select values to get the remaining.
vals <- c(items, sample(items, n - unique_items_count, replace = TRUE))
# Now shuffle them
sample(vals)
}
m <- c("a", "b", "c")
at_least_one_samp(10, m)