It seems not possible to get matrices of factor in R. Is it true? If yes, why? If not, how should I do?
f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
m <- matrix(f,4,5)
is.factor(m) # fail.
m <- factor(m,letters[1:5])
is.factor(m) # oh, yes?
is.matrix(m) # nope. fail.
dim(f) <- c(4,5) # aha?
is.factor(f) # yes..
is.matrix(f) # yes!
# but then I get a strange behavior
cbind(f,f) # is not a factor anymore
head(f,2) # doesn't give the first 2 rows but the first 2 elements of f
# should I worry about it?
In this case, it may walk like a duck and even quack like a duck, but f from:
f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
dim(f) <- c(4,5)
really isn't a matrix, even though is.matrix() claims that it strictly is one. To be a matrix as far as is.matrix() is concerned, f only needs to be a vector and have a dim attribute. By adding the attribute to f you pass the test. As you have seen, however, once you start using f as a matrix, it quickly loses the features that make it a factor (you end up working with the levels or the dimensions get lost).
There are really only matrices and arrays for the atomic vector types:
logical,
integer,
real,
complex,
string (or character), and
raw
plus, as #hadley reminds me, you can also have list matrices and arrays (by setting the dim attribute on a list object. See, for example, the Matrices & Arrays section of Hadley's book, Advanced R.)
Anything outside those types would be coerced to some lower type via as.vector(). This happens in matrix(f, nrow = 3) not because f is atomic according to is.atomic() (which returns TRUE for f because it is internally stored as an integer and typeof(f) returns "integer"), but because it has a class attribute. This sets the OBJECT bit on the internal representation of f and anything that has a class is supposed to be coerced to one of the atomic types via as.vector():
matrix <- function(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL) {
if (is.object(data) || !is.atomic(data))
data <- as.vector(data)
....
Adding dimensions via dim<-() is a quick way to create an array without duplicating the object, but this bypasses some of the checks and balances that R would do if you coerced f to a matrix via the other methods
matrix(f, nrow = 3) # or
as.matrix(f)
This gets found out when you try to use basic functions that work on matrices or use method dispatch. Note that after assigning dimensions to f, f still is of class "factor":
> class(f)
[1] "factor"
which explains the head() behaviour; you are not getting the head.matrix behaviour because f is not a matrix, at least as far as the S3 mechanism is concerned:
> debug(head.matrix)
> head(f) # we don't enter the debugger
[1] d c a d b d
Levels: a b c d e
> undebug(head.matrix)
and the head.default method calls [ for which there is a factor method, and hence the observed behaviour:
> debugonce(`[.factor`)
> head(f)
debugging in: `[.factor`(x, seq_len(n))
debug: {
y <- NextMethod("[")
attr(y, "contrasts") <- attr(x, "contrasts")
attr(y, "levels") <- attr(x, "levels")
class(y) <- oldClass(x)
lev <- levels(x)
if (drop)
factor(y, exclude = if (anyNA(levels(x)))
NULL
else NA)
else y
}
....
The cbind() behaviour can be explained from the documented behaviour (from ?cbind, emphasis mine):
The functions cbind and rbind are S3 generic, ...
....
In the default method, all the vectors/matrices must be atomic
(see vector) or lists. Expressions are not allowed. Language
objects (such as formulae and calls) and pairlists will be coerced
to lists: other objects (such as names and external pointers) will
be included as elements in a list result. Any classes the inputs
might have are discarded (in particular, factors are replaced by
their internal codes).
Again, the fact that f is of class "factor" is defeating you because the default cbind method will get called and it will strip the levels information and return the internal integer codes as you observed.
In many respects, you have to ignore or at least not fully trust what the is.foo functions tell you, because they are just using simple tests to say whether something is or is not a foo object. is.matrix() and is.atomic() are clearly wrong when it comes to f (with dimensions) from a particular point of view. They are also right in terms of their implementation or at least their behaviour can be understood from the implementation; I think is.atomic(f) is not correct, but if by "if is of an atomic type" R Core mean "type" to be the thing returned by typeof(f) then is.atomic() is right. A more strict test is is.vector(), which f fails:
> is.vector(f)
[1] FALSE
because it has attributes beyond a names attribute:
> attributes(f)
$levels
[1] "a" "b" "c" "d" "e"
$class
[1] "factor"
$dim
[1] 4 5
As to how should you get a factor matrix, well you can't, at least if you want it to retain the factor information (the labels for the levels). One solution would be to use a character matrix, which would retain the labels:
> fl <- levels(f)
> fm <- matrix(f, ncol = 5)
> fm
[,1] [,2] [,3] [,4] [,5]
[1,] "c" "a" "a" "c" "b"
[2,] "d" "b" "d" "b" "a"
[3,] "e" "e" "e" "c" "e"
[4,] "a" "b" "b" "a" "e"
and we store the levels of f for future use incase we lose some elements of the matrix along the way.
Or work with the internal integer representation:
> (fm2 <- matrix(unclass(f), ncol = 5))
[,1] [,2] [,3] [,4] [,5]
[1,] 3 1 1 3 2
[2,] 4 2 4 2 1
[3,] 5 5 5 3 5
[4,] 1 2 2 1 5
and you can always get back to the levels/labels again via:
> fm2[] <- fl[fm2]
> fm2
[,1] [,2] [,3] [,4] [,5]
[1,] "c" "a" "a" "c" "b"
[2,] "d" "b" "d" "b" "a"
[3,] "e" "e" "e" "c" "e"
[4,] "a" "b" "b" "a" "e"
Using a data frame would seem to be not ideal for this as each component of the data frame would be treated as a separate factor whereas you seem to want to treat the array as a single factor with one set of levels.
If you really wanted to do what you want, which is have a factor matrix, you would most likely need to create your own S3 class to do this, plus all the methods to go with it. For example, you might store the factor matrix as a character matrix but with class "factorMatrix", where you stored the levels alongside the factor matrix as an extra attribute say. Then you would need to write [.factorMatrix, which would grab the levels, then use the default [ method on the matrix, and then add the levels attribute back on again. You could write cbindand head methods as well. The list of required method would grow quickly however, but a simple implementation may suit and if you make your objects have class c("factorMatrix", "matrix") (i.e inherit from the "matrix" class), you'll pick up all the properties/methods of the "matrix" class (which will drop the levels and other attributes) so you can at least work with the objects and see where you need to add new methods to fill out the behaviour of the class.
Unfortunately factor support is not completely universal in R, so many R functions default to treating factors as their internal storage type, which is integer:
> typeof(factor(letters[1:3]))
[1] "integer
This is what happens with matrix, cbind. They don't know how to handle factors, but they do know what to do with integers, so they treat your factor like an integer. head is actually the opposite. It does know how to handle a factor, but it never bothers to check that your factor is also a matrix so just treats it like a normal dimensionless factor vector.
Your best bet to operate as if you had factors with your matrix is to coerce it to character. Once you are done with your operations, you can restore it back to factor form. You could also do this with the integer form, but then you risk weird stuff (you could for example do matrix multiplication on an integer matrix, but that makes no sense for factors).
Note that if you add class "matrix" to your factor some (but not all) things start working:
f <- factor(letters[1:9])
dim(f) <- c(3, 3)
class(f) <- c("factor", "matrix")
head(f, 2)
Produces:
[,1] [,2] [,3]
[1,] a d g
[2,] b e h
Levels: a b c d e f g h i
This doesn't fix rbind, etc.
Related
When I run the code:
library(vecsets)
p <- c("a","b")
q <- c( "a")
vunion(p,q, multiple = TRUE)
I get the result:
[1] "a" "b"
But I expect the result to be
vunion(p,q, multiple = TRUE)
[1] "a" "b" "a"
I also do not understand the result provided in the example of the vesect package. The example shows:
x <- c(1:5,3,3,3,2,NA,NA)
y <- c(2:5,4,3,NA)
vunion(x,y,multiple=TRUE)
[1] 2 3 3 4 5 NA 1 3 3 2 NA 4
But if we check
length(x)+length(y); length(vunion(x,y))
[1] 18
[1] 12
we get different lengths, but I think they should be the same. Note, for example, 5 appears only once.
What's going on here? Can someone explain?
I think the vecset package documentation (link) describes this behavior quite well:
The base::union function removes duplicates per algebraic set theory. vunion does not, and so returns as many duplicate elements as are in either input vector (not the sum of their inputs.) In short, vunion is the same as vintersect(x,y) + vsetdiff(x,y) + vsetdiff(y,x).
It's true that you have to read carefully, though. I've emphasized the important part. The issue is not with character versus numeric vectors, but rather whether elements are repeated within the same vector or not. Consider p1 versus p2 in the following example. The result from vunion will have as many a's as either p or q, so we expect 1 "a" in the first part and two a's in the second part; both times we expect only 1 "b":
library(vecsets)
q <- c("a", "b")
p1 <- c("a", "b")
vunion(p1, q, multiple = TRUE)
[1] "a" "b"
p2 <- c("a", "a", "b")
vunion(p2, q, multiple = TRUE)
[1] "a" "b" "a"
I am trying to convert a basic matrix from one type to another. This seems like a really basic question, but surprisingly I have not seen an answer to it.
Here's a simple example:
> btest <- matrix(LETTERS[1:9], ncol = 3)
> ctest <- apply(btest, 2, as.factor)
> class(ctest[1,1])
[1] "character"
The only examples I could find on stack overflow dealt with data.frame columns, which seems more straightforward...
dtest <- as.data.frame(btest, stringsAsFactors = F)
dtest[] <- lapply(dtest[colnames(dtest)], as.factor)
dtest
V1 V2 V3
1 A D G
2 B E H
3 C F I
class(dtest[1,1])
[1] "factor"
Is there a straightforward way to change a matrix from character to factor and specify the levels as well?
matrix holds only one data type. Factor is a complex data type made up of character and integer types. Matrix cannot hold two types at a time. List is the appropriate data structure for factor. Data.frame is a kind of list data structure.
The help documentation of matrix ?matrix states that
an optional data vector (including a list or expression
vector). Non-atomic classed R objects are coerced by as.vector and all
attributes discarded.
The attributes for a factor is shown below.
attributes(factor(letters[1:4]))
$levels
[1] "a" "b" "c" "d"
$class
[1] "factor"
These attributes are removed using as.vector during matrix formation.
attributes(as.vector(factor(letters[1:4])))
NULL
In R, a matrix is mostly just a vector with a dim attribute of length 2 (see ?matrix). Its class is matrix, but it usually isn't stored as an attribute, unlike with list-based objects.
Thus, you can reconstruct a factor matrix with structure:
btest <- matrix(LETTERS[1:9], ncol = 3)
btest_fac <- structure(factor(btest), dim = dim(btest), class = c('matrix', 'factor'))
btest_fac
#> [,1] [,2] [,3]
#> [1,] A D G
#> [2,] B E H
#> [3,] C F I
#> Levels: A B C D E F G H I
str(btest_fac)
#> matrix [1:3, 1:3] A B C D ...
#> - attr(*, "levels")= chr [1:9] "A" "B" "C" "D" ...
class(btest_fac)
#> [1] "matrix" "factor"
However, while this is possible, it's not very useful, as functions will deal with it unpredictably, e.g. apply will coerce it to integer. You could define your own class and appropriate methods for it, but that would be a lot more work.
I read these:
https://stackoverflow.com/a/5159049/1175496
Matrices are for data of the same type.
https://stackoverflow.com/q/29732279/1175496
Vectors (and so matrix) can accept only one type of data
If matrix can only accept one data type, why can I do this:
> m_list<-matrix(list('1',2,3,4),2,2)
> m_list
[,1] [,2]
[1,] "1" 3
[2,] 2 4
The console output looks like I am combining character and integer data types.
The console output looks similar to this matrix:
> m_vector<-matrix(1:4,2,2)
> m_vector
[,1] [,2]
[1,] 1 3
[2,] 2 4
When I assign to m_list, it doesn't coerce the other values (as in https://stackoverflow.com/q/29732279/1175496 )
> m_list[2,2] <-'4'
> m_list
[,1] [,2]
[1,] "1" 3
[2,] 2 "4"
OK here is what I gather from replies so far:
Question
How can I have a matrix with different types?
Answer
You cannot; the elements are not different types; all (4) elements of this matrix are lists
all(
is.list(m_list[1,1]),
is.list(m_list[2,1]),
is.list(m_list[1,2]),
is.list(m_list[2,2]))
#[1] TRUE
Question
But I constructed matrix like this: matrix(list('1',2,3,4),2,2), how did this become a matrix of (4) lists, rather than a matrix of (4) characters, or even (4) integers?
Answer
I'm not sure. Even though the documentation says re: the first argument to matrix:
Non-atomic classed R objects are coerced by as.vector and all
attributes discarded.
It seems these are identical
identical(as.vector(list('1',2,3,4)), list('1',2,3,4))
#[1] TRUE
Question
But I assign a character ('4') to an element of m_list, how does that work?
m_list[2,2] <-'4'
Answer
It is "coerced", as if you did this:
m_list[2,2] <- as.list('4')
Question
If the elements in m_list are lists, is m_list equivalent to matrix(c(list('1'),list(2),list(3),list(4)),2,2)?
Answer
Yes, these are equivalent:
m_list <- matrix(list('1',2,3,4),2,2)
m_list2 <- matrix(c(list('1'),list(2),list(3),list(4)),2,2)
identical(m_list, m_list2)
#[1] TRUE
Question
So how can I retrieve the typeof the '1' hidden in m_list[1,1]?
Answer
At least two ways:
typeof(m_list[1,1][[1]])
#[1] "character"
...or, can directly do this (thanks, Frank) (since indexing has this "is applied in turn to the list, the selected component, the selected component of that component, and so on" behavior)...
typeof(m_list[[1,1]])
#[1] "character"
Question
How can I tell the difference between these two
m1 <- matrix(c(list(1), list(2), list(3), list(4)), 2, 2)
m2 <- matrix(1:4, 2, 2)
Answer
If you are using RStudio,
m1 is described as List of 4
m2 is described as int [1:2, 1:2] 1 2 3 4
..or else, just use typeof(), which for vectors and matrices, identifies the type of their elements... (thanks, Martin)
typeof(m1)
#[1] "list"
typeof(m2)
#[1] "integer"
class can also help distinguish, but you must wrap the matrices in vectors first:
#Without c(...)
class(m1)
#[1] "matrix"
class(m2)
#[1] "matrix"
#With c(...)
class(c(m1))
#[1] "list"
class(c(m2))
#[1] "integer"
...you could tell a subtle difference in the console output; notice how the m2 (containing integers) right-aligns its elements (because numerics are usually right-aligned)...
m1
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
m2
# [,1] [,2]
#[1,] 1 3
#[2,] 2 4
Short-Answer: Matrices in R cannot contain different data types. All data have to or will be transformed into either logical, numerical, character or list.
Matrices always contain the same type. If input data to matrix() have different data types, they will automatically transformed into the same type. Thus, all data will be either logical, numerical, character or list. And here is your case, in your example all elements are being transformed into individual lists.
> myList <- list('1',2,3,4)
> myMatrix <- matrix( myList ,2,2)
> myMatrix
[,1] [,2]
[1,] "1" 3
[2,] 2 4
> typeof(myMatrix)
"list"
If you want to transformed completely your data from a list, you need to unlist the data.
> myList <- list('1',2,3,4)
> myMatrix <- matrix( unlist(myList) ,2,2)
> myMatrix
[,1] [,2]
[1,] "1" "3"
[2,] "2" "4"
> typeof(myMatrix)
"character"
Picking up the comments, verify yourself:
typeof(m_list)
typeof(m_list[2,2])
I am teaching myself the basics of R and have been encountering trouble using the function tapply when passing the sort function while trying to use non-default optional arguments for sort. Here is an example of the trouble I am facing:
Given the vectors
x <- c(1.1, 1.0, 2.1, NA_real_)
y <- c("a", "b", "c","d")
I find that
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
results in the same output regardless of the logical assignments I endow decreasing and na.last with. In fact, the output always defaults to the sort default values
decreasing = FALSE, na.last = NA
For the record, when inputing the above example, the output is
> tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
1 1.1 2.1
"b" "a" "c"
Let me also mention that if I define the alternate function
sort2 <- function(v) sort(v, decreasing=TRUE, na.last=TRUE);
and pass sort2 to tapply instead, I still encounter the same trouble.
I am using running this code on a Mac OS X 10.10.4, using R 3.2.0. Using sort standalone results in the desired behavior (calling sort on its own without passing through tapply, that is), since it acts appropriately when altering the decreasing and na.last arguments.
Thank you in advance for any help.
I don't think you're using tapply() correctly.
tapply(y, x, sort, decreasing=TRUE, na.last=TRUE)
The above line of code basically says "sort vector y grouping by categorical vector x". Your vector x is not really a categorical vector at all, it's a numeric vector with only distinct values, plus an NA. tapply() ignores the NA index, and then treats each of the remaining three distinct numeric values in x as separate groups, so it passes each of the three corresponding character strings from y to three different calls of sort(), which obviously has no effect on anything (which explains why your customization arguments have no effect) and returns the result ordered by the x groups.
Here's an example of how to do what I think you're trying to do:
x <- c(NA,1,2,3,NA,2,1,3);
g <- rep(letters[1:2],each=4);
x;
## [1] NA 1 2 3 NA 2 1 3
g;
## [1] "a" "a" "a" "a" "b" "b" "b" "b"
tapply(x,g,sort,decreasing=T,na.last=T);
## $a
## [1] 3 2 1 NA
##
## $b
## [1] 3 2 1 NA
##
Edit: When you want to sort a vector by another vector, you can use order():
y[order(x,decreasing=T,na.last=T)];
## [1] "c" "a" "b" "d"
y[order(x,decreasing=F,na.last=T)];
## [1] "b" "a" "c" "d"
I have a vector:
seq1<-c('a','b','c','b','a','b','c','b','a','b','c')
I wish to permute the elements of this vector to create multiple (ideally up to 5000) vectors with the condition that the permuted vectors cannot have repeated elements within the vector in consecutive elements. e.g. "abbca...." is not allowed as 'b-b' is a repeat.
I realize that for this small example there probably are not 5000 solutions. I am typically dealing with much larger vectors. I am also willing to consider sampling with replacement, though currently I'm working on solutions without replacement.
I am looking for better solutions than my current thinking.
Option 1. - brute force.
Here, I just repeatedly sample and check if any successive elements are duplicates.
set.seed(18)
seq1b <- sample(seq1a)
seq1b
#[1] "b" "b" "a" "a" "c" "b" "b" "c" "a" "c" "b"
sum(seq1b[-length(seq1b)]==seq1b[-1]) #3
This is not a solution as there are 3 duplicated consecutive elements. I also realize that lag is probably a better way to check for duplicated elements but for some reason it is being finicky (I think it is being masked by another package I have loaded).
set.seed(1000)
res<-NULL
for (i in 1:10000){res[[i]]<-sample(seq1a)}
res1 <- lapply(res, function(x) sum(x[-length(x)]==x[-1]))
sum(unlist(res1)==0) #228
This produces 228 options out of 10000 iterations. But let's see how many unique ones:
res2 <- res[which(unlist(res1)==0)]
unique(unlist(lapply(res2, paste0, collapse=""))) #134
Out of 10000 attempts we only get 134 unique ones from this short example vector.
Here are 3 of the 134 example sequences produced:
# "bcbabcbabca" "cbabababcbc" "bcbcababacb"
In fact, if I try over 500,000 samples, I can only get 212 unique sequences that match my non-repeating criteria. This is probably close to the upper limit of possible ones.
Option 2. - iteratively
A second idea I had is to be more iterative about the approach.
seq1a
table(seq1a)
#a b c
#3 5 3
We could sample one of these letters as our starting point. Then sample another from the remaining ones, check if it is the same as the previously chosen one and if not, add it to the end. And so on and so forth...
set.seed(10)
newseq <- sample(seq1a,1) #b
newseq #[1] "b"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)]
table(remaining)
#a b c
#3 4 3
set.seed(10)
newone <- sample(remaining,1) #c
#check if newone is same as previous one.
newone==newseq[length(newseq)] #FALSE
newseq <- c(newseq, newone) #update newseq
newseq #[1] "b" "c"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)] #update remaining
remaining
table(remaining)
#a b c
#3 4 2
This might work, but I can also see it running into lots of issues - e.g. we could go:
# "a" "c" "a" "c" "a" "b" ...
and then be left with 3 more 'b's that cannot go at the end as they'd be duplicates.
Of course, this would be a lot easier if I allowed sampling with replacement, but for now I'm trying to do this without replacement.
You can use the iterpc package to work with combinations and iterations. I hadn't heard of it until trying to answer this question so there might also be more effective ways to use the same package.
Here I've used iterpc to set up an iterator, and getall to find all combinations of the vector based on that iterator. This seems to just report unique combinations, making it a bit nicer than finding all combinations with expand.grid.
#install.packages("iterpc")
require("iterpc")
seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')
I <- iterpc(n = table(seq1), ordered=TRUE)
all_seqs <- getall(I)
# result is a matrix with permutations as rows:
head(all_seqs)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#[1,] "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c"
#[2,] "a" "a" "a" "b" "b" "b" "b" "c" "b" "c" "c"
#[3,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "b" "c"
#[4,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "b"
#[5,] "a" "a" "a" "b" "b" "b" "c" "b" "b" "c" "c"
#[6,] "a" "a" "a" "b" "b" "b" "c" "b" "c" "b" "c"
The rle function tells us about consecutive values equal to each other in a vector. The lengths component of the output tells us how many times each element of values is repeated:
rle(c("a", "a", "b", "b", "b", "c", "b"))
# Run Length Encoding
# lengths: int [1:3] 2 3 1 1
# values : chr [1:3] "a" "b" "c" "b"
The length of values or lengths will be equal to the length of the original vector only for combinations which have no consecutive repeats.
You can therefore apply rle to each row, calculate the length of values or lengths and keep rows from all_seqs where the calculated value is the same as the length of seqs1.
#apply the rle function
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
# keep rows which have an rle with a length equal to length(seq1)
all_seqs_good <- all_seqs[which(all_seqs_rle == length(seq1)), ]
all_seqs_good has an nrow of 212, suggesting that you did indeed find all possible combinations for your example vector.
nrow(all_seqs_good)
# 212
Technically this is still brute forcing (except that it doesn't calculate every possible combination - only unique ones), but is fairly quick for your example. I'm not sure how well it will cope with larger vectors yet...
Edit: this does seem to fail for larger vectors. One solution would be to break larger vectors into smaller chunks, then process those chunks as above and combine them - keeping only the combinations which meet your criteria.
For example, breaking a vector of length 24 into two vectors of length 12, then combining the results can give you 200,000+ combinations which meet your critera and is pretty quick (around 1 minute for me):
# function based on the above solution
seq_check <- function(mySeq){
I = iterpc(n = table(mySeq), ordered=TRUE)
all_seqs <- getall(I)
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
all_seqs_good <- all_seqs[which(all_seqs_rle == length(mySeq)), ]
return(all_seqs_good)
}
set.seed(1)
seq1<-sample(c(rep("a", 8), rep("b", 8), rep("c", 8)),24)
seq1a <- seq1[1:12]
seq1b <- seq1[13:24]
#get all permutations with no consecutive repeats
seq1a = apply(seq_check(seq1a), 1, paste0, collapse="")
seq1b = apply(seq_check(seq1b), 1, paste0, collapse="")
#combine seq1a and seq1b:
combined_seqs <- expand.grid(seq1a, seq1b)
combined_seqs <- apply(combined_seqs, 1, paste0, collapse="")
#function to calculate rle lengths
rle_calc <- function(x) length(rle(unlist(strsplit(x, "")))$values)
#keep combined sequences which have rle lengths of 24
combined_seqs_rle <- sapply(combined_seqs, rle_calc)
passed_combinations <- combined_seqs[which(combined_seqs_rle == 24)]
#find number of solutions
length(passed_combinations)
#[1] 245832
length(unique(passed_combinations))
#[1] 245832
You might need to re-order the starting vector for best results. For example, if seq1 in the above example had started with "a" eight times in a row, there would be no passing solutions. For example, try the splitting up solution with seq1 <- c(rep("a", 8), rep("b", 8), rep("c", 8)) and you get no solutions back, even though there are really the same number of solutions for the random sequence.
It doesn't look like you need to find every possible passing combination, but if you do then for larger vectors you'll probably need to iterate through I using the getnext function from iterpc, and check each one in a loop which would be very slow.
Here another solution. Please see the comments in the code for an explanation of the algorithm.
In a way, it's similar to your second (iterative) approach, but it includes
a while loop that ensures that the next element is valid
and a stopping criterion for the case when the remaining elements would necessarily form an invalid combination
The algorithm is also quite efficient with longer seq1 vectors as given in one of your comments. But I guess it's performance will degrade if you have more unique elements in seq1.
Here the code:
First a few definitions
set.seed(1234)
seq1=c('a','b','c','b','a','b','c','b','a','b','c')
#number of attempts to generate a valid combination
Nres=10000
#this list will hold the results
#we do not have to care about memory allocation
res_list=list()
Now generate the combinations
#the outer loop creates the user-defined number of combination attempts
for (i in 1:Nres) {
#create a "population" from seq1
popul=seq1
#pre-allocate an NA vector of the same length as seq1
res_vec=rep(NA_character_,length(seq1))
#take FIRST draw from the population
new_draw=sample(popul,1)
#remove draw from population
popul=popul[-match(new_draw,popul)]
#save new draw
res_vec[1]=new_draw
#now take remaining draws
for (j in 2:length(seq1)) {
#take new draws as long as
#1) new_draw is equal to the last draw and
#2) as long as there are any valid elements left in popul
while((new_draw==res_vec[j-1])&any(res_vec[j-1]!=popul)) {
#take new draw
new_draw=sample(popul,1)
}
#if we did not find a valid draw break inner loop
if (new_draw==res_vec[j-1]) {
break
}
#otherwise save new_draw ...
res_vec[j]=new_draw
#... and delete new draw from population
popul=popul[-match(new_draw,popul)]
}
#this is to check whether we had to break the inner loop
#if not, save results vector
if (sum(is.na(res_vec[j]))==0) res_list[[length(res_list)+1]]=res_vec
}
Now let's check the results
#for each result vector in res_list:
#1) check whether all subsequent elements are different ---> sum(x[-1]==x[-length(x)])==0
#2) and whether we have the same number of elements as in seq1 ---> all.equal(table(x),table(seq1),check.attributes=FALSE)
sum(sapply(res_list,function(x) (sum(x[-1]==x[-length(x)])==0)&all.equal(table(x),table(seq1),check.attributes=FALSE)))
#6085
#the previous number should be the same as the length of res_list
length(res_list)
#6085
#check the number of unique solutions
length(unique(res_list))
#212
The speed of your actual job will depend on a lot of factors (e.g. how many possible passing combinations exist), but I think you can accomplish this relatively quickly by using 2 loops (similarly to how you outlined, but possibly quicker):
Permutate your set of variables and check that there are no
sequential values.
Assess whether the passing permutation is unique to those that have already been chosen
In the following example, you set two values to control the searching process: nsuccess - Desired number of many unique permutations; nmax - Maximum number of permutations (sets upper limit on computation time)
Example
seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')
seq1
set.seed(1)
nsuccess <- 200
nmax <- 30000
res <- matrix(NA, nrow=length(seq1), ncol=nsuccess)
i <- 1
j <- 1
while(i <= nsuccess & j <= nmax){
s1 <- sample(seq1)
s1str <- paste(s1, collapse=",")
test <- rle(s1)$lengths
if(sum(test) == length(test)) { # check that no values are consecutive
U <- unique(apply(res, 2, function(x){paste(x, collapse=",")}))
if(!s1str %in% U){ # check if new permutation is unique
res[,i] <- s1
i <- i+1
}
}
j <-j+1
}
print(paste("i =", i, "; j =", j))
res # view the unique permutations