Split data.table into roughly equal parts

Split data.table into roughly equal parts - r

To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id. Suppose:
N is the length of the data
k is the number of distinct values of id
M is the number of desired parts
The idea is that M << k << N, so splitting by id is no good.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
t(dt$id)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# [1,] "a" "b" "b" "b" "b" "c" "c" "c" "d" "d" "d" "e" "e" "f" "f" "f"
in this example, the desired split for M=3 is {{a,b}, {c,d}, {e,f}}
and for M=4 is {{a,b}, {c}, {d,e}, {f}}
More generally, if id were numeric, the cutoff points should be
quantile(id, probs=seq(0, 1, length.out = M+1), type=1) or some similar split to roughly-equal parts.
What is an efficient way to do this?

Preliminary comment
I recommend reading what the main author of data.table has to say about parallelization with it.
I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting #eddi's comment from below...
Instead of literally splitting up the data - create a new "parallel.id" column, and then call
dt[, parallel_operation(.SD), by = parallel.id]
Answer, assuming you don't want to use by
Sort the IDs by size:
ids <- names(sort(table(dt$id)))
n <- length(ids)
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
gs <- split(alt_ids, ceiling(seq(n) / (n/M)))
res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])]
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],]
Check that the sizes aren't too bad:
# using the OP's example data...
sapply(res, nrow)
# [1] 7 9 for M = 2
# [1] 5 5 6 for M = 3
# [1] 1 6 3 6 for M = 4
# [1] 1 4 2 3 6 for M = 5
Although I emphasized data.table at the top, this should work fine with a data.frame, too.

If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:
split(dt, as.numeric(as.factor(dt$id)) %% M)
It assigns id to the the bucket using factor-value mod number-of buckets.
For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.

If k is big enough, you can use this idea to split data into groups:
First, lets find size for each of ids
group_sizes <- dt[, .N, by = id]
Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain
grps_vals <- list()
grps_vals[1 : M] <- c(0)
grps_nms <- list()
grps_nms[1 : M] <- c(0)
(Here I specially added zero values to be able to create list of size M)
Then using loop on every iteration add values to the smallest group. It will make groups roughly equal
for ( i in 1:nrow(group_sizes)){
sums <- sapply(groups, sum)
idx <- which(sums == min(sums))[1]
groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
}
Finally, delete first zero element from list of names :)
grps_nms <- lapply(grps_nms, function(x){x[-1]})
> grps_nms
[[1]]
[1] "a" "d" "f"
[[2]]
[1] "b"
[[3]]
[1] "c" "e"

Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
dt %>%
select(id) %>%
distinct() %>% # select distinct id values
mutate(group = ntile(id,3)) %>% # create grouping
inner_join(dt, by="id") # join back initial information
PS: I've learnt lots of useful stuff based on previous answers.

Related

Cbind bind's the integer number rather content

Cbind bind's the integer number rather content.But while using paste function i could see the content of the text.I'm not sure why it's binding the integer rather the content of the column.
It's not working:
data<-read.csv("NFL.CSV",head=T)
output <- cbind( data$content, cl$cluster)
Now I could see the content
output <- paste( data$content, cl$cluster)
Sample Data:There two columns one is content and another one is id
content ,id
NFL flexes Dallas Cowboys-Washington Redskins game , cbbbcf9395705611c3eeeffaa610a602
#special_event32 redskins still suck ,9b50b8be10460eab6c0f6f3590067bd7
RG3 leads Redskins over Eagles 27-20 (The Associated Press) PHILADELPHIA (AP) -- With one ,77e1a37031884642b8d1bccad99516c6

Since you didn't give any example data, I have to guess, but I strongly suspect that your columns content and/or cluster are factor columns in which case cbind will convert them to integer values:
> cbind(as.factor(c("a", "b")), as.factor(c("a", "c")))
[,1] [,2]
[1,] 1 1
[2,] 2 2
What you can do is put as.character around your vectors:
> cbind(as.character(as.factor(c("a", "b"))),
+ as.character(as.factor(c("a", "b"))))
[,1] [,2]
[1,] "a" "a"
[2,] "b" "b"
or in your example:
output <- cbind(as.character(data$content),
as.character(cl$cluster))
Another solution is to use cbind.data.frame
> cbind.data.frame(as.factor(c("a", "b")), as.factor(c("a", "b")))
as.factor(c("a", "b")) as.factor(c("a", "b"))
1 a a
2 b b
or just data.frame
output <- data.frame(content = data$content,
cluster = cl$cluster)

Generate multiple permutations of vector with non-repeating elements

I have a vector:
seq1<-c('a','b','c','b','a','b','c','b','a','b','c')
I wish to permute the elements of this vector to create multiple (ideally up to 5000) vectors with the condition that the permuted vectors cannot have repeated elements within the vector in consecutive elements. e.g. "abbca...." is not allowed as 'b-b' is a repeat.
I realize that for this small example there probably are not 5000 solutions. I am typically dealing with much larger vectors. I am also willing to consider sampling with replacement, though currently I'm working on solutions without replacement.
I am looking for better solutions than my current thinking.
Option 1. - brute force.
Here, I just repeatedly sample and check if any successive elements are duplicates.
set.seed(18)
seq1b <- sample(seq1a)
seq1b
#[1] "b" "b" "a" "a" "c" "b" "b" "c" "a" "c" "b"
sum(seq1b[-length(seq1b)]==seq1b[-1]) #3
This is not a solution as there are 3 duplicated consecutive elements. I also realize that lag is probably a better way to check for duplicated elements but for some reason it is being finicky (I think it is being masked by another package I have loaded).
set.seed(1000)
res<-NULL
for (i in 1:10000){res[[i]]<-sample(seq1a)}
res1 <- lapply(res, function(x) sum(x[-length(x)]==x[-1]))
sum(unlist(res1)==0) #228
This produces 228 options out of 10000 iterations. But let's see how many unique ones:
res2 <- res[which(unlist(res1)==0)]
unique(unlist(lapply(res2, paste0, collapse=""))) #134
Out of 10000 attempts we only get 134 unique ones from this short example vector.
Here are 3 of the 134 example sequences produced:
# "bcbabcbabca" "cbabababcbc" "bcbcababacb"
In fact, if I try over 500,000 samples, I can only get 212 unique sequences that match my non-repeating criteria. This is probably close to the upper limit of possible ones.
Option 2. - iteratively
A second idea I had is to be more iterative about the approach.
seq1a
table(seq1a)
#a b c
#3 5 3
We could sample one of these letters as our starting point. Then sample another from the remaining ones, check if it is the same as the previously chosen one and if not, add it to the end. And so on and so forth...
set.seed(10)
newseq <- sample(seq1a,1) #b
newseq #[1] "b"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)]
table(remaining)
#a b c
#3 4 3
set.seed(10)
newone <- sample(remaining,1) #c
#check if newone is same as previous one.
newone==newseq[length(newseq)] #FALSE
newseq <- c(newseq, newone) #update newseq
newseq #[1] "b" "c"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)] #update remaining
remaining
table(remaining)
#a b c
#3 4 2
This might work, but I can also see it running into lots of issues - e.g. we could go:
# "a" "c" "a" "c" "a" "b" ...
and then be left with 3 more 'b's that cannot go at the end as they'd be duplicates.
Of course, this would be a lot easier if I allowed sampling with replacement, but for now I'm trying to do this without replacement.

You can use the iterpc package to work with combinations and iterations. I hadn't heard of it until trying to answer this question so there might also be more effective ways to use the same package.
Here I've used iterpc to set up an iterator, and getall to find all combinations of the vector based on that iterator. This seems to just report unique combinations, making it a bit nicer than finding all combinations with expand.grid.
#install.packages("iterpc")
require("iterpc")
seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')
I <- iterpc(n = table(seq1), ordered=TRUE)
all_seqs <- getall(I)
# result is a matrix with permutations as rows:
head(all_seqs)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#[1,] "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c"
#[2,] "a" "a" "a" "b" "b" "b" "b" "c" "b" "c" "c"
#[3,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "b" "c"
#[4,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "b"
#[5,] "a" "a" "a" "b" "b" "b" "c" "b" "b" "c" "c"
#[6,] "a" "a" "a" "b" "b" "b" "c" "b" "c" "b" "c"
The rle function tells us about consecutive values equal to each other in a vector. The lengths component of the output tells us how many times each element of values is repeated:
rle(c("a", "a", "b", "b", "b", "c", "b"))
# Run Length Encoding
# lengths: int [1:3] 2 3 1 1
# values : chr [1:3] "a" "b" "c" "b"
The length of values or lengths will be equal to the length of the original vector only for combinations which have no consecutive repeats.
You can therefore apply rle to each row, calculate the length of values or lengths and keep rows from all_seqs where the calculated value is the same as the length of seqs1.
#apply the rle function
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
# keep rows which have an rle with a length equal to length(seq1)
all_seqs_good <- all_seqs[which(all_seqs_rle == length(seq1)), ]
all_seqs_good has an nrow of 212, suggesting that you did indeed find all possible combinations for your example vector.
nrow(all_seqs_good)
# 212
Technically this is still brute forcing (except that it doesn't calculate every possible combination - only unique ones), but is fairly quick for your example. I'm not sure how well it will cope with larger vectors yet...
Edit: this does seem to fail for larger vectors. One solution would be to break larger vectors into smaller chunks, then process those chunks as above and combine them - keeping only the combinations which meet your criteria.
For example, breaking a vector of length 24 into two vectors of length 12, then combining the results can give you 200,000+ combinations which meet your critera and is pretty quick (around 1 minute for me):
# function based on the above solution
seq_check <- function(mySeq){
I = iterpc(n = table(mySeq), ordered=TRUE)
all_seqs <- getall(I)
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
all_seqs_good <- all_seqs[which(all_seqs_rle == length(mySeq)), ]
return(all_seqs_good)
}
set.seed(1)
seq1<-sample(c(rep("a", 8), rep("b", 8), rep("c", 8)),24)
seq1a <- seq1[1:12]
seq1b <- seq1[13:24]
#get all permutations with no consecutive repeats
seq1a = apply(seq_check(seq1a), 1, paste0, collapse="")
seq1b = apply(seq_check(seq1b), 1, paste0, collapse="")
#combine seq1a and seq1b:
combined_seqs <- expand.grid(seq1a, seq1b)
combined_seqs <- apply(combined_seqs, 1, paste0, collapse="")
#function to calculate rle lengths
rle_calc <- function(x) length(rle(unlist(strsplit(x, "")))$values)
#keep combined sequences which have rle lengths of 24
combined_seqs_rle <- sapply(combined_seqs, rle_calc)
passed_combinations <- combined_seqs[which(combined_seqs_rle == 24)]
#find number of solutions
length(passed_combinations)
#[1] 245832
length(unique(passed_combinations))
#[1] 245832
You might need to re-order the starting vector for best results. For example, if seq1 in the above example had started with "a" eight times in a row, there would be no passing solutions. For example, try the splitting up solution with seq1 <- c(rep("a", 8), rep("b", 8), rep("c", 8)) and you get no solutions back, even though there are really the same number of solutions for the random sequence.
It doesn't look like you need to find every possible passing combination, but if you do then for larger vectors you'll probably need to iterate through I using the getnext function from iterpc, and check each one in a loop which would be very slow.

Here another solution. Please see the comments in the code for an explanation of the algorithm.
In a way, it's similar to your second (iterative) approach, but it includes
a while loop that ensures that the next element is valid
and a stopping criterion for the case when the remaining elements would necessarily form an invalid combination
The algorithm is also quite efficient with longer seq1 vectors as given in one of your comments. But I guess it's performance will degrade if you have more unique elements in seq1.
Here the code:
First a few definitions
set.seed(1234)
seq1=c('a','b','c','b','a','b','c','b','a','b','c')
#number of attempts to generate a valid combination
Nres=10000
#this list will hold the results
#we do not have to care about memory allocation
res_list=list()
Now generate the combinations
#the outer loop creates the user-defined number of combination attempts
for (i in 1:Nres) {
#create a "population" from seq1
popul=seq1
#pre-allocate an NA vector of the same length as seq1
res_vec=rep(NA_character_,length(seq1))
#take FIRST draw from the population
new_draw=sample(popul,1)
#remove draw from population
popul=popul[-match(new_draw,popul)]
#save new draw
res_vec[1]=new_draw
#now take remaining draws
for (j in 2:length(seq1)) {
#take new draws as long as
#1) new_draw is equal to the last draw and
#2) as long as there are any valid elements left in popul
while((new_draw==res_vec[j-1])&any(res_vec[j-1]!=popul)) {
#take new draw
new_draw=sample(popul,1)
}
#if we did not find a valid draw break inner loop
if (new_draw==res_vec[j-1]) {
break
}
#otherwise save new_draw ...
res_vec[j]=new_draw
#... and delete new draw from population
popul=popul[-match(new_draw,popul)]
}
#this is to check whether we had to break the inner loop
#if not, save results vector
if (sum(is.na(res_vec[j]))==0) res_list[[length(res_list)+1]]=res_vec
}
Now let's check the results
#for each result vector in res_list:
#1) check whether all subsequent elements are different ---> sum(x[-1]==x[-length(x)])==0
#2) and whether we have the same number of elements as in seq1 ---> all.equal(table(x),table(seq1),check.attributes=FALSE)
sum(sapply(res_list,function(x) (sum(x[-1]==x[-length(x)])==0)&all.equal(table(x),table(seq1),check.attributes=FALSE)))
#6085
#the previous number should be the same as the length of res_list
length(res_list)
#6085
#check the number of unique solutions
length(unique(res_list))
#212

The speed of your actual job will depend on a lot of factors (e.g. how many possible passing combinations exist), but I think you can accomplish this relatively quickly by using 2 loops (similarly to how you outlined, but possibly quicker):
Permutate your set of variables and check that there are no
sequential values.
Assess whether the passing permutation is unique to those that have already been chosen
In the following example, you set two values to control the searching process: nsuccess - Desired number of many unique permutations; nmax - Maximum number of permutations (sets upper limit on computation time)
Example
seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')
seq1
set.seed(1)
nsuccess <- 200
nmax <- 30000
res <- matrix(NA, nrow=length(seq1), ncol=nsuccess)
i <- 1
j <- 1
while(i <= nsuccess & j <= nmax){
s1 <- sample(seq1)
s1str <- paste(s1, collapse=",")
test <- rle(s1)$lengths
if(sum(test) == length(test)) { # check that no values are consecutive
U <- unique(apply(res, 2, function(x){paste(x, collapse=",")}))
if(!s1str %in% U){ # check if new permutation is unique
res[,i] <- s1
i <- i+1
}
}
j <-j+1
}
print(paste("i =", i, "; j =", j))
res # view the unique permutations

Packing and unpacking elements from list in R

I have two questions related to using list in R and I am trying to see how I can improve my naive solution. I have seen questions on similar topic here but the approach described there is not helping.
Q1:
MWE:
a <- c(1:5)
b <- "adf"
c <- array(rnorm(9), dim = c(3,3) )
Make a list, say with name "packedList", while preserving the name of
all variables.
Current solution: packedList <- list(a = a, b = b, c = c)
However, if the number of variables (three in above problem i.e. a, b, c) is
large (say we have 20 variables), then my current solution may not be
the best.
This is idea is useful while returning large number of variables from
a function.
Q2:
MWE: Given packedList, extract variables a, b, c
I would like to extract all elements in the given list (i.e. packedList) to the environment while preserving their names. This is reverse of task 1.
For example: Given variable packedList in the environment, I can define a, b, and c as follows:
a <- packedList$a
b <- packedList$b
c <- packedList$c
However, if the number of variables is very large then my solution can be cumbersome.
- After some Google search, I found one solution but I am not sure if it is the most elegant solution either. The solution is shown below:
x <- packedList
for(i in 1:length(x)){
tempobj <- x[[i]]
eval(parse(text=paste(names(x)[[i]],"= tempobj")))
}

You are most likely looking for mget (Q1) and list2env (Q2).
Here's a small example:
ls() ## Starting with an empty workspace
# character(0)
## Create a few objects
a <- c(1:5)
b <- "adf"
c <- array(rnorm(9), dim = c(3,3))
ls() ## Three objects in your workspace
[1] "a" "b" "c"
## Pack them all into a list
mylist <- mget(ls())
mylist
# $a
# [1] 1 2 3 4 5
#
# $b
# [1] "adf"
#
# $c
# [,1] [,2] [,3]
# [1,] 0.70647167 1.8662505 1.7941111
# [2,] -1.09570748 0.9505585 1.5194187
# [3,] -0.05225881 -1.4765127 -0.6091142
## Remove the original objects, keeping just the packed list
rm(a, b, c)
ls() ## only one object is there now
# [1] "mylist"
## Use `list2env` to recreate the objects
list2env(mylist, .GlobalEnv)
# <environment: R_GlobalEnv>
ls() ## The list and the other objects...
# [1] "a" "b" "c" "mylist"

Finding all possible combinations of vector intersections?

I have a set of four vectors that look like this:
[1] PRI2CO HEISCO PRI2CO DIALGU DIALGU ALSEBL
Levels: ALSEBL DIALGU HEISCO PRI2CO
[1] PRI2CO TET2PA ALSEBL PRI2CO ALSEBL TET2PA
[7] HEISCO TET2PA
Levels: ALSEBL HEISCO PRI2CO TET2PA
I would like to generate a vector that contains all values that match between every possible combination of the four vectors. For the two above, it would contain ALESBL, HEISCO, and PRI2CO. I've been doing every combination by hand so far but its tedious and I figure there has to be a better way. I tried writing a loop for it but I'm pretty new to R and it hasn't worked yet. Here's what I've been doing:
trees.species.P234<-intersect(intersect(trees.species.P2,trees.species.P3),trees.species.P4)
> trees.species.P234
[1] "PRI2CO " "ALSEBL "
I was thinking a for loop that involved a factorial might do it, but I can't get it to work.

Here you go, using the same vectors as proposed by gadzooks:
v1 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v2 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
v3 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v4 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
veclist <- list(v1,v2,v3,v4)
combos <- Reduce(c,lapply(2:length(veclist),
function(x) combn(1:length(veclist),x,simplify=FALSE) ))
lapply(combos, function(x) Reduce(intersect,veclist[x]) )
#[[1]]
#[1] "PRI2CO" "HEISCO" "ALSEBL"
#
#[[2]]
#[1] "PRI2CO" "HEISCO" "DIALGU" "ALSEBL"
#
#[[3]]
#[1] "PRI2CO" "HEISCO" "ALSEBL"
#etc etc

First you have to list all the combinations. For that use combn function.
> combn(1:4,2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 2 2 3
[2,] 2 3 4 3 4 4
Now we can use the apply function to find intersection between your vectors. But before that
lets create a list of vectors. For easy reproducibility i created this list.
c <- combn(1:4,2)
l <- list(c("a","b"),c("b","c"),c("c","d"),c("d","e"))
Result <- apply(c,2,function(x){intersect(l[[x[1]]],l[[x[2]]])})
This result will be a list if you want it as vector you can use do.call
do.call("c",Result)
[1] "b" "c" "d"
For unique components
unique(do.call("c",Result))
This can be used for large lists as well.

v1 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v2 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
v3 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v4 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
vall <- unique(c(v1,v2,v3,v4))
for(x in vall){
if((x %in% v1)&(x %in% v2)&(x %in% v3)&(x %in% v4)){
print(x)}
}

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.

I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]

Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a

Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split data.table into roughly equal parts - r

Related

Cbind bind's the integer number rather content

Generate multiple permutations of vector with non-repeating elements

Packing and unpacking elements from list in R

Finding all possible combinations of vector intersections?

Shuffling a vector - all possible outcomes of sample()?

Categories

Resources