Given are two vectors, a and b
a = letters[1:6]
b = letters[7:11]
The goal is to sample a two column matrix using a and b. The first column should contain elements from a such that each element of a is repeated two times. The second column should contain elements from b such that each element of b is also repeated at least two times. One more condition is that the pairs have to be unique.
I have figured out how to sample the 12 pairs but have not figured out how I can ensure they will always be unique. For example, in the solution presented below, row 3 and row 11 are the same.
The desired output should have no duplicate rows.
set.seed(42)
m = cbind(sample(c(a, a)), sample(c(b, b, sample(b, 2, replace = TRUE))))
m
# [,1] [,2]
# [1,] "e" "g"
# [2,] "f" "k"
# [3,] "c" "k"
# [4,] "b" "h"
# [5,] "f" "j"
# [6,] "d" "i"
# [7,] "e" "h"
# [8,] "a" "g"
# [9,] "d" "h"
#[10,] "a" "i"
#[11,] "c" "k"
#[12,] "b" "j"
You can make it a function and throw replace in there, i.e.
f1 <- function(a, b){
m <- cbind(sample(c(a, a)), sample(c(b, b, sample(b, 2, replace = TRUE))))
m[,2] <-replace(m[,2], duplicated(m), sample(b[!b %in% m[duplicated(m),2]], 1))
return(m)
}
#which seems stable
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
Another way that doesn't require replacement
m = rbind(
c(1,1,0,0,0),
c(1,1,0,0,0),
c(0,0,1,1,0),
c(0,0,1,1,0),
c(0,0,0,0,1),
c(0,0,0,0,1)
)
# One "free" selection in each of the last two rows
m[5, sample(4,1)] = 1
m[6, sample(4,1)] = 1
# Scramble it while preserving row/column sums
m = m[sample(6), sample(5)]
> as.matrix(expand.grid(a=a,b=b))[as.logical(m),]
# a b
# [1,] "a" "g"
# [2,] "b" "g"
# [3,] "e" "g"
# [4,] "c" "h"
# [5,] "d" "h"
# [6,] "f" "h"
# [7,] "d" "i"
# [8,] "f" "i"
# [9,] "b" "j"
#[10,] "c" "j"
#[11,] "a" "k"
#[12,] "e" "k"
Definitely not elegant, but would work.
a = letters[1:6]
b = letters[7:11]
asamp <- sample(c(a,a))
finished <- F
while(!finished) {
bsamp <- sample(c(b, b, sample(b, 2, replace = TRUE)))
if(length(unique(paste(asamp,bsamp)))==12) finished <- T
}
cbind(asamp,bsamp)
Related
I need a function similar to expand.grid but without the combinations of duplicate elements.
Here is a simplified version of my problem.
X1 = c("x","y","z")
X2 = c("A","B","C")
X3 = c("y","C","G")
d <- expand.grid(X1,X2,X3)
d
Var1 Var2 Var3
1 x A y
2 y A y
3 z A y
4 x B y
. . . .
. . . .
. . . .
23 y B G
24 z B G
25 x C G
26 y C G
27 z C G
d has 27 rows. But 6 of these contain duplicate values which I do not need Rows: 2, 5, 8, 16, 17 & 18
Is there a way to get the other 21 rows which does not contain any duplicates.
Note that vectors have more than 3 elements (c("x","y","z","k","m"...), up to 50) and number of vectors is more than 3 in the real case. (X4, X5, X6... up to 11 ). Because of this expanded object is getting real large and RAM cannot handle it.
In RcppAlgos*, there is a function called comboGrid that does the trick:
library(RcppAlgos) ## as of v2.4.3
comboGrid(X1, X2, X3, repetition = F)
# Var1 Var2 Var3
# [1,] "x" "A" "C"
# [2,] "x" "A" "G"
# [3,] "x" "A" "y"
# [4,] "x" "B" "C"
# [5,] "x" "B" "G"
# [6,] "x" "B" "y"
# [7,] "x" "C" "G"
# [8,] "x" "C" "y"
# [9,] "y" "A" "C"
# [10,] "y" "A" "G"
# [11,] "y" "B" "C"
# [12,] "y" "B" "G"
# [13,] "y" "C" "G"
# [14,] "z" "A" "C"
# [15,] "z" "A" "G"
# [16,] "z" "A" "y"
# [17,] "z" "B" "C"
# [18,] "z" "B" "G"
# [19,] "z" "B" "y"
# [20,] "z" "C" "G"
# [21,] "z" "C" "y"
Large Test
set.seed(42)
rnd_lst <- lapply(1:11, function(x) {
sort(sample(LETTERS, sample(26, 1)))
})
## Number of results that expand.grid would return if your machine
## had enough memory... over 300 trillion!!!
prettyNum(prod(lengths(rnd_lst)), big.mark = ",")
# [1] "365,634,846,720"
exp_grd_test <- expand.grid(rnd_lst)
# Error: vector memory exhausted (limit reached?)
system.time(cmb_grd_test <- comboGrid(rnd_lst, repetition=FALSE))
# user system elapsed
# 9.866 0.330 10.196
dim(cmb_grd_test)
# [1] 3036012 11
head(cmb_grd_test)
# Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
# [1,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "K"
# [2,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "L"
# [3,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "M"
# [4,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "N"
# [5,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "O"
# [6,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "P"
* I am the author of RcppAlgos
(Sorry, I just realized that your problem is as much a size problem, so removing them post-generation may not be feasible. For that, this may not be the best answer, but I'll keep it around for smaller-and-related questions.)
base R
I hard-code "3", but you can use ncol(d) and/or ncol(d)-1 for programmatic use.
d[lengths(apply(d, 1, unique)) > 2, ]
# Var1 Var2 Var3
# 1 x A y
# 3 z A y
# 4 x B y
# 6 z B y
# 7 x C y
# 9 z C y
# 10 x A C
# 11 y A C
# 12 z A C
# 13 x B C
# 14 y B C
# 15 z B C
# 19 x A G
# 20 y A G
# 21 z A G
# 22 x B G
# 23 y B G
# 24 z B G
# 25 x C G
# 26 y C G
# 27 z C G
(The row names are not reset, you can see the gaps to verify it is not 27 rows.)
And to verify, here are the rows with dupes:
d[lengths(apply(d, 1, unique)) < 3, ]
# Var1 Var2 Var3
# 2 y A y
# 5 y B y
# 8 y C y
# 16 x C C
# 17 y C C
# 18 z C C
With a simple vector like
x <- sample(letters[1:3], size=20, replace=T)
I would extract the most frequent letter with something like
y <- table(x)
print(names(y)[y==max(y)])
"b"
However, using the same technique over a multidimensional dataframe does not work:
set.seed(5)
x <- data.frame(c1=sample(letters[1:3], size=30, replace=T),
c2=sample(letters[4:5], size=30, replace=T),
c3=sample(letters[6:10], size=30, replace=T))
y <- table(x)
print(names(y)[y==max(y)])
NULL
How can I extract the levels of c1, c2, and c3 that have the highest value in the contingency table?
I know I could convert the table to a dataframe and find the row where the Freq column is highest, but given the number of dimensions & levels in my dataset, doing the conversion to a dataframe would not fit in my RAM memory.
Edit: So my expected output in the second case would be c, d, j, as in:
z <- data.frame(y)
z[z$Freq==max(z$Freq), 1:3]
c1 c2 c3
27 c d j
But note that I cannot use the data.frame call on my data due to RAM issues.
You can use which with arr.ind = TRUE:
mapply("[",
dimnames(y),
as.data.frame(which(y == max(y), arr.ind = TRUE)))
# c1 c2 c3
#"c" "d" "j"
mapply("[",
dimnames(y),
as.data.frame(which(y == min(y), arr.ind = TRUE)))
# c1 c2 c3
# [1,] "a" "d" "f"
# [2,] "b" "d" "g"
# [3,] "c" "d" "g"
# [4,] "b" "e" "g"
# [5,] "a" "d" "h"
# [6,] "b" "d" "h"
# [7,] "c" "d" "h"
# [8,] "c" "e" "h"
# [9,] "a" "e" "i"
#[10,] "b" "e" "i"
#[11,] "c" "e" "i"
Given a variable x that can take values A,B,C,D
And three columns for variable x:
df1<-
rbind(c("A","B","C"),c("A","D","C"),c("B","A","C"),c("A","C","B"), c("B","C","A"), c("D","A","B"), c("A","B","D"), c("A","D","C"), c("A",NA,NA),c("D","A",NA),c("A","D",NA))
How do I make column indicating the combination of in the three preceding column such that permutations (ABC, ACB, BAC) would be considered as the same combination of ABC, (AD, DA) would be considered as the same combination of AD?
Pasting the three columns with apply(df1,1,function(x) paste(x[!is.na(x)], collapse=", ")->df1$x4 and using df1%>%group(x4)%>%summarize(c=count(x4)) would count AD,DA as different instead of the same.
Edited title
My desired result would be to get
a<-cbind(c("ABC",4),c("ACD",2),c("ABD",2),c("A",1),c("AD",2))
Someone already solved my question. Thanks
You can apply function paste after sorting each row vector.
df1 <-
cbind(df1, apply(df1, 1, function(x) paste(sort(x), collapse = "")))
df1
# [,1] [,2] [,3] [,4]
# [1,] "A" "B" "C" "ABC"
# [2,] "A" "D" "C" "ACD"
# [3,] "B" "A" "C" "ABC"
# [4,] "A" "C" "B" "ABC"
# [5,] "B" "C" "A" "ABC"
# [6,] "D" "A" "B" "ABD"
# [7,] "A" "B" "D" "ABD"
# [8,] "A" "D" "C" "ACD"
# [9,] "A" NA NA "A"
#[10,] "D" "A" NA "AD"
#[11,] "A" "D" NA "AD"
You can now simply table the column, with no need for an external package to be loaded and more complex pipes.
table(df1[, 4])
#A ABC ABD ACD AD
#1 4 2 2 2
suppose I have two vector like this :
l1 = c('C','D','E','F')
l2 = c('G','C','D','F')
I generate all combinations of two elements using combn function:
l1_vector = t(combn(l1,2))
l2_vector = t(combn(l2,2))
> l1_vector
[,1] [,2]
[1,] "C" "D"
[2,] "C" "E"
[3,] "C" "F"
[4,] "D" "E"
[5,] "D" "F"
[6,] "E" "F"
> l2_vector
[,1] [,2]
[1,] "G" "C"
[2,] "G" "D"
[3,] "G" "F"
[4,] "C" "D"
[5,] "C" "F"
[6,] "D" "F"
Now I want to calculate the repeat elements of l1_vector and l2_vector , as the example i give, the repeat of elements should be 3 (["C","D"],["C","F"],["D","F"])
How can I do that without using loop ?
As mentioned in the comments, you can use the merge function for this. Since the default behavior of merge is to use all of the available columns, it will return only those rows that are perfect matches.
> merge(l1_vector, l2_vector)
V1 V2
1 C D
2 C F
3 D F
>
> nrow(merge(l1_vector, l2_vector))
[1] 3
While merge is perfectly fine for your case, there is some work around.
If you just need the number of repeated elements:
choose(length(intersect(l1, l2)), 2)
[1] 3
If you need the repeated elements:
t(combn(intersect(l1, l2), 2))
[,1] [,2]
[1,] "C" "D"
[2,] "C" "F"
[3,] "D" "F"
I have a dataframe with 2 columns of factors variables like this:
V1 <- c("A","B","C","Y","D","E","F","U","G","H","I","J","R")
V2 <- c("Z","Y","W","B","V","U","T","E","S","R","Q","P","H")
df <- cbind(V1,V2)
df
V1 V2
[1,] "A" "Z"
[2,] "B" "Y"
[3,] "C" "W"
[4,] "Y" "B"
[5,] "D" "V"
[6,] "E" "U"
[7,] "F" "T"
[8,] "U" "E"
[9,] "G" "S"
[10,] "H" "R"
[11,] "I" "Q"
[12,] "J" "P"
[13,] "R" "H"
Now I woudl like to count, using a function, all the cases where the combination of V1 and V2 equals to combination V2 and V1 and return them, for example for df this count will be equal to 3, like this:
y <-combinations_inver(df[,1],df[,2])
y$Combinations
"B""Y"= "Y""B"
"E""U"= "U""E"
"H""R"= "R""H"
y$Count
[1] 3 #because there are three ocurrences (see $Combinations)
A simple way to do it would be:
forwards<-paste(V1,V2)
backwards<-paste(V2,V1)
The intersection of these two "sets" would be what you are looking for, but R gives both sets of matches, so you would need to divide the length by 2:
length(intersect(forwards, backwards))/2
We can use pmin and pmax to reorder the elements for each row, then use duplicated to find the index of duplicate elements, get the unique rows after subsetting and get the nrow
m1 <- cbind(pmin(df[,1], df[,2]), pmax(df[,1], df[,2]))
i1 <- duplicated(m1)|duplicated(m1, fromLast=TRUE)
nrow(unique(m1[i1,]))
#[1] 3