I need a function similar to expand.grid but without the combinations of duplicate elements.
Here is a simplified version of my problem.
X1 = c("x","y","z")
X2 = c("A","B","C")
X3 = c("y","C","G")
d <- expand.grid(X1,X2,X3)
d
Var1 Var2 Var3
1 x A y
2 y A y
3 z A y
4 x B y
. . . .
. . . .
. . . .
23 y B G
24 z B G
25 x C G
26 y C G
27 z C G
d has 27 rows. But 6 of these contain duplicate values which I do not need Rows: 2, 5, 8, 16, 17 & 18
Is there a way to get the other 21 rows which does not contain any duplicates.
Note that vectors have more than 3 elements (c("x","y","z","k","m"...), up to 50) and number of vectors is more than 3 in the real case. (X4, X5, X6... up to 11 ). Because of this expanded object is getting real large and RAM cannot handle it.
In RcppAlgos*, there is a function called comboGrid that does the trick:
library(RcppAlgos) ## as of v2.4.3
comboGrid(X1, X2, X3, repetition = F)
# Var1 Var2 Var3
# [1,] "x" "A" "C"
# [2,] "x" "A" "G"
# [3,] "x" "A" "y"
# [4,] "x" "B" "C"
# [5,] "x" "B" "G"
# [6,] "x" "B" "y"
# [7,] "x" "C" "G"
# [8,] "x" "C" "y"
# [9,] "y" "A" "C"
# [10,] "y" "A" "G"
# [11,] "y" "B" "C"
# [12,] "y" "B" "G"
# [13,] "y" "C" "G"
# [14,] "z" "A" "C"
# [15,] "z" "A" "G"
# [16,] "z" "A" "y"
# [17,] "z" "B" "C"
# [18,] "z" "B" "G"
# [19,] "z" "B" "y"
# [20,] "z" "C" "G"
# [21,] "z" "C" "y"
Large Test
set.seed(42)
rnd_lst <- lapply(1:11, function(x) {
sort(sample(LETTERS, sample(26, 1)))
})
## Number of results that expand.grid would return if your machine
## had enough memory... over 300 trillion!!!
prettyNum(prod(lengths(rnd_lst)), big.mark = ",")
# [1] "365,634,846,720"
exp_grd_test <- expand.grid(rnd_lst)
# Error: vector memory exhausted (limit reached?)
system.time(cmb_grd_test <- comboGrid(rnd_lst, repetition=FALSE))
# user system elapsed
# 9.866 0.330 10.196
dim(cmb_grd_test)
# [1] 3036012 11
head(cmb_grd_test)
# Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
# [1,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "K"
# [2,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "L"
# [3,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "M"
# [4,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "N"
# [5,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "O"
# [6,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "P"
* I am the author of RcppAlgos
(Sorry, I just realized that your problem is as much a size problem, so removing them post-generation may not be feasible. For that, this may not be the best answer, but I'll keep it around for smaller-and-related questions.)
base R
I hard-code "3", but you can use ncol(d) and/or ncol(d)-1 for programmatic use.
d[lengths(apply(d, 1, unique)) > 2, ]
# Var1 Var2 Var3
# 1 x A y
# 3 z A y
# 4 x B y
# 6 z B y
# 7 x C y
# 9 z C y
# 10 x A C
# 11 y A C
# 12 z A C
# 13 x B C
# 14 y B C
# 15 z B C
# 19 x A G
# 20 y A G
# 21 z A G
# 22 x B G
# 23 y B G
# 24 z B G
# 25 x C G
# 26 y C G
# 27 z C G
(The row names are not reset, you can see the gaps to verify it is not 27 rows.)
And to verify, here are the rows with dupes:
d[lengths(apply(d, 1, unique)) < 3, ]
# Var1 Var2 Var3
# 2 y A y
# 5 y B y
# 8 y C y
# 16 x C C
# 17 y C C
# 18 z C C
Related
With a simple vector like
x <- sample(letters[1:3], size=20, replace=T)
I would extract the most frequent letter with something like
y <- table(x)
print(names(y)[y==max(y)])
"b"
However, using the same technique over a multidimensional dataframe does not work:
set.seed(5)
x <- data.frame(c1=sample(letters[1:3], size=30, replace=T),
c2=sample(letters[4:5], size=30, replace=T),
c3=sample(letters[6:10], size=30, replace=T))
y <- table(x)
print(names(y)[y==max(y)])
NULL
How can I extract the levels of c1, c2, and c3 that have the highest value in the contingency table?
I know I could convert the table to a dataframe and find the row where the Freq column is highest, but given the number of dimensions & levels in my dataset, doing the conversion to a dataframe would not fit in my RAM memory.
Edit: So my expected output in the second case would be c, d, j, as in:
z <- data.frame(y)
z[z$Freq==max(z$Freq), 1:3]
c1 c2 c3
27 c d j
But note that I cannot use the data.frame call on my data due to RAM issues.
You can use which with arr.ind = TRUE:
mapply("[",
dimnames(y),
as.data.frame(which(y == max(y), arr.ind = TRUE)))
# c1 c2 c3
#"c" "d" "j"
mapply("[",
dimnames(y),
as.data.frame(which(y == min(y), arr.ind = TRUE)))
# c1 c2 c3
# [1,] "a" "d" "f"
# [2,] "b" "d" "g"
# [3,] "c" "d" "g"
# [4,] "b" "e" "g"
# [5,] "a" "d" "h"
# [6,] "b" "d" "h"
# [7,] "c" "d" "h"
# [8,] "c" "e" "h"
# [9,] "a" "e" "i"
#[10,] "b" "e" "i"
#[11,] "c" "e" "i"
Given are two vectors, a and b
a = letters[1:6]
b = letters[7:11]
The goal is to sample a two column matrix using a and b. The first column should contain elements from a such that each element of a is repeated two times. The second column should contain elements from b such that each element of b is also repeated at least two times. One more condition is that the pairs have to be unique.
I have figured out how to sample the 12 pairs but have not figured out how I can ensure they will always be unique. For example, in the solution presented below, row 3 and row 11 are the same.
The desired output should have no duplicate rows.
set.seed(42)
m = cbind(sample(c(a, a)), sample(c(b, b, sample(b, 2, replace = TRUE))))
m
# [,1] [,2]
# [1,] "e" "g"
# [2,] "f" "k"
# [3,] "c" "k"
# [4,] "b" "h"
# [5,] "f" "j"
# [6,] "d" "i"
# [7,] "e" "h"
# [8,] "a" "g"
# [9,] "d" "h"
#[10,] "a" "i"
#[11,] "c" "k"
#[12,] "b" "j"
You can make it a function and throw replace in there, i.e.
f1 <- function(a, b){
m <- cbind(sample(c(a, a)), sample(c(b, b, sample(b, 2, replace = TRUE))))
m[,2] <-replace(m[,2], duplicated(m), sample(b[!b %in% m[duplicated(m),2]], 1))
return(m)
}
#which seems stable
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
Another way that doesn't require replacement
m = rbind(
c(1,1,0,0,0),
c(1,1,0,0,0),
c(0,0,1,1,0),
c(0,0,1,1,0),
c(0,0,0,0,1),
c(0,0,0,0,1)
)
# One "free" selection in each of the last two rows
m[5, sample(4,1)] = 1
m[6, sample(4,1)] = 1
# Scramble it while preserving row/column sums
m = m[sample(6), sample(5)]
> as.matrix(expand.grid(a=a,b=b))[as.logical(m),]
# a b
# [1,] "a" "g"
# [2,] "b" "g"
# [3,] "e" "g"
# [4,] "c" "h"
# [5,] "d" "h"
# [6,] "f" "h"
# [7,] "d" "i"
# [8,] "f" "i"
# [9,] "b" "j"
#[10,] "c" "j"
#[11,] "a" "k"
#[12,] "e" "k"
Definitely not elegant, but would work.
a = letters[1:6]
b = letters[7:11]
asamp <- sample(c(a,a))
finished <- F
while(!finished) {
bsamp <- sample(c(b, b, sample(b, 2, replace = TRUE)))
if(length(unique(paste(asamp,bsamp)))==12) finished <- T
}
cbind(asamp,bsamp)
As a variant of this question
I have a vector with strings, each string has 2 to 4 characters.
Strng <- c("XDX", "GUV", "FQ", "ACUE", "HIT", "AYX", "NFD", "AHBW", "GKQ", "PYF")
I want to split it to data frame with 4 columns, where each column contains one of the characters or 0 (for the case where the length of the string is less tan 4). The zeros can be in front of - doesn't matter.
So (probably) after applying this:
ss<-strsplit(Strng,"")
z<-lapply(ss,as.character)
I would like to have a dataframe like this:
>df
"X" "D" "X" "0"
"G" "U" "V" "0"
"F" "Q" "0" "0"
"A" "C" "U" "E"
"H" "I" "T" "0"
"A" "Y" "X" "0"
"N" "F" "D" "0"
"A" "H" "B" "W"
"G" "K" "Q" "0"
"P" "Y" "F" "0"
Any ideas?
Thank you,
Kalin
Here's an alternative with "data.table":
library(data.table)
setDT(tstrsplit(Strng, "", fill = "0"))[]
# V1 V2 V3 V4
# 1: X D X 0
# 2: G U V 0
# 3: F Q 0 0
# 4: A C U E
# 5: H I T 0
# 6: A Y X 0
# 7: N F D 0
# 8: A H B W
# 9: G K Q 0
# 10: P Y F 0
You could also use cSplit from my "splitstackshape" package, but it fills with NA and uses a little bit strange syntax:
library(splitstackshape)
cSplit(data.table(Strng), "Strng", "", stripWhite = FALSE)
We can use stri_list2matrix from stringi after we split the "Strng" to a list.
library(stringi)
stri_list2matrix(strsplit(Strng, ''), fill=0, byrow=TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "X" "D" "X" "0"
# [2,] "G" "U" "V" "0"
# [3,] "F" "Q" "0" "0"
# [4,] "A" "C" "U" "E"
# [5,] "H" "I" "T" "0"
# [6,] "A" "Y" "X" "0"
# [7,] "N" "F" "D" "0"
# [8,] "A" "H" "B" "W"
# [9,] "G" "K" "Q" "0"
#[10,] "P" "Y" "F" "0"
Or a base R option would be (variant of the one described in the link)
read.fwf(file= textConnection(Strng),
widths = rep(1,max(nchar(Strng))))
I have a long list of sequences as follows
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC etc.
I also have a shorter list and I would like to see how many times each element in the short list appears in the long list and plot it as a histogram. I suppose its like a Vlookup function. How can I do this in R?
Try:
longlist = c("AAAAAACGTTATGATCGATC", "AAAATTCGCGCTTAGAGATC", "AAGCTACGCATGCATCGACT",
"AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT",
"AAGCTACGCATGCATCGACT", "AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC"
)
shortlist = c("AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT")
longlist
[1] "AAAAAACGTTATGATCGATC" "AAAATTCGCGCTTAGAGATC" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
[6] "AAGCTACGCATGCATCGACT" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
shortlist
[1] "AAAAAACGTTATGATCGATC" "AAGCTACGCATGCATCGACT"
outdf = data.frame(var=character(), freq=numeric(), stringsAsFactors=F)
for(i in 1:length(shortlist)) {outdf[i,]=c(shortlist[i], sum(longlist==shortlist[i]))}
outdf
var freq
1 AAAAAACGTTATGATCGATC 5
2 AAGCTACGCATGCATCGACT 3
outdf$freq = as.numeric(outdf$freq)
barplot(outdf$freq, names.arg=outdf$var)
Can easily use following to see frequency and barplot of full longlist:
table(longlist)
longlist
AAAAAACGTTATGATCGATC AAAATTCGCGCTTAGAGATC AAGCTACGCATGCATCGACT
5 1 3
barplot(table(longlist))
match and table should work for your character vectors. Here's an example just random letters:
set.seed(1492)
dat <- sample(c(letters, LETTERS), 100, replace=TRUE)
dat
## [1] "o" "l" "j" "f" "c" "a" "S" "A" "u" "N" "H" "H" "k" "B" "B" "P" "g"
## [18] "r" "I" "V" "H" "t" "g" "F" "e" "W" "E" "D" "r" "Y" "h" "Z" "R" "l"
## [35] "Z" "K" "v" "f" "b" "q" "M" "P" "i" "u" "w" "m" "S" "g" "f" "g" "G"
## [52] "h" "q" "T" "J" "M" "K" "m" "X" "Q" "f" "x" "t" "B" "k" "z" "I" "Y"
## [69] "z" "g" "z" "u" "O" "k" "G" "L" "n" "B" "A" "A" "J" "p" "U" "F" "E"
## [86] "X" "R" "J" "G" "L" "H" "o" "z" "r" "d" "r" "V" "H" "S" "I"
matches <- match(dat, LETTERS)
match_counts <- table(matches[!is.na(matches)])
match_counts
##
## 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
names(match_counts) <- LETTERS[as.numeric(names(match_counts))]
match_counts
## A B D E F G H I J K L M N O P Q R S T U V W X Y Z
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
barplot(sort(match_counts), col="#649388")
Assuming that the sequences are strings.
lines <- readLines(n=6)
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
shortlist <- readLines(n=1)
AGTD
Here, I am assuming that each element as individual characters as it is not clear.
pat1 <- gsub("(?<=[A-Za-z])(?=[A-Za-z])", "|", shortlist, perl=TRUE)
pat1
#[1] "A|G|T|D"
library(stringr)
lvls <- unique(str_extract_all(shortlist, "[A-Za-z]")[[1]])
t1 <- table(factor(unlist(regmatches(lines,gregexpr(pat1, lines))), levels=lvls))
t1
#
# A G T D
#47 21 29 0
barplot(t1, col="#649388")
Update
If your shortlist is like below and you wanted to get the frequencies for each string instead of characters in the string.
shortlist1 <- readLines(n=4)
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATCT
AAAAAACG
pat2 <- paste0("^",paste(shortlist1, collapse="|"), "$")
lvls1 <- unique(shortlist1)
t2 <- table(factor(unlist(regmatches(lines,gregexpr(pat2, lines))), levels=lvls1))
t2
#AAGCTACGCATGCATCGACT AAAAAACGTTATGATCGATC AAAAAACGTTATCT
# 1 3 0
# AAAAAACG
# 0
barplot(t2, col="#649388")
I have two vectors:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3) # 7 unique values
B <- c("a","b","c","d","e","f","g") # 7 different values
I would like to match the values of B to A such that the smallest value in A gets the first value from B and continued on to the largest.
The above example would be:
A: 1 3 5 6 4 3 2 3 3 3 3 3 4 6 7 7 5 4 4 3
assigned: a c e f d c b c c c c c d f g g e d d c
Try this:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3)
B <- letters[1:7]
B[match(A, sort(unique(A)))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d" "f" "g"
# [16] "g" "e" "d" "d" "c"
Another option that handles the general case that #JoshO'Brien addresses would be
B[as.numeric(factor(A))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
A2<-ifelse(A > 4, A + 1, A)
# [1] 1 3 6 7 4 3 2 3 3 3 3 3 4 7 8 8 6 4 4 3
B[as.numeric(factor(A2))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
However, following benchmark shows that this method is slower than #JoshOBrien's.
library(microbenchmark)
B <- make.unique(rep(letters, length.out=1000))
A <- sample(seq_along(B), replace=TRUE)
unique_sort_match <- function() B[match(A, sort(unique(A)))]
factor_as.numeric <- function() B[as.numeric(factor(A))]
bm<-microbenchmark(unique_sort_match(), factor_as.numeric(), times=1000L)
plot(bm)
To elaborate on the comments in #Josh's answer:
If A does in fact represent a permutation of the elements of B (ie, where a 1 in A represents the first element of B, a 4 in A represents the 4th element in B, etc), then as #Matthew Plourde points out, you would want to simply use A as your index to B:
B[A]
If A does not represent a permutation of B, then you should use the method suggested by #Josh