Extracting common character strings from multiple vectors of different lengths - r

Is there function in R to find the common characters in multiple vectors (of different lengths). For example, if I have 3 vectors...
a1 <- LETTERS[1:7]
a2 <- LETTERS[4:8]
a3 <- LETTERS[2:10]
a1
# [1] "A" "B" "C" "D" "E" "F" "G"
a2
# [1] "D" "E" "F" "G" "H"
a3
# [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
I can think of a messy solutions...
intersect(intersect(a1,a2),a3)
# [1] "D" "E" "F" "G"
Problem is, I have around 8 or 9 vectors. Is there a better way to this?

Yes:
Reduce(intersect,list(a1,a2,a3))

Related

Why does gtools::combinations and permutations not work with a vector containing the same elements?

Say I have a vector vec <- c("H", "H", "H", "H", "M", "M", "A", "A")
How do I get all combinations / permutations if I e.g. draw 5 out of 8 with the expetced ouput.
> head(t, 6)
[,1] [,2] [,3] [,4] [,5]
[1,] "H" "H" "H" "H" "M"
[2,] "H" "H" "H" "H" "M"
[3,] "H" "H" "H" "H" "A"
[4,] "H" "H" "H" "H" "A"
[5,] "H" "H" "H" "M" "M"
[6,] "H" "H" "H" "M" "A"
I tried gtools::combinations() but I always get the error that there are too few different elements (same is true for gtools::permutations() regardless if repeats are allowed or not.
So I did it in a laborious way
t <- gtools::combinations(8, 5, vec, repeats.allowed = F)
Error in gtools::combinations(8, 5, vec, repeats.allowed = F) :
too few different elements
t <- gtools::combinations(8, 5, letters[1:8], repeats.allowed = F)
for ( i in 1:8) {
if ( i <=4 ) {
t[t == letters[i]] <- "H"
} else if (i <= 6) {
t[t == letters[i]] <- "M"
} else if (i <= 8) {
t[t == letters[i]] <- "A"
}
}
I am looking for an easier solution from any package or base R and want to know, why it doesn't work. Thanks in advance.
An alternative
combn(vec,5)
which results in 56 combinations (choose(8,5)).
When you need combinations/permutations of a vector that contains repeats, or multisets, many of the available functions in base R and other packages will produce unnecessary duplicate results that eventually need to be filtered out. For smaller problems, this is not an issue, however this approach quickly becomes impractical.
Currently, there are a couple of packages capable of handling these types of problems. They are arrangements and RcppAlgos (I am the author).
vec <- c("H", "H", "H", "H", "M", "M", "A", "A")
tbl_v <- table(vec)
tbl_v
vec
A H M
2 4 2
library(RcppAlgos)
comboGeneral(names(tbl_v), 5, freqs = tbl_v)
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "A" "H" "H" "H"
[2,] "A" "A" "H" "H" "M"
[3,] "A" "A" "H" "M" "M"
[4,] "A" "H" "H" "H" "H"
[5,] "A" "H" "H" "H" "M"
[6,] "A" "H" "H" "M" "M"
[7,] "H" "H" "H" "H" "M"
[8,] "H" "H" "H" "M" "M"
## For package arrangements we have:
## arrangements::combinations(names(tbl_v), 5, freq = tbl_v)
Similarly, for permutations, we have:
permuteGeneral(names(tbl_v), 5, freqs = tbl_v)
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "A" "H" "H" "H"
[2,] "A" "A" "H" "H" "M"
[3,] "A" "A" "H" "M" "H"
[4,] "A" "A" "H" "M" "M"
. . . . . .
. . . . . .
. . . . . .
[137,] "M" "M" "H" "A" "A"
[138,] "M" "M" "H" "A" "H"
[139,] "M" "M" "H" "H" "A"
[140,] "M" "M" "H" "H" "H"
## For package arrangements we have:
## arrangements::permutations(names(tbl_v), 5, freq = tbl_v)
Both packages contain algorithms that generate each result without the need for filtering. This approach is much more efficient.
For example, what if we had big_vec <- rep(vec, 8) and we wanted all combinations of length 16. Using the filtering approach, one would need to generate all combinations of a vector of length 64 choose 16 and then filter them. That is choose(64, 16) = 4.885269e+14 total combinations. That's going to be difficult.
With these two packages, this problem is a breeze.
big_vec <- rep(vec, 8)
tbl_big_v <- table(big_vec)
tbl_big_v
big_vec
A H M
16 32 16
system.time(test_big <- comboGeneral(names(tbl_big_v), 16,
freqs = tbl_big_v))
user system elapsed
0 0 0
dim(test_big)
[1] 153 16
apply(gtools::combinations(8,5,repeats.allowed = FALSE),2,\(x) vec[x])
does what you want.
I don't know why the package wants different values if applying it on a vector through. It's unclear within the documentation.

Compact Letter Display from a matrix of significancies or by hand

I am running a multiple pairwise comparison in R. I'm using the survival package survminer. I'm using the function:
pairwise_survdiff {survminer}
It gives the pairwise comparisons with significance as expected, but doesn't seem to have a way to give a compact letter display (CLD) of the results. I'm looking at pairs of 19 levels. I ended up printing the results, putting them into excel by hand and then doing letters by hand. But now I need to do it again and am hoping for an easier way.
Can I have R do a CLD from the pairwise_survdiff {survminer} results directly?
Baring that
Is there a way to get it to print results into a table that can be read by a spreadsheet?
If I make the logic matrix by hand, how do I have R take that and turn it into a CLD?
And 4) If I'm doing it all by hand, I'm wondering if there is a more compact method of showing this list of comparisons. Can I eliminate any of these letters due to redundancy?
hand made CLD for comparisons
Thank you
Here's the example from survminer
library(survminer)
library(multcomp)
library(tidyr)
data(myeloma)
res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
data = myeloma)
Looking at the internals of the glht.summary method from the multcomp package, we create the lvl_order vector which identifies the ordering of the levels of x from smallest to largest.
x <- myeloma$molecular_group
levs <- levels(x)
y <- Surv(myeloma$time, myeloma$event)
lvl_order <- levels(x)[order(tapply(as.numeric(y)[1:length(x)],
x, mean))]
Then we can re-arrange the p-values from the res object into a matrix. mycomps is a matrix of the two sides of the paired comparisons. The signif vector is logical indicating whether differences are significant or not.
comps <- as_tibble(res$p.value, rownames="row") %>%
pivot_longer(-row, names_to="col", values_to="p") %>%
na.omit()
mycomps <- as.matrix(comps[,1:2])
signif <- comps$p < .05
Then, you can use the insert_absorb internal function to create the letters:
multcomp:::insert_absorb(signif,
decreasing=FALSE,
comps=mycomps,
lvl_order=lvl_order)
# $Letters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a" "b" "ab" "b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $monospacedLetters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a " " b" "ab" " b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $LetterMatrix
# a b
# MAF TRUE TRUE
# Proliferation TRUE FALSE
# Cyclin D-2 FALSE TRUE
# MMSET TRUE TRUE
# Hyperdiploid FALSE TRUE
# Low bone disease TRUE TRUE
# Cyclin D-1 TRUE TRUE
#
# $aLetters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
# [23] "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
# [45] "S" "T" "U" "V" "W" "X" "Y" "Z"
#
# $aseparator
# [1] "."
#
# attr(,"class")
# [1] "multcompLetters"

Splitting a dataframe into overlapping groups of equal sizes

I am looking for a way to split my data into groups where each group is made of the same window size I define.
Chrom Start End
chr1 1 10
chr1 11 20
chr1 21 30
chr1 31 40
For example, if I want a window size of 20, then the groups would be : 1-20 , 11-30 , 21 - 40.
As long as the size of the group did not exceed 20 it can keep adding to the same group.
I tried using the split function but couldn't implement this way using it.
Is there a way around this?
A vector (or column of a dataframe) can be split into overlapping windows like this:
# Size of overlap
o <- 10
# Size of sliding window
n <- 20
# Dummy data
x <- sample(LETTERS, size = 40, replace = T)
# Define start and end point (s and e)
s <- 1
e <- n
# Loop to create fragments
for(i in 1:(length(x)/o)){
assign(paste0("x", i), x[s:e])
s <- s + o
e <- (s + n) - 1
}
# Call fragments
x1
x2
x3
Result:
> x
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R" "W" "Z" "F" "T" "N" "U" "F" "R" "A" "V"
> x1
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L"
> x2
[1] "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R"
library(IRanges)
library(GenomicRanges)
(gr1 <- GRanges("chr1",IRanges(c(1,11,21,31),width=10),strand="*"))
(gr2 <- GRanges("chr1",IRanges(c(1,11,21),width=20),strand="*"))
fo <- findOverlaps(gr1, gr2)
queryHits(fo)
subjectHits(fo)
Check http://genomicsclass.github.io/book/pages/bioc1_igranges.html#intrarange for more details.

Counting NA characters of a matrix

I have the following matrix:
V1 V2 V3 V4
[1,] "d" "e" "i" "NA"
[2,] "j" "e" "i" "NA"
[3,] "j" "n" "k" "l"
[4,] "j" "k" "l" "m"
[5,] "j" "k" "i" "NA"
[6,] "o" "n" "NA" "NA"
I am trying to count the number elements per row that is not NA, but all of the usual ways like !is.na(MATRIX) are not working. I am always getting the answer to be 4. I presume this is because the program is viewing "NA" as a character, but I do not know how to fix this.
'NA' is not NA_character_ so is.na does not work. Just use
rowSums(MATRIX != 'NA')
If the NAs are character strings, convert them to real NA with mat[mat=="NA"] <- NA and then use the solution in Sotos' comment

Using a sample list as a template for sampling from a larger list with wraparound

Similar to my question at Using a sample list as a template for sampling from a larger list without wraparound, how can I know do this allowing for a wrap-around?
Thus, if I have a vector of letters:
> all <- letters
> all
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and then I define a reference sample from letters as follows:
> refSample <- c("j","l","m","s")
in which the spacing between elements is 2 (1st to 2nd), 1 (2nd to 3rd) and 6 (3rd to 4th), how can I then select n samples from all that have identical, wrap-around spacing between its elements to refSample? For example, "a","c","d","j", "q" "s" "t" "z" and "r" "t" "u" "a" would be valid samples, but "a","c","d","k" would not.
Again, parameterised for a function is best.
I would have left it as an exercise but here goes --
all <- letters
refSample <- c("j","l","m","s")
pick_matches <- function(n, ref, full, wrap = FALSE) {
iref <- match(ref,full)
spaces <- diff(iref)
tot_space <- sum(spaces)
N <- length( full ) - 1
max_start <- N - tot_space*(1-wrap)
starts <- sample(0:max_start, n, replace = TRUE)
return( sapply( starts, function(s) full[ 1 + cumsum(c(s, spaces)) %% (N+1) ] ) )
}
> set.seed(1)
> pick_matches(5, refSample, all, wrap = FALSE)
[,1] [,2] [,3] [,4] [,5]
[1,] "e" "g" "j" "p" "d"
[2,] "g" "i" "l" "r" "f"
[3,] "h" "j" "m" "s" "g"
[4,] "n" "p" "s" "y" "m"
> pick_matches(5, refSample, all, wrap = TRUE)
[,1] [,2] [,3] [,4] [,5]
[1,] "x" "y" "r" "q" "b"
[2,] "z" "a" "t" "s" "d"
[3,] "a" "b" "u" "t" "e"
[4,] "g" "h" "a" "z" "k"

Resources