Compact Letter Display from a matrix of significancies or by hand - r

I am running a multiple pairwise comparison in R. I'm using the survival package survminer. I'm using the function:
pairwise_survdiff {survminer}
It gives the pairwise comparisons with significance as expected, but doesn't seem to have a way to give a compact letter display (CLD) of the results. I'm looking at pairs of 19 levels. I ended up printing the results, putting them into excel by hand and then doing letters by hand. But now I need to do it again and am hoping for an easier way.
Can I have R do a CLD from the pairwise_survdiff {survminer} results directly?
Baring that
Is there a way to get it to print results into a table that can be read by a spreadsheet?
If I make the logic matrix by hand, how do I have R take that and turn it into a CLD?
And 4) If I'm doing it all by hand, I'm wondering if there is a more compact method of showing this list of comparisons. Can I eliminate any of these letters due to redundancy?
hand made CLD for comparisons
Thank you

Here's the example from survminer
library(survminer)
library(multcomp)
library(tidyr)
data(myeloma)
res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
data = myeloma)
Looking at the internals of the glht.summary method from the multcomp package, we create the lvl_order vector which identifies the ordering of the levels of x from smallest to largest.
x <- myeloma$molecular_group
levs <- levels(x)
y <- Surv(myeloma$time, myeloma$event)
lvl_order <- levels(x)[order(tapply(as.numeric(y)[1:length(x)],
x, mean))]
Then we can re-arrange the p-values from the res object into a matrix. mycomps is a matrix of the two sides of the paired comparisons. The signif vector is logical indicating whether differences are significant or not.
comps <- as_tibble(res$p.value, rownames="row") %>%
pivot_longer(-row, names_to="col", values_to="p") %>%
na.omit()
mycomps <- as.matrix(comps[,1:2])
signif <- comps$p < .05
Then, you can use the insert_absorb internal function to create the letters:
multcomp:::insert_absorb(signif,
decreasing=FALSE,
comps=mycomps,
lvl_order=lvl_order)
# $Letters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a" "b" "ab" "b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $monospacedLetters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a " " b" "ab" " b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $LetterMatrix
# a b
# MAF TRUE TRUE
# Proliferation TRUE FALSE
# Cyclin D-2 FALSE TRUE
# MMSET TRUE TRUE
# Hyperdiploid FALSE TRUE
# Low bone disease TRUE TRUE
# Cyclin D-1 TRUE TRUE
#
# $aLetters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
# [23] "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
# [45] "S" "T" "U" "V" "W" "X" "Y" "Z"
#
# $aseparator
# [1] "."
#
# attr(,"class")
# [1] "multcompLetters"

Related

Reorder vector so no certain items are positioned next to each other

Please consider the following example:
[[1]]
[1] 11 12 13 14
[[2]]
[1] 1 2 3
[[3]]
[1] 4
[[4]]
[1] 5
[[5]]
[1] 6
[[6]]
[1] 7
[[7]]
[1] 8
[[8]]
[1] 9
[[9]]
[1] 10
[[10]]
[1] 15
[[11]]
[1] 16
[[12]]
[1] 17
In this example, I have 12 unique values in a vector that is 17 elements long. For simplicity, let's say that this vector is:
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
The first code block shows the index positions in foo_bar of each of the unique values (the letters a–l).
I am attempting to write an algorithm that reorders foo_bar so that, for all indices except the final one (index 17 in the foo_bar example), position i and position i+1 never contains the same two values. Here's an example of what would be an appropriate outcome:
reordered_foo_bar <- c("b","c","b","d","b","e","f","g","h","a","i","a","j","a","k","a", "l")
something like this?
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
test == FALSE
while (test == FALSE) {
new_foo_bar <- sample(foo_bar, size = length(foo_bar), replace = FALSE)
test <- length(rle(new_foo_bar)$lengths) == length(foo_bar)
}
new_foo_bar
# [1] "f" "a" "g" "b" "h" "d" "j" "c" "e" "i" "a" "b" "k" "a" "l" "a" "b"
First we identify the indices of the unique values in the vector.
indices <-
unique(foo_bar) %>%
sort() %>%
lapply(function(x) which(foo_bar == x))
Then we create a position score based on 1) which order the value has when ordered by decreasing frequency and 2) how many previous occurences of this value has occurred, and we add these two values together. However, to ensure that we get a different value inserted between them, we divide 2) by 2. Finally, we order the position scores and reorder foo_bar with this new order.
This solution is also robust in case it is not possible to prevent duplicate values next to each other (for example because the values are c("a","a","b","a").
out <-
lengths(indices) %>%
lapply(., function(x) 1:x) %>%
{lapply(len_seq(.), function(x) (unlist(.[x]) + x / 2))} %>%
unlist() %>%
order() %>%
{unlist(indices)[.]} %>%
foo_bar[.]
The output is then:
> "a" "b" "a" "c" "b" "d" "a" "e" "b" "f" "a" "g" "h" "i" "j" "k" "l"

simple question about r language in naming

I am a developing beginner in r. I have a simple question about r language.
Thanks to many experts in this site, I am improving a lot.
I am always grateful for that, and anyone who's giving hand with this question, thank you in advance.
This is the code.
Data=sample(1:5,size=25,replace=T)
names(Data)=c("a","b","c","d","e")
I want to name each of 1,2,3,4,5 to a,b,c,d,e.
so I thought I could accomplish this by using the upper code.
I know that the right code is
Data=c("a","b","c","d","e")[Data]
But I can't understand why this is the right code and why I need the last [Data].
Any help would be really appreciated!! Thank you so much in advance!!:)
The last Data provides an index to subset values from c("a","b","c","d","e").
Let's take a simple example :
Consider,
a <- 1:10
Now to get the first value in a you can do
a[1]
#[1] 1
To get 3rd value in a you can do
a[3]
#[1] 3
To get 6th and 8th value in a you can do
a[c(6, 8)]
#[1] 6 8
What will happen if you repeat a certain index? Say you select 1 twice and 3 once.
a[c(1, 1, 3)]
#[1] 1 1 3
As you can see the first value is selected two times and third one time.
Now ,Data that you have serves as that index to subset whereas a becomes c("a","b","c","d","e")
a <- c("a","b","c","d","e")
set.seed(123)
Data=sample(1:5,size=25,replace=T)
Data
#[1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
Now you use this Data values to subset from a giving
a[Data]
#[1] "c" "c" "b" "b" "c" "e" "d" "a" "b" "c" "e" "c" "c" "a" "d" "a" "a" "e" "c" "b" "b" "a" "c" "d" "a"
A side note, there is an inbuilt constant letters and LETTERS which gives 26 lower and upper case alphabets.
letters
#[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
Here is a way that takes advantage of how objects of class "factor" are coded internally in R. In R, factors are coded as consecutive integers starting at 1, and what the user sees is their labels and levels, not the integer values. But the integer values do not go away, they are still there.
First, create a vector of integers like in the question but setting the RNG seed in order to make the results reproducible. This vector is saved for later.
set.seed(123)
Data <- sample(1:5, size = 25, replace = TRUE)
Saved <- Data
Now create the factor. Note the labels atribute is set to the letters "a" to"e".
Data <- factor(Data, labels = c("a","b","c","d","e"))
Data
# [1] c c b b c e d a b c e c c a d a a e c b b a c d a
#Levels: a b c d e
See the internal representation.
as.integer(Data)
# [1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
And compare with the initial values.
identical(Saved, as.integer(Data))
# [1] TRUE
This is because Data contains the numbers you want to name in the order you want to name them. By adding [Data] to the end you are selecting the letters in the order of Data. To understand this, try what c("a","b","c","d","e")[c(1, 2)] does; it selects just the two first letters. If you instead type c("a","b","c","d","e")[c(5, 4)] it will select the two last letters, but in reverse order. Then if you print just Data, you'll see that it contains the numbers from 1 to 5, which is the amount of unique letters. So it will select the letters according to that order. You can see that all the numbers correspond to the letters in order by printing the correctly named Data.
Using names(Data)=c("a","b","c","d","e") does not work correctly since you aren't naming all 25 of the numbers, but rather just the first five of them.

Find all unique characters in a data frame in R

I am wondering what is the most efficient way to find all unique characters from a data frame in R.
i.e for example:- [0-9,a-z,A-Z,",","$","&","#" etc]
> k
cola colb
1 1&3# %^
2 A4C% 89&
The output I am expecting is a list containing all unique characters including special characters. i.e 123#%^AC89&
There's nothing really efficient about this, but ... demonstrating on the diamonds dataset from the ggplot2 library,
library(ggplot2)
unique(unlist(lapply(diamonds, function(x) unlist(strsplit(as.character(x), "")))))
# [1] "0" "." "2" "3" "1" "9" "4" "6" "5" "8" "7" "I" "d" "e" "a" "l" "P" "r" "m" "i" "u" "G" "o"
# [24] "V" "y" " " "F" "E" "J" "H" "D" "S"
If you're curious about how many of each?
table(unlist(lapply(diamonds, function(x) unlist(strsplit(as.character(x), "")))))
# . 0 1 2 3 4 5 6 7 8 9 a
# 12082 261929 81785 142173 135042 108355 121267 157242 161862 91438 71904 67144 23161
# d D e E F G H i I J l m o
# 38539 6775 47424 9797 12942 28280 8304 15401 51763 2808 21551 27582 33976
# P r S u V y
# 13791 27483 51409 13791 49953 12082
(This is effectively akrun's answer ... posted before I saw his comment-edit.)
Using your sample frame:
k <- data.frame(cola = c("1&3#", "A4C%"), colb = c("%^", "89&"), stringsAsFactors = FALSE)
unique(unlist(lapply(k, function(x) unlist(strsplit(as.character(x), "")))))
# [1] "1" "&" "3" "#" "A" "4" "C" "%" "^" "8" "9"
And if you want them in a sorted no-space string,
paste(sort(unique(unlist(lapply(k, function(x) unlist(strsplit(as.character(x), "")))))), collapse = "")
# [1] "#%&^13489AC"
Since your question suggests you're considering using this in a regex somewhere, you can sandwich this in brackets. I wouldn't go through the pain of finding character ranges (e.g., AD-GW-Z24-9), since that buys you very little regex efficiency but would take a bit more effort to generate.

Splitting a dataframe into overlapping groups of equal sizes

I am looking for a way to split my data into groups where each group is made of the same window size I define.
Chrom Start End
chr1 1 10
chr1 11 20
chr1 21 30
chr1 31 40
For example, if I want a window size of 20, then the groups would be : 1-20 , 11-30 , 21 - 40.
As long as the size of the group did not exceed 20 it can keep adding to the same group.
I tried using the split function but couldn't implement this way using it.
Is there a way around this?
A vector (or column of a dataframe) can be split into overlapping windows like this:
# Size of overlap
o <- 10
# Size of sliding window
n <- 20
# Dummy data
x <- sample(LETTERS, size = 40, replace = T)
# Define start and end point (s and e)
s <- 1
e <- n
# Loop to create fragments
for(i in 1:(length(x)/o)){
assign(paste0("x", i), x[s:e])
s <- s + o
e <- (s + n) - 1
}
# Call fragments
x1
x2
x3
Result:
> x
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R" "W" "Z" "F" "T" "N" "U" "F" "R" "A" "V"
> x1
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L"
> x2
[1] "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R"
library(IRanges)
library(GenomicRanges)
(gr1 <- GRanges("chr1",IRanges(c(1,11,21,31),width=10),strand="*"))
(gr2 <- GRanges("chr1",IRanges(c(1,11,21),width=20),strand="*"))
fo <- findOverlaps(gr1, gr2)
queryHits(fo)
subjectHits(fo)
Check http://genomicsclass.github.io/book/pages/bioc1_igranges.html#intrarange for more details.

Extracting common character strings from multiple vectors of different lengths

Is there function in R to find the common characters in multiple vectors (of different lengths). For example, if I have 3 vectors...
a1 <- LETTERS[1:7]
a2 <- LETTERS[4:8]
a3 <- LETTERS[2:10]
a1
# [1] "A" "B" "C" "D" "E" "F" "G"
a2
# [1] "D" "E" "F" "G" "H"
a3
# [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
I can think of a messy solutions...
intersect(intersect(a1,a2),a3)
# [1] "D" "E" "F" "G"
Problem is, I have around 8 or 9 vectors. Is there a better way to this?
Yes:
Reduce(intersect,list(a1,a2,a3))

Resources