I am looking for a way to split my data into groups where each group is made of the same window size I define.
Chrom Start End
chr1 1 10
chr1 11 20
chr1 21 30
chr1 31 40
For example, if I want a window size of 20, then the groups would be : 1-20 , 11-30 , 21 - 40.
As long as the size of the group did not exceed 20 it can keep adding to the same group.
I tried using the split function but couldn't implement this way using it.
Is there a way around this?
A vector (or column of a dataframe) can be split into overlapping windows like this:
# Size of overlap
o <- 10
# Size of sliding window
n <- 20
# Dummy data
x <- sample(LETTERS, size = 40, replace = T)
# Define start and end point (s and e)
s <- 1
e <- n
# Loop to create fragments
for(i in 1:(length(x)/o)){
assign(paste0("x", i), x[s:e])
s <- s + o
e <- (s + n) - 1
}
# Call fragments
x1
x2
x3
Result:
> x
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R" "W" "Z" "F" "T" "N" "U" "F" "R" "A" "V"
> x1
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L"
> x2
[1] "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R"
library(IRanges)
library(GenomicRanges)
(gr1 <- GRanges("chr1",IRanges(c(1,11,21,31),width=10),strand="*"))
(gr2 <- GRanges("chr1",IRanges(c(1,11,21),width=20),strand="*"))
fo <- findOverlaps(gr1, gr2)
queryHits(fo)
subjectHits(fo)
Check http://genomicsclass.github.io/book/pages/bioc1_igranges.html#intrarange for more details.
Related
Say I have a vector vec <- c("H", "H", "H", "H", "M", "M", "A", "A")
How do I get all combinations / permutations if I e.g. draw 5 out of 8 with the expetced ouput.
> head(t, 6)
[,1] [,2] [,3] [,4] [,5]
[1,] "H" "H" "H" "H" "M"
[2,] "H" "H" "H" "H" "M"
[3,] "H" "H" "H" "H" "A"
[4,] "H" "H" "H" "H" "A"
[5,] "H" "H" "H" "M" "M"
[6,] "H" "H" "H" "M" "A"
I tried gtools::combinations() but I always get the error that there are too few different elements (same is true for gtools::permutations() regardless if repeats are allowed or not.
So I did it in a laborious way
t <- gtools::combinations(8, 5, vec, repeats.allowed = F)
Error in gtools::combinations(8, 5, vec, repeats.allowed = F) :
too few different elements
t <- gtools::combinations(8, 5, letters[1:8], repeats.allowed = F)
for ( i in 1:8) {
if ( i <=4 ) {
t[t == letters[i]] <- "H"
} else if (i <= 6) {
t[t == letters[i]] <- "M"
} else if (i <= 8) {
t[t == letters[i]] <- "A"
}
}
I am looking for an easier solution from any package or base R and want to know, why it doesn't work. Thanks in advance.
An alternative
combn(vec,5)
which results in 56 combinations (choose(8,5)).
When you need combinations/permutations of a vector that contains repeats, or multisets, many of the available functions in base R and other packages will produce unnecessary duplicate results that eventually need to be filtered out. For smaller problems, this is not an issue, however this approach quickly becomes impractical.
Currently, there are a couple of packages capable of handling these types of problems. They are arrangements and RcppAlgos (I am the author).
vec <- c("H", "H", "H", "H", "M", "M", "A", "A")
tbl_v <- table(vec)
tbl_v
vec
A H M
2 4 2
library(RcppAlgos)
comboGeneral(names(tbl_v), 5, freqs = tbl_v)
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "A" "H" "H" "H"
[2,] "A" "A" "H" "H" "M"
[3,] "A" "A" "H" "M" "M"
[4,] "A" "H" "H" "H" "H"
[5,] "A" "H" "H" "H" "M"
[6,] "A" "H" "H" "M" "M"
[7,] "H" "H" "H" "H" "M"
[8,] "H" "H" "H" "M" "M"
## For package arrangements we have:
## arrangements::combinations(names(tbl_v), 5, freq = tbl_v)
Similarly, for permutations, we have:
permuteGeneral(names(tbl_v), 5, freqs = tbl_v)
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "A" "H" "H" "H"
[2,] "A" "A" "H" "H" "M"
[3,] "A" "A" "H" "M" "H"
[4,] "A" "A" "H" "M" "M"
. . . . . .
. . . . . .
. . . . . .
[137,] "M" "M" "H" "A" "A"
[138,] "M" "M" "H" "A" "H"
[139,] "M" "M" "H" "H" "A"
[140,] "M" "M" "H" "H" "H"
## For package arrangements we have:
## arrangements::permutations(names(tbl_v), 5, freq = tbl_v)
Both packages contain algorithms that generate each result without the need for filtering. This approach is much more efficient.
For example, what if we had big_vec <- rep(vec, 8) and we wanted all combinations of length 16. Using the filtering approach, one would need to generate all combinations of a vector of length 64 choose 16 and then filter them. That is choose(64, 16) = 4.885269e+14 total combinations. That's going to be difficult.
With these two packages, this problem is a breeze.
big_vec <- rep(vec, 8)
tbl_big_v <- table(big_vec)
tbl_big_v
big_vec
A H M
16 32 16
system.time(test_big <- comboGeneral(names(tbl_big_v), 16,
freqs = tbl_big_v))
user system elapsed
0 0 0
dim(test_big)
[1] 153 16
apply(gtools::combinations(8,5,repeats.allowed = FALSE),2,\(x) vec[x])
does what you want.
I don't know why the package wants different values if applying it on a vector through. It's unclear within the documentation.
I am running a multiple pairwise comparison in R. I'm using the survival package survminer. I'm using the function:
pairwise_survdiff {survminer}
It gives the pairwise comparisons with significance as expected, but doesn't seem to have a way to give a compact letter display (CLD) of the results. I'm looking at pairs of 19 levels. I ended up printing the results, putting them into excel by hand and then doing letters by hand. But now I need to do it again and am hoping for an easier way.
Can I have R do a CLD from the pairwise_survdiff {survminer} results directly?
Baring that
Is there a way to get it to print results into a table that can be read by a spreadsheet?
If I make the logic matrix by hand, how do I have R take that and turn it into a CLD?
And 4) If I'm doing it all by hand, I'm wondering if there is a more compact method of showing this list of comparisons. Can I eliminate any of these letters due to redundancy?
hand made CLD for comparisons
Thank you
Here's the example from survminer
library(survminer)
library(multcomp)
library(tidyr)
data(myeloma)
res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
data = myeloma)
Looking at the internals of the glht.summary method from the multcomp package, we create the lvl_order vector which identifies the ordering of the levels of x from smallest to largest.
x <- myeloma$molecular_group
levs <- levels(x)
y <- Surv(myeloma$time, myeloma$event)
lvl_order <- levels(x)[order(tapply(as.numeric(y)[1:length(x)],
x, mean))]
Then we can re-arrange the p-values from the res object into a matrix. mycomps is a matrix of the two sides of the paired comparisons. The signif vector is logical indicating whether differences are significant or not.
comps <- as_tibble(res$p.value, rownames="row") %>%
pivot_longer(-row, names_to="col", values_to="p") %>%
na.omit()
mycomps <- as.matrix(comps[,1:2])
signif <- comps$p < .05
Then, you can use the insert_absorb internal function to create the letters:
multcomp:::insert_absorb(signif,
decreasing=FALSE,
comps=mycomps,
lvl_order=lvl_order)
# $Letters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a" "b" "ab" "b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $monospacedLetters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a " " b" "ab" " b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $LetterMatrix
# a b
# MAF TRUE TRUE
# Proliferation TRUE FALSE
# Cyclin D-2 FALSE TRUE
# MMSET TRUE TRUE
# Hyperdiploid FALSE TRUE
# Low bone disease TRUE TRUE
# Cyclin D-1 TRUE TRUE
#
# $aLetters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
# [23] "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
# [45] "S" "T" "U" "V" "W" "X" "Y" "Z"
#
# $aseparator
# [1] "."
#
# attr(,"class")
# [1] "multcompLetters"
I am wondering what is the most efficient way to find all unique characters from a data frame in R.
i.e for example:- [0-9,a-z,A-Z,",","$","&","#" etc]
> k
cola colb
1 1&3# %^
2 A4C% 89&
The output I am expecting is a list containing all unique characters including special characters. i.e 123#%^AC89&
There's nothing really efficient about this, but ... demonstrating on the diamonds dataset from the ggplot2 library,
library(ggplot2)
unique(unlist(lapply(diamonds, function(x) unlist(strsplit(as.character(x), "")))))
# [1] "0" "." "2" "3" "1" "9" "4" "6" "5" "8" "7" "I" "d" "e" "a" "l" "P" "r" "m" "i" "u" "G" "o"
# [24] "V" "y" " " "F" "E" "J" "H" "D" "S"
If you're curious about how many of each?
table(unlist(lapply(diamonds, function(x) unlist(strsplit(as.character(x), "")))))
# . 0 1 2 3 4 5 6 7 8 9 a
# 12082 261929 81785 142173 135042 108355 121267 157242 161862 91438 71904 67144 23161
# d D e E F G H i I J l m o
# 38539 6775 47424 9797 12942 28280 8304 15401 51763 2808 21551 27582 33976
# P r S u V y
# 13791 27483 51409 13791 49953 12082
(This is effectively akrun's answer ... posted before I saw his comment-edit.)
Using your sample frame:
k <- data.frame(cola = c("1&3#", "A4C%"), colb = c("%^", "89&"), stringsAsFactors = FALSE)
unique(unlist(lapply(k, function(x) unlist(strsplit(as.character(x), "")))))
# [1] "1" "&" "3" "#" "A" "4" "C" "%" "^" "8" "9"
And if you want them in a sorted no-space string,
paste(sort(unique(unlist(lapply(k, function(x) unlist(strsplit(as.character(x), "")))))), collapse = "")
# [1] "#%&^13489AC"
Since your question suggests you're considering using this in a regex somewhere, you can sandwich this in brackets. I wouldn't go through the pain of finding character ranges (e.g., AD-GW-Z24-9), since that buys you very little regex efficiency but would take a bit more effort to generate.
Is there function in R to find the common characters in multiple vectors (of different lengths). For example, if I have 3 vectors...
a1 <- LETTERS[1:7]
a2 <- LETTERS[4:8]
a3 <- LETTERS[2:10]
a1
# [1] "A" "B" "C" "D" "E" "F" "G"
a2
# [1] "D" "E" "F" "G" "H"
a3
# [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
I can think of a messy solutions...
intersect(intersect(a1,a2),a3)
# [1] "D" "E" "F" "G"
Problem is, I have around 8 or 9 vectors. Is there a better way to this?
Yes:
Reduce(intersect,list(a1,a2,a3))
Similar to my question at Using a sample list as a template for sampling from a larger list without wraparound, how can I know do this allowing for a wrap-around?
Thus, if I have a vector of letters:
> all <- letters
> all
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and then I define a reference sample from letters as follows:
> refSample <- c("j","l","m","s")
in which the spacing between elements is 2 (1st to 2nd), 1 (2nd to 3rd) and 6 (3rd to 4th), how can I then select n samples from all that have identical, wrap-around spacing between its elements to refSample? For example, "a","c","d","j", "q" "s" "t" "z" and "r" "t" "u" "a" would be valid samples, but "a","c","d","k" would not.
Again, parameterised for a function is best.
I would have left it as an exercise but here goes --
all <- letters
refSample <- c("j","l","m","s")
pick_matches <- function(n, ref, full, wrap = FALSE) {
iref <- match(ref,full)
spaces <- diff(iref)
tot_space <- sum(spaces)
N <- length( full ) - 1
max_start <- N - tot_space*(1-wrap)
starts <- sample(0:max_start, n, replace = TRUE)
return( sapply( starts, function(s) full[ 1 + cumsum(c(s, spaces)) %% (N+1) ] ) )
}
> set.seed(1)
> pick_matches(5, refSample, all, wrap = FALSE)
[,1] [,2] [,3] [,4] [,5]
[1,] "e" "g" "j" "p" "d"
[2,] "g" "i" "l" "r" "f"
[3,] "h" "j" "m" "s" "g"
[4,] "n" "p" "s" "y" "m"
> pick_matches(5, refSample, all, wrap = TRUE)
[,1] [,2] [,3] [,4] [,5]
[1,] "x" "y" "r" "q" "b"
[2,] "z" "a" "t" "s" "d"
[3,] "a" "b" "u" "t" "e"
[4,] "g" "h" "a" "z" "k"