In R, how can I produce all the permutation of a group, but in this group there are some repetitive elements.
Example :
A = {1,1,2,2,3}
solution :
1,1,2,2,3
1,1,2,3,2
1,1,3,2,2
1,2,1,2,3
1,2,2,1,3
1,2,2,3,1
.
.
using the gtools package,
library(gtools)
x <- c(1,1,2,2,3)
permutations(5, 5, x, set = FALSE)
Just use the combinat package:
A = c(1,1,2,2,3)
library(combinat)
permn(A)
If you want to do it with built-in R:
permute <- function(vec,n=length(vec)) {
permute.index <- sample.int(length(vec),n)
return(vec[permute.index])
}
permute(A)
Using the permute package:
x <- c(1,1,2,2,3)
require(permute)
allPerms(x, observed = TRUE)
I have done extensive research on combination and permutation. This result which I have found is written on a book Known as Junction (an art of counting combination and permutation. To view my site then log on to https://sites.google.com/site/junctionslpresentation/home
I have also have solution for your question. I have also found to order a multiple object permutation. This multiple object permutation I call it (CON of MSNO) which means Combination Order Number of Multiple Same Number of Objects.
To view this method of ordering then go to the site https://sites.google.com/site/junctionslpresentation/proof-for-advance-permutation
at the bottom of this site I have attached some word documents. Your required solution is written on the word document 12 Proof (CON of MSNO) and 13 Proof (Converse of CON of MSNO). Download this word document for the proper view of the written matters.
Related
Similar questions have been raised for other languages: C, sql, java, etc.
But I'm trying to do this in R.
I have:
ret_series <- c(1, 2, 3)
x <- "ret_series"
How do I get (1, 2, 3) by calling some function / manipulation on x, without direct mentioning of ret_series?
You provided the answer in your question. Try get.
> get(x)
[1] 1 2 3
For a one off use, the get function works (as has been mentioned), but it does not scale well to larger projects. it is better to store you data in lists or environments, then use [[ to access the individual elements:
mydata <- list( ret_series=c(1,2,3) )
x <- 'ret_series'
mydata[[x]]
What's wrong with either of the following?
eval(as.name(x))
eval(as.symbol(x))
Note that some of the examples above wouldn't work for a data.frame.
For instance, given
x <- data.frame(a=seq(1,5))
get("x$a") would not give you x$a.
Similar questions have been raised for other languages: C, sql, java, etc.
But I'm trying to do this in R.
I have:
ret_series <- c(1, 2, 3)
x <- "ret_series"
How do I get (1, 2, 3) by calling some function / manipulation on x, without direct mentioning of ret_series?
You provided the answer in your question. Try get.
> get(x)
[1] 1 2 3
For a one off use, the get function works (as has been mentioned), but it does not scale well to larger projects. it is better to store you data in lists or environments, then use [[ to access the individual elements:
mydata <- list( ret_series=c(1,2,3) )
x <- 'ret_series'
mydata[[x]]
What's wrong with either of the following?
eval(as.name(x))
eval(as.symbol(x))
Note that some of the examples above wouldn't work for a data.frame.
For instance, given
x <- data.frame(a=seq(1,5))
get("x$a") would not give you x$a.
This question is just asking for an implementation in R of the following question : Find the longest common starting substring in a set of strings (JavaScript)
"This problem is a more specific case of the Longest common substring problem. I need to only find the longest common starting substring in an array".
So im just looking an R implementation for this question (preferably not in a for / while loop fashion that was suggested in the JavaScript version), if possible i would like to wrap it up as a function, so i could apply on many groups in a data table.
After some searches, i couldn't find an R example for this, hence this question.
Example Data:
I have the following vector of characters:
dput(data)
c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
I'm looking to run an algorithm in R that will find the following output: ADA44.
From what I've seen in the JavaScript accepted answer, the idea is to first sort the vector, extract the first and last elements (for example : "ADA4417-3ARMZ-R7" and "ADA4432-1BRJZ-R2" , break them into single characters, and loop through them until one of the characters don't match (hope im right)
Any Help on that would be great!
Taking inspiration from what you suggested, you can try this function :
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}
UPDATE
Following #nicola suggestion, a simpler and more elegant variant for the function:
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}
Examples:
With your data
x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"
When there is no common starting substring
x<-c("abc","def")
> comsub(x)
# character(0)
A non-base alternative, using the lcprefix function in Biostrings to find the "Longest Common Prefix [...] of two strings"
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
x2 <- sort(x)
substr(x2[1], start = 1, stop = lcprefix(x2[1], x2[length(x2)]))
# [1] "ADA44"
Piggybacking off Henrik's answer, Bioconductor has a C based prefix function and an R based one. The R based one is:
lcPrefix <- function (x, ignore.case = FALSE)
{
x <- as.character(x)
if (ignore.case)
x <- toupper(x)
nc <- nchar(x, type = "char")
for (i in 1:min(nc)) {
ss <- substr(x, 1, i)
if (any(ss != ss[1])) {
return(substr(x[1], 1, i - 1))
}
}
substr(x[1], 1, i)
}
<environment: namespace:Biobase>
... and doesn't require any special features of Bioconductor (as far as I can tell).
--- Citation ---
Orchestrating high-throughput genomic analysis with Bioconductor. W.
Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods,
2015:12, 115.
Here is a compact solution:
data<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", "ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
substr(data[1],1,which.max(apply(do.call(rbind,lapply(strsplit(data,''),`length<-`,nchar(data[1]))),2,function(i)!length(unique(i))==1))-1)
[1] "ADA44"
(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)
The task I'm trying to achieve is to align several sequences.
I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.
My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)
Real_sequence The_sequence_we_see
1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
Here is an example code to reproduce the above example:
ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG)
{
number.of.changes <- sample(seq_len(number.of.changes), 1)
new.letters <- sample(letters.to.change.with , number.of.changes, T)
where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
x[where.to.change.the.letters] <- new.letters
return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-")
insert.missing.values(original.seq)
seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))
seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")
# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
I understand that if all I had was a string and a pattern I would be able to use
library(Biostrings)
pairwiseAlignment(...)
But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).
Is there a known method for doing this in R?
Writing an alignment algorithm in R looks like a bad idea to me, but there is an R interface to the MUSCLE algorithm in the bio3d package (function seqaln()). Be aware of the fact that you have to install this algorithm first.
Alternatively, you can use any of the available algorithms (eg ClustalW, MAFFT, T-COFFEE) and import the multiple sequence alignemts in R using bioconductor functionality. See eg here..
Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.
You can perform multiple alignment in R with the DECIPHER package.
Following your example, it would look something like:
library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)
It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!
You are looking for a global alignment algorithm on multiple sequences.
Did you look at Wikipedia before asking ?
First learn what global alignment is, then look for multiple sequence alignment.
Wikipedia doesn't give a lot of details about algorithms, but this paper is better.
Similar questions have been raised for other languages: C, sql, java, etc.
But I'm trying to do this in R.
I have:
ret_series <- c(1, 2, 3)
x <- "ret_series"
How do I get (1, 2, 3) by calling some function / manipulation on x, without direct mentioning of ret_series?
You provided the answer in your question. Try get.
> get(x)
[1] 1 2 3
For a one off use, the get function works (as has been mentioned), but it does not scale well to larger projects. it is better to store you data in lists or environments, then use [[ to access the individual elements:
mydata <- list( ret_series=c(1,2,3) )
x <- 'ret_series'
mydata[[x]]
What's wrong with either of the following?
eval(as.name(x))
eval(as.symbol(x))
Note that some of the examples above wouldn't work for a data.frame.
For instance, given
x <- data.frame(a=seq(1,5))
get("x$a") would not give you x$a.