I have a number of data sequences and I want to select the longest sequence out them using R - r

I am working on a large number of sequences (nucleotide sequences) and I want to select the longest sequence (the sequence with the biggest length) out of them.
My sequences are elements of a list.
I am working on the R software.
Any help with the code? which functions to use?

If you list is named l use sapply(l,length) will return a vector with the length of each element in your list. To select the longest sequence use
s<-sapply(l,length) # or use s<-lengths(l) (Richard Scriven's comment)
longest<-l[[match(max(s),s)]]
Example :
x<-rnorm(100)
y<-rnorm(1000)
z<-rnom(10)
l<-list(x,y,z)
s<-sapply(l,length)
longest<-l[[match(max(s),s)]]
length(longest)
[1] 1000

Related

Check if vector of strings contains words created from two others words

I have very very long vector of strings (peptides).
head(unique(pseq_list))
#[1] "GPPNHHMGPMSER" "SLSGQCHHHGENLR" "HSSGQDKPHETYR"
#"DHDKPHQQSDK" "AHMESDK" "HISESHEK"
I want to check if in this vector are peptides created by two others peptides. For example if there are "AHMESDK", "AHME" and "SDK" I want to know that. I tried grepl function but probably my vector is to long(?). Also, how to save such results?
If it would be too difficult to verify if there exists "AHMESDK" = "AHME" + "SDK" it would be nice to know at least if in the vector are peptides which contains others (for example "HISESHEK" and "SES").
Context provided by #quant in the comments:
As a note for everyone without biological background.
Peptides are macromolecules. Our body can compose these macromolecules by "gluing" different amino acids together. The sequence of amino acids glued together is called the primary structure of a peptide and in bioinformatics often the one letter code, see rpeptide.com is used in order to represent the primary structure.
So AHMESDK simply means a peptide composed of Alanin, Histidine and so on.
Data:
pseq<-c("GPPNHHMGPMSER", "SLSGQCHHHGENLR", "HSSGQDKPHETYR", "DHDKPHQQSDK", "AHMESDK", "AHME", "SES", "HISESHEK")
Two approaches:
Approach 1:
peplist<-sapply(pseq,grep, pseq, value=TRUE)
Result:
$GPPNHHMGPMSER
[1] "GPPNHHMGPMSER"
$SLSGQCHHHGENLR
[1] "SLSGQCHHHGENLR"
$HSSGQDKPHETYR
[1] "HSSGQDKPHETYR"
$DHDKPHQQSDK
[1] "DHDKPHQQSDK"
$AHMESDK
[1] "AHMESDK"
$AHME
[1] "AHMESDK" "AHME"
$SES
[1] "SES" "HISESHEK"
$HISESHEK
[1] "HISESHEK"
This gives you a list where for every element, you get the list of elements it exists in. We can then create a list of only those peptids that appear within other peptids:
peplist[sapply(peplist,length)>1]
Approach 2:
pepcombs<-expand.grid(pseq,pseq) %>%
apply(1,paste0,collapse="")
pseq[pseq %in% pepcombs]
This will give you a list of peptids that can be constructed by combining two of the other peptids.

How to find the length of a list based on a condition in R

The problem
I would like to find a length of a list.
The expected output
I would like to find the length based on a condition.
Example
Suppose that I have a list of 4 elements as follows:
myve <–list(1,2,3,0)
Here I have 4 elements, one of them is zero. How can I find the length by extracting the zero values? Then, if the length is > 1I would like to substruct one. That is:
If the length is 4 then, I would like to have 4-1=3. So, the output should be 3.
Note
Please note that I am working with a problem where the zero values may be changed from one case to another. For example, For the first list may I have only one 0 value, while for the second list may I have 2 or 3 zero values.
The values are always positive or zero.
You just need to apply the condition to each element. This will produce a list of boolean, then you sum it to get the number of True elements (i.e. validation your condition).
In your case:
sum(myve != 0)
In a more complex case, where the confition is expressed by a function f:
sapply(myve, f)
Use sapply to extract the ones different to zeros and sum to count them
sum(sapply(myve, function(x) x!=0))

Creating Vector in R (multiple conditions)

Need to create and print a vector in R that includes the following in this order:
A sequence of integers from 6 to 10 (inclusive)
A twofold repetition of the vector c(2, -5.1, -33)
The value of the sum of 7/42 and 2
a) Then extract the first and last elements of the vector to form another vector
b) Form a third vector from the elements not extracted above
* Use the vectors from (a) and (b) to reconstruct and print the original first vector
That should do it:
a.vec<-c(seq(6,10,1),rep(c(2,-5.1,-33),times=2),(7/42+2))
b.vec<-a.vec[c(1,length(a.vec))]
c.vec<-a.vec[-c(1,length(a.vec))]
a.vec<-c(b.vec[1],c.vec,b.vec[2])

How can I compare two strings to find the number of characters that match in R, using substitution distance?

In R, I have two character vectors, a and b.
a <- c("abcdefg", "hijklmnop", "qrstuvwxyz")
b <- c("abXdeXg", "hiXklXnoX", "Xrstuvwxyz")
I want a function that counts the character mismatches between each element of a and the corresponding element of b. Using the example above, such a function should return c(2,3,1). There is no need to align the strings.
I need to compare each pair of strings character-by-character and count matches and/or mismatches in each pair. Does any such function exist in R?
Or, to ask the question in another way, is there a function to give me the edit distance between two strings, where the only allowed operation is substitution (ignore insertions or deletions)?
Using some mapply fun:
mapply(function(x,y) sum(x!=y),strsplit(a,""),strsplit(b,""))
#[1] 2 3 1
Another option is to use adist which Compute the approximate string distance between character vectors:
mapply(adist,a,b)
abcdefg hijklmnop qrstuvwxyz
2 3 1

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Resources