Find the sequence using R - r

How to write a function that accepts a DNA sequence (as a single string) and a number “n >= 2” and returns a vector with all DNA subsequences (as strings) that start with the triplet “AAA” or “GAA” and end with the triplet “AGT” and have at least 2 and at most “n” other triplets between the start and the end.
Q1:
for "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT" and for n=2,
the answer is c=(“GAACCCACTAGT”, “AAATTTGGGAGT”).
Q2:
e.g, n=10
the answer is: c("GAACCCACTAGTATAAAATTTGGGAGT", "AAACCCTTTGGGAGT")

here is a possible approach.
it uses a regex based on 2 -> n repetitions of three [A-Z] as it's core.
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
#for n = 10, this looks like: "(AAA|GAA)([A-Z]{3}){2,10}AGT"
stringr::str_extract_all( dna, regex )
# n = 2
# [[1]]
# [1] "GAACCCACTAGT" "AAATTTGGGAGT"
# n = 10
# [[1]]
# [1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"

Related

How to split a sentence in two halves in R

I have a vector of string, and I want each string to be cut roughly in half, at the nearest space.
For exemple, with the following data :
test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
"qsdf",
"mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
"qsddddddddddddddddddddddddddddddd",
"qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)
I want to get something like this :
first sec
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd
5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj
Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.
First, we split the strings by spaces.
a <- strsplit(test$init, " ")
Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector:
b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))
Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word).
combined <- Map(function(x, y){
if(y == 1){
return(c(x, NA))
}else{
return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
}
}, a, b)
Finally, we rbind the combined strings and change the column names.
newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")
Result:
> newdf
first second
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf <NA>
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd <NA>
5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj
You can use the function nbreak from the package that I wrote:
devtools::install_github("igorkf/breaker")
library(tidyverse)
test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)
#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1
#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)
test
# init
# 1 Phrase with four words
# 2 That phrase has five words
#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
separate(init, c("first", "second"), sep = "\n")
# first second
# 1 Phrase with four words
# 2 That phrase has five words

Extracting numbers from character string based on delimiters

I have the following dataframe:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3",
"abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b)
df$b <- as.character(df$b)
And I need to extract the numbers in df$b that come between the second and third underscores and assign to df$c.
I'm guessing there's a fairly simple solution, but haven't found it yet. The actual dataset is fairly large (3MM rows) so efficiency is a bit of a factor.
Thanks for the help!
We can use sub to match the zeor or more characters that are not a _ ([^_]*) from the start (^) of the string followed by an underscore (_), then another set of characters that are not an underscore followed by underscore, capture the one of more numbers that follows in a group ((\\d+)) followed by underscore and other characters, then replace it with the backreference for that group and finally convert it to numeric
as.numeric(sub("^[^_]*_[^_]+_(\\d+)_.*", "\\1", df$b))
#[1] 123456 78912 345678912 34567 891234556778
create a my_split function that finds the start and end position of "_" using gregexpr. Then extract the string between start and end position using substr.
my_split <- function(x, start, end){
a1 <- gregexpr("_", x)
substr(x, a1[[1]][start]+1, a1[[1]][end]-1)
}
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
sapply(b, my_split, start = 2, end = 3)
# abc_a_123456_defghij_1 abc_a_78912_abc_2
# "123456" "78912"
# abc_a_345678912_xyzabc_3 abc_b_34567_defgh_4
# "345678912" "34567"
# abc_c_891234556778_ijklmnop_5
# "891234556778"
using data.table library
library(data.table)
setDT(df)[, c := lapply(b, my_split, start = 2, end = 3)]
df
# a b c
# 1: 1 abc_a_123456_defghij_1 123456
# 2: 2 abc_a_78912_abc_2 78912
# 3: 3 abc_a_345678912_xyzabc_3 345678912
# 4: 4 abc_b_34567_defgh_4 34567
# 5: 5 abc_c_891234556778_ijklmnop_5 891234556778
data:
a <- seq(1:5)
b <- c("abc_a_123456_defghij_1", "abc_a_78912_abc_2", "abc_a_345678912_xyzabc_3", "abc_b_34567_defgh_4", "abc_c_891234556778_ijklmnop_5")
df <- data.frame(a, b, stringsAsFactors = FALSE)

How to count the number of two observations binary combinations?

In the example below, I would like the know the number of 010 sequences, or the number of 1010 sequences. Below is a workable example;
x <- c(1,0,0,1,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,1,0)
In this example, the number of 010 sequences would be 6 and the number of 1010 sequences would be 4.
What would be the most efficient/simplest way to count the number of consecutive sequences?
A stringless way:
f = function(x, patt){
if (length(x) == length(patt)) return(as.integer(x == patt))
w = head(seq_along(x), 1L-length(patt))
for (k in seq_along(patt)) w <- w[ x[w + k - 1L] == patt[k] ]
w
}
length(f(x, patt = c(0,1,0))) # 6
length(f(x, patt = c(1,0,1,0))) # 4
Alternatives. From #cryo11, here's another way:
function(x,patt) sum(apply(embed(x,length(patt)),1,function(x) all(!xor(x,patt))))
or another variation:
function(x,patt) sum(!colSums( xor(patt, t(embed(x,length(patt)))) ))
or with data.table:
library(data.table)
setkey(setDT(shift(x, seq_along(patt), type = "lead")))[as.list(patt), .N]
(The shift function is very similar to embed.)
Another solution would be this:
library(stringr)
x <- c(1,0,0,1,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,1,0)
xx = paste0(x, collapse = "")
str_count(xx, '(?<=010)')
[1] 6
str_count(xx, '(?<=1010)')
[1] 4
As #Pierre Lafortune pointed out in the comments this can be done without using any packages:
length(gregexpr("(?<=010)", xx, perl=TRUE)[[1]])
[1] 6
logic : take a substr of length of pattern you are searching for and compare it with the pattern.
xx = paste0(x, collapse = "")
# [1] "1001000111001010101010"
# case 1 :
xxx = "010"
sum(sapply(1:(length(x)-nchar(xxx)+1), function(i) substr(xx,i,i+nchar(xxx)-1)==xxx))
# [1] 6
# case 2 :
xxx = "1010"
# [1] 4
R introduced the startsWith function in 3.3.0. Using this and substring, we can implement #joel.wilson's method as
sum(startsWith(substring(paste(x, collapse=""),
head(seq_along(x), -2), tail(seq_along(x), -2)), "010"))
Here, substring constructs all three character adjacent sets and startsWith tests if each of these is the same as "010". The TRUE values are then summed together.

Extract 2 parts of a string

Assume I have the following string (filename):
a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
which consists of several parts (here is given p1)
or another one
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
which consists of only one part (so no need to label any p)
How can I extract the Identifier, which is the three letters before the VARXXXXX (so in case one it would be TKN, in case two it would be ZHN) PLUS the part identifier, if available?
So the result should be:
case1 : TKN_p1
case2 : ZHN
I know how to extract the first identifier, but I cannot handle the second one at the same time.
My approach so far:
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", a)
sub(".*(.{3})_VAR29380_timely(.{3}).*","\\1\\2", b)
but this adds .tx incorrectly in the second case.
You are not using anchors and matching the last 3 characters right after timely without checking what these characters are (. matches any character).
I suggest
sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
Details:
^ - start of string
.*/ - part of string up to and including the last /
([A-Z]{3}) - 3 ASCII uppercase letters captured into Group 1
_VAR\\d+_timely - _VAR + 1 or more digits + _timely
(_[^_.]+)? - an optional Group 2 capturing _ + 1 or more chars other than _ and .
\\. - a dot
[^.]* - zero or more chars other than .
$ - end of string.
Replacement pattern contains 2 backreferences to both the capturing groups to insert their contents to the replaced string.
R demo:
a <- "X/ZHEB100/TKN_VAR29380_timely_p1.txt"
a2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", a)
a2
[1] "TKN_p1"
b <- "X/ZHEB100/ZHN_VAR29380_timely.txt"
b2 <- sub("^.*/([A-Z]{3})_VAR\\d+_timely(_[^_.]+)?\\.[^.]*$", "\\1\\2", b)
b2
[1] "ZHN"
Just another solution, for something different from Wiktor's already working solution:
library( magrittr )
data <- c( a, b )
First get the "ID" values by splitting on "/", taking the last value, and taking the first 3 characters of that:
ID <- strsplit( data, "/" ) %>%
sapply( tail, n = 1 ) %>%
substr( 1, 3 )
Then get the "part" values by splitting out both "timely" and ".txt", and taking the last element (which may be an empty string):
part <- strsplit( data, "timely|.txt" ) %>%
sapply( tail, n = 1 )
Now just paste them together for the result:
output <- paste0( ID, part )
output
[1] "TKN_p1" "ZHN"
Or, if you'd rather not create the intermediate objects:
output <- strsplit( data, "/" ) %>%
sapply( tail, n = 1 ) %>%
substr( 1, 3 ) %>%
paste0( strsplit( data, "timely|.txt" ) %>%
sapply( tail, n = 1 ) )

How to delete everything after nth delimiter in R?

I have this vector myvec. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?
myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")
result
chr2:213403244
chr7:55240586
chr7:55241607
We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.
sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
The above works for the example posted. For general cases to remove after the nth delimiter,
n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Checking with a different 'n'
n <- 3
and repeating the same steps
sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
#[3] "chr7:55241607:55241607"
Or another option would be to split by : and then paste the n number of components together.
n <- 2
vapply(strsplit(myvec, ':'), function(x)
paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.
1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:
k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be ^((.*?:){2}.*?):.* ) and use it with sub:
k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 1: For k=1 this can be further simplified to sub(":.*", "", myvec) and for k=n-1 it can be further simplified to sub(":[^:]*$", "", myvec)
Note 2: Here is a visualization of the regular regular expression for k equal to 3:
^((.*?:){2}.*?):.*
Debuggex Demo
3) iteratively delete last field We could remove the last field n-k times using the last regular expression in Note 1 above like this:
n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)
If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:
n <- count.fields(textConnection(myvec[1]), sep = ":")
4) locate position of kth colon Locate the positions of the colons using gregexpr and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use substr to extract that many characters from the respective strings.
k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using paste0(myvec, ":") as the input instead of myvec.
Note 4: We compare performance:
library(rbenchmark)
benchmark(
.read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
.sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec),
.for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
.gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
order = "elapsed", replications = 1000)[1:4]
giving:
test replications elapsed relative
2 .sprintf.sub 1000 0.11 1.000
4 .gregexpr 1000 0.14 1.273
3 .for 1000 0.15 1.364
1 .read.table 1000 2.16 19.636
The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.
ADDED Added additional solutions and additional notes.

Resources