Extract strings based on multiple patterns - r

I have thousands of DNA sequences that look like this :).
ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
"CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")
I need to extract every sequence between the
CTACG and CAGTC. However, many cases in these sequences come with an error
(deletion, insertion, substitution). Is there any way to account for mismatches based on Levenshtein distance?
ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
"CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")
qdapRegex::ex_between(ref, "CTACG", "CAGTC")
#> [[1]]
#> [1] "GTTATGTACGATTAAAGAAGATCGT"
#>
#> [[2]]
#> [1] "CGTTGATATTTTGCATGCTTACTCC"
#>
#> [[3]]
#> [1] NA
reprex()
#> Error in reprex(): could not find function "reprex"
Created on 2021-12-18 by the reprex package (v2.0.1)
Like this I would be able to extract the sequence also in the last case.
UPDATE: can I create a dictionary with a certain Levenshtein distance and then match it to each sequence?

Using aregexec, build a regex pattern with sprintf, and finally removing the matches using gsub. Putting it into a Vectorized function to avoid overloading the script with lapplys or loops.
In the regex, the .* refers to everything before (resp. after) the respective letters. Note, that you probably need to adapt the max.distance= with your real data.
fu <- Vectorize(function(x) {
p1 <- regmatches(x, aregexec('.*CTACG', x, max.distance=0.1))
p2 <- regmatches(x, aregexec('CAGTC.*', x, max.distance=0.1))
gsub(sprintf('%s|%s', p1, p2), '', x, perl=TRUE)
})
fu(ref)
# CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC
# "GTTATGTACGATTAAAGAAGATCGT" "CGTTGATATTTTGCATGCTTACTCC"
# CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC
# "CGTTGATATTTTGCATGCTTACTCC"
Data:
ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC",
"CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC")

Related

How do you test if a matrix exists in a matrix list? (Wordle Project)

I am an infrequent R users so my apologies if any of my terminology is incorrect. I am working on a project around the game Wordle to see if a given Wordle submission in my family group chat is unique or if they have already been submitted before. The inspiration for this came from the twitter account "Scorigami" which tracks every NFL game and tweets whether or not that score has occurred before in the history of the league.
To load the Wordle entries into R, I've decided to turn each submission into a Matrix where 0 = incorrect letter, 1 = right letter/wrong position, and 2 = right letter/correct position. In R this looks like this:
wordle_brendan <- rbind(c(1,0,0,0,0),c(2,2,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
wordle_jack <- rbind(c(2,0,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
I then combine them into a list that will be used to check against any future Wordle submissions to see if they have been previously submitted.
list <- list(wordle_brendan, wordle_jack)
I think I am on the right track, but I don't know how to create a new wordle matrix to test whether that submission has been given before. Say I recreated "wordle_brendan" with the same values but under a different name... How would I then get R to check if that matrix exists in my preexisting list of matrices? I've tried using the %in% function 1,000 different ways but can't get it to work.. Any help would be much appreciated! Thanks! (And if you can think of a better way to do this, please let me know!)
There are multiple ways to do this, but this is pretty simple. We need some samples to check:
new1 <- list[[2]] # The same as your second matrix
new2 <- new1
new2[3, 5] <- 0 # Change one position from 2 to 0.
To compare
any(sapply(list, identical, y=new1))
# [1] TRUE
any(sapply(list, identical, y=new2))
# [1] FALSE
So new1 matches an existing matrix, but new2 does not. To see which matrix:
which(sapply(list, identical, y=new1))
# [1] 2
which(sapply(list, identical, y=new2))
# integer(0)
So new1 matches the second matrix in list, but new2 does not match any matrix.
Here is a way with a matequal function. Base function identical compares objects, not values and if the matrices have the same values but different attributes, such as names, identical returns FALSE.
This is many times too strict. A function that compares values only will return TRUE in these cases.
I will use dcarlson's new1 to illustrate this point.
matequal <- function(x, y) {
ok <- is.matrix(x) && is.matrix(y) && all(dim(x) == dim(y))
ok && all(x == y)
}
wordle_brendan <- rbind(c(1,0,0,0,0),c(2,2,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
wordle_jack <- rbind(c(2,0,0,0,0),c(2,2,0,0,0),c(2,2,2,2,2))
list <- list(wordle_brendan, wordle_jack)
new1 <- list[[2]] # The same as your second matrix
wordle_john <- wordle_jack
dimnames(wordle_john) <- list(1:3, letters[1:5])
list2 <- list(wordle_brendan, wordle_jack, wordle_john)
sapply(list2, identical, y=new1)
#> [1] FALSE TRUE FALSE
sapply(list2, matequal, y=new1)
#> [1] FALSE TRUE TRUE
Created on 2022-09-27 with reprex v2.0.2
Edit
identical is not a function to compare two objects' values, it's a function to compare the objects themselves. In the following example identical returns FALSE though x and y have equal values, in the usual sense of equal.
matequal <- function(x, y) {
ok <- is.matrix(x) && is.matrix(y) && all(dim(x) == dim(y))
ok && all(x == y)
}
x <- matrix(1:5, ncol = 1)
y <- matrix(1 + 0:4, ncol = 1)
all(x == y)
#> [1] TRUE
identical(x, y)
#> [1] FALSE
matequal(x, y)
#> [1] TRUE
Created on 2022-09-28 with reprex v2.0.2
This is because the internal representations of x and y, borrowed from the C language, correspond to different class attributes. One of the objects stores elements of class "integer" and the other elements of class "numeric". The matrices both have the same class attribute ("matrix" "array"), the matrices elements' storage type is the main difference.
In a comment it is asked
Thank you and dcarlson for the response! Regarding the your two sapply lines, can you explain what the use would be behind using matequal as opposed to identical? Is the only difference that matequal takes into account the column and row names?
So the answer to the question in comment is no, the attributes, in this case dimnames, are not the only reason why identical is some or many times not ideal to compare R objects.
typeof(x)
#> [1] "integer"
typeof(y)
#> [1] "double"
class(x[1])
#> [1] "integer"
class(y[2])
#> [1] "numeric"
class(x)
#> [1] "matrix" "array"
class(y)
#> [1] "matrix" "array"
Created on 2022-09-28 with reprex v2.0.2

Gtools mixedsort not working as expected on numeric string

I have a string
str1 <- "T-759..780, -D-27..758_E, -D-781..1338_C"
And I tried to use gtools::mixedsort to order these comma separated strings.
sapply(strsplit(str1 , ','), function(x) toString(gtools::mixedsort(x)))
I get
" -D-781..1338_C, -D-27..758_E, T-759..780"
I am expecting
"-D-27..758_E, T-759..780 -D-781..1338_C"
Not sure what I need to do to get the expected output.
I think you have a misconception on how mixedsort() works. It doesn't sort by the numbers in the string, it splits a string in separate string and number parts and sorts all of them in order. I hope these small example illustrate how mixedsort() works. It starts by sorting the elements of the vector c("B_1", "A_2", "A_10") by their first string-part c("B", "A", "A"), so A is always before B and then for the two A-elements it sorts them by their numbers 10 and 2:
# example showing how mixedsort works
example <- c("B_1", "A_2", "A_10")
gtools::mixedsort(example)
#> [1] "A_2" "A_10" "B_1"
sort(example) # in comparison to normal sort, which doesn't recognize parts of the string as numbers
#> [1] "A_10" "A_2" "B_1"
Created on 2022-09-02 by the reprex package (v2.0.1)
But according to your example, you want to sort a vector by the first number, which appears in each element, and ignore a possible - infront of the number. In that case, you can just use a regular expression to extract the first number in a string with gsub(".*?([0-9]+).*", "\\1", x) and use that to sort the vector. I wrote a small function for it:
# function to sort by first number, ignoring minus before the number
sort.first.number <- function(x) {
v <- gsub(".*?([0-9]+).*", "\\1", x)
x[order(v)]
}
str1 <- "T-759..780, -D-27..758_E, -D-781..1338_C"
sapply(strsplit(str1 , ','), function(x) toString(sort.first.number(x)))
#> [1] " -D-27..758_E, T-759..780, -D-781..1338_C"
Created on 2022-09-02 by the reprex package (v2.0.1)

Why don't lambda functions handle replacement functions in their intuitive form?

Why don't lambda functions handle replacement functions in their natural form? For example, consider the length<- function. Say I want to standardize the lengths of a list of objects, I may do something like:
a <- list(c("20M1", "A1", "ACC1"), c("20M2", "A2", "ACC2"), c("20M3"))
mx <- max(lengths(a))
lapply(a, `length<-`, mx)
#> [[1]]
#> [1] "20M1" "A1" "ACC1"
#>
#> [[2]]
#> [1] "20M2" "A2" "ACC2"
#>
#> [[3]]
#> [1] "20M3" NA NA
However if I wanted to specify the argument input locations explicitly using a lambda function I'd need to do (which also works):
lapply(a, function(x) `length<-`(x, mx))
But why doesn't the more intuitive notation for replacement functions (see below) work?
lapply(a, function(x) length(x) <- mx)
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 3
#>
#> [[3]]
#> [1] 3
This returns an output I did not expect. What is going on here? Lambda functions seem to handle the intuitive form of infix functions, so I was a little surprised they don't work with the intuitive form of replacement functions. Why is this / is there a way to specify replacement functions in lambda functions using their intuitive form?
(I imagine it has something to do with the special operator <-... but would be curious for a solution or more precise explanation).
Whenever you do an assignment in R, the value returned from that expression is the right hand side value. This is true even for "special" versions of assign functions. For example if you do this
x <- 1:2; y <- (names(x) <- letters[1:2])
> y
[1] "a" "b"
You can see that y gets the values of the names, not the updated value of x.
In your case if you want to return the updated value itself, you need to do so explicitly
lapply(a, function(x) {length(x) <- mx; x})

Creating a vector of graph objects in r

I am currently working with the 'igraph' package on R.
I have created two functions that create a statistical table of graph object that work pretty well if used directly on a single graph object (here is an example of what they look like) :
Sfn <- function(x) # Give a table of statistics for nodes
{
Name <- deparse(substitute(x))
Nodes <- V(x)$name
Dtotal <- degree(x, mode="all")
Eigenvector <- eigen_centrality(x)
statistics_table <- data.frame(Nodes,
Dtotal,
Eigenvector)
colnames(statistics_table) <- c("Nodes","Total Degrees",
"Eigenvector centrality")
write.table(statistics_table,
file = paste0("Table_of_",Name,"_nodes.csv"),
sep=",",
row.names = F)
print("Success.")
}
As I am using several graph objects, I would like not to have to write one line per command, such as :
Sfn(g)
Sfn(g2)
Sfn(g3)
# etc...
Sfn(n)
I would thus like to create a vector of lists in which I could collect all my graph objects. I created something like that :
G <- c(
list(CC1),list(CC2),list(CC3),
list(CC4),list(CC5),list(CC6),
list(CC7),list(CC8),list(CC9),
list(CC10),list(CC11),list(CC12))
Yet, this solution is not optimal. First, it is too long to write if I have, for example, 100 graph objects. Secondly, if I write my script with an for() loop, the name of the variable sent to my function will be the name of the parameter of for(), thus, ruining the variable Name of my function Sfn. In other words, the script for(i in G) {Sfn(G)} does not work, because the variable Name will be equal to i :
# In my function Sfn, Name <- deparse(substitute(i)),
for(i in G) {print(deparse(substitute(i)))}
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
[1] "i"
Also, the solution there : (Change variable name in for loop using R) does not work because I have, in my graph objects, very different randomly attributed graph names (such as "CC1","g2","CT3","CC1T3", etc).
Do you have any idea on how I could possibly:
1 - achieve a better way of creating a vector of graph objects ?
2 - make the name of the parameter sent to my variable the same as the actual name of the variable ?
Using deparse(substitute()) makes it hard to do what you want. If you really want to do this without changing your Sfn function, you'll need to construct a call to it as a string and parse that. For example:
names <- c("CC1", "g2")
Sfn <- function(x) deparse(substitute(x)) # just return the name
result <- list()
for (n in names) {
result[[n]] <- eval(parse(text = paste("Sfn(", n, ")")))
}
result
#> $CC1
#> [1] "CC1"
#>
#> $g2
#> [1] "g2"
Created on 2021-09-12 by the reprex package (v2.0.0)
This could be much simpler if you passed the name you want as a string to Sfn, instead of trying to get it using deparse(substitute()), e.g.
names <- c("CC1", "g2")
Sfn <- function(x, name) name
result <- list()
for (n in names)
result[[n]] <- Sfn(n, n)
result
#> $CC1
#> [1] "CC1"
#>
#> $g2
#> [1] "g2"
Created on 2021-09-12 by the reprex package (v2.0.0)
Edited to add: Not only is the second solution cleaner, it's safer too. If you don't have complete control of the names vector, there's a huge security risk: someone could set the "name" to some executable code (see https://xkcd.com/327/) and it would be executed.

Conditional shift of substring position in R

> Df1
[1] "HM_004_T" "HM_004_T2" "HM_005_T" "HMFN_005_T2" "HM_007_T" "HM_007_T2" "HM_088_TR"
[8] "HM_088_T3"
Reference is made to change position of word within a string in r. I have a slightly different question. I first wish to delete _T if it presents on its own, and wish to delete _T2, _T3 or _TR and move them before all other text.
My ideal output will be:
Df1 <- c("HM_004", "T2_HM_004", "HM_005", "T2_HM_005", "HM_007", "T2_HM_007", "TR_HM_088", "T3_HM_088")
Input data
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2", "HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
You can do this with nested sub and backreference:
DF1 <- sub("(.*)_(T\\w)$", "\\2_\\1", sub("_T$", "", DF1))
Here you delete string-final _T in the first sub operation, the result of which you pass to the second sub operation, which switches the order of (i) whatever comes before the underscore _ and (ii) T followed by a digit or a letter (\\w), by referring to these two substrings with the backreferences \\1and \\2.
Result:
DF1
[1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007" "TR_HM_088" "T3_HM_088"
Data:
DF1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
You can achieve this relatively easy with the package stringr and the functions str_remove() and str_replace().
I am assuming that the patterns of interest always occur at the end of the text and that they are always preceded by _.
Please, have a look at the updated code below. This treats the pattern _T*, where * can now be a letter, as target thus good pattern.
library(stringr)
Df1 <- c("HM_004_T", "HM_004_T2", "HM_005_T", "HM_005_T2",
"HM_007_T", "HM_007_T2", "HM_088_TR", "HM_088_T3")
# Here I remove the roots I don't want like "_T" and "_T*"
# where "*" can be a digit or a character
df2 <- str_remove(Df1, "_T$")
# Here I replace the patterns through the group reference
final <- str_replace( df2, "(^.*)_(T\\d+$|T\\w+$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
# A more coincise way would be the following where \\w is the workhorse.
final <- str_replace( df2, "(^.*)_(T\\w$)", "\\2_\\1" )
final
#> [1] "HM_004" "T2_HM_004" "HM_005" "T2_HM_005" "HM_007" "T2_HM_007"
#> [7] "TR_HM_088" "T3_HM_088"
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?

Resources