How to calculate longest common substring anywhere in two strings - r

I am trying to calculate the longest exact common substring without gaps between a string and a vector of strings in R. How do I modify stringdist to return any common string anywhere in the two compared strings and return the distance?
Reproduce data:
string1 <- "whereiam"
vec1 <- c("firstiam","twoiswhereiaminthisvec","thisisthree","fouriamhere","fivewherehere")
Attempted stringdist function tried (doesnt work for my purposes):
library(stringdist)
stringdistvec <- stringdist(string1,vec1,method="lcs")
[1] 8 14 13 11 11 #not calculating the lcs type I want
Desired result instead with explanation of matches:
#desired to work to get this result:
desired_stringdistvec <- c(3,8,1,3,5)
[1] 3 8 1 3 5
#match 1: iam (3 common substr)
#match 2: whereiam (8 common substr)
#match 3: i (one letter only)
#match 5: iam (3 common substr)
#match 6: where (5 common substr)

One approach might be to look at the transformation sequence produced by adist() and count the characters in the longest contiguous match:
trafos <- attr(adist(string1, vec1, counts = TRUE), "trafos")
sapply(gregexpr("M+", trafos), function(x) max(0, attr(x, "match.length")))
[1] 3 8 1 3 5

Related

count of multiple partially matching DNA sequences

I have a dataset of partially matching DNA sequences and want to assign different numerical indexes to the partially matching sequences.
i.e.:
sequences <- c("AAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"AAAAAAAAAAAAAAAAAAAAAAAAAAAACCC",
"AAAAAAAAAAAAAAAAAAAAAAAAACC",
"CATTTTCAG",
"CATTTTCAGTCAAAATTT",
"CATG",
"CATGG",
"CATGGGTT",
"GATC")
The first one recurs in the 2nd, 3rd and 4th and they should all get a value 1, the 5th recurs in the 6th and they should all get a 2, the 7th recurs in the 8th and 9th and should all get a 3, the 10th does not recur and should get 4 as index. This is just an example of course, sometimes the dataset could contain >3000 rows.
I tried several solutions including grepl and str_count. The latest one of the attempts was to create a dictionary first to store all the sequences and the indices, create a list of prefixes and then iterate the prefixes to assign the indices. However the result is not what I expect as all the sequences get a index of 1.
# Create a dictionary to store the sequences and their indices
indices <- as.list(1:length(sequences))
names(indices) <- sequences
# Create a function that returns the first 7 characters of a sequence
get_prefix <- function(seq) {
return(substring(seq, 1, 7))
}
# Create a list of unique prefixes
prefixes <- unique(sapply(sequences, get_prefix))
# Iterate over the prefixes and assign the same index to all sequences that start with the same prefix
for (i in 1:length(prefixes)) {
prefix <- prefixes[i]
seqs <- sequences[sapply(sequences, get_prefix) == prefix]
indices[seqs] <- which.min(indices[seqs])
}
# Print the final indices
print(indices)
Any help is welcome! thanks!
This problem relates to grouping using relational data. You can use grep + igraph to do so:
library(igraph)
sapply(sequences, grep, sequences, value = TRUE) |>
stack() |>
graph.data.frame() |>
clusters() |>
getElement("membership") |>
stack()
values ind
1 1 AAAAAAAAAAAAAAA
2 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
3 1 AAAAAAAAAAAAAAAAAAAAAAAAAAAACCC
4 1 AAAAAAAAAAAAAAAAAAAAAAAAACC
5 2 CATTTTCAG
6 2 CATTTTCAGTCAAAATTT
7 3 CATG
8 3 CATGG
9 3 CATGGGTT
10 4 GATC

How can I check for cluster patterns in a sequence of numbers and obtain the next value?

Given a set of sequences
seq1 <- c(3,3,3,7,7,7,4,4)
seq2 <- c(17,17,77,77,3)
seq3 <- c(5,5,23)
How can we create a function to check this sequence for cluster patterns and predict the next value of the sequence which in this case would be 4,3, and 23 respectively.
Edit: The sequence should first be checked for cluster patterns, if it does not contain this class of pattern then the sequence should be ignored or passed onto another function
Edit 2: A pattern should be defined by more that 1 of the same consecutive number and always grouped consistently e.g 1,1,1,2,2,2,3,3,3 is a pattern but 1,1,2,2,2,3,3 is not a pattern
Here's a way with rle in base R which checks if all run-lengths, except last, are equal and if TRUE then repeats the last value such that it has same pattern as others -
rl <- rle(seq1)$lengths
# check if all run-lengths, except last, are equal
if(all(head(rl, -1) == rl[1])) {
c(seq1, rep(seq1[length(seq1)], diff(range(rl))))
} else {
# do something else
}
# [1] 3 3 3 7 7 7 4 4 4
The same approach applies for seq2 and seq3.

Shuffling string (non-randomly) for maximal difference

After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.
I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.
My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:
seq_1<-"ACTG"
seq_2<-"ATGTT"
seq_3<-"ACGTGCT"
For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.
A favourable scramble sequence for seq_3 could be something like;
seq_3.scramble<-"CATGTGC"
,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.
I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.
Do you have any ideas?
python, since I don't know r, but the basic solution is as follows
def calcDistance(originalString,newString):
d = 0
i=0
while i < len(originalString):
if originalString[i] != newString[i]: d=d+1
i=i+1
s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
if calcDistance(s,combo) > d_max:
d_max = calcDistance(s,combo)
s_final = combo
Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.
seq_3<-"ACGTGCT"
myfun <- function(S) {
require(combinat)
require(dplyr)
require(stringdist)
vec <- unlist(strsplit(S, ""))
P <- sapply(permn(vec), function(i) paste(i, collapse=""))
Dist <- c(stringdist(S, P, method="hamming"))
df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
distinct(seq, HD) %>%
arrange(desc(HD))
return(df)
}
library(combinat)
library(dplyr)
library(stringdist)
head(myfun(seq_3), 10)
# seq HD
# 1 TACGTGC 7
# 2 TACGCTG 7
# 3 CACGTTG 7
# 4 GACGTTC 7
# 5 CGACTTG 7
# 6 CGTACTG 7
# 7 TGCACTG 7
# 8 GTCACTG 7
# 9 GACCTTG 7
# 10 GATCCTG 7

Reverse only alphabetical patterns in a string in R

I'm trying to learn R and a sample problem is asking to only reverse part of a string that is in alphabetical order:
String: "abctextdefgtext"
StringNew: "cbatextgfedtext"
Is there a way to identify alphabetical patterns to do this?
Here is one approach with base R based on the patterns showed in the example. We split the string to individual characters ('v1'), use match to find the position of characters with that of alphabet position (letters), get the difference of the index and check if it is equal to 1 ('i1'). Using the logical vector, we subset the vector ('v1'), create a grouping variable and reverse (rev) the vector based on grouping variable. Finally, paste the characters together to get the expected output
v1 <- strsplit(str1, "")[[1]]
i1 <- cumsum(c(TRUE, diff(match(v1, letters)) != 1L))
paste(ave(v1, i1, FUN = rev), collapse="")
#[1] "cbatextgfedtext"
Or as #alexislaz mentioned in the comments
v1 = as.integer(charToRaw(str1))
rawToChar(as.raw(ave(v1, cumsum(c(TRUE, diff(v1) != 1L)), FUN = rev)))
#[1] "cbatextgfedtext"
EDIT:
1) A mistake was corrected based on #alexislaz's comments
2) Updated with another method suggested by #alexislaz in the comments
data
str1 <- "abctextdefgtext"
You could do this in base R
vec <- match(unlist(strsplit(s, "")), letters)
x <- c(0, which(diff(vec) != 1), length(vec))
newvec <- unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
paste0(letters[newvec], collapse = "")
#[1] "cbatextgfedtext"
Where s <- "abctextdefgtext"
First you find the positions of each letter in the sequence of letters ([1] 1 2 3 20 5 24 20 4 5 6 7 20 5 24 20)
Having the positions in hand, you look for consecutive numbers and, when found, reverse that sequence. ([1] 3 2 1 20 5 24 20 7 6 5 4 20 5 24 20)
Finally, you get the letters back in the last line.

find string that the second string is 9 using R

I have a list of numbers and I want to find numbers which their second string is 9. the grep() code find any number that has 9 but I am looking for a code that find number that second string is 9. so the below returns:
p <- c(34405, 09098424, 6908347, 8900333, 453434)
grep(9, p)
[1] 1 2 3 4
I am looking for something that return:
[1] 2 3 4
Thanks
Majran
We can use substr to extract the 2nd digit and check whether (==) that is equal to 9, get the numeric index by wrapping with which.
which(substr(p,2,2)=="9")
#[1] 2 3 4
Or another option is grep where we match the pattern ^.9 (where ^ suggests the start of the string, . can be any character followed by 9 i.e. the second character)
grep("^.9", p)
#[1] 2 3 4
NOTE: Here I am assuming that the OP's vector is character class because numeric elements don't have 0 padded on the left.
data
p <- c("34405", "09098424", "6908347", "8900333", "453434")

Resources