String match algorithms in R - r

I would like to find the string match, with number of characters which are matching from the start. I have two strings a <- "ABCDBADCABC", b <- "ABC". I want to find the match of b in a. I am interested to find if b <- "ABC" exists at the start of a <- "ABCDBADCABC". I am not looking for other locations of string match other than start.
Other example: b <- ABCDBADCABC, a <- "ABCDAB", here only four characters of a match with b from the start. So output will be ABCD match from a.
What are the available options in R to do this.

I would keep it simple and make a and b vectors contain individual characters. Then string matching is straight forward.
## Make a and b
b = "ABCDBADCABC"
a = "ABCDAB"
Find the length of the shortest vector
min_char = min(nchar(a), nchar(b))
Then split a and b up
a_split = strsplit(substr(a,1, min_char), "")[[1]]
b_split = strsplit(substr(b,1, min_char), "")[[1]]
Compare using standard operators
comp = a_split == b_split
Find the first occurrence of FALSE
which.min(comp) - 1
With less code:
compare(a, b)
where
compare = function(a, b) {
min_char = min(nchar(a), nchar(b))
a_split = strsplit(substr(a,1, min_char), "")[[1]]
b_split = strsplit(substr(b,1, min_char), "")[[1]]
comp = a_split == b_split
which.min(comp) - 1
}

Related

If a row matches a criteria do paste in R

Let's imagine you have a dataframe with two columns ID and POSITION. I want to paste some text depending on the ID value.
I want to paste the ID value with GK0000 (when ID>10) or GK00000 (when ID<10) along with .2:, POSITION value, .. and the following POSITION value (POSITION+1)
For example if ID = 1 and POSITION = 10, the result would be GK000001.2:10..11 and if ID = 10 and POSITION = 10, the result would be GK000010.2:10..11
In Excel I can do this being A as ID and B as POSITION using =IF(A2<10,CONCATENATE("GK00000",A2,".2:",B2,"..",B2+1),CONCATENATE("GK0000",A2,".2:",B2,"..",B2+1)) but I want to add it to my R script line.
I give you a short example of my input data just ilustrative
ID <- c(1,5,9,10,12)
POSITION <- c(10,50,90,100,120)
df <- cbind(ID,POSITION)
and the result I'm expecting is
CONCAT <- c("GK000001.2:10..11","GK000005.2:50..51","GK000009.2:90..91",
"GK000010.2:100..101","GK000012.2:120..121")
dfResult <- cbind(ID,POSITION,CONCAT)
I believe the question asks for a string format given two arguments, A and B and a number of digits.
concat <- function(A, B, digits = 6){
fmt <- paste0("%0", digits, "d")
fmt <- paste0("GK", fmt, ".2:%d..%d")
sprintf(fmt, A, B, B + 1)
}
concat(df[, 'ID'], df[, 'POSITION'], 6)
# [1] "GK000001.2:10..11" "GK000002.2:20..21" "GK000003.2:30..31"
# [4] "GK000004.2:40..41" "GK000005.2:50..51" "GK000006.2:60..61"
# [7] "GK000007.2:70..71" "GK000008.2:80..81" "GK000009.2:90..91"
#[10] "GK000010.2:100..101" "GK000011.2:110..111" "GK000012.2:120..121"

Generation of Unique ID

Can some help with how to generate a unique 6 digit URN in R,as I don't know how to do this please.
Below are the rule for the URN
It needs to be alphanumeric,start with letter and maybe end with letter (e.g AA34YB)
Use only upper case alphabets
Do not use the alphabets O or I (this is the alphabet after H and before J)
Use only digits from 1- 9. Exclude 0
First two digit should be letter,then followed by 2 digit number and end with 2 digit letter,e.g "AA22DD","EE34TY","ER67YU"
All records must contain number as shown in rule 5
IT MUST BE 6 DIGIT PLEASE
I would love to generate upto 4 million unique records please.Any R code suggestion is highly welcome.I am not an expert in R,actually new to R
Thanks very much
here is a function that will generate ordered unique IDs:
generateIDs <- function(n, existing=NULL){
# Initialise a counter to produce IDs
counter <- 0
# Create a arrays of letters and digits
letters <- LETTERS[LETTERS %in% c("O", "I") == FALSE]
digits <- 1:9
# Initialise an array to store the IDs created
ids <- c()
# iterate through the letters
for(first in letters){
# iterate through the letters
for(second in letters){
# iterate through the digits
for(third in digits){
# iterate through the digits
for(fourth in digits){
# iterate through the letters
for(fifth in letters){
# iterate through the letters
for(sixth in letters){
# Create the unique code
code <- paste0(first, second, third, fourth, fifth, sixth)
# Check if already exists
if(code %in% existing == FALSE){
# Iterate the counter
counter <- counter + 1
# Store the ID
ids[counter] <- code
existing[length(existing) + 1] <- code
# Check if created enough IDs
if(counter == n){
return(ids)
}
# Note progress
if(counter %% 10000 == 0){
cat("\rCreated", counter, "ids!")
}
}
}
}
}
}
}
}
}
That is a horrific number of nested for loops but it avoids the inefficient random generation of IDs. You can test it using the following code:
generateIDs(10)
"AA11AA" "AA11AB" "AA11AC" "AA11AD" "AA11AE" "AA11AF" "AA11AG" "AA11AH" "AA11AJ" "AA11AK"
Note that ideally you should run this function once. Theoretically, this function could create up to 26873856 unique IDs but it doesn't scale well!
See #GKi's answer for a much better solution! :-)
You can use expand.grid to generate Unique ID's.
n <- 10
t1 <- LETTERS[!LETTERS %in% c("O", "I")]
t2 <- 1:9
#t1 <- rawToChar(as.raw(c(65:72,74:78,80:90)), multiple = TRUE) #Alternativ
#t2 <- rawToChar(as.raw(49:57), multiple = TRUE)
apply(expand.grid(t1, t1, t2, t2, t1, t1)[seq(n),], 1, paste, collapse = "")
# 1 2 3 4 5 6 7 8
#"AA11AA" "BA11AA" "CA11AA" "DA11AA" "EA11AA" "FA11AA" "GA11AA" "HA11AA"
# 9 10
#"JA11AA" "KA11AA"
set.seed(1) #Sample randomly
apply(expand.grid(t1, t1, t2, t2, t1, t1)[sample(length(t1)^4*length(t2)^2, n),]
, 1, paste, collapse = "")
#10938497 17633234 12201267 18120554 21612295 21509711 13901861 6841049
#"SL15UK" "BG59TR" "CU65XL" "BH54ES" "GJ13HV" "YF31FV" "EE79KN" "SV66CG"
#23945701 10770210
#"NK23KX" "TG68QK"
In case it needs to much memory look #Joseph-Crispell's answer.

Split string according to ambiguous delimiter in R

I have a pairs of strings included in a data frame:
df <- data.frame(str = c("L_V1_ROI-L_MST_ROI",
"L_V6_ROI-L_V2_ROI",
"L_V3_ROI-L_V4_ROI",
"L_V8_ROI-L_4_ROI",
"L_p9-46v_ROI-L_a9-46v_ROI"))
Each pair is separated by - symbol with the exception of the last pair which contains three - symbols and should be separated into substrings L_p9-46v_ROI and L_a9-46v_ROI.
A task is to split these pairs into substrings according to the separator. To do this I simply use:
library(tidyr)
df %>% separate(data = df, col = str, into = c("str1", "str2"), sep = "-")
which gives the following result:
str1 str2
1 L_V1_ROI L_MST_ROI
2 L_V6_ROI L_V2_ROI
3 L_V3_ROI L_V4_ROI
4 L_V8_ROI L_4_ROI
5 L_p9 46v_ROI
Warning message:
Too many values at 1 locations: 5
As expected, the problem lies in the 5th pair which has more than one - symbol.
Question: what is the regex to match the proper separator?
My partial solution is pasted below, but I hope that there should be more intelligent solution.
my_split <- function(string, pattern) {
## Match start end end position of the "_ROI-"
position <- str_locate(string = string, pattern = pattern)
start <- position[1]
end <- position[2]
## Extract substrings
substring1 <- substr(my_str, 1, start + 3)
substring2 <- substr(my_str, end + 1, nchar(string))
return(list(substring1, substring2))
}
## Toy example
my_str <- "L_p9-46v_ROI-L_a9-46v_ROI"
my_split(string = my_str, pattern = "_ROI-")
[[1]]
[1] "L_p9-46v_ROI"
[[2]]
[1] "L_a9-46v_ROI"

How to delete everything after nth delimiter in R?

I have this vector myvec. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?
myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")
result
chr2:213403244
chr7:55240586
chr7:55241607
We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.
sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
The above works for the example posted. For general cases to remove after the nth delimiter,
n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Checking with a different 'n'
n <- 3
and repeating the same steps
sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
#[3] "chr7:55241607:55241607"
Or another option would be to split by : and then paste the n number of components together.
n <- 2
vapply(strsplit(myvec, ':'), function(x)
paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.
1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:
k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be ^((.*?:){2}.*?):.* ) and use it with sub:
k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 1: For k=1 this can be further simplified to sub(":.*", "", myvec) and for k=n-1 it can be further simplified to sub(":[^:]*$", "", myvec)
Note 2: Here is a visualization of the regular regular expression for k equal to 3:
^((.*?:){2}.*?):.*
Debuggex Demo
3) iteratively delete last field We could remove the last field n-k times using the last regular expression in Note 1 above like this:
n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)
If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:
n <- count.fields(textConnection(myvec[1]), sep = ":")
4) locate position of kth colon Locate the positions of the colons using gregexpr and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use substr to extract that many characters from the respective strings.
k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using paste0(myvec, ":") as the input instead of myvec.
Note 4: We compare performance:
library(rbenchmark)
benchmark(
.read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
.sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec),
.for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
.gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
order = "elapsed", replications = 1000)[1:4]
giving:
test replications elapsed relative
2 .sprintf.sub 1000 0.11 1.000
4 .gregexpr 1000 0.14 1.273
3 .for 1000 0.15 1.364
1 .read.table 1000 2.16 19.636
The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.
ADDED Added additional solutions and additional notes.

Finding repeated substrings with R

I have the following code for finding out a pattern (consecutively repeated substring) in a string, say 0110110110000. The output patterns are 011 and 110, since they are both repeated within the string. What changes can be done to the following code?
I'd like to identify substrings that start from any position in a given string, and which repeat for at least a threshold number of times. In the above mentioned string, the threshold is three (th = 3). The repeated string should be the maximal repeated string. In the above string, 110 and 011 both satisfy these conditions.
Here's my attempt at doing this:
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
You can do this with regex:
s <- '0110110110000'
thr <- 3
m <- gregexpr(sprintf('(?=(.+)(?:\\1){%s,})', thr-1), s, perl=TRUE)
unique(mapply(function(x, y) substr(s, x, x+y-1),
attr(m[[1]], 'capture.start'),
attr(m[[1]], 'capture.length')))
# [1] "011" "110" "0"
The pattern in the gregexpr uses a positive lookahead to prevent characters from being consumed by the match (and so allowing overlapping matches, such as with the 011 and 110). We use a repeated (at least thr - 1 times) backreference to the captured group to look for repeated substrings.
Then we can extract the matched substrings by taking start positions and lengths from the attributes of the result of gregexpr, i.e. the object m.
You didn't specify a minimum string length, so this returns 0 as one of the repeated substrings. If you have a minimum and/or maximum substring length in mind, you can modify the first subexpression of the regex. For example, the following would match only substrings with at least 2 characters.
sprintf('(?=(.{2,})(?:\\1){%s,})', thr-1)

Resources