locate all overlapping patterns in string - r

I find this function "str_locate_all":
library(stringr)
string = paste0(c(5,5,5,6,6,5,5,6), collapse = "")
pattern = paste0(c(5,5), collapse = "")
str_locate_all(string, pattern)
[[1]]
start end
[1,] 1 2
[2,] 6 7
Here I look for (only consecutive) pattern '55' in string '55566556' . It tells me that it occurs only twice - but I see that '55' also happens between position 2 and position 3.
How to get this function to output?
> str_locate_all(string, pattern)
[[1]]
start end
[1,] 1 2
[2,] 2 3
[3,] 6 7`

Regex matches consume characters
To elaborate on my comment, see this Python answer:
Except for zero-length assertion, character in the input will always be consumed in the matching. If you are ever in the case where you want to capture certain character in the input string more the once, you will need zero-length assertion in the regex.
What happens in your case
We can step through a (simplified) version of regex matching your string "55566556", with your pattern, "55":
Match 1: Characters in position 1 and 2 match "55" and are consumed. State of string: "566556".
Characters 3 and 4 (maintaining original indices), "56", are not a match.
Characters 4 and 5, "66", are not a match.
Characters 5 and 6, "65", are not a match.
Match 2: Characters in position 6 and 7 match "55" and are consumed.
Character 8, "6", is not a match.
No more matches.
Using a pattern which does not consume the input (zero-length assertion)
To resolve this issue, you need to use a pattern which does not consume the input string when it finds a match:
There are several zero-length assertion (e.g. ^ (start of input/line), $ (end of input/line), \b (word boundary)), but look-arounds ((?<=) positive look-behind and (?=) positive look-ahead) are the only way that you can capture overlapping text from the input. Negative look-arounds ((?<!) negative look-behind, (?!) negative look-ahead) are not very useful here: if they assert true, then the capture inside failed; if they assert false, then the match fails. These assertions are zero-length (as mentioned before), which means that they will assert without consuming the characters in the input string. They will actually match empty string if the assertion passes.
However you will see slightly strange output if you apply a lookahead pattern directly:
lookahead_pattern <- paste0("(?=(", pattern, "))") # (?=(55))
str_locate_all(string, lookahead_pattern)
# [[1]]
# start end
# [1,] 1 0
# [2,] 2 1
# [3,] 6 5
As you can see, the start positions are correct but the end positions are not. That is because we have had to use a zero-length match, in order to not consume the string.
In this case we know the length of the match is 2 characters. However, we do not always know the length from the input (e.g. in variable length matches such as "5.+"). One way around this is to get the matching text using stringi:
stringi::stri_match_all_regex(string, lookahead_pattern)
# [[1]]
# [,1] [,2]
# [1,] "" "55"
# [2,] "" "55"
# [3,] "" "55"
Putting it together to get your desired output
I am going to use stringi::stri_locate_all_regex, rather than stringr::str_locate_all, which is a wrapper for it:
library(stringi)
string <- paste0(c(5, 5, 5, 6, 6, 5, 5, 6), collapse = "")
pattern <- paste0(c(5, 5), collapse = "")
lookahead_pattern <- paste0("(?=(", pattern, "))")
match_starts <- stri_locate_all_regex(
string,
lookahead_pattern
)[[1]]
# "55" "55" "55"
match_text <- stri_match_all_regex(string, lookahead_pattern)[[1]][,2]
match_end <- match_starts[,"start"] + nchar(match_text) - 1
match_indices <- data.frame(
start = match_starts[,"start"],
end = match_end
)
match_indices
# start end
# 1 1 2
# 2 2 3
# 3 6 7
Incidentally, you can also do this all in base R, using the approach here.

Related

Regex with 2 capture groups, "key=value" or "value_only"

I am trying to build a regex that matches either key=value or value_only, where in the key=value case the value may contain = signs. The key should go into capture group 1 and the value should go into capture group 2. Examples in R/stringr, this is the ICU engine. I have not found any combination of greedy, possessive and lazy quantifiers to get this to work. Am I missing something?
library(stringr)
data <- c(
"key1=value1",
"value_only_no_key",
"key2=value2=containing=equal=signs"
)
# Desired outcome:
result <- matrix(c(
"key1", "value1",
"", "value_only_no_key",
"key2", "value2=containing=equal=signs"
), ncol=2, byrow= TRUE)
# The non-optionality of = results in no match for #2
str_match(
data,
"(.*?)=(.*)"
)[,-1]
# Same here
str_match(
data,
"([^=]*?)=(.*)"
)[,-1]
# The optionality of =? lets the greedy capture 2 eat everything
str_match(
data,
"(.*?)=?(.*)"
)[,-1]
# This is better than nothing, but the value_no_key ends up in the first match
str_match(
data,
"([^=]*+)=?+(.*)"
)[,-1]
If you know that the key is before the first occurrence of the equals sign, you can use a negated character class to match all characters excluding =
If you don't want to match empty strings and there should be at least a single character for the value:
^(?:([^\s=]+)=)?(.+)
Regex demo
If the key can also contain spaces, you can exclude matching a newline instead of whitespace chars.
^(?:([^\r\n=]+)=)?(.+)
Example
library(stringr)
data <- c(
"key1=value1",
"value_only_no_key",
"key2=value2=containing=equal=signs"
)
str_match(data,
"^(?:([^\\s=]+)=)?(.+)"
)[,-1]
Output
[,1] [,2]
[1,] "key1" "value1"
[2,] NA "value_only_no_key"
[3,] "key2" "value2=containing=equal=signs"
How about using a non-matching (?:) optional ? group anchored to the start of the string ^?
str_match(data,
"^(?:(.*?)=)?(.*)"
)[,-1]
[,1] [,2]
[1,] "key1" "value1"
[2,] NA "value_only_no_key"
[3,] "key2" "value2=containing=equal=signs"

Turn txt file into dataframe

I have a txt file with this data in it:
1 message («random_choice»)[5];
2 reply («принято»)[2][3];
3 regulate («random_choice»)[5];
4 Early reg («for instance»)[2][3][4];
4xx: Success (загрузка):
6 OK («fine»)[2][3];
I want to turn it into dataframe, consisting of three columns ID, message, comment.
I also want to remove unnecessary numbers at the end in square brackets.
And also some values in ID column have strings (usually xx). In these cases, column must be just empty.
So, desired result must look like this:
ID Message Comment
1 message random_choice
2 reply принято
3 regulate random_choice
4 Early reg for instance
Success загрузка
6 OK fine
How could i do that? Even when i try to read this txt file i get strange error:
df <- read.table("data_received.txt", header = TRUE)
error i get:
Error in read.table("data_received.txt", header = TRUE) :
more columns than column names
You can use strcapture for this.
Fake data, you'll likely do txt <- readLines("data_received.txt"). (Since my locale on windows is not being friendly to those strings, I'll replace with straight ascii, assuming it'll work just fine on your system.)
txt <- readLines(textConnection("1 message («random_choice»)[5];
# 2 reply («asdf»)[2][3];
# 3 regulate («random_choice»)[5];
# 4 Early reg («for instance»)[2][3][4];
# 4xx: Success (something):
# 6 OK («fine»)[2][3];"))
The breakout:
out <- strcapture("^(\\S+)\\s+([^(]+)\\s+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
# Warning in fun(mat[, i]) : NAs introduced by coercion
out
# ID Message Comment
# 1 1 message «random_choice»
# 2 2 reply «asdf»
# 3 3 regulate «random_choice»
# 4 4 Early reg «for instance»
# 5 NA Success something
# 6 6 OK «fine»
The proto= argument indicates what type of columns are generated. Since I set the ID=0L, it assumes it'll be integer, so anything that does not convert to integer becomes NA (which satisfies your fifth row omission).
Explanation on the regex:
in general:
* means zero-or-more of the previous character (or character class)
+ means one-or-more
? (not used, but useful nonetheless) means zero or one
^ and $ mean the beginning and end of the string, respectively (a ^ within [..] is different)
(...) is a capture group: anything within the non-escaped parens is stored, anything not is discarded
[...] is a character group, any of the characters is a match; if this is instead [^..], then it is inverted: anything except what is listed
[[...]] is a character class
^(\\S+), start with (^) one or more (+) non-space characters (\\S);
\\s+ one or more space character (\\s) (discarded);
([^(]+) one or more character that is not a left-paren;
\\((.*)\\)$ a literal left-paren (\\() and then zero or more of anything (.*), all the way to a literal right-paren (\\)) and the end of the string ($).
It should be noted that \\s and \\S are non-POSIX regex characters, where it is generally suggested to use [^[:space:]] for \\S (no space chars) and [[:space:]] for \\s. Those are equivalent but I went with code-golf initially. With this replacement, it looks like
out <- strcapture("^([^[:space:]]+)[[:space:]]+([^(]+)[[:space:]]+\\((.*)\\).*$", txt,
proto = data.frame(ID=0L, Message="", Comment=""))
We can use {unglue}. Here we see you have two patterns, one contains "«" and ID, the other doesn't. {unglue} will use the first pattern that matches. any {foo} or {} expression matches the regex ".*?", and a data.frame is built from the names put between brackets.
txt <- c(
"1 message («random_choice»)[5];", "2 reply («asdf»)[2][3];",
"3 regulate («random_choice»)[5];", "4 Early reg («for instance»)[2][3][4];",
"4xx: Success (something):", "6 OK («fine»)[2][3];")
library(unglue)
patterns <-
c("{id} {Message} («{Comment}»){}",
"{} {Message} ({Comment}){}")
unglue_data(txt, patterns)
#> id Message Comment
#> 1 1 message random_choice
#> 2 2 reply asdf
#> 3 3 regulate random_choice
#> 4 4 Early reg for instance
#> 5 <NA> Success something
#> 6 6 OK fine

Create a new vector with text from strings in an old vector in R

Working with a data frame in R studio. One column, PODMap, has sentences such as "At my property there is a house at 38.1234, 123.1234 and also I have a car". I want to create new columns, one for the latitude and one for the longitude.
Fvalue is the data frame. So far I have
matches <- regmatches(fvalue[,"PODMap"], regexpr("..\\.....", fvalue[,"PODMap"], perl = TRUE))
Since the only periods in the text are in longitude and latitude, this returns the first lat or long listed in each string (still working on finding a regex to grab the longitude from after the latitude but that's a different question). The problem is, for instance, if my vector is c("test 38.1111", "x", "test 38.2222") then it returns (38.1111. 38.2222) which has the right values, but the vector won't be the right length for my data frame and won't match. I need it to return a blank or a 0 or NA for each string that doesn't have the value matching the regular expression, so that it can be put into the data frame as a column. If I'm going about this entirely wrong let me know about that too.
You can use regexecwhich returns a list of the same length so you don't loose the non-match spaces
PODMap<-c("At my property there is a house at 38.1234, 123.1234 and also I have a",
"Test TEst TEST Tes T 12.1234, 123.4567 test Tes",
"NO LONG HEre Here No Lat either",
"At my property there is a house at 12.1234, 423.1234 and also I have ")
Index<-c(1:4)
fvalue<-data.frame(Index,PODMap)
matches <- regmatches(fvalue[,"PODMap"], regexec("..\\.....", fvalue[,"PODMap"], perl
= TRUE))
> matches
[[1]]
[1] "38.1234"
[[2]]
[1] "12.1234"
[[3]]
character(0)
[[4]]
[1] "12.1234"
Using the package stringr, we can get both the long and lat.
library(stringr)
matches<-str_match_all(fvalue[,"PODMap"], ".\\d\\d\\.\\d\\d\\d\\d")
> matches
[[1]]
[,1]
[1,] " 38.1234"
[2,] "123.1234"
[[2]]
[,1]
[1,] " 12.1234"
[2,] "123.4567"
[[3]]
[,1]
[[4]]
[,1]
[1,] " 12.1234"
[2,] "423.1234"
The \\d checks for any digit 1:9, so that will keep out any words, and we use str_match_all to get all the matches from the string, as regmatches will only take the first match. str_match_all will set a value to NULL instead of character(0) though, which should not be a problem.
Check out this regex demo

Regex - Capturing repeating groups excluding surrounding partially matching content

For the record, I am using R, but the queries I have are platform independent (as it stands) so I'll demo with regex101. I am attempting to capture repeated groups that may or may not be surrounded by other text. So the ideal behaviour is shown in this demo:
demo1
regex: (\d{2})(AB)
text: blahblah11AB12AB13ABblah
So it nicely captures all the groups I want:
Match 1
Full match 8-12 `11AB`
Group 1. 8-10 `11`
Group 2. 10-12 `AB`
Match 2
Full match 12-16 `12AB`
Group 1. 12-14 `12`
Group 2. 14-16 `AB`
Match 3
Full match 16-20 `13AB`
Group 1. 16-18 `13`
Group 2. 18-20 `AB`
However, if I include another piece of matched text, it captures that as well (which is fair enough I suppose)
text: blahblah11AB12AB13ABblah22AB
returns the same but with the extra group:
Match 4
Full match 24-28 `22AB`
Group 1. 24-26 `22`
Group 2. 26-28 `AB`
demo2
What I want to do is capture the first group but disregard all other text, even if there is a subsequent match. In essence, I want to get just the three matches from this text: blahblah11AB12AB13ABblah22AB
I have tried a number of things, such as this:
(((\d{2})(AB))+)(.*)
But then I get the following, which loses all both the last group capture:
Demo 3
Match 1
Full match 8-28 `11AB12AB13ABblah22AB`
Group 1. 8-20 `11AB12AB13AB`
Group 2. 16-20 `13AB`
Group 3. 16-18 `13`
Group 4. 18-20 `AB`
Group 5. 20-28 `blah22AB`
I need something which retains the repeated groups. Am stumped!
In R, the output should look like this:
[[1]]
[,1] [,2] [,3]
[1,] "11AB" "11" "AB"
[2,] "12AB" "12" "AB"
[3,] "13AB" "13" "AB"
Thanks in advance...
An idea would be to use \G for chaining matches to ^ start and reset by \K after.
(?:^.*?\K|\G)(\d{2})(AB)
^.*?\K will match any amount of any characters lazily before the first match
|\G or continue at the end of previous match which can be: start, first, previous
See your updated demo
This will match the first chain of matches and is a pcre pattern (perl=TRUE).
If there can only be non-digits before first match, use ^\D*\K instead of ^.*?\K.
You could use the quantifier {3} to get just the 3 firsts groups: (((\d{2})(AB)){3}). See demo
If I understand it correctly, the problem is in the placement of the parenthesis.
pattern <- "(\\d{2}AB)"
s <- "blahblah11AB12AB13ABbla"
m <- gregexpr(pattern, s)
regmatches(s, m)
#[[1]]
#[1] "11AB" "12AB" "13AB"
s2 <- "blahblah11AB12AB13ABblah22AB"
s3 <- "11AB12AB13ABblah22AB"
S <- c(s, s2, s3)
m <- gregexpr(pattern, S)
regmatches(S, m)
#[[1]]
#[1] "11AB" "12AB" "13AB"
#
#[[2]]
#[1] "11AB" "12AB" "13AB" "22AB"
#
#[[3]]
#[1] "11AB" "12AB" "13AB" "22AB"
Note that many times this is run in one code line only. I have left it like this to make it more clear.
EDIT.
Maybe the following does what the OP is asking for.
I bet there are better solutions, it seems to me that two regex's are an overkill.
pattern <- "((\\d{2}AB)+)([^[:digit:]AB]+(\\d{2}AB))"
pattern2 <- "(\\d{2}AB)"
m <- gregexpr(pattern2, gsub(pattern, "\\1", S))
regmatches(S, m)
#[[1]]
#[1] "11AB" "12AB" "13AB"
#
#[[2]]
#[1] "11AB" "12AB" "13AB"
#
#[[3]]
#[1] "11AB" "12AB" "13AB"

Parsing String and splitting it in R

I have somehow a regex problem with handling strings in R.
I have data structure provided by RNAfold software that looks like this:
"....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
This is a typical secondary structure for miRNAs, but I also have other sequences that are not miRNAs, that look somwhat like this:
...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
This second sequence has two hairpin loops, one at the beginning and another one in the middle, whereas the first sequence just has one hairpin loop in the middle.
Dots (".") represent nucleotides that are not paired, while "(" represent nucleotides that are paired with their counterparts, represented as ")".
I want to split this string so that I can get the stems in the structure.
The output I would like to obtain is:
Input:
[1] "....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
Output:
[1] "....(((..((((((((.(((((((((((........."
[2] "))))))))))).))))))))..))).."
So that I can count the number of splited strings and count the number of stems.
The result for the second sequence would be:
Input:
[1] ...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
Output:
[1] "...((((....."
[2] "))))...........(((((((...((..(((..((((...((((((....."
[3] ")))).))...)))).))).))...)))))))......."
So in esence, what I want is to parse the strings, so that they are splitted when they fin a ")" symbol, conserving all the symbols of the string.
I have been tried using strplit() and some regex variations but I haven't been able to find the trick...
Any help?
Thanks
You could do a lookahead and look for dots ending by a closing parenthesis which come straight after an opening parenthesis.
x <- c("....(((..((((((((.(((((((((((..))))))))))).))))))))..)))..",
"...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
strsplit(x, "\\((?=(\\.+\\)))", perl = TRUE)
# [[1]]
# [1] "....(((..((((((((.((((((((((" "..))))))))))).))))))))..))).."
#
# [[2]]
# [1] "...(((" ".....))))...........(((((((...((..(((..((((...((((("
# [3] ".....)))).))...)))).))).))...)))))))......."
If you looking to count character it might be more convenient to do this:
x <- "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...)))))))......."
with(rle(strsplit(x, "")[[1]]), setNames(lengths, values))
## . ( . ) . ( . ( . ( . ( . ( . ) . ) . ) . ) . ) . ) .
## 3 4 5 4 11 7 3 2 2 3 2 4 3 6 5 4 1 2 3 4 1 3 1 2 3 7 7
You can get the output you specified using DavidArenburg's logic but with a twist - David uses a lookahead regex expression to find the ( that precedes the pattern.{N}) where N can be any number. A variable-length lookbehind (where pattern contains unspecified # of a character) would be ideal but does not work (read - is not allowed). The trick is to reverse the string to use variable-length lookahead, much like a variable-length lookbehind might operate.
Data
S <- c("....(((..((((((((.(((((((((((.........))))))))))).))))))))..)))..", "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
Functions
reverse_string <- function(S) {
paste(rev(unlist(strsplit(S, ""))), collapse="")
}
myfun <- function(S) {
T <- reverse_string(S)
result <- unlist(strsplit(T, "\\)(?=(\\.+\\())", perl = TRUE))
setNames(rev(sapply(result, function(i) reverse_string(i))), NULL)
}
Result
lapply(S, myfun)
# [[1]]
# [1] "....(((..((((((((.(((((((((((........."
# [2] ")))))))))).))))))))..))).."
# [[2]]
# [1] "...((((....."
# [2] ")))...........(((((((...((..(((..((((...((((((....."
# [3] "))).))...)))).))).))...)))))))......."

Resources