Regex with 2 capture groups, "key=value" or "value_only" - r

I am trying to build a regex that matches either key=value or value_only, where in the key=value case the value may contain = signs. The key should go into capture group 1 and the value should go into capture group 2. Examples in R/stringr, this is the ICU engine. I have not found any combination of greedy, possessive and lazy quantifiers to get this to work. Am I missing something?
library(stringr)
data <- c(
"key1=value1",
"value_only_no_key",
"key2=value2=containing=equal=signs"
)
# Desired outcome:
result <- matrix(c(
"key1", "value1",
"", "value_only_no_key",
"key2", "value2=containing=equal=signs"
), ncol=2, byrow= TRUE)
# The non-optionality of = results in no match for #2
str_match(
data,
"(.*?)=(.*)"
)[,-1]
# Same here
str_match(
data,
"([^=]*?)=(.*)"
)[,-1]
# The optionality of =? lets the greedy capture 2 eat everything
str_match(
data,
"(.*?)=?(.*)"
)[,-1]
# This is better than nothing, but the value_no_key ends up in the first match
str_match(
data,
"([^=]*+)=?+(.*)"
)[,-1]

If you know that the key is before the first occurrence of the equals sign, you can use a negated character class to match all characters excluding =
If you don't want to match empty strings and there should be at least a single character for the value:
^(?:([^\s=]+)=)?(.+)
Regex demo
If the key can also contain spaces, you can exclude matching a newline instead of whitespace chars.
^(?:([^\r\n=]+)=)?(.+)
Example
library(stringr)
data <- c(
"key1=value1",
"value_only_no_key",
"key2=value2=containing=equal=signs"
)
str_match(data,
"^(?:([^\\s=]+)=)?(.+)"
)[,-1]
Output
[,1] [,2]
[1,] "key1" "value1"
[2,] NA "value_only_no_key"
[3,] "key2" "value2=containing=equal=signs"

How about using a non-matching (?:) optional ? group anchored to the start of the string ^?
str_match(data,
"^(?:(.*?)=)?(.*)"
)[,-1]
[,1] [,2]
[1,] "key1" "value1"
[2,] NA "value_only_no_key"
[3,] "key2" "value2=containing=equal=signs"

Related

locate all overlapping patterns in string

I find this function "str_locate_all":
library(stringr)
string = paste0(c(5,5,5,6,6,5,5,6), collapse = "")
pattern = paste0(c(5,5), collapse = "")
str_locate_all(string, pattern)
[[1]]
start end
[1,] 1 2
[2,] 6 7
Here I look for (only consecutive) pattern '55' in string '55566556' . It tells me that it occurs only twice - but I see that '55' also happens between position 2 and position 3.
How to get this function to output?
> str_locate_all(string, pattern)
[[1]]
start end
[1,] 1 2
[2,] 2 3
[3,] 6 7`
Regex matches consume characters
To elaborate on my comment, see this Python answer:
Except for zero-length assertion, character in the input will always be consumed in the matching. If you are ever in the case where you want to capture certain character in the input string more the once, you will need zero-length assertion in the regex.
What happens in your case
We can step through a (simplified) version of regex matching your string "55566556", with your pattern, "55":
Match 1: Characters in position 1 and 2 match "55" and are consumed. State of string: "566556".
Characters 3 and 4 (maintaining original indices), "56", are not a match.
Characters 4 and 5, "66", are not a match.
Characters 5 and 6, "65", are not a match.
Match 2: Characters in position 6 and 7 match "55" and are consumed.
Character 8, "6", is not a match.
No more matches.
Using a pattern which does not consume the input (zero-length assertion)
To resolve this issue, you need to use a pattern which does not consume the input string when it finds a match:
There are several zero-length assertion (e.g. ^ (start of input/line), $ (end of input/line), \b (word boundary)), but look-arounds ((?<=) positive look-behind and (?=) positive look-ahead) are the only way that you can capture overlapping text from the input. Negative look-arounds ((?<!) negative look-behind, (?!) negative look-ahead) are not very useful here: if they assert true, then the capture inside failed; if they assert false, then the match fails. These assertions are zero-length (as mentioned before), which means that they will assert without consuming the characters in the input string. They will actually match empty string if the assertion passes.
However you will see slightly strange output if you apply a lookahead pattern directly:
lookahead_pattern <- paste0("(?=(", pattern, "))") # (?=(55))
str_locate_all(string, lookahead_pattern)
# [[1]]
# start end
# [1,] 1 0
# [2,] 2 1
# [3,] 6 5
As you can see, the start positions are correct but the end positions are not. That is because we have had to use a zero-length match, in order to not consume the string.
In this case we know the length of the match is 2 characters. However, we do not always know the length from the input (e.g. in variable length matches such as "5.+"). One way around this is to get the matching text using stringi:
stringi::stri_match_all_regex(string, lookahead_pattern)
# [[1]]
# [,1] [,2]
# [1,] "" "55"
# [2,] "" "55"
# [3,] "" "55"
Putting it together to get your desired output
I am going to use stringi::stri_locate_all_regex, rather than stringr::str_locate_all, which is a wrapper for it:
library(stringi)
string <- paste0(c(5, 5, 5, 6, 6, 5, 5, 6), collapse = "")
pattern <- paste0(c(5, 5), collapse = "")
lookahead_pattern <- paste0("(?=(", pattern, "))")
match_starts <- stri_locate_all_regex(
string,
lookahead_pattern
)[[1]]
# "55" "55" "55"
match_text <- stri_match_all_regex(string, lookahead_pattern)[[1]][,2]
match_end <- match_starts[,"start"] + nchar(match_text) - 1
match_indices <- data.frame(
start = match_starts[,"start"],
end = match_end
)
match_indices
# start end
# 1 1 2
# 2 2 3
# 3 6 7
Incidentally, you can also do this all in base R, using the approach here.

Create a new vector with text from strings in an old vector in R

Working with a data frame in R studio. One column, PODMap, has sentences such as "At my property there is a house at 38.1234, 123.1234 and also I have a car". I want to create new columns, one for the latitude and one for the longitude.
Fvalue is the data frame. So far I have
matches <- regmatches(fvalue[,"PODMap"], regexpr("..\\.....", fvalue[,"PODMap"], perl = TRUE))
Since the only periods in the text are in longitude and latitude, this returns the first lat or long listed in each string (still working on finding a regex to grab the longitude from after the latitude but that's a different question). The problem is, for instance, if my vector is c("test 38.1111", "x", "test 38.2222") then it returns (38.1111. 38.2222) which has the right values, but the vector won't be the right length for my data frame and won't match. I need it to return a blank or a 0 or NA for each string that doesn't have the value matching the regular expression, so that it can be put into the data frame as a column. If I'm going about this entirely wrong let me know about that too.
You can use regexecwhich returns a list of the same length so you don't loose the non-match spaces
PODMap<-c("At my property there is a house at 38.1234, 123.1234 and also I have a",
"Test TEst TEST Tes T 12.1234, 123.4567 test Tes",
"NO LONG HEre Here No Lat either",
"At my property there is a house at 12.1234, 423.1234 and also I have ")
Index<-c(1:4)
fvalue<-data.frame(Index,PODMap)
matches <- regmatches(fvalue[,"PODMap"], regexec("..\\.....", fvalue[,"PODMap"], perl
= TRUE))
> matches
[[1]]
[1] "38.1234"
[[2]]
[1] "12.1234"
[[3]]
character(0)
[[4]]
[1] "12.1234"
Using the package stringr, we can get both the long and lat.
library(stringr)
matches<-str_match_all(fvalue[,"PODMap"], ".\\d\\d\\.\\d\\d\\d\\d")
> matches
[[1]]
[,1]
[1,] " 38.1234"
[2,] "123.1234"
[[2]]
[,1]
[1,] " 12.1234"
[2,] "123.4567"
[[3]]
[,1]
[[4]]
[,1]
[1,] " 12.1234"
[2,] "423.1234"
The \\d checks for any digit 1:9, so that will keep out any words, and we use str_match_all to get all the matches from the string, as regmatches will only take the first match. str_match_all will set a value to NULL instead of character(0) though, which should not be a problem.
Check out this regex demo

extracting a word (of variable length) ending with 동 from a string in R

I have a data frame in R with one column containing an address in Korean. I need to extract one of the words (a word ending with 동), if it's there (it's possible that it's missing) and create a new column named "dong" that will contain this word. So my data is shown in column "address" and desired output is shown in column "dong" shown below.
address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
dong <- c("탄방동","효동","오정동","자양동",NA)
data <- data.frame(address,dong, stringsAsFactors = FALSE)
I've tried using grep but it's not giving me exactly what I need.
grep(".+동\\s",data$address,value=T)
I think I have 2 issues: 1) I'm not sure how to write a proper regular expression to identify the word I need and 2) I'm not sure why grep returns the whole string rather than the word. I would appreciate any suggestions.
A regex to extract Korean whole words ending with a specific letter is
\b\w*동\b
See the regex demo.
Details:
\b- leading word boundary
\w* - 0+ word chars
동 - ending letter
\b - trailing word boundary
See the R demo:
address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
## matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address, perl=TRUE ))
matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address ))
dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x))
data <- data.frame(address,dong, stringsAsFactors = FALSE)
Output:
address dong
1 대전광역시 서구 탄방동 홈플러스 탄방동
2 대전광역시 동구 효동 주민센터 효동
3 대전광역시 대덕구 오정동 한남마트 오정동
4 대전광역시 동구 자양동 87-3번지 성동경로당 자양동
5 대전광역시 유성구 용계로 128 <NA>
Note that dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x)) line is necessary to add NA to those rows where no match was found.
grep returns the whole string. In your case, stringr library is useful.
library(stringr)
str_match(paste0(data$address, ' '), '([^\\s]+동)\\s')
[,1] [,2]
[1,] "탄방동 " "탄방동"
[2,] "효동 " "효동"
[3,] "오정동 " "오정동"
[4,] "자양동 " "자양동"
[5,] NA NA
The column 2 is what you want. Note that I added a space at the end of strings so that regex would match if "dong" appears at the end of string.

How to take out specific letters from a character variable

xcv(123)
wert(232)
t(145)
tyui ier(133)
ytie(435)
...
The length of the string is dynamic meaning it is random. The number between the brackets are the target letters that are required to be taken out & stored in a new column in the same data set.
The following key words might help:
substr() strsplit()
I'm actively looking for an answer. Your help would be deeply appreciated.
Do you mean you want to extract the, for example, 123rd letter from the string called xcv?
set.seed(123)
xcv <- paste( sample( letters, 200, replace = TRUE ), collapse = "" )
n <- 123
You can extract the nth letter like so:
substr( xcv, n, n )
# [1] "i"
dat = c('xcv(123)' ,'wert(232)', 't(145)', 'tyui ier(133)', 'ytie(435)')
target = gsub(".*\\(|\\).*", "", dat) #captures anything in between '(' and ')'. We use \\( and \\) to denote the brackets since they are special characters.
cbind(dat, target)
dat target
[1,] "xcv(123)" "123"
[2,] "wert(232)" "232"
[3,] "t(145)" "145"
[4,] "tyui ier(133)" "133"
[5,] "ytie(435)" "435"

R grep whole words separated by special characters

Suppose there is a vector of sequences of the form "foo" or "foo|baz|bar" (one single word or multiple words separated by special character like "|"), and we are also given a word and we want to find to which items of the vector it has a whole word match.
For example the word "foo" has a whole match in "foo|baz|bar", but not a whole match in either "foobaz|bar" or "bazfoo".
First I tried to use "\\b" that indicates either the start or the end edges of a whole word and it works successfully:
grep("\\bfoo\\b", "foo") # match
grep("\\bfoo\\b", "foobaz|bar") # mismatch
grep("\\bfoo\\b", "bazfoo") # mismatch
Then I tried to add "|" as the other possible separator of both ends, and group it with "\\b" using [ and ]:
grep("[|\\b]foo[|\\b]", "foo|baz|bar") # mismatch!
grep("[|\\b]foo[|\\b]", "foo") # mismatch!
Later I found \\b is not indicator of start or end of the character string, but start or end of a whole word (so many characters like space and ,|-^. but not numbers and underline _ separate whole words). So "[|\\b]foo[|\\b]" matches to all of these strings: "foo", "foo|bar|baz", "foo-bar", "baz foo|bar" but not to "foo_bar" or "foo2".
But my question still remains: Why "[|\\b]foo[|\\b]" pattern fails to match with "foo"?
You could use strplit:
> "foo" %in% unlist(strsplit("foo|baz|bar", split = "|", fixed = TRUE))
[1] TRUE
Which you can vectorize:
> z <- c("foo|baz|bar", "foobaz|bar", "bazfoo")
> x <- c("foo", "foot")
> sapply(strsplit(z, split = "|", fixed = TRUE), function(x,y)y %in% x, x)
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
\b matches at the following positions
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character. (Word characters are a-zA-Z1-9_)
Since | stands for alternation operator in regex, you will have to escape it.
So the regex \bfoo\b would match foo in foo|bar because | is a non word character. There is no need to use the character set [\b\|]
Edit: As flodel pointed out below \b inside the character set represents the backspace character. So it would match the | inside [\b\|] and not word boundary.
Since | has special meaning in a regular expression, you need to escape it, i.e. use \\|:
ptn <- "\\bfoo[\\|\\b]"
grep(ptn, "foo|baz|bar")
[1] 1
grep(ptn, "foo")
integer(0)
This would also work:
gregexpr("foo|", "foo|baz|bar", fixed = TRUE)[[c(1, 1)]] > 0
gregexpr("foo|", "foobaz|bar", fixed = TRUE)[[c(1, 1)]] > 0
gregexpr("foo|", "bazfoo", fixed = TRUE)[[c(1, 1)]] > 0
This approach is different in that you can utilize spacing options that you supply gregexpr to find words consisting of two words:
gregexpr("foo|", "baz foo|", fixed = TRUE)[[c(1, 1)]] > 0
gregexpr(" foo|", "baz foo|", fixed = TRUE)[[c(1, 1)]] > 0

Resources