Regular Expressions Capturing Groups - r

I am trying to extract latitudes, longitudes, and a label from a string in R (v3.4.1). My thought is that a regular expression is the way to go, and since the stringr package has the ability to extract capturing groups, I thought this is the package to use. The problem is that I am receiving an error that I cannot interpret. Any help would be appreciated.
Here is an example of a string that I would like to extract the information from. I want to grab the last set of latitude (41.505) and longitude (-81.608333) along with the label (Adelbert Hall).
a <- "Case Western Reserve University campus41°30′18″N 81°36′30″W / 41.505°N 81.608333°W / 41.505; -81.608333 (Adelbert Hall)"
Here is the regular expression that I created to grab the fields that I am interested in.
coordRegEx <- "([\\d]*\\.\\d*)(?#Capture Latitude);\\h(-\\d*\\.\\d*)(?#Capture Longitude)\\N*\\((\\N*)(?#Capture Label)\\)"
Now, when I try to match the regular expression in the string using:
s <- str_match(a,coordRegEx)
I get the following error:
Error in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) : Incorrect Unicode property. (U_REGEX_PROPERTY_SYNTAX)
My guess is that this error has something to do with the Regex pattern, but using documentation and web searches, I have been unable to decipher it.

There are several issues with the current code:
The (?#:...) are comments that are only allowed when you pass an x modifier to the regex
The \N shorthand character that matches any non-line break char is not supported by the ICU regex library (it supports \N{UNICODE CHARACTER NAME} that matches a named character). You may replace \N with ..
See your fixed approach:
> a <- "Case Western Reserve University campus41°30′18″N 81°36′30″W / 41.505°N 81.608333°W / 41.505; -81.608333 (Adelbert Hall)"
> coordRegEx <- "(?x)(\\d*\\.\\d*)(?#Capture Latitude);\\h(-\\d*\\.\\d*)(?#Capture Longitude).*\\((.*)(?#Capture Label)\\)"
> s <- str_match(a,coordRegEx)
> s
[,1] [,2] [,3] [,4]
[1,] "41.505; -81.608333 (Adelbert Hall)" "41.505" "-81.608333" "Adelbert Hall"

If we need a string output
sub(".*\\/\\s*", "", a)
#[1] "41.505; -81.608333 (Adelbert Hall)"
If we need it as separate
strsplit(sub(".*\\/\\s*", "", a), ";\\s*|\\s*\\(|\\)")[[1]]
#[1] "41.505" "-81.608333" "Adelbert Hall"

Related

Quanteda: How can I use square brackets with glob-style pattern matching using tokens_lookup?

I have two interrelated questions with respect to pattern matching in R using the package {quanteda} and the tokens_lookup() function with the default valuetype="glob" (see here and here).
Say I wanted to match a German word which can be spelt slightly differently depending on whether it is singular or plural: "Apfel" (EN: apple), "Äpfel" (EN: apples). For the plural, we thus use the umlaut "ä" instead of "a" at the beginning. So if I look up tokens, I want to make sure that whether or not I find fruits in a text does not depend on whether the word I'm lokking for is singular or plural. This is a very simple example and I'm aware that I might as well build a dictionary that features "äpfel*" and "apfel*", but my question is more generally about the use of special characters like square brackets.
So in essence, I thought I could simply go with sqaure brackets similarly to regex pattern matching: [aä]. More generally, I thought I could use things like [a-z] to match any single letter from a to z or [0-9] to match any single number between 0 and 9. In fact, that's what it says here. For some reason, none of that seems to work:
library(quanteda)
text <- c(d1 = "i like apples and apple pie",
d2 = "ich mag äpfel und apfelkuchen")
dict_1 <- dictionary(list(fruits = c("[aä]pfel*"))) # EITHER "a" OR "ä"
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*"))) # ANY LETTER
tokens(text) %>%
tokens_lookup(dict_1, valuetype = "glob")
tokens(text) %>%
tokens_lookup(dict_2, valuetype = "glob")
1.) Is there a way to use square brackets at all in glob pattern matching?
2.) If so, would [a-z] also match umlauts (ä,ö,ü) or if not, how can we match characters like that?
1) No, you cannot use brackets with glob pattern matching. However, they work perfectly with regex pattern matching.
2) No, [a-z] will not match umlauts.
Here's how to do it, stripping away all from your question that is not necessary to answering the question.
library("quanteda")
## Package version: 2.0.1
text <- "Ich mag Äpfel und Apfelkuchen"
toks <- tokens(text)
dict_1 <- dictionary(list(fruits = c("[aä]pfel*")))
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*")))
tokens_lookup(toks, dict_1, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich" "mag" "FRUITS" "und" "FRUITS"
tokens_lookup(toks, dict_2, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich" "mag" "Äpfel" "und" "FRUITS"
Note: No need to import all of the tidyverse just to get %>%, as quanteda makes this available through re-export.

How to replace only specific groups in a match in R using stringr package?

In my R project I am using package stringr to perform regex operations.
text <- "My code #snippet wanna get this# is simple"
pattern <- "#([^ \t]+) (.+)#"
pattern looks for stuff inside #...#. The following code:
stringr::str_match_all(text, pattern)
Will give me the content of the groups I am targeting:
[[1]]
[,1] [,2] [,3]
[1,] "#snippet wanna get this#" "snippet" "wanna get this"
How do I replace the content of group 3 (and only that) with a different text? The final desired result would be:
My code #snippet REPLACED WITH THIS# is simple
I am playing with stringr::str_replace_all but I don't seem to get how to solve this issue. I keep replacing the whole match and not just a single group content.
You may capture what you need to keep and just match what you need to replace, use
> gsub("(#[^ \t#]+ )[^#]*(#)", "\\1REPLACED WITH THIS\\2", text)
[1] "My code #snippet REPLACED WITH THIS# is simple"
Details
(#[^ \t#]+ ) - Group 1: #, then any 1+ chars other than #, space and tab, and a space
[^#]* - 0+ chars other than #
(#) - Group 2: a # char
Another way: use gsubfn with a pattern where all your pattern parts are captured into separate groups and then rebuild the replacement after performing the required manipulations:
> gsubfn::gsubfn("(#[^ \t#]+ )([^#]*)(#)", function(x, y, z) paste0(x, "REPLACED WITH THIS", z), text)
[1] "My code #snippet REPLACED WITH THIS# is simple"
Here, the x, y and z refer to the groups defined in the pattern:
(#[^ \t#]+ )([^#]*)(#)
| --- x ---||- y -||z|
With stringr, you may - but you should be very careful with this - use a pattern with lookbehind/lookahead:
> stringr::str_replace_all(text, "(?<=#[^ \t#]{1,1000} )[^#]*(?=#)", "REPLACED WITH THIS")
[1] "My code #snippet REPLACED WITH THIS# is simple"
The (?<=#[^ \t#]{1,1000} ) lookbehind works because it matches a known length pattern (the {1,1000} says there can be from 1 to 1000 occurrences of any chars but space, tab and #), and this "constrained-width lookbehind" is supported since stringr uses ICU regex library.

Regular expressions, stringr - have the regex, can't get it to work in R

I have a data frame with a field called "full.path.name"
This contains things like
s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx
01 GROUP is a pattern of variable size in the whole string.
I would like to add a new field onto the data frame called "short.path"
and it would contain things like
s:///01 GROUP
s:///02 GROUP LONGER NAME
I've managed to extract the last four characters of the file using stringr, I think I should use stringr again.
This gives me the file extension
sfiles$file_type<-as.factor(str_sub(sfiles$Type.of.file,-4))
I went to https://www.regextester.com/
and got this
s:///*.[^/]*
as the regex to use
so I tried it below
sfiles$file_path_short<-as.factor(str_match(sfiles$Full.path.name,regex("s:///*.[^/]*")))
What I thought I would get is a new field on my data frame containing
01 GROUP etc
I get NA
When I try this
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,"[S]")
Gives me S
Where am I going wrong?
When I use: https://regexr.com/
I get
\d* [A-Z]* [A-Z]*[^/]
How do I put that into
sfiles$file_path_short<-str_extract(sfiles$Full.path.name,\d* [A-Z]* [A-Z]*[^\/])
And make things work?
EDIT:
There are two solutions here.
The reason the solutions didn't work at first was because
sfiles$Full.path.name
was >255 in some cases.
What I did:
To make g_t_m's regex work
library(tidyverse)
#read the file
sfiles1<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
# add a field to calculate path length and filter out
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
# then use str_replace to take out the full path name and leave only the
top
# folder names
sfiles$file_path_short <- as.factor(str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1"))
levels(sfiles$file_path_short)
[1] "S:///01 GROUP 1"
[2] "S:///02 GROUP 2"
[3] "S:///03 GROUP 3"
[4] "S:///04 GROUP 4"
[5] "S:///05 GROUP 5"
[6] "S:///06 GROUP 6"
[7] "S:///07 GROUP 7
I think it was the full.path.name field that was causing problems.
To make Wiktor's answer work I did this:
#read the file
sfiles<-read.csv("H:/sdrive_files.csv", stringsAsFactors = F)
str(sfiles)
sfiles$file_path_length <- str_length(sfiles$Full.path.name)
sfiles<-sfiles%>%filter(file_path_length <=255)
sfiles$file_path_short <- str_replace(sfiles$Full.path.name, "
(^.+?/[^/]+?)/.+$", "\\1")
You may use a mere
sfiles$file_path_short <- str_extract(sfiles$Full.path.name, "^s:///[^/]+")
If you plan to exclude s:/// from the results, wrap it within a positive lookbehind:
"(?<=^s:///)[^/]+"
See the regex demo
Details
^ - start of string
s:/// - a literal substring
[^/]+ - a negated character class matching any 1+ chars other than /.
(?<=^s:///) - a positive lookbehind that requires the presence of s:/// at the start of the string immediately to the left of the current location (but this value does not appear in the resulting matches since lookarounds are non-consuming patterns).
Firstly, I would amend your regex to extract the file extension, since file extensions are not always 4 characters long:
library(stringr)
df <- data.frame(full.path.name = c("s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx",
"s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf"), stringsAsFactors = F)
df$file_type <- str_replace(basename(df$full.path.name), "^.+\\.(.+)$", "\\1")
df$file_type
[1] "docx" "pdf"
Then, the following code should give you your short name:
df$file_path_short <- str_replace(df$full.path.name, "(^.+?/[^/]+?)/.+$", "\\1")
df
full.path.name file_type file_path_short
1 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.docx docx s:///01 GROUP
2 s:///01 GROUP/01 SUBGROUP/~$ document name has spaces.pdf pdf s:///01 GROUP

Regular Expression pattern using gsub in r- get a small pattern in the middle of a larger pattern from xml file

everyone.
I am completely new to regex in r, and i run into a problem when trying to retrieve a smaller set of pattern in the middle of a larger pattern using tagged xml file.
Here, i have a three-word sequence "reinforce the advantage" tagged by BNC (British National Corpus) Basic (C5) Tagset system. In specific, i want to only retrieve the three lemmatized words immediately after every "hw=" in this long sequence.
<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>
Can anyone please offer a possible solution with gsub or other functions in r? Many thanks in advance!
NF
vec <- "<w c5=VVI hw=reinforce pos=VERB>reinforce </w><w c5=AT0 hw=the pos=ART>the </w><w c5=NN2 hw=advantage pos=SUBST>advantages </w>"
m <- gregexpr("(?<=hw=)\\S+", vec, perl = T)
regmatches(vec, m)
# [[1]]
# [1] "reinforce" "the" "advantage"
copied from regex101.com
/
(?<=hw=)\S+
/
Positive Lookbehind (?<=hw=)
Assert that the Regex below matches
hw= matches the characters hw= literally (case sensitive)
\S+ matches any non-whitespace character (equal to [^\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible,
giving back as needed (greedy)
first ?unlist then collapse (?paste0)
paste0(unlist(
regmatches(vec, m)
), collapse = " ")
# [1] "reinforce the advantage"

Return number from string

I'm trying to extract the "Number" of "Humans" in the string below, for example:
string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
The position of the text in the string will constantly change, so I need R to search the string and find "Species|Human|Number|" and return 1.
Apologies if this is a duplicate of another thread, but I've looked here (extract a substring in R according to a pattern) and here (R extract part of string). But I'm not having any luck.
Any ideas?
Use a capturing approach - capture 1 or more digits (\d+) after the known substring (just escape the | symbols):
> string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
> pattern = "Species\\|Human\\|Number\\|(\\d+)"
> unlist(regmatches(string,regexec(pattern,string)))[2]
[1] "1"
A variation is to use a PCRE regex with regmatches/regexpr
> pattern="(?<=Species\\|Human\\|Number\\|)\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Here, the left side context is put inside a non-consuming pattern, a positive lookbehind, (?<=...).
The same functionality can be achieved with \K operator:
> pattern="Species\\|Human\\|Number\\|\\K\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Simplest way I can think of:
as.integer(gsub("^.+Species\\|Human\\|Number\\|(\\d+).+$", "\\1", string))
It will introduce NAs where there is no mention of Speces|Human|Number. Also, there will be artefacts if any of the strings is a number (but I assume that this won't be an issue)

Resources