Regex - Capturing repeating groups excluding surrounding partially matching content - r

For the record, I am using R, but the queries I have are platform independent (as it stands) so I'll demo with regex101. I am attempting to capture repeated groups that may or may not be surrounded by other text. So the ideal behaviour is shown in this demo:
demo1
regex: (\d{2})(AB)
text: blahblah11AB12AB13ABblah
So it nicely captures all the groups I want:
Match 1
Full match 8-12 `11AB`
Group 1. 8-10 `11`
Group 2. 10-12 `AB`
Match 2
Full match 12-16 `12AB`
Group 1. 12-14 `12`
Group 2. 14-16 `AB`
Match 3
Full match 16-20 `13AB`
Group 1. 16-18 `13`
Group 2. 18-20 `AB`
However, if I include another piece of matched text, it captures that as well (which is fair enough I suppose)
text: blahblah11AB12AB13ABblah22AB
returns the same but with the extra group:
Match 4
Full match 24-28 `22AB`
Group 1. 24-26 `22`
Group 2. 26-28 `AB`
demo2
What I want to do is capture the first group but disregard all other text, even if there is a subsequent match. In essence, I want to get just the three matches from this text: blahblah11AB12AB13ABblah22AB
I have tried a number of things, such as this:
(((\d{2})(AB))+)(.*)
But then I get the following, which loses all both the last group capture:
Demo 3
Match 1
Full match 8-28 `11AB12AB13ABblah22AB`
Group 1. 8-20 `11AB12AB13AB`
Group 2. 16-20 `13AB`
Group 3. 16-18 `13`
Group 4. 18-20 `AB`
Group 5. 20-28 `blah22AB`
I need something which retains the repeated groups. Am stumped!
In R, the output should look like this:
[[1]]
[,1] [,2] [,3]
[1,] "11AB" "11" "AB"
[2,] "12AB" "12" "AB"
[3,] "13AB" "13" "AB"
Thanks in advance...

An idea would be to use \G for chaining matches to ^ start and reset by \K after.
(?:^.*?\K|\G)(\d{2})(AB)
^.*?\K will match any amount of any characters lazily before the first match
|\G or continue at the end of previous match which can be: start, first, previous
See your updated demo
This will match the first chain of matches and is a pcre pattern (perl=TRUE).
If there can only be non-digits before first match, use ^\D*\K instead of ^.*?\K.

You could use the quantifier {3} to get just the 3 firsts groups: (((\d{2})(AB)){3}). See demo

If I understand it correctly, the problem is in the placement of the parenthesis.
pattern <- "(\\d{2}AB)"
s <- "blahblah11AB12AB13ABbla"
m <- gregexpr(pattern, s)
regmatches(s, m)
#[[1]]
#[1] "11AB" "12AB" "13AB"
s2 <- "blahblah11AB12AB13ABblah22AB"
s3 <- "11AB12AB13ABblah22AB"
S <- c(s, s2, s3)
m <- gregexpr(pattern, S)
regmatches(S, m)
#[[1]]
#[1] "11AB" "12AB" "13AB"
#
#[[2]]
#[1] "11AB" "12AB" "13AB" "22AB"
#
#[[3]]
#[1] "11AB" "12AB" "13AB" "22AB"
Note that many times this is run in one code line only. I have left it like this to make it more clear.
EDIT.
Maybe the following does what the OP is asking for.
I bet there are better solutions, it seems to me that two regex's are an overkill.
pattern <- "((\\d{2}AB)+)([^[:digit:]AB]+(\\d{2}AB))"
pattern2 <- "(\\d{2}AB)"
m <- gregexpr(pattern2, gsub(pattern, "\\1", S))
regmatches(S, m)
#[[1]]
#[1] "11AB" "12AB" "13AB"
#
#[[2]]
#[1] "11AB" "12AB" "13AB"
#
#[[3]]
#[1] "11AB" "12AB" "13AB"

Related

R - Extract information from string following a general format

This is a complete re-write of my original question in an attempt to clarify it and make it as answerable as possible. My objective is to write a function which takes a string as input and returns the information contained therein in tabular format. Two examples of the kind of character strings the function will face are the following
s1 <- " 9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167***\r"
s2 <- " 10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0\r"
(For those who had read my original question, these are smaller strings for simplicity.)
The required output would be:
Rank Code Name Club Class Time Points
9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167
10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0
I have managed to split the string based on where there's a blank space using:
strsplit(s1, " ")[[1]][strsplit(s1, " ")[[1]] != ""]
although a more elegant solution was given by G. Grothendieck in the comments below using:
unlist(strsplit(trimws(s1), " +"))
This results in
"9" "9875" "Γεωργίου" "Άγγελος" "Δημήτρης" "ΑΒ/Γ" "Π/Π" "Β" "00:54:05" "167***\r"
However, this is still problematic as "Γεωργίου" "Άγγελος" and "Δημήτρης" should be combined into "Γεωργίου Άγγελος Δημήτρης" (note that the number of elements could be two OR three) and the same applies to "Π/Π" "Β" which should be combined into "Π/Π Β".
The question
How can I use the additional information that I have, namely:
The order of the elements will always be the same
The Name data will consist of two or three words
The Club data (i.e. ΑΒ/Γ in s1 and ΔΕΖ in s2) will come from a pre-defined list of clubs (e.g. stored in a character vector named sClub)
The Class data (i.e. Π/Π Β in s1 and N in s2) will come from a pre-defined list of classes (e.g. stored in a character vector named sClass)
The Points data will always contain "\r" and won't contain any spaces.
to produce the required output above?
Defining
sClub <- c("ΑΒ/Γ", "ΔΕΖ")
sClass <- c("Π/Π Β", "N")
we may do
library(stringr)
myfun <- function(s)
gsub("\\*", "", trimws(str_match(s, paste0("^\\s*(\\d+)\\s*?(\\w+)\\s*?([\\w ]+)\\s*(", paste(sClub, collapse = "|"),")\\s*(", paste(sClass, collapse = "|"), ")(.*?)\\s*([^ ]*\r)"))[, -1]))
sapply(list(s1, s2), myfun)
# [,1] [,2]
# [1,] "9" "10"
# [2,] "9875" "8954F"
# [3,] "Γεωργίου Άγγελος Δημήτρης" "Smith John"
# [4,] "ΑΒ/Γ" "ΔΕΖ"
# [5,] "Π/Π Β" "N"
# [6,] "00:54:05" "ΔΕΝ ΕΚΚΙΝΗΣΕ"
# [7,] "167" "0"
The way it works is just taking into account all your additional information and constructing a long regex. It finishes with erasing * and removing leading/trailing whitespace.

Extracting multiple substrings that come after certain characters in a string using stringi in R

I have a large dataframe in R that has a column that looks like this where each sentence is a row
data <- data.frame(
datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
stringsAsFactors=FALSE)
I want to extract all the words that come after "wiki/" and put them in another column
So for the first row it should come out with "political_philosophy self-governance"
The second row should look like "hierarchy free_association_(communism_and_anarchism)"
The third row should be "state_(polity)"
And the fourth row should be "anti-statism"
I definitely want to use stringi because it's a huge dataframe. Thanks in advance for any help.
I've tried
stri_extract_all_fixed(data$datalist, "wiki")[[1]]
but that just extracts the word wiki
You can do this with a regex. By using stri_match_ instead of stri_extract_ we can use parentheses to make matching groups that let us extract only part of the regex match. In the result below, you can see that each row of df gives a list item containing a data frame with the whole match in the first column and each matching group in the following columns:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match
[[1]]
[,1] [,2]
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance" "self-governance"
[[2]]
[,1] [,2]
[1,] "wiki/stateless_society" "stateless_society"
[2,] "wiki/hierarchy" "hierarchy"
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"
[[3]]
[,1] [,2]
[1,] "wiki/state_(polity)" "state_(polity)"
[[4]]
[,1] [,2]
[1,] "wiki/anti-statism" "anti-statism"
You can then use apply functions to make the data into any form you want:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
You can use a lookbehind in the regex.
library(dplyr)
library(stringi)
text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")
df <- data.frame(text, stringsAsFactors = FALSE)
df %>%
mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))
You may use
> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
See the online R code demo.
Details
wiki/(\\S+) - matches wiki/ and captures 1+ non-whitespace chars into Group 1
| - or
(?:(?!wiki/\\S).)+ - a tempered greedy token that matches any char, other than a line break char, 1+ occurrences, that does not start a wiki/+a non-whitespace char sequence.
If you need to get rid of redundant whitespace inside the result you may use another call to gsub:
> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"

Delete duplicate elements in String in R

I've got some problems deleting duplicate elements in a string.
My data look similar to this:
idvisit path
1 1,16,23,59
2 2,14,14,19
3 5,19,23,19
4 10,10
5 23,23,27,29,23
I have a column containing an unique ID and a column containing a path for web page navigation.
The right column contains some cases, where pages just were reloaded and the page were tracked twice or even more.
The pages are separated with commas and are saved as factors.
My problem is, that I don't want to have multiple pages in a row, so the data should look like this.
idvisit path
1 1,16,23,59
2 2,14,19
3 5,19,23,19
4 10
5 23,27,29,23
The multiple pages next to each other should be removed. I know how to delete a specific multiple number using regexpressions, but I have about 20.000 different pages and can't do this for all of them.
Does anyone have a solution or a hint, for my problem?
Thanks
Sebastian
We can use tidyverse. Use the separate_rows to split the 'path' variable by the delimiter (,) to convert to a long format, then grouped by 'idvisit', we paste the run-length-encoding values
library(tidyverse)
separate_rows(df1, path) %>%
group_by(idvisit) %>%
summarise(path = paste(rle(path)$values, collapse=","))
# A tibble: 5 × 2
# idvisit path
# <int> <chr>
#1 1 1,16,23,59
#2 2 2,14,19
#3 3 5,19,23,19
#4 4 10
#5 5 23,27,29,23
Or a base R option is
df1$path <- sapply(strsplit(df1$path, ","), function(x) paste(rle(x)$values, collapse=","))
NOTE: If the 'path' column is factor class, convert to character before passing as argument to strsplit i.e. strsplit(as.character(df1$path), ",")
Using stringr package, with function: str_replace_all, I think it gets what you want using the following regular expression: ([0-9]+),\\1and then replace it with \\1 (we need to scape the \ special character):
library(stringr)
> str_replace_all("5,19,23,19", "([0-9]+),\\1", "\\1")
[1] "5,19,23,19"
> str_replace_all("10,10", "([0-9]+),\\1", "\\1")
[1] "10"
> str_replace_all("2,14,14,19", "([0-9]+),\\1", "\\1")
[1] "2,14,19"
You can use it in a array form: x <- c("5,19,23,19", "10,10", "2,14,14,19") then:
str_replace_all(x, "([0-9]+),\\1", "\\1")
[1] "5,19,23,19" "10" "2,14,19"
or using sapply:
result <- sapply(x, function(x) str_replace_all(x, "([0-9]+),\\1", "\\1"))
Then:
> result
5,19,23,19 10,10 2,14,14,19
"5,19,23,19" "10" "2,14,19"
Notes:
The first line is the attribute information:
> str(result)
Named chr [1:3] "5,19,23,19" "10" "2,14,19"
- attr(*, "names")= chr [1:3] "5,19,23,19" "10,10" "2,14,14,19"
If you don't want to see them (it does not affect the result), just do:
attributes(result) <- NULL
Then,
> result
[1] "5,19,23,19" "10" "2,14,19"
Explanation about the regular expression used: ([0-9]+),\\1
([0-9]+): Starts with a group 1 delimited by () and finds any digit (at least one)
,: Then comes a punctuation sign: , (we can include spaces here, but the original example only uses this character as delimiter)
\\1: Then comes an identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.
Then if the pattern matches, it replaces it, with the value of the variable \\1, i.e. the first time the number appears in the pattern matched.
How to handle more than one duplicated number, for example 2,14,14,14,19?:
Just use this regular expression instead: ([0-9]+)(,\\1)+, then it matches when at least there is one repetition of the delimiter (right) and the number. You can try other possibilities using this regex101.com (in MHO it more user friendly than other online regular expression checkers).
I hope this would work for you, it is a flexible solution, you just need to adapt it with the pattern you need.

Filter character vector based on first two elements

I have a vector that look like this:
data <- c("0115", "0159", "0256", "0211")
I want to filter the data based on the first 2 elements of my vector. For example:
group 1 - elements that start with 01
group 2 - elements that start with 02
Any idea how to accomplish this?
You might want to use Regular Expression (regex) to find strings that start with "01" or "02".
Base approach is use grep(), which returns indices of strings that match a pattern. Here's an example - notice I've changed the 2nd and 4th data elements to demonstrate how just searching for "01" or "02" will lead to incorrect answer:
d <- c("0115", "0102", "0256", "0201")
grep("01", d)
#> [1] 1 2 4
d[grep("01", d)]
#> [1] "0115" "0102" "0201"
Because this searches for "01" anywhere, you get "0201" in the mix. To avoid, add "^" to the pattern to specify that the string starts with "01":
grep("^01", d)
#> [1] 1 2
d[grep("^01", d)]
#> [1] "0115" "0102"
If you use the stringr package, you can also use str_detect() in the same way:
library(stringr)
d[str_detect(d, "^01")]
#> [1] "0115" "0102"

Fix the order of strings that have both letter and number components

I have string data like below.
a <- c("53H", "H26","14M","M47")
##"53H" "H26" "14M" "M47"
I want to fix the numbers and letters in a certain order such that
the numbers goes first, the letters goes second, or the other way around.
How can I do it?
##"53H" "26H" "14M" "47M"
or
##"H53" "H26" "M14" "M47"
You can extract the numbers and letters separately with gsub, then use paste0
to put them in any order you like.
a <- c("53H", "H26","14M","M47")
( nums <- gsub("[^0-9]", "", a) ) ## extract numbers
# [1] "53" "26" "14" "47"
( lets <- gsub("[^A-Z]", "", a) ) ## extract letters
# [1] "H" "H" "M" "M"
Numbers first answer:
paste0(nums, lets)
# [1] "53H" "26H" "14M" "47M"
Letters first answer:
paste0(lets, nums)
# [1] "H53" "H26" "M14" "M47"
You can capture the relevant parts in groups using () and then backreference them using gsub:
a <- c("53H", "H26","14M","M47")
gsub("^([0-9]+)([A-Z]+)$", "\\2\\1", a)
# [1] "H53" "H26" "M14" "M47"
This is like saying "Find a group of numbers at the start of my string and capture them in a group (^([0-9]+)). Then find the group of letters that go on to the end of my string and capture them in a second group (([A-Z]+)). That's my search pattern. Next, replace it such that the second group (referred to by \\2) is returned first and the first group (referred to by \\1) is returned second).
From Ananda Mahto's answer, you can order the number first and letter second using the following code:
gsub("^([A-Z]+)([0-9]+)$", "\\2\\1", a)
because you want to capture the strings which start with a letter (^([A-Z]+)), then capture the group of numbers ( ([0-9]+)$ )/

Resources