How to extract text inside the brackets in R? - r

How can I extract all brackets which include a name AND a year?
string="testo(antonio.2018).testo(antonio).testo(giovanni,2018).testo(2018),testo(libero 2019)"
the desired output would look like this:
"(antonio.2018)" "(giovanni,2018)" "(libero 2019)"
I do not want to extract (2018) and (antonio)

You can use str_extract_all from the stringr package with this regex pattern:
stringr::str_extract_all(string,
"\\(\\w+([[:punct:]]{1}|[[:blank:]]{1})[[:digit:]]+\\)")
# [[1]]
# [1] "(antonio.2018)" "(giovanni,2018)" "(libero 2019)"
A small description of the regex:
\\w will match any word-character
+ means that it has to be matched at least once
[[:punct:]] will match any punctuation character
{1} will exactly one appearance
(....|....) indicates one pattern OR the other has to be met
[[:blank:]] means any whitespace must occur
[[:digit:]] means any digit must occur
\\( braces have to be exited.

#loki answer is great! You can also try this, I hope this works for you :)
x<-regmatches(string, gregexpr("(?=\\().*?(?<=\\))", string, perl=T))[[1]]
>x
[1] "(antonio.2018)" "(antonio)" "(giovanni,2018)" "(2018)" "(libero 2019)"
#Extract every nth value.
>x[seq_along(x) %% 2 > 0]
[1] "(antonio.2018)" "(giovanni,2018)" "(libero 2019)"
Note: Unsure of your complete dataset (i.e. if the structure will always be in nth format. If it is (every 2nd value), this will work on large scale.

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Regex for alternating pattern between two character groups

I'm trying to find matches where the pattern alternates between two character groups, D\E and R\K\H.
The pattern I've come up with (through reading other posts on here) is
(([DE](?=[RKH])*)|(([RKH])(?=[DE])*))+
Using this pattern with this test string: DREDRDRDRARDK
I get the following matches: DR, DRDRD, RD
I want: DRE, DRDRDR, RDK
The matches are missing the last letter for each group.
Please could someone help me figure out why.
Match the first group followed by the second with all that matched any number of times and then possibly followed by the first group. i.e. ([DE][RKH])+[DE]?, or the same with the groups interchanged, i.e. ([RKH][DE])+[RKH]? or just the first group, i.e. [DE] or just the second group, i.e. [RKH]:
library(gsubfn)
x <- "DREDRDRDRARDK" # input
rx <- "(([DE][RKH])+[DE]?|([RKH][DE])+[RKH]?|[DE]|[RKH])"
strapply(x, rx)
## [[1]]
## [1] "DRE" "DRDRDR" "RDK"
In your pattern, you repeatedly match a single character out of 2 character classes followed by a positive lookahead which asserts that there should be a character present directly at the right.
(Note that the positive lookahead should not be optionally repeated (?=[RKH])* or else it will always be true, matching too much)
If the quantifier * is not present after the lookahead you will get your matches where characters are missing.
The reason why the matches are missing the last letter for each group is when [DE] is matched, there is a positive lookahead asserting what is directly to the right is [RKH] (and the other way around due to the alternation)
It does not match the E in DRE because when matching E the lookahead asserts on of [RKH] after is, which is not the case
It does not match the last R in DRDRDR as there is no A following the last R
As the positive lookahead asserts that there should be a next character present, you also don't match the last K because there is no character after it
As already answered, you can repeatedly match the pairs of character classes followed by optionally matching the first character class after it.
Without the groups, I think it could also be shortened to:
(?:[DE][RKH])+[DE]?|(?:[RKH][DE])+[RKH]?
Regex demo
library(stringr)
str_extract_all("DREDRDRDRARDK", "(?:[DE][RKH])+[DE]?|(?:[RKH][DE])+[RKH]?")
Output
[[1]]
[1] "DRE" "DRDRDR" "RDK"

Stringr function or or gsub() to find an x digit string and extract first x digits?

Regex and stringr newbie here. I have a data frame with a column from which I want to find 10-digit numbers and keep only the first three digits. Otherwise, I want to just keep whatever is there.
So to make it easy let's just pretend it's a simple vector like this:
new<-c("111", "1234567891", "12", "12345")
I want to write code that will return a vector with elements: 111, 123, 12, and 12345. I also need to write code (I'm assuming I'll do this iteratively) where I extract the first two digits of a 5-digit string, like the last element above.
I've tried:
gsub("\\d{10}", "", new)
but I don't know what I could put for the replacement argument to get what I'm looking for. Also tried:
str_replace(new, "\\d{10}", "")
But again I don't know what to put in for the replacement argument to get just the first x digits.
Edit: I disagree that this is a duplicate question because it's not just that I want to extract the first X digits from a string but that I need to do that with specific strings that match a pattern (e.g., 10 digit strings.)
If you are willing to use the library stringr from which comes the str_replace you are using. Just use str_extract
vec <- c(111, 1234567891, 12)
str_extract(vec, "^\\d{1,3}")
The regex ^\\d{1,3} matches at least 1 to a maximum of 3 digits occurring right in the beginning of the phrase. str_extract, as the name implies, extracts and returns these matches.
You may use
new<-c("111", "1234567891", "12")
sub("^(\\d{3})\\d{7}$", "\\1", new)
## => [1] "111" "123" "12"
See the R online demo and the regex demo.
Regex graph:
Details
^ - start of string anchor
(\d{3}) - Capturing group 1 (this value is accessed using \1 in the replacement pattern): three digit chars
\d{7} - seven digit chars
$ - end of string anchor.
So, the sub command only matches strings that are composed only of 10 digits, captures the first three into a separate group, and then replaces the whole string (as it is the whole match) with the three digits captured in Group 1.
You can use:
as.numeric(substring(my_vec,1,3))
#[1] 111 123 12

Finding Abbreviations in Data with R

In my data (which is text), there are abbreviations.
Is there any functions or code that search for abbreviations in text? For example, detecting 3-4-5 capital letter abbreviations and letting me count how often they happen.
Much appreciated!
detecting 3-4-5 capital letter abbreviations
You may use
\b[A-Z]{3,5}\b
See the regex demo
Details:
\b - a word boundary
[A-Z]{3,5} - 3, 4 or 5 capital letters (use [[:upper:]] to match letters other than ASCII, too)
\b - a word boundary.
R demo online (leveraging the regex occurrence count code from #TheComeOnMan)
abbrev_regex <- "\\b[A-Z]{3,5}\\b";
x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
sum(gregexpr(abbrev_regex,x)[[1]] > 0)
## => [1] 3
regmatches(x, gregexpr(abbrev_regex, x))[[1]]
## => [1] "XYZ" "WXYZ" "VWXYZ"
You can use the regular expression [A-Z] to match any ocurrence of acapital letter. If you want this pattern to be repeated 3 times you can add \1{3} to your regex. Consider using variables and a loop to get the job done for 3 to 5 repetition times.

R Remove specific character with range of possible positions within string

I would like to remove the character 'V' (always the last one in the strings) from the following vector containing a large number of strings. They look similar to the following example:
str <- c("VDM 000 V2.1.1",
"ABVC 001 V10.15.0",
"ASDV 123 V1.20.0")
I know that it is always the last 'V', I would like to remove.
I also know that this character is either the sixth, seventh or eighth last character within these strings.
I was not really able to come up with a nice solution. I know that I have to use sub or gsub but I can only remove all V's rather than only the last one.
Has anyone got an idea?
Thank you!
This regex pattern is written to match a "V" that is then followed by 5 to 7 other non-"V" characters. The "[...]" construct is a "character-class" and within such constructs a leading "^" causes negation. The "{...} consturct allows two digits specifying minimum and maximum lengths, and the "$" matches the length-0 end-of-string which I think was desired when you wrote "sixth, seventh or eighth last character":
sub("(V)(.{5,7})$", "\\2", str)
[1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
Since you only wanted a single substitution I used sub instead of gsub.
You can use:
gsub("V(\\d+.\\d+.\\d+)$","\\1",str)
##[1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
The regex V(\\d+.\\d+.\\d+)$ matches the "version" consisting of the character "V" followed by three sets of digits (i.e., \\d+) separated by two "." at the end of the string (i.e., $). The parenthesis around the \\d+.\\d+.\\d+ provides a group within the match that can be referenced by \\1. Therefore, gsub will replace the whole match with the group, thereby removing that "V".
Since you know it's the last V you want to remove from the string, try this regex V(?=[^V]*$):
gsub("V(?=[^V]*$)", "", str, perl = TRUE)
# [1] "VDM 000 2.1.1" "ABVC 001 10.15.0" "ASDV 123 1.20.0"
The regex matches V before pattern [^V]*$ which consists of non V characters from the end of the String, which guarantees that the matched V is the last V in the string.

Resources