Finding second space after each comma - r

This is a follow up to this question: Concatenate previous and latter words to a word that match a condition in R
I am looking for a regex which splits the string at the second space that happens after comma. Look at the example below:
vector <- c("Paulsen", "Kehr,", "Diego",
"Schalper", "Sepúlveda,", "Alejandro",
"Von Housen", "Kush,", "Terry")
X <- paste(vector, collapse = " ")
X
## this is the string I am looking to split:
"Paulsen Kehr, Diego Schalper Sepúlveda, Diego Von Housen Kush, Terry"
Second space after each comma is the criterion for my regex. So, my output will be:
"Paulsen Kehr, Diego"
"Schalper Sepúlveda, Alejandro"
"Von Housen Kush, Terry"
I came up with a pattern but it is not quite working.
[^ ]+ [^ ]+, [^ ]+( )
Using it with strsplit removes all the words instead of splitting at group-1 (i.e. [^ ]+ [^ ]+, [^ ]+(group-1)) only. I think I just needs to exclude the full match and match with the space afterwards only. --
regex demo
strsplit(X, "[^ ]+ [^ ]+, [^ ]+( )")
# [1] "" [2] "" [3] "Von Housen Kush, Terry"
Can anyone think of a regex for finding the second space after each comma?

You may use
> strsplit(X, ",\\s+\\S+\\K\\s+", perl=TRUE)
[[1]]
[1] "Paulsen Kehr, Diego" "Schalper Sepúlveda, Alejandro" "Von Housen Kush, Terry"
See the regex demo
Details
, - a comma
\s+ - 1+ whitespaces
\S+ - 1+ non-whitespaces
\K - match reset operator discarding all text matched so far
\s+ - 1+ whitespaces

Related

How do I extract text between two characters in R

I'd like to extract text between two strings for all occurrences of a pattern. For example, I have this string:
x<- "\nTYPE: School\nCITY: ATLANTA\n\n\nCITY: LAS VEGAS\n\n"
I'd like to extract the words ATLANTA and LAS VEGAS as such:
[1] "ATLANTA" "LAS VEGAS"
I tried using gsub(".*CITY:\\s|\n","",x). The output this yields is:
[1] " LAS VEGAS"
I would like to output both cities (some patterns in the data include more than 2 cities) and to output them without the leading space.
I also tried the qdapRegex package but could not get close. I am not that good with regular expressions so help would be much appreciated.
You may use
> unlist(regmatches(x, gregexpr("CITY:\\s*\\K.*", x, perl=TRUE)))
[1] "ATLANTA" "LAS VEGAS"
Here, CITY:\s*\K.* regex matches
CITY: - a literal substring CITY:
\s* - 0+ whitespaces
\K - match reset operator that discards the text matched so far (zeros the current match memory buffer)
.* - any 0+ chars other than line break chars, as many as possible.
See the regex demo online.
Note that since it is a PCRE regex, perl=TRUE is indispensible.
Another option:
library(stringr)
str_extract_all(x, "(?<=CITY:\\s{3}).+(?=\\n)")
[[1]]
[1] "ATLANTA" "LAS VEGAS"
reads as: extract anything preceded by "City: " (and three spaces) and followed by "\n"
An option can be as:
regmatches(x,gregexpr("(?<=CITY:).*(?=\n\n)",x,perl = TRUE))
# [[1]]
# [1] " ATLANTA" " LAS VEGAS"

Replace first comma with a semicolon into a string using R and regex

I would like to replace only the first comma in my dataset with a semicolon using R, regex, and, possibly, the library stringr.
The following one is an extract of my dataset:
mydata <- structure(list(SURNAME_Name = c("AASSVE Arnstein", "ABATECOLA Gianpaolo",
"ABATEMARCO Antonio", "ABBAFATI Cristiana", "ABBATE Tindara",
"ABBRUZZO Antonino", "ABRARDI Laura", "ABRATE Graziano", "ACCONCIA Antonio",
"ACHARD Paola Olimpia", "ADAMO Rosa", "ADAMO Stefano", "ADDA Jerome Frans",
"ADDABBO Tindara", "ADDIS Elisabetta", "ADDIS Michela", "ADELFIO Giada",
"ADIGUZEL Feray", "ADIMARI Gianfranco", "DE MARCHI Maria Paola")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
I performed this code to insert a comma between SURNAME and Names and then I tried replacing the first comma with a semicolon:
library(stringr)
mydata %>%
mutate(Name_delimited=str_replace_all(string=SURNAME_Name,pattern="(\\s)(?=[A-Z]{1}[a-z]+)",replacement="\\,"),
Name_delimited1=str_replace_all(string=Name_delimited,pattern="\\1(\\,)(?=[A-Z]{1}[a-z]+)",replacement="\\;"))
But it doesn't work as I expected because, for example, the row number 10 in my dataset remains ACHARD,Paola,Olimpia instead of ACHARD;Paola,Olimpia and for row number 20 where I expected DE MARCHI;Maria,Paola instead of DE MARCHI,Maria,Paola
Any hints are wellcome
You may replace the first whitespace(s) with ; using str_replace and then use str_replace_all to replace all other spaces with ,:
> str_replace_all(str_replace(mydata$SURNAME_Name, "\\s+", ";"), "\\s+", ",")
[1] "AASSVE;Arnstein" "ABATECOLA;Gianpaolo" "ABATEMARCO;Antonio"
[4] "ABBAFATI;Cristiana" "ABBATE;Tindara" "ABBRUZZO;Antonino"
[7] "ABRARDI;Laura" "ABRATE;Graziano" "ACCONCIA;Antonio"
[10] "ACHARD;Paola,Olimpia" "ADAMO;Rosa" "ADAMO;Stefano"
[13] "ADDA;Jerome,Frans" "ADDABBO;Tindara" "ADDIS;Elisabetta"
[16] "ADDIS;Michela" "ADELFIO;Giada" "ADIGUZEL;Feray"
[19] "ADIMARI;Gianfranco" "ADINOLFI;Paola"
Note you may replace str_replace with sub and str_replace_all with gsub and use
gsub("\\s+", ",", sub("\\s+", ";", mydata$SURNAME_Name))
relying on sole base R functions.
To preserve whitespaces inside ALLCAPS surnames, use
> reg <- "(*UCP)\\b\\p{Lu}+(?:\\s+\\p{Lu}+)+\\b(*SKIP)(*F)|\\s+"
> gsub(reg, ",", sub(reg, ";", mydata$SURNAME_Name, perl=TRUE), perl=TRUE)
[1] "AASSVE;Arnstein" "ABATECOLA;Gianpaolo" "ABATEMARCO;Antonio" "ABBAFATI;Cristiana"
[5] "ABBATE;Tindara" "ABBRUZZO;Antonino" "ABRARDI;Laura" "ABRATE;Graziano"
[9] "ACCONCIA;Antonio" "ACHARD;Paola,Olimpia" "ADAMO;Rosa" "ADAMO;Stefano"
[13] "ADDA;Jerome,Frans" "ADDABBO;Tindara" "ADDIS;Elisabetta" "ADDIS;Michela"
[17] "ADELFIO;Giada" "ADIGUZEL;Feray" "ADIMARI;Gianfranco" "DE MARCHI;Maria,Paola"
The regex engine is now PCRE, and I added a (*UCP) PCRE verb to make \b Unicode aware, and an \\b\\p{Lu}+(?:\\s+\\p{Lu}+)+\\b(*SKIP)(*F) alternative that matches any whitespace-separated ALLCAPS letter words as whole words and then skips these matches keeping whitespace intact.
Details
(*UCP) - makes \b in this pattern Unicode aware
\\b - a word boundary
\\p{Lu}+ - 1+ Unicode uppercase letters
(?:\\s+\\p{Lu}+)+ - 1 or more occurrences of 1+ whitespaces and then 1+ Unicode letters
\\b - word boundary
(*SKIP)(*F) - PCRE verbs that discard the matched text and proceed looking for the next match starting from the location where the previous search ended
| - or
\\s+ - 1+ whitespaces in any other context.

regular expression to find exact matching containing a space and a punctuation

I am going through a dataset containing text values (names) that are formatted like this example :
M.Joan (13-2)
A.Alfred (20-13)
F.O'Neil (12-231)
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)
Some strings have two names in it like
M.Joan (13-2) A.Alfred (20-13)
I only want to extract the name from the string.
Some names are easy to extract because they don't have spaces or anything.
However some are hard because they have a space like the last one above.
name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)
When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.
Output:
[[1]]
[1] "Z.Taylor "
[[2]]
[1] "Z.Taylor "
[[3]]
[1] "Z.Taylor "
[[4]]
[1] "Z.Taylor "
[[5]]
[1] "Y.Berra "
[[6]]
[1] "Y.Berra "
You may use
x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))
See the regex demo
Or the str_extract_all version:
str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")
See the regex demo.
It matches
\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
\\s* - 0+ whitespace chars
\\( - a literal (.

Regex, R, and Commas

I'm having some trouble with a regex string in R. I'm trying to use regex to extract the tags from a string (scraped from the web) as follows:
str <- "\n\n\n \n\n\n “Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss\n\n\n\n\n \n tags:\n attributed-no-source,\n cry,\n crying,\n experience,\n happiness,\n joy,\n life,\n misattributed-dr-seuss,\n optimism,\n sadness,\n smile,\n smiling\n \n \n 176513 likes\n \n\n\n\n\nLike\n\n"
# Why doesn't this work at all?
stringr::str_match(str, "tags:(.+)\\d")
[,1] [,2]
[1,] NA NA
# Why just the first tag? What happens at the comma?
stringr::str_match(str, "tags:\n(.+)")
[,1] [,2]
[1,] "tags:\n attributed-no-source," " attributed-no-source,"
So two questions -- why doesn't my first idea work, and why doesn't the second capture through the end of the string, rather than just the first comma?
Thanks!
Note that stringr regex flavor is that of ICU. Unlike TRE, . does not match line breaks in ICU regex patterns.
So, a possible fix is to use (?s) - a DOTALL modifier that makes . match any char including line break chars - at the start of your patterns:
str_match(str, "(?s)tags:(.+)\\d")
and
str_match(str, "(?s)tags:\n(.+)")
However, I feel as if you need to get all the strings below tags: as separate matches. I suggest using a base R regmatches / gregexpr with a PCRE regex like
(?:\G(?!\A),?|tags:)\R\h*\K[^\s,]+
See the regex demo on your data.
(?:\G(?!\A),?|tags:) - match the end of the previous successful match with 1 or 0 , after it (\G(?!\A),?) or (|) tags: substring
\R - a line break sequence
\h* - 0+ horizontal whitespaces
\K - a match reset operator discarding all the text matched so far
[^\s,]+ - 1 or more chars other than whitespace and ,
See the R demo:
str <- "\n\n\n \n\n\n “Don't cry because it's over, smile because it happened.”\n ―\n Dr. Seuss\n\n\n\n\n \n tags:\n attributed-no-source,\n cry,\n crying,\n experience,\n happiness,\n joy,\n life,\n misattributed-dr-seuss,\n optimism,\n sadness,\n smile,\n smiling\n \n \n 176513 likes\n \n\n\n\n\nLike\n\n"
reg <- "(?:\\G(?!\\A),?|tags:)\\R\\h*\\K[^\\s,]+"
vals <- regmatches(str, gregexpr(reg, str, perl=TRUE))
unlist(vals)
Result:
[1] "attributed-no-source" "cry" "crying"
[4] "experience" "happiness" "joy"
[7] "life" "misattributed-dr-seuss" "optimism"
[10] "sadness" "smile" "smiling"

Using R, how to use str_extract properly on this case?

I've learned from Ronak Shah and akrun(in this post) how to construct a regular expression to exclude every terms from a dataframe (alldata in my example) except those words,
^\BWORD1|WORD2|WORD3|WORD4|WORD5\>
but for some reasons, can't figure why it is giving me
"WORD2", "WORD3", NA
instead of
"WORD1 WORD2 WORD5", "WORD3", NA
here is my example :
library(stringr)
alldata <- data.frame(toupper(c("word1 anotherword word2 word5", "word3", "none")))
names(alldata)<-"columna"
removeex <- c("word1" , "word2" ,"word3" ,"word4", "word5")
regularexprex <- toupper(paste0("^\\b",paste0(removeex, collapse = "|"), "\\>"))
alldata$columnb <- str_extract(alldata$columna, regularexprex)
I've tried to add + or * at the end of the regular expression but without any effects.
Due to the fact i'm a beginner on regex, i surely miss something, may someone guide me on this ?
Regards,
You need to replace the last two lines in your above code to
> regularexprex <- paste0("(?i)\\s*\\b(?!(?:",paste0(removeex, collapse = "|"), ")\\b)\\w+")
## => "(?i)\\s*\\b(?!(?:word1|word2|word3|word4|word5)\\b)\\w+"
> str_replace_all(alldata$columna, regularexprex, "")
[1] "WORD1 WORD2 WORD5" "WORD3" ""
First, the toupper() turned \b to \B (non-word boundary) - you just need a case insensitive matching (I added the (?i) modifier), and the word boundaries were not applied to the group, only to the items on the both sides.
Also, what you need is a pattern to match the whole string, so .* at the start and end of the pattern.
The final regex for replacing looks like
(?i)\s*\b(?!(?:word1|word2|word3|word4|word5)\b)\w+
See the regex demo
If your entries contain newlines, you should also add s modifier: (?i) -> (?s).
Details:
(?i) - case insensitive modifier (works with PCRE and ICU regexes)
\s* - 0+ whitespaces
\b - a leading word boundary
(?!(?:word1|word2|word3|word4|word5)\b) - the word cannot equal word1, etc.
\w+ - 1+ word chars (letters, digits or underscores).

Resources