Extract string and keep some delimiters but not others in R

Extract string and keep some delimiters but not others in R - r

I wish to remove the text after the last cluster of various types of delimiter characters, as well as those delimiters, except if it is a closing parenthesis. I trim trailing whitespace first since whitespace is a delimiter.
name <- c("Geomdan dong", "Geomdan-dong ", "Geomdan 1(il)-dong", "Geomdan-1(il)dong", "Geomdan-1(il) dong")
#My attempt
sub("[-\\) ][^-\\) ]*$", "", trimws(name))
[1] "Geomdan" "Geomdan" "Geomdan 1(il)" "Geomdan-1(il" "Geomdan-1(il)"
#Desired output
[1] "Geomdan" "Geomdan" "Geomdan 1(il)" "Geomdan-1(il)" "Geomdan-1(il)"

One option is to make the first character class optional and remove the )
[-\\ ]?[^-\\) ]+$
Regex demo | R demo
name <- c("Geomdan dong", "Geomdan-dong ", "Geomdan 1(il)-dong", "Geomdan-1(il)dong", "Geomdan-1(il) dong")
sub("[-\\ ]?[^-\\) ]+$", "", trimws(name))
Output
[1] "Geomdan" "Geomdan" "Geomdan 1(il)" "Geomdan-1(il)"
[5] "Geomdan-1(il)"
If you want to keep strings that for example contain only word characters, you can either match what is in the character class, or assert a ) to the left and use perl=T to use a perl compatible expression.
(?:[ -]|(?<=\)))[^-) ]*$
Regex demo | R demo
sub("(?:[ -]|(?<=\\)))[^-) ]*$", "", trimws(name), perl=T)

Related

regex replace parts/groups of a string in R

Trying to postprocess the LaTeX (pdf_book output) of a bookdown document to collapse biblatex citations to be able to sort them chronologically using \usepackage[sortcites]{biblatex} later on. Thus, I need to find }{ after \\autocites and replace it with ,. I am experimenting with gsub() but can't find the correct incantation.
# example input
testcase <- "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}"
# desired output
"text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
A simple approach was to replace all }{
> gsub('\\}\\{', ',', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep,separate}"
But this also collapses {keep}{separate}.
I was then trying to replace }{ within a 'word' (string of characters without whitspace) starting with \\autocites by using different groups and failed bitterly:
> gsub('(\\\\autocites)([^ \f\n\r\t\v}{}]+)((\\}\\{})+)', '\\1\\2\\3', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} some text {keep}{separate}"
Addendum:
The actual document contains more lines/elements than the testcase above. Not all elements contain \\autocites and in rare cases one element has more than one \\autocites. I didn't originally think this was relevant. A more realistic testcase:
testcase2 <- c("some text",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate} \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}")

A single gsub call is enough:
gsub("(?:\\G(?!^)|\\\\autocites)\\S*?\\K}{", ",", testcase, perl=TRUE)
## => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
See the regex demo. Here, (?:\G(?!^)|\\autocites) matches the end of the previous match or \autocites string, then it matches any 0 or more non-whitespace chars, but as few as possible, then \K discards the text from the current match buffer and consumes the }{ substring that is eventually replaced with a comma.
There is also a very readable solution with one regex and one fixed text replacements using stringr::str_replace_all:
library(stringr)
str_replace_all(testcase, "\\\\autocites\\S+", function(x) gsub("}{", ",", x, fixed=TRUE))
# => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
Here, \\autocites\S+ matches \autocites and then 1+ non-whitespace chars, and gsub("}{", ",", x, fixed=TRUE) replaces (very fast) each }{ with , in the matched text.

Not the prettiest solution, but it works. This repeatedly replaces }{ with , but only if it follows autocities with no intervening blanks.
while(length(grep('(autocites\\S*)\\}\\{', testcase, perl=TRUE))) {
testcase = sub('(autocites\\S*)\\}\\{', '\\1,', testcase, perl=TRUE)
}
testcase
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

I'll make the input string slightly bigger to make the algorithm more clear.
str <- "
text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}
text \\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990} text {keep}{separate}
"
We will firstly extract all the citation blocks, replace "}{" with "," in them and then put them back into the string.
# pattern for matching citation blocks
pattern <- "\\\\autocites(\\[[^\\[\\]]*\\])*(\\{[[:alnum:]]*\\})+"
cit <- str_extract_all(str, pattern)[[1]]
cit
#> [1] "\\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990}"
#> [2] "\\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990}"
Replace in citation blocks:
newcit <- str_replace_all(cit, "\\}\\{", ",")
newcit
#> [1] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
Break the original string in the places where citation block was found
strspl <- str_split(str, pattern)[[1]]
strspl
#> [1] "\ntext " " text {keep}{separate}\ntext " " text {keep}{separate}\n"
Insert modified citation blocks:
combined <- character(length(strspl) + length(newcit))
combined[c(TRUE, FALSE)] <- strspl
combined[c(FALSE, TRUE)] <- newcit
combined
#> [1] "\ntext "
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [3] " text {keep}{separate}\ntext "
#> [4] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [5] " text {keep}{separate}\n"
Paste it together to finalize:
newstr <- paste(combined, collapse = "")
newstr
#> [1] "\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\n"
I suspect there could be a more elegant fully-regex solution based on the same idea, but I wasn't able to find one.

I found an incantation that works. It's not pretty:
gsub("\\\\autocites[^ ]*",
gsub("\\}\\{",",",
gsub(".*(\\\\autocites[^ ]*).*","\\\\\\1",testcase) #all those extra backslashes are there because R is ridiculous.
),
testcase)
I broke it in to lines to hopefully make it a little more intelligible. Basically, the innermost gsub extracts just the autocites (anything that follows \\autocites up to the first space), then the middle gsub replaces the }{s with commas, and the outermost gsub replaces the result of the middle one for the pattern extracted in the innermost one.
This will only work with a single autocites in a string, of course.
Also, fortune(365).

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?

You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo

The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).

You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

Replace first comma with a semicolon into a string using R and regex

I would like to replace only the first comma in my dataset with a semicolon using R, regex, and, possibly, the library stringr.
The following one is an extract of my dataset:
mydata <- structure(list(SURNAME_Name = c("AASSVE Arnstein", "ABATECOLA Gianpaolo",
"ABATEMARCO Antonio", "ABBAFATI Cristiana", "ABBATE Tindara",
"ABBRUZZO Antonino", "ABRARDI Laura", "ABRATE Graziano", "ACCONCIA Antonio",
"ACHARD Paola Olimpia", "ADAMO Rosa", "ADAMO Stefano", "ADDA Jerome Frans",
"ADDABBO Tindara", "ADDIS Elisabetta", "ADDIS Michela", "ADELFIO Giada",
"ADIGUZEL Feray", "ADIMARI Gianfranco", "DE MARCHI Maria Paola")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
I performed this code to insert a comma between SURNAME and Names and then I tried replacing the first comma with a semicolon:
library(stringr)
mydata %>%
mutate(Name_delimited=str_replace_all(string=SURNAME_Name,pattern="(\\s)(?=[A-Z]{1}[a-z]+)",replacement="\\,"),
Name_delimited1=str_replace_all(string=Name_delimited,pattern="\\1(\\,)(?=[A-Z]{1}[a-z]+)",replacement="\\;"))
But it doesn't work as I expected because, for example, the row number 10 in my dataset remains ACHARD,Paola,Olimpia instead of ACHARD;Paola,Olimpia and for row number 20 where I expected DE MARCHI;Maria,Paola instead of DE MARCHI,Maria,Paola
Any hints are wellcome

You may replace the first whitespace(s) with ; using str_replace and then use str_replace_all to replace all other spaces with ,:
> str_replace_all(str_replace(mydata$SURNAME_Name, "\\s+", ";"), "\\s+", ",")
[1] "AASSVE;Arnstein" "ABATECOLA;Gianpaolo" "ABATEMARCO;Antonio"
[4] "ABBAFATI;Cristiana" "ABBATE;Tindara" "ABBRUZZO;Antonino"
[7] "ABRARDI;Laura" "ABRATE;Graziano" "ACCONCIA;Antonio"
[10] "ACHARD;Paola,Olimpia" "ADAMO;Rosa" "ADAMO;Stefano"
[13] "ADDA;Jerome,Frans" "ADDABBO;Tindara" "ADDIS;Elisabetta"
[16] "ADDIS;Michela" "ADELFIO;Giada" "ADIGUZEL;Feray"
[19] "ADIMARI;Gianfranco" "ADINOLFI;Paola"
Note you may replace str_replace with sub and str_replace_all with gsub and use
gsub("\\s+", ",", sub("\\s+", ";", mydata$SURNAME_Name))
relying on sole base R functions.
To preserve whitespaces inside ALLCAPS surnames, use
> reg <- "(*UCP)\\b\\p{Lu}+(?:\\s+\\p{Lu}+)+\\b(*SKIP)(*F)|\\s+"
> gsub(reg, ",", sub(reg, ";", mydata$SURNAME_Name, perl=TRUE), perl=TRUE)
[1] "AASSVE;Arnstein" "ABATECOLA;Gianpaolo" "ABATEMARCO;Antonio" "ABBAFATI;Cristiana"
[5] "ABBATE;Tindara" "ABBRUZZO;Antonino" "ABRARDI;Laura" "ABRATE;Graziano"
[9] "ACCONCIA;Antonio" "ACHARD;Paola,Olimpia" "ADAMO;Rosa" "ADAMO;Stefano"
[13] "ADDA;Jerome,Frans" "ADDABBO;Tindara" "ADDIS;Elisabetta" "ADDIS;Michela"
[17] "ADELFIO;Giada" "ADIGUZEL;Feray" "ADIMARI;Gianfranco" "DE MARCHI;Maria,Paola"
The regex engine is now PCRE, and I added a (*UCP) PCRE verb to make \b Unicode aware, and an \\b\\p{Lu}+(?:\\s+\\p{Lu}+)+\\b(*SKIP)(*F) alternative that matches any whitespace-separated ALLCAPS letter words as whole words and then skips these matches keeping whitespace intact.
Details
(*UCP) - makes \b in this pattern Unicode aware
\\b - a word boundary
\\p{Lu}+ - 1+ Unicode uppercase letters
(?:\\s+\\p{Lu}+)+ - 1 or more occurrences of 1+ whitespaces and then 1+ Unicode letters
\\b - word boundary
(*SKIP)(*F) - PCRE verbs that discard the matched text and proceed looking for the next match starting from the location where the previous search ended
| - or
\\s+ - 1+ whitespaces in any other context.

A regex to remove all words which contains number in R

I want to write a regex in R to remove all words of a string containing numbers.
For example:
first_text = "a2c if3 clean 001mn10 string asw21"
second_text = "clean string

Try with gsub
trimws(gsub("\\w*[0-9]+\\w*\\s*", "", first_text))
#[1] "clean string"

It is easier to select words with no numbers than to select and delete words with numbers:
> library(stringr)
> str1 <- "a2c if3 clean 001mn10 string asw21"
> paste(unlist(str_extract_all(str1, "(\\b[^\\s\\d]+\\b)")), collapse = " ")
[1] "clean string"
Note:
Backslashes have to be escaped in R to work properly, hence double backslashes
\b is word boundary
\s is white space
\d is digit character
a caret (^) inside square brackets is a negater: find characters that do not match ...
"+" after the character group inside [] means "1 or more" occurrences of those (non white space and non digit) characters

Just another alternative using gsub
trimws(gsub("[^\\s]*[0-9][^\\s]*", "", first_text, perl=T))
#[1] "clean string"

A bit longer than some of the answers but very tractable is to first convert the string to a vector of words, then check word by word if there are any numbers and use standard R subsetting.
first_text_vec <- strsplit(first_text, " ")[[1]]
first_text_vec
[1] "a2c" "if3" "clean" "001mn10" "string" "asw21"
paste(first_text_vec[!grepl("[0-9]", first_text_vec)], collapse = " ")
[1] "clean string"

Put space after a specifc word in a string vector in R

I want to put a space after a specific character in a string vector in R.
Example:
Text <-"<U+00A6>Word"
My goal is to put a space after the ">" to seperate the string in two characters to come to: <U+00A6> Word
I tried with gsub, but I do not have the right idea:
Text = gsub("<*", " ", Text)
But that only puts a space after each character.
Can you advise on that?

You can use this:
sub(">", "> ", Text)
# [1] "<U+0093> Word"
or this (without repeating the >):
sub("(?<=>)", " ", Text, perl = TRUE)
# [1] "<U+0093> Word"
If you just want to extract Word, you can use:
sub(".*>", "", Text)
# [1] "Word"

We can use str_extract to extract the word after the >
library(stringr)
str_extract(Text, "(?<=>)\\w+")
#[1] "Word"
Or another option is strsplit
strsplit(Text, ">")[[1]][2]
#[1] "Word"