gsub only part of pattern - r

I want to use gsub to correct some names that are in my data. I want names such as "R. J." and "A. J." to have no space between the letters.
For example:
x <- "A. J. Burnett"
I want to use gsub to match the pattern of his first name, and then remove the space:
gsub("[A-Z]\\.\\s[A-Z]\\.", "[A-Z]\\.[A-Z]\\.", x)
But I get:
[1] "[A-Z].[A-Z]. Burnett"
Obviously, instead of the [A-Z]'s I want the actual letters in the original name. How can I do this?

Use capture groups by enclosing patterns in (...), and refer to the captured patterns with \\1, \\2, and so on. In this example:
x <- "A. J. Burnett"
gsub("([A-Z])\\.\\s([A-Z])\\.", "\\1.\\2.", x)
[1] "A.J. Burnett"
Also note that in the replacement you don't need to escape the . characters, as they don't have a special meaning there.

You can use a look-ahead ((?=\\w\\.)) and a look-behind ((?<=\\b\\w\\.)) to target such spaces and replace them with "".
x <- c("A. J. Burnett", "Dr. R. J. Regex")
gsub("(?<=\\b\\w\\.) (?=\\w\\.)", "", x, perl = TRUE)
# [1] "A.J. Burnett" "Dr. R.J. Regex"
The look-ahead matches a word character (\\w) followed by a period (\\.), and the look-behind matches a word-boundary (\\b) followed by a word character and a period.

Related

Usng R - gsub using code in replacement - Replace comma with full stop after pattern

I would like to manually correct a record by using R. Last name and first name should always be separated by a comma.
names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
Sometimes, however, a full stop has crept in as a separator, as in the case of "JOHNSON. Richard". I would like to do this automatically. Since the last name is always at the beginning of the line, I can simply access it via sub:
sub("^[[:upper:]]+\\.","^[[:upper:]]+\\,",names)
However, I cannot use a function for the replacement that specifically replaces the full stop with a comma.
Is there a way to insert a function into the replacement that does this for me?
Your sub is mostly correct, but you'll need a capture group (the brackets and backreference \\1) for the replacement.
Because we are "capturing" the upper case letters, therefore \\1 here represents the original upper case letters in your original strings. The only replacement here is \\. to \\,. In other words, we are replacing upper case letters ^(([[:upper:]]+) AND full stop \\. with it's original content \\1 AND comma \\,.
For more details you can visit this page.
test_names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
sub("^([[:upper:]]+)\\.","\\1\\,",test_names)
[1] "ADAM, Smith J." "JOHNSON, Richard" "BROWN, Wilhelm K."
[4] "DAVIS, Daniel"
Can be done by a function like so:
names <- c("ADAM, Smith", "JOHNSON. Richard", "BROWN, Wilhelm", "DAVIS, Daniel")
replacedots <- function(mystring) {
gsub("\\.", ",", names)
}
replacedots(names)
[1] "ADAM, Smith" "JOHNSON, Richard" "BROWN, Wilhelm" "DAVIS, Daniel"

Abbreviate vector of names in R, using stringr library

I have this variable:
names<-c("Sophia Abbe", "Olivia Abbett", "Emma Abbey", "Ava Abbitt", "Isabella Abbot", "Mia Abbott", "Aria Abbs")
I want to abbreviate the first names and place them into a vector.
I want to obtain a vector like ("S. Abbe", "O. Abbett", ... , "A. Abbs)
What would be an efficient way of doing this with the stringr functions str_c(), str_split() and str_sub() ?
An option with sub by matching the lower case letters and replace with a . in base R
sub("[a-z]+", ".", names)
#[1] "S. Abbe" "O. Abbett" "E. Abbey" "A. Abbitt" "I. Abbot" "M. Abbott" "A. Abbs"
In this the [a-z]+ matching one or more lower blocks of characters i.e. those in the first word (because we are using sub) and replace with ""
Or using str_replace
library(stringr)
str_replace(names, "[a-z]+", ".")

Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")

Remove all vowels in a sentence except for those which occur in the beginning of the word in R

The intention is to replace all vowels with blank if they are not the first character of word in a sentence.
For Instance, I AM A HAPPY MINISTER => I AM A HPPY MNSTR. Is there a way to implement this in R?
You can try using lookaround:
gsub("(?<=[A-Z])[AEIOU]", "", "I AM A HAPPY MINISTER", perl=TRUE)
# [1] "I AM A HPPY MNSTR"
This regex searches for uppercase vowels that are preceded by any uppercase letter, then they are replaced by the empty string.
As mentionned in comment by #Jota, another option is to use \\S (anything but the space class), which will permit to also remove vowel after hyphen or quote for example:
gsub("(?<=\\S)[AEIOU]", "", "I AM A HAPPY WELL-INTENTIONED MINISTER, D'ACCORD", perl=TRUE)
#[1] "I AM A HPPY WLL-NTNTND MNSTR, D'CCRD"
A variant, using parameter ignore.case:
gsub("(?<=\\S)[aeiou]", "", "I AM A HAPPY WELL-INTENTIONED MINISTER, D'ACCORD", perl = TRUE, ignore.case = TRUE)
We can do
x <- "I AM A HAPPY MINISTER"
gsub("([^\\w ])[AEIOU]", "\\1", x)
This searches for vowels, which are not after a wordlimit or a space. The vowel is deleted - only the non wordlimit character (or the space) is returned.
We can SKIP the words that start with [AEIOU] and match the [AEIOU] in other parts of the string, replace it with ''.
gsub("(\\b|\\s)[AEIOU](*SKIP)(*F)|[AEIOU]", "", str1, perl=TRUE)
#[1] "I AM A HPPY MNSTR"
data
str1 <- "I AM A HAPPY MINISTER"

removing spaces outside quotes in r

Rearranging simpsons names with R to follow first name, last name format but there are large spaces between the names, is it possible to remove spaces outside the quoted names?
library(stringr)
simpsons <- c("Moe Syzlak", "Burns, C. Montgomery", "Rev. Timothy Lovejoy", "Ned Flanders", "Simpson, Homer", "Dr. Julius Hibbert")
reorder <- sapply(sapply(str_split(simpsons, ","), str_trim),rev)
for (i in 1:length(name) ) {
splitname[i]<- paste(unlist(splitname[i]), collapse = " ")
}
splitname <- unlist(splitname)
If we need to rearrange the first name followed by last name, we could use sub. We capture one or more than character which is not a , in a group, followed by , followed by 0 or more space (\\s*), capture one or more characters that are not a , as the 2nd group, and in the replacement reverse the backreference to get the output.
sub("([^,]+),\\s*([^,]+)", "\\2 \\1", simpsons)
#[1] "Moe Syzlak" "C. Montgomery Burns" "Rev. Timothy Lovejoy" "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"

Resources