How to remove middle name from a full name in R - r

I have a field in a data frame that formatted as last name, coma, space, first name, space, middle name, and sometimes without middle name. I need to remove middle names from the full names when they have it, and all spaces. Couldn't figure out how. My guess is that it will involve regular expression and stuff. It would be nice if you can provide explanations for the answer. Below is an example,
names <- c("Casillas, Kyron Jamar", "Knoll, Keyana","McDonnell, Messiah Abdul")
names
Expected output will be,
names_n <- c("Casillas,Kyron", "Knoll,Keyana","McDonnell,Messiah")
names_n
Thanks!

You can use this:
gsub("([^,]+,).*?(\\w+)$","\\1\\2",names)
[1] "Casillas,Jamar" "Knoll,Keyana" "McDonnell,Abdul"
Here we divide the string into two capturing groups and use backreference to recollect their content:
([^,]+,): the 1st capture group, which captures any sequence of characters that is not a ,followed by a comma
.*?: this lazily matches what follows until ...
(\\w+)$: ... the 2nd capture group, which captures the alphanumeric string at the end
\\1\\2 in the replacment argument recollects the contents of the two capture groups only, thereby removing whatever is not captured. If you wish to separate the surname from the first name not only by a comma but also a whitespace just squeeze one whitespace between the two backreferences, thus: \\1 \\2

We may capture the second word (\\w+) and replace with the backreference (\\1) of the captured word
sub("\\s+", "", sub("\\s+(\\w+)\\s+\\w+$", "\\1", names))
-output
[1] "Casillas,Kyron" "Knoll,Keyana" "McDonnell,Messiah"

Related

Add a character to each word within a sentence in R

I have sentences in R (they represent column names of an SQL query), like the following:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
I would need to add a character(s) like "k." in front of every word of the sentence. Notice how sometimes words within the sentence may be separated by a comma and a space, but sometimes just by a comma.
The desired output would be:
new_sentence <- "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
I would prefer to achieve this without using a loop for. I saw this post Add a character to the start of every word but there they work with a vector and I can't figure out how to apply that code to my example.
Thanks
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
gsub(pattern = "(\\w+)", replacement = "k.\\1", sample_sentence)
# [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
Explanation: in regex \\w+ matches one or more "word" characters, and putting it in () makes each match a "capturing group". We replace each capture group with k.\\1 where \\1 refers to every group captured by the first set of ().
A possible solution:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
paste0("k.", gsub("(,\\s*)", "\\1k\\.", sample_sentence))
#> [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"

Place a dot before a letter

I need to put a dot before a letter in this type of strings
name of data set: V2
6K102
62D102
627Z102
I would like to get this:
6.K102
62.D102
627.Z102
I am using this regex:
mutate(V2 = gsub("^[A-Z]",'\\.', V2))
If the string has to start with 1 or more digits followed by a char A-Z, you could use 2 capturing groups
^(\d+)([A-Z])
In the replacement use "\\1.\\2"
sub("^([0-9]+)([A-Z])", "\\1.\\2", V2)
you could use sub("([A-Z])",".\\1", V2)
This apply to the question before it was updated.
Usually \u2022 prints a bullet in text items. So if your question is regarding a label, you may just inserted it there "\u2022 ..."
Otherwise, for text items in datasets such as V2, you can work around by applying paste0 in combination with ifelse. In this case, your text items the you need a black dot infront if, is stored in V2$name
V2$name <- ifelse(V2$name==1,1,paste0("\u2022", V2$name))
Your regex lacks a capturing group around the letter pattern (so that you could keep it after replacement) and contains a redundant ^ anchor that matches the string start location. Also, you are using a gsub function while you just need a sub, since only one replacement is expected.
Use
sub("([[:upper:]])", ".\\1", V2)
With stringr (see demo):
stringr::str_replace(V2, "[[:upper:]]", "\\.\\0")
Details
sub - only the first match is replaced
([[:upper:]]) - matches and captures any uppercase letter into Group 1 (later referenced to with \1 from the replacement pattern)
\1 - the value of Group 1 (the uppercase letter matched)
Note that stingr solution uses \0, the placeholder for the whole match value, so no need to capture the uppercase letter in the regex pattern.
See the R demo:
V2 <- c("6K102","62D102","627Z102")
sub("([[:upper:]])", ".\\1", V2)
# => [1] "6.K102" "62.D102" "627.Z102"

using regular expressions (regex) to make replace multiple patterns at the same time in R

I have a vector of strings and I want to remove -es from all strings (words) ending in either -ses or -ces at the same time. The reason I want to do it at the same time and not consequitively is that sometimes it happens that after removing one ending, the other ending appears while I don't want to apply this pattern to a single word twice.
I have no idea how to use two patterns at the same time, but this is the best I could:
text <- gsub("[sc]+s$", "[sc]", text)
I know the replacement is not correct, but I wonder how can I show that I want to replace it with the letter I just detected (c or s in this case). Thank you in advance.
To remove es at the end of words, that is preceded with s or c, you may use
gsub("([sc])es\\b", "\\1", text)
gsub("(?<=[sc])es\\b", "", text, perl=TRUE)
To remove them at the end of strings, you can go on using your $ anchor:
gsub("([sc])es$", "\\1", text)
gsub("(?<=[sc])es$", "", text, perl=TRUE)
The first gsub TRE pattern is ([sc])es\b: a capturing group #1 that matches either s or c, and then es is matched, and then \b makes sure the next char is not a letter, digit or _. The \1 in the replacement is the backreference to the value stored in the capturing group #1 memory buffer.
In the second example with the PCRE regex (due to perl=TRUE), (?<=[sc]) positive lookbehind is used instead of the ([sc]) capturing group. Lookbehinds are not consuming text, the text they match does not land in the match value, and thus, there is no need to restore it anyhow. The replacement is an empty string.
Strings ending with "ces" and "ses" follow the same pattern, i.e. "*es$"
If I understand it correctly than you don't need two patterns.
Example:
x = c("ces", "ses", "mes)
gsub( pattern = "*([cs])es$", replacement = "\\1", x)
[1] "c" "s" "mes"
Hope it helps.
M

remove parenthesis after number, keep number

I need to remove a parenthesis after a number in a string:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1)-a3cons*HGDI_r_lag_1)-(1-a3cons)*HNW_r_lag_2)+a4cons*rate_90_r_lag_1))+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2)+a7cons*dl_HNW_r_lag_1)+a8cons*d_rate_UNE_lag_2)+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
The resulting string should look like this:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1-a3cons*HGDI_r_lag_1-(1-a3cons)*HNW_r_lag_2+a4cons*rate_90_r_lag_1)+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2+a7cons*dl_HNW_r_lag_1+a8cons*d_rate_UNE_lag_2+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
The regular expression I am trying to capture here is the first parenthesis after the string "lag_" followed by some number. Note, that in places there are two parenthesis:
rate_90_r_lag_1))
And I only want to remove the first one.
I've tried a simple regex in gsub
a <- "dl_CONH_r = a0cons+a2cons*(CONH_r_lag_1)-a3cons*HGDI_r_lag_1)-(1-a3cons)*HNW_r_lag_2)+a4cons*rate_90_r_lag_1))+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_2)+a7cons*dl_HNW_r_lag_1)+a8cons*d_rate_UNE_lag_2)+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
gsub("[0-9]\\)","[0-9]",a)
But I the resulting string removes the number and replaces it with [0-9]:
"dl_CONH_r = a0cons+a2cons*(CONH_r_lag_[0-9]-a3cons*HGDI_r_lag_[0-9]-(1-a3cons)*HNW_r_lag_[0-9]+a4cons*rate_90_r_lag_[0-9])+a5cons*dl_HCOE_r+a6cons*dl_HOY_r_lag_[0-9]+a7cons*dl_HNW_r_lag_[0-9]+a8cons*d_rate_UNE_lag_[0-9]+(1-a5cons-a6cons-a7cons)*(dl_TREND_PROD+dl_TREND_AVEH+dl_TREND_WAP)"
I understand that the gsub is doing what it is intended to do. What I don't know is how to keep the number before the parenthesis?
You need to use a look around (in this case the preceded by) so that it will match just the parentheses as the matching group instead of the numbers and the parentheses. Then you can just remove the parentheses.
gsub("(?<=[0-9])\\)","", a, perl = TRUE)
You can do this using capture groups:
Lets just try it on the string my_string <- " = a0cons+a2cons*(CONH_r_lag_1)-a3cons*"
reg_expression <- "(.*[0-9])\\)(.*)" #two capture groups, with the parenthesis not in a group
my_sub_string <- sub(reg_expression,"\\1\\2", my_string)
Notice "\\1" reads like \1 to the regex engine, and so is a special character referring to the first capture group. (These can also be named)
Another way of doing this is lookarounds:
There are two basic kinds of lookarounds, a lookahead (?=) and a lookbehind (?<=). Since we want to match a pattern, but not capture, something behind our matched expression we need a lookbehind.
reg_expression <- "(?<=[0-9])\\)" #lookbehind
my_sub_string <- sub(reg_expression,"", my_string)
Which will match the pattern, but only replace the parenthesis.

Separate long and complex names in R

Say I have the following list of full scientific names of plant species inside my dataset:
FullSpeciesNames <- c("Aronia melanocarpa (Michx.) Elliott", "Cotoneaster divaricatus Rehder & E. H. Wilson","Rosa canina L.","Ranunculus montanus Willd.")
I want to obtain a list of simplified names, i.e just the first two elements of a given name, namely:
SimpleSpeciesNames<- c("Aronia melanocarpa", "Cotoneaster divaricatus", "Rosa canina", "Ranunculus montanus")
How can this be done in R?
We can use sub to match a word (\\w+) followed by one or more white space (\\s+) followed by another word and space, capture as a group, and the rest of the characters (.*). In the replacement, use the backreference of the captured group (\\1)
trimws(sub("^((\\w+\\s+){2}).*", "\\1", FullSpeciesNames))
An alternative that is more complicated in function use, but does not require regular expressions is
substring(FullSpeciesNames,
1, sapply(gregexpr(" ", FullSpeciesNames, fixed=TRUE), "[[", 2) - 1)
[1] "Aronia melanocarpa" "Cotoneaster divaricatus" "Rosa canina" "Ranunculus montanus"
gregexpr can be used to find the positions of certain characters in a string (it can also look for patterns with regular expressions). Here we are looking for spaces. It returns a list of the positions for each string in the character vector. sapply is used to extract the position of the second space. The vector of these positions (minus one) is fed to substring, which runs through the initial vector and takes the substrings starting from the first character to the indicated position.

Resources