Changing "firstname lastname" to "lastname, firstname" - r

I have a list of names that I need to convert from "Firstname Lastname" to "Lastname, Firstname".
Barack Obama
Donald J. Trump
J. Edgar Hoover
Beyonce Knowles-Carter
Sting
I used G. Grothendieck's answer to "last name, first name" -> "first name last name" in serialized strings to get to gsub("([^ ]*) ([^ ]*)", "\\2, \\1", str) which gives me -
Obama, Barack
J., DonaldTrump,
Edgar, J.Hoover,
Knowles-Carter, Beyonce
Sting
What I would like to get -
Obama, Barack
Trump, Donald J.
Hoover, J. Edgar
Knowles-Carter, Beyonce
Sting
I would like a regex answer.

There is an esoteric function called person designed for holding names, a conversion function as.person which does this parsing for you and a format method to make use of it afterwards (with a creative use of the braces argument). It even works with complex surnames (eg van Nistelrooy) but the single name result is unsatisfactory. It can fixed with a quick ending sub though.
x <- c("Barack Obama","Donald J. Trump","J. Edgar Hoover","Beyonce Knowles-Carter","Sting", "Ruud van Nistelrooy", "John von Neumann")
y <- as.person(x)
format(y, include=c("family","given"), braces=list(family=c("",",")))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting," "van Nistelrooy, Ruud"
[7] "von Neumann, John"
## fix for single names - curse you Sting!
sub(",$", "", format(y, include=c("family","given"), braces=list(family=c("",","))))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting" "van Nistelrooy, Ruud"
[7] "von Neumann, John"

Use
gsub("(.*[^van])\\s(.*)", "\\2, \\1", people)
The regex:
(.*[^van]) \\s (.*)
Any ammount of characters exluding "van"... the last white space... The last name containing any character.
Data:
people <- c("Barack Obama",
"Donald J. Trump",
"J. Edgar Hoover",
"Beyonce Knowles-Carter",
"Sting",
"Ruud van Nistelrooy",
"Xi Jinping",
"Hans Zimvanmer")
Result:
[1] "Obama, Barack" "Trump, Donald J." "Hoover, J. Edgar"
[4] "Knowles-Carter, Beyonce" "Sting" "van Nistelrooy, Ruud"
[7] "Jinping, Xi" "Zimvanmer, Hans"

Related

How to append strings to a vector from a for loop?

I got a data frame that has a column with names separated by commas, I want to create a vector that includes each name independently inside but my solution didn't work. Need help with it.
library(tidyverse)
cast <- netflix_titles$cast
names <- c()
for(i in cast){
splitted <- strsplit(i, ",")
for(act in splitted){
append(names, act)
}
}
rows are in this format
"Jesse Eisenberg, Woody Harrelson, Emma Stone, Abigail Breslin, Amber Heard, Bill Murray, Derek Graf"
You can get a vector of names with unlist(strsplit()). strsplit itself returns a list which you can turn into an atomic vector with unlist.
unlist(strsplit("Jesse Eisenberg, Woody Harrelson, Emma Stone, Abigail Breslin, Amber Heard, Bill Murray, Derek Graf", ", "))
#> [1] "Jesse Eisenberg" "Woody Harrelson" "Emma Stone" "Abigail Breslin"
#> [5] "Amber Heard" "Bill Murray" "Derek Graf"
Hence, you can completely remove the for loop if you add unlist().
You can even do it for the whole column in the data frame:
df <- data.frame(cast = c(
"Jesse Eisenberg, Woody Harrelson, Emma Stone, Abigail Breslin, Amber Heard, Bill Murray, Derek Graf",
"Bruce Willis, Matt Damon, Brad Pitt"
))
unlist(strsplit(df$cast, ", "))
#> [1] "Jesse Eisenberg" "Woody Harrelson" "Emma Stone" "Abigail Breslin"
#> [5] "Amber Heard" "Bill Murray" "Derek Graf" "Bruce Willis"
#> [9] "Matt Damon" "Brad Pitt"

How do I remove part of a string that follows a certain pattern up to, but not including another pattern using R?

I have a dataframe in R that contains people data. First part of a string is a full name. Every so often I encounter a nickname in brackets. There could be other data enclosed in brackets that I do not want to delete. Here is an example of a kind of data I am working with:
Name <- c(
"JOSEPH RYAN SMITH (USRID1)",
"ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)",
"TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)",
"JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)",
"WILLIAM (BILLIE) JOEL (USRID5)")
df <- as.data.frame(Name)
I get:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM (BILLIE) JOEL (USRID5)
I only want to remove nicknames. I noticed that what sets a nickname apart is that it is always in brackets and is always followed by a last name. All other indicators included in brackets are followed by " (" or end of record. I tried to remove a string that is in brackets that is followed by a space and a character A-Z.
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][A-Z]")
This removed the first letter of the last name and gave me:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY OHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM OEL (USRID5)
I also unsuccessfully tried "not followed by (" like this:
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][^\\(]")
I tried a few other things which removed other indicators that are in brackets that I do need to keep. Any help is appreciated. Thank you.
Use positive lookeahd (?=) so that first letter of last name is matched but not removed.
stringr::str_remove(df$Name, "\\([A-Z]+\\)\\s(?=[A-Z])")
#[1] "JOSEPH RYAN SMITH (USRID1)"
#[2] "ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)"
#[3] "TIMOTHY JOHNSON (USRID3) (INTERN)"
#[4] "JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)"
#[5] "WILLIAM JOEL (USRID5)"
You can also write this in base R with sub :
sub('\\([A-Z]+\\)\\s(?=[A-Z])', '', df$Name, perl = TRUE)

Split 2 numbers character into it's own new line

I've a character object with 84 elements.
> head(output.by.line)
[1] "\n17"
[2] "Now when Joseph saw that his father"
[3] "laid his right hand on the head of"
[4] "Ephraim, it displeased him; so he took"
[5] "hold of his father's hand to remove it"
[6] "from Ephraim's head to Manasseh's"
But there is a line that has 2 numbers (49) that is not in it's own line:
[35] "49And Jacob called his sons and"
I'd like to transform this into:
[35] "\n49"
[36] "And Jacob called his sons and"
And insert this in the correct numeration, after object 34.
Dput Output:
dput(output.by.line)
c("\n17", "Now when Joseph saw that his father", "laid his right hand on the head of",
"Ephraim, it displeased him; so he took", "hold of his father's hand to remove it",
"from Ephraim's head to Manasseh's", "head.", "\n18", "And Joseph said to his father, \"Not so,",
"my father, for this one is the firstborn;", "put your right hand on his head.\"",
"\n19", "But his father refused and said, \"I", "know, my son, I know. He also shall",
"become a people, and he also shall be", "great; but truly his younger brother shall",
"be greater than he, and his descendants", "shall become a multitude of nations.\"",
"\n20", "So he blessed them that day, saying,", "\"By you Israel will bless, saying, \"May",
"God make you as Ephraim and as", "Manasseh!\"' And thus he set Ephraim",
"before Manasseh.", "\n21", "Then Israel said to Joseph, \"Behold, I",
"am dying, but God will be with you and", "bring you back to the land of your",
"fathers.", "\n22", "Moreover I have given to you one", "portion above your brothers, which I",
"took from the hand of the Amorite with", "my sword and my bow.\"",
"49And Jacob called his sons and", "said, \"Gather together, that I may tell",
"you what shall befall you in the last", "days:", "\n2", "\"Gather together and hear, you sons of",
"Jacob, And listen to Israel your father.", "\n3", "\"Reuben, you are my firstborn, My",
"might and the beginning of my strength,", "The excellency of dignity and the",
"excellency of power.", "\n4", "Unstable as water, you shall not excel,",
"Because you went up to your father's", "bed; Then you defiled it-- He went up to",
"my couch.", "\n5", "\"Simeon and Levi are brothers;", "Instruments of cruelty are in their",
"dwelling place.", "\n6", "Let not my soul enter their council; Let",
"not my honor be united to their", "assembly; For in their anger they slew a",
"man, And in their self-will they", "hamstrung an ox.", "\n7",
"Cursed be their anger, for it is fierce;", "And their wrath, for it is cruel! I will",
"divide them in Jacob And scatter them", "in Israel.", "\n8",
"\"Judah, you are he whom your brothers", "shall praise; Your hand shall be on the",
"neck of your enemies; Your father's", "children shall bow down before you.",
"\n9", "Judah is a lion's whelp; From the prey,", "my son, you have gone up. He bows",
"down, he lies down as a lion; And as a", "lion, who shall rouse him?",
"\n10", "The scepter shall not depart from", "Judah, Nor a lawgiver from between his",
"feet, Until Shiloh comes; And to Him", "shall be the obedience of the people.",
"\n11", "Binding his donkey to the vine, And his", "donkey's colt to the choice vine, He"
)
Please, check this:
library(tidyverse)
split_line_number <- function(x) {
x %>%
str_replace("^([0-9]+)", "\n\\1\b") %>%
str_split("\b")
}
output.by.line %>%
map(split_line_number) %>%
unlist()
# Output:
# [35] "\n49"
# [36] "And Jacob called his sons and"
# [37] "said, \"Gather together, that I may tell"
# [38] "you what shall befall you in the last"
An option using stringr::str_match is to match two components of an optional number followed by everything. Get the captured output from the matched matrix (2:3) and create a new vector of strings by dropping NAs and empty strings.
vals <- c(t(stringr::str_match(output.by.line, "(\n?\\d+)?(.*)")[, 2:3]))
output <- vals[!is.na(vals) & vals != ""]
output[32:39]
#[1] "portion above your brothers, which I"
#[2] "took from the hand of the Amorite with"
#[3] "my sword and my bow.\""
#[4] "49"
#[5] "And Jacob called his sons and"
#[6] "said, \"Gather together, that I may tell"
#[7] "you what shall befall you in the last" "days:"
We'll make use of the stringr package:
library(stringr)
Modify the object:
output.by.line <- unlist(
ifelse(grepl('[[:digit:]][[:alpha:]]', output.by.line), str_split(gsub('([[:digit:]]+)([[:alpha:]])', paste0('\n', '\\1 \\2'), output.by.line), '[[:blank:]]', n = 2), output.by.line)
)
Print the resuts:
dput(output.by.line)
#[32] "portion above your brothers, which I"
#[33] "took from the hand of the Amorite with"
#[34] "my sword and my bow.\""
#[35] "\n49"
#[36] "And Jacob called his sons and"
#[37] "said, \"Gather together, that I may tell"
#[38] "you what shall befall you in the last"

Collapse elements separated by ""

I have raw bibliographic data as follows:
bib =
c("Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte",
"Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in",
"Republican China*, Cambridge: Harvard University Press, 1976.",
"", "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing",
"History* and the Crisis of Traditional Chinese Historiography,\"",
"*Historiography East & West*2.2 (Sept. 2004): 173-204", "",
"Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:",
"Yale University Press, 1988.", "")
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte"
[2] "Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in"
[3] "Republican China*, Cambridge: Harvard University Press, 1976."
[4] ""
[5] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing"
[6] "History* and the Crisis of Traditional Chinese Historiography,\""
[7] "*Historiography East & West*2.2 (Sept. 2004): 173-204"
[8] ""
[9] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:"
[10] "Yale University Press, 1988."
[11] ""
I would like to collapse elements between the ""s in one line so that:
clean_bib[1]=paste(bib[1], bib[2], bib[3])
clean_bib[2]=paste(bib[5], bib[6], bib[7])
clean_bib[3]=paste(bib[9], bib[10])
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
Is there a one-liner that does this automatically?
You can use tapply while grouping with all "" then paste together the groups
unname(tapply(bib,cumsum(bib==""),paste,collapse=" "))
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] " Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] " Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
[4] ""
you can also do:
unname(c(by(bib,cumsum(bib==""),paste,collapse=" ")))
or
unname(tapply(bib,cumsum(grepl("^$",bib)),paste,collapse=" "))
etc
Similar to the other answer. This uses split and sapply. The second line is just to remove any elements with only has "".
vec <- unname(sapply(split(bib, f = cumsum(bib %in% "")), paste0, collapse = " "))
vec[!vec %in% ""]

Remove specific string at the end position of each row from dataframe(csv)

I am trying to clean a set of data which is in csv format. After loading data into R, i need to replace and also remove some characters from the it. Below is an example. Ideally i want to
replace the St at the end of each -> Street
in cases where there are St St.
i need to remove St and replace St. with just Street.
I tried to use this code
sub(x = evostreet, pattern = "St.", replacement = " ") and later
gsub(x = evostreet, pattern = "St.", replacement = " ") to remove the St. at the end of each row but this also remove some other occurrences of St and the next character
3 James St.
4 Glover Road St.
5 Jubilee Estate. St.
7 Fed Housing Estate St.
8 River State School St.
9 Brown State Veterinary Clinic. St.
11 Saw Mill St.
12 Dyke St St.
13 Governor Rd St.
I'm seeing a lot of close answers but I'm not seeing any that address the second problem he's having such as replacing "St St." with "Street"; e.g., "Dyke St St."
sub, as stated in the documentation:
The two *sub functions differ only in that sub replaces only the first occurrence of a pattern
So, just using "St\\." as the pattern match is incorrect.
OP needs to match a possible pattern of "St St." and I'll further assume that it could even be "St. St." or "St. St".
Assuming OP is using a simple list:
x = c("James St.", "Glover Road St.", "Jubilee Estate. St.",
"Fed Housing Estate St.", "River State School St St.",
"Brown State Vet Clinic. St. St.", "Dyke St St.")`
[1] "James St." "Glover Road St."
[3] "Jubilee Estate. St." "Fed Housing Estate St."
[5] "River State School St St." "Brown State Vet Clinic. St. St."
[7] "Dyke St St."
Then the following will replace the possible combinations mentioned above with "Street", as requested:
y <- sub(x, pattern = "[ St\\.]*$", replacement = " Street")
[1] "James Street" "Glover Road Street"
[3] "Jubilee Estate Street" "Fed Housing Estate Street"
[5] "River State School Street" "Brown State Vet Clinic Street"
[7] "Dyke Street"
Edit:
To answer OP's question below in regard to replacing one substr of St. with Saint and another with Street, I was looking for a way to be able to match similar expressions to return different values but at this point I haven't been able to find it. I suspect regmatches can do this but it's something I'll have to fiddle with later.
A simple way to accomplish what you're wanting - let's assume:
x <- c("St. Mary St St.", "River State School St St.", "Dyke St. St")
[1] "Saint Mary St St." "River State School St St."
[3] "Dyke St. St"
So you want x[1] to be Saint Mary Street, x[2] to be River State School Street and x[3] to be Dyke Street. I would want to resolve the Saint issue first by assigning sub() to y like:
y <- sub(x, pattern = "^St\\.", replacement = "Saint")
[1] "Saint Mary Street" "River State School Street"
[3] "Dyke Street"
To resolve the St's as the end, we can use the same resolution as I posted except notice now I'm not using x as my input vector but isntead the y I just made:
y <- sub(y, pattern = "[ St\\.]*$", replacement = " Street")
And that should take care of it. Now, I don't know if this is the most efficient way. And if you're dataset is rather large this may run slow. If I find a better solution I will post it (provided no one else beats me).
You don't need to use regular expression here.
sub(x = evostreet, pattern = "St.", replacement = " ", fixed=T)
The fixed argument means that you want to replace this exact character, not matches of a regular expression.
I think that your problem is that the '.' character in the regular expression world means "any single character". So to match literally in R you should write
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You will need to "comment" the dot... otherwise it means anything after St and that is why some other parts of your text are eliminated.
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You can add $ at the end if you want to remove the tag apearing just at the end of the text.
sub(x = evostreet, pattern = "St\\.$", replacement = " ")
The difference between sub and gsub is that sub will deal just with the firs time your tag appears in a text. gsub will eliminate all if there are duplicated. In your case as you are looking for the pattern at the end of the line it should not make any difference if you use the $.

Resources