How can I extract from title from name in a column? - r

I have a column of names of the form "Hobs, Mr. jack" i.e. lastname, title. firstname. title has 4 types -"Mr.", "Mrs.","Miss.","Master." How can I search for each item in the column & return the title ,which I can store in another column ?
Name <- c("Hobs, Mr. jack","Hobs, Master. John","Hobs, Mrs. Nicole",........)
desired output - a column "title" with values - ("Mr","Master", "Mrs",.....)
I have tried something like this:
f <- function(d) {
if (grep("Mr", d$title)) {
gsub("$Mr$", "Mr", d$title, ignore.case = T)
}
}
no success >.<

Maybe something like this:
library(stringr)
> Name <- c("Hobs, Mr. jack","Hobs, Master. John","Hobs, Mrs. Nicole")
> str_extract(string = Name,pattern = "(Mr|Master|Mrs)\\.")
[1] "Mr." "Master." "Mrs."
A fancier regex might exclude the period up front, or you could remove them in a second step.

Considering dataset name as df and column as Name. New column name would be title.
df$Title <- gsub('(.*, )|(\\..*)', '', df$Name)

Related

How to perform "find and replace" with multiple patterns to be found in a string in R

I am trying to switch genders of words in a string in R. For example, if I have the sentence "My gf has a mother who talks to my father and his bf", I want it to read "My bf has a father who talks to my mother and her gf".
I have a key-value pair list which contains a list of gender pairs -- right now it is just a dataframe which looks something like the below. Then my naive way of solving it was just to do a string replace where I iterate through the list and replace the key with the value. The obvious problem with this is that it just ends up swapping everything in the sentence, and then swapping it all back. You can see this is the example code below.
library(stringr)
key_vals = data.frame(first_word = c("bf", "gf", "mother", "father", "his", "her"), second_word = c("gf", "bf", "father", "mother", "her", "his"))
ex = "My gf has a mother who talks to my father and his bf"
for(i in 1:nrow(key_vals)){
ex = str_replace_all(ex, key_vals$first_word[i], key_vals$second_word[i])
}
My other idea was making two lists, one which had all male keys and all female values, and one which was the opposite. Then if I split up the sentence into individual words, for each word I could do an if statement like "if a male string is present, replace it with a female string, elif a female string is present, replace it with a male string, else do nothing". However, I can't figure out how to get just the words alone in a way I can then easily recombine into a working sentence. String split based on regex etc. just deletes the words, so I'm really struggling.
Another problem is that if, for example, there is something like "mother", it might get replaced to be "mothis", since I'm using a stupid way of matching strings which doesn't first identify the words, so it seems like I need to split it into words in any case.
This feels like it should be much more straightforward than it has been for me! Any help would be very appreciated.
We may use gsubfn
library(gsubfn)
gsubfn("(\\w+)", setNames(as.list(key_vals[[2]]), key_vals[[1]]), ex)
[1] "My bf has a father who talks to my mother and her gf"
Change for loop part to this:
plyr::mapvalues(str_split(ex, ' ')[[1]], key_vals$first_word, key_vals$second_word) %>%
str_flatten(' ')
The following `from` values were not present in `x`: her
[1] "My bf has a father who talks to my mother and her gf"
ex
[1] "My gf has a mother who talks to my father and his bf"
I think the warning can be ignored as it is just complaining that her is not in the sentence that ex contains.
The code first splits the character into a vector, then replaces the individual words and then pastes them back together again.
Rather than relying on a data frame of replacements, you could use a named vector, which is similar to a dictionary of values:
replacements <- key_vals$second_word
names(replacements) <- key_vals$first_word
bf gf mother father his her
"gf" "bf" "father" "mother" "her" "his"
ex_split <- str_split(ex, ' ')[[1]]
swapped <- replacements[ex_split]
final <- paste0(ifelse(!is.na(swapped), swapped, ex_split), collapse = ' ')
"My bf has a father who talks to my mother and her gf"
After creating ex_split, you could also substitute and glue everything together with Reduce:
Reduce(function(x, y) paste(x, ifelse(!is.na(replacements[y]), replacements[y], y)), ex_split)
Here is a base R option using strsplit + match like below
with(
key_vals,
{
v <- unlist(strsplit(ex, "(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)", perl = TRUE))
p <- second_word[match(v, first_word)]
paste0(ifelse(is.na(p), v, p), collapse = "")
}
)
and it yields
[1] "My bf has a father who talks to my mother and her gf"
This does what you need.
library(stringr)
# I've updated the columns names, for clarity
key_vals <- data.frame(words = c("bf", "gf", "mother", "father", "his", "her"), swapped_words = c("gf", "bf", "father", "mother", "her", "his"))
# used str_split to break the sentence into multiple words
ex <- "My gf has a mother who talks to my father and his bf"
words <- stringr::str_split(ex, " ")[[1]] #break into words
# do a inner join between the two tables
dict <- merge(data.frame(words=words), key_vals, by = "words", all.x = TRUE, incomparables = NA)
# now we basically apply the dictionary to the string, using an apply function
# we also use paste(..., collapse = " ") to make them into one sentence again
words <- paste(sapply(words, function(x) {
if (!x %in% key_vals$words)
return (x)
return(dict$swapped_words[dict$words == x])
}), collapse=" ")

name splitting in base r

I've got a list of names that have been written in a messy way in a single column. I'm trying to extract first name, middle names and last names out of this column to store separately.
To do this, I gsub the first word from each name entry and save it as the first name. I then remove the last word and first word of each entry and save that as the middle names. Then i gsub the last word from each entry and save it as the last name.
This gave me a problem, because for entries that have only one name entered (so 'kevin' instead of 'kevin banks') my code saves the first name as the last name ('kevin kevin'). I tried to fix it using a for-loop that deletes the lastname column if the original name entry has only 1 word. When i try this, ALL the lastname entries are empty, even the ones that do have a last name!
This is my code:
df <- data.frame(ego = c("linda", "wendy pralice of rivera", "bruce springsteen", "dan", "sam"))
df$firstname <- gsub("([A-Za-z]+).*", "\\1", df$ego)
df$middlename <- gsub("^\\w*\\s*", "", gsub("\\s*\\w*\\.*$", "", df$ego))
df$lastname <- gsub("^.* ([A-Za-z]+)", "\\1", df$ego)
for(n in df$ego) {
if(lengths(strsplit(n, " ")) == 1) {
df$lastname <- ""
}
}
What am i doing wrong?
If there are 4 fields put double quotes around the middle two. For example, a b c d would be changed to a "b c" d giving s1. (If there are not 4 fields then no substitution is done and s1 is set to df$ego.)
If there are exactly two fields insert double quotes between the two. For example, a b would be changed to a "" b. (If there are not exactly two fields then no substitution is done and s2 is set to s1).
Finally read in.
s1 <- sub('^(\\w+) (\\w+ \\w+) (\\w)+$', '\\1 "\\2" \\3', df$ego)
s2 <- sub('^(\\w+) (\\w+)$', '\\1 "" \\2', s1)
read.table(text = s2, as.is = TRUE, fill = TRUE,
col.names = c("first", "middle", "last"))
giving:
first middle last
1 linda
2 wendy pralice of a
3 bruce springsteen
4 dan
5 sam

Partial lookup for cell contents from one column in a different column

I am working n R, using R studio,
I have a dataframe with 4 columns.
Column A contains passenger iD,
B contains passenger name,
C contains husband name.
I am attempting to create a new column which look to see if the husband name in column C is listed in any of the records in column B. If so it should then return to me the passenger iD of the husband from column A.
To make things more complicated, as in the first example in some cases, the husband's given in column C might not include the his second name, which would be included in column B.
library(stringr)
rm(list=ls())
passengerid <- c(0908,9883,7767,3302)
Name<- c("Backstrom, Mrs. Karl Alfred (Maria Mathilda Gustafsson)",
"Backstrom, Mr. Karl Alfred John",
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Cumings, Mr. John Bradley")
HusbandName <- c("Backstrom, Mr. Karl Alfred","","Cumings, Mr. John
Bradley","")
df1<- data.frame(cbind(passengerid,Name,HusbandName))
df1$Name <- as.character(df1$Name)
df1$HusbandName <- as.character(df1$HusbandName)
I have tried using Stringr, but facing problems because 1)I need the code to look at only 1 element of the vector HusbandName and search for it in the whole vector Name. 2) I found it difficult to use regular expressions given that the pattern I am looking for is vectorised (as HusbandName)
This is what I have tried so far:
Attempt 1 - only finds exact matches & doesn't return the passengerID & doesn't add column to df
df1$Husbandid < - for (i in 1:NROW(df1$HusbandName)) {
print(HusbandName[i] %in% Name)}
Attempt 2 - finds partial matches, but does not ignore blanks & does not tell me passenger id & doesn't add column to df
df1$Husbandid <- for (i in 1:NROW(df1$HusbandName)) {
print(which(str_detect(df1$Name,df1$HusbandName[i])))}
#Attempt 3 - almost works but - the printed results are different from those added into the dataframe as a new column. how can i correct for this? Ultimately I need the ones in the df to be correct. the error is that those without husbands are showing husbandiD when this should be blank or na. can this be corrected or is there a way to convert the output of the for loop into a vector we can add to the df?
for (i in 1:NROW(df1$HusbandName)) {
if (df1$HusbandName[i] =="") {
print("Man") & next()
}
FoundHusbandNames<- c(which(str_detect(df1$Name,df1$HusbandName[i])))
print(df1$passengerid[FoundHusbandNames]) -> df1$Husbandid[i] }
This will get you the id where the names actually match, as for Cumings. It won't work for Backstrom though. Not sure if you missed off the 'John' at the end of Karl Alfred or if the data is inconsistent. If the former, this should be fine.
library(dplyr)
husbands <- df1[, c(1, 2)] %>% filter(HusbandName == '')
colnames(husbands)[2] <- "HusbandName"
df2 <- left_join(df1, husbands, by = "HusbandName")
View(df2)

How to use a look up table to create a new row in a data frame?

I have two data frames. One contains the data I am trying to clean/modify(df_x) and one is a lookup table(df_y). There is a column df_x$TEXT that contains a string like "Some text - with more" and the lookup table df_y looks like this:
SORT ABB
-------------- ----
Some Text ST
I want to see if a value in df_y$SORT is in the df_x$TEXT for every row of df_x. If there is a match then take the df_y$ABB value at that matched row and add it to a new column in df_x like df_x$TEXT_ABB.
For the information above the algorithm would see that "Some Text" is in "Some text - with more" (ignoring case) so it would add the value "ST" to the column df_x$TEXT_ABB.
I know I can use match and or a combination of sapply and grep to search if it exists but I can not figure how to do this AND grab the abbreviation I would like to map it back to a new column in the original dataframe.
you can try this:
df_x <- data.frame(TEXT=c("Some Text 001", "other text", "Some Text 002"))
df_y <- read.table(header=TRUE, text=
'SORT ABB
"Some Text" ST
"Other Text" OT')
L <- sapply(df_y$SORT, grep, x=df_x$TEXT, ignore.case=TRUE)
df_x$abb <- NA
for (l in 1:length(L)) if (length(L[[l]])!=0) df_x$abb[L[[l]]] <- as.character(df_y$ABB[l])

Convert string of words to a custom acronym/abbreviation in R and concatenate it with data from other rows?

Here is an example data set:
data <- data.frame (author = c('bob', 'john', 'james'),
year = c(2000, 1942, 1765),
title = c('test title one two three',
'another test title four five',
'third example title'))
And I would like to automate the process of making bibtex references, e.g. with a function like this:
bibtexify <- function (author, year, title) {
acronym <- convert.to.acronym(title)
paste(author, year, acronym, sep='')
}
so that I get the following result:
with(data, bibtexify(author, year, title))
[1] 'bob2000tto'
[2] 'john1942att'
[3] 'james1765tet'
Is it possible to do this in R?
you want abbreviate
R> abbreviate('test title one two three')
test title one two three
"ttott"
Here is one possibility you could build from:
title <- c('test title one two three',
'another test title four five',
'third example title')
library(gsubfn)
sapply( strapply(title, "([a-zA-Z])[a-zA-Z]*"), function(x) paste(x[1:3], collapse=''))
This assumes that there are at least 3 words in each title, will need to be fixed if that is not the case.
If you really want an acronym, and not an abbreviation you can use this functions:
acronymatize <- function(x) {
s <- strsplit(x, " ")[[1]]
paste((substring(s, 1,1)),
sep="", collapse="")
}
Example:
> acronymatize("One Two Three")
[1] "OTT"
accronym <- function(x)
{
s <- strsplit(as.character(x)," ")
s1 <- lapply(s,substring,1,1)
s2 <- lapply(s1,paste,collapse="",sep="")
unlist(s2)
}
The answer by Patrick works when there is a single element to be made an acronym. However, if there are more than one elements to be entered then probably the above function may work better. Instead of choosing 1st element in the s I am converting the entire list into short-form.
I was using too complicated a method before I saw Patrick's answer thanks for the suggestion.

Resources