Consider the following data.frame:
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
I would like to remove the duplicates First Name/Last Name if the Full Name is available in the String. Also, no changes made to the string if there is no match. The result should be like the data-frame provided below;
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood"), UniqueName = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Michael Jordan, John DeNero, Ani Adhikari, Mia Scher", "Nenshad Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
Any Inputs will be really appreciable.
Answer
Use grepl to find strings that [1] do not contain a space, and [2] are present in other names.
Code
df$UniqueName <- sapply(df$Name, function(x) {
sn <- unlist(strsplit(x, split = ", ", fixed = TRUE))
sn2 <- sn[!(!grepl(" ", sn) & sapply(sn, function(y) sum(grepl(y, sn)) > 1))]
paste(sn2, collapse = ", ")
})
Rationale
We use sapply since each entry needs a lot of work. We essentially perform 3 steps: [1] split the string with strsplit, [2] subset to keep only those that you want, [3] paste the string back together with paste.
The reasoning here is that single first or last names do not contain a space, and if they are present in other names then you want to remove them. Hence, we find those that do not have a space (!grepl(" ", sn)) and that are a substring of another entry (sapply(sn, function(y) sum(grepl(y, sn)) > 1)). Then, we remove those using [!( )].
Related
Hi I have a column with names. It has names in the format of:
c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner").
I just want to extract the first name but I am not exactly how to do it in an easy way due to the prefixes on the names. Please let me know if you have any suggestions.
This is an approach:
library(magrittr) # for %>%
dirty_names <- c(
"Tom",
"Tom Turner",
"Dr. Tom Turner",
"R. Tom Turner",
"J Tom Turner",
"Jr. Tom Turner"
)
dirty_names %>%
# Remove first word if it ends with . e.g. Dr., Jr., R.
sub("^\\w+\\.", "", .) %>%
trimws() %>%
# Remove first word if it is one letter e.g. J
sub("^[A-Za-z] ", "", .) %>%
# Delete everything after first word
sub("(\\w+).*", "\\1", .)
# [1] "Tom" "Tom" "Tom" "Tom" "Tom" "Tom"
Solution
Here is a solution in the tidyverse, which uses regular expressions ("regex") to extract every component of interest:
Optional prefix: either a single letter (J), or several letters followed by a period (Jr.); separated from the ensuing name by whitespace ( ).
Required first_name: a "streak" of characters before the next whitespace.
Optional last_name: a "streak" of characters after that next whitespace.
# Load useful functions.
library(tidyverse)
# ...
# Code to generate a 'dirty_data' table with a 'dirty_name' column.
# ...
# Define the regex for extracting the name components, each within a (capture group).
dirty_regex <-
# Prefix Next Whitespace
# |----------------------------------------------| |------------|
"^((([[:alpha:]])|([[:alpha:]]+\\.))[[:blank:]]+)?([^[:blank:]]+)(([[:blank:]]*)(.*))?$"
# |-------------| |--|
# First Name Last Name
# Clean the 'dirty_data' and store it in a fresh table: 'clean_data'.
clean_data <- dirty_data %>%
mutate(
# Remove external whitespace for easier analysis.
clean_full_name = str_trim(dirty_name),
# Break the dirty names (using regex) into a matrix of their components.
name_components = str_match(dirty_name, dirty_regex),
# Extract each component.
clean_prefix = name_components[, 2],
clean_first_name = name_components[, 6],
clean_last_name = name_components[, 9],
# Remove the matrix.
name_components = NULL,
# Trim any external whitespace in the (new) components.
across(starts_with("clean_") & !clean_full_name, str_trim),
# Replace any empty strings ("") with blanks (NAs).
across(starts_with("clean_"), na_if, y = "")
)
# Print and inspect our result.
clean_data
Result
Given data like your dirty_data below
# The dirty names.
dirty_names_vec <- c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")
# A table with a column for the dirty names.
dirty_data <- tibble(dirty_name = dirty_names_vec)
this workflow should yield the following result for clean_data:
# A tibble: 6 × 5
dirty_name clean_full_name clean_prefix clean_first_name clean_last_name
<chr> <chr> <chr> <chr> <chr>
1 Tom Tom NA Tom NA
2 Tom Turner Tom Turner NA Tom Turner
3 Dr. Tom Turner Dr. Tom Turner Dr. Tom Turner
4 R. Tom Turner R. Tom Turner R. Tom Turner
5 J Tom Turner J Tom Turner J Tom Turner
6 Jr. Tom Turner Jr. Tom Turner Jr. Tom Turner
Note
If other "dirty" names are in different formats, you must modify your dirty_regex accordingly. You should likewise adjust the index i of each capture group, used to extract the components via clean_* = name_components[, i].
See str_match() from the stringr package, for extracting components in "capture groups". For further information on defining those groups, see regular expressions with stringr.
I have a data frame of tweets for a sentiment analysis I am working on. I want to remove references to some proper names (for example, "Jeff Smith"). Is there a way to remove all or partial references to a name in the same command? Right now I am doing it the long way:
library(stringr)
str_detect(text, c('(Jeff Smith) | (Jeff) | (Smith)' ))
But that obviously gets cumbersome as I add more names. Ideally there'd be some way to feed just "Jeff Smith" and then be able to match all or some of it. Does anybody have any ideas?
Some sample code if you would like to play with it:
tweets = data.frame(text = c('Smith said he’s not counting on Monday being a makeup day.',
"Williams says that Steve Austin will miss the rest of the week",
"Weird times: Jeff Smith just got thrown out attempting to steal home",
"Rest day for Austin today",
"Jeff says he expects to bat leadoff", "Jeff", "No reference to either name"))
name = c("Jeff Smith", "Steve Austin")
Based on the data showed, all of them should be TRUE
library(dplyr)
library(stringr)
pat <- str_c(gsub(" ", "\\b|\\b", str_c("\\b", name, "\\b"),
fixed = TRUE), collapse="|")
tweets %>%
mutate(ind = str_detect(text, pat))
-output
# text ind
#1 Smith said he’s not counting on Monday being a makeup day. TRUE
#2 Williams says that Steve Austin will miss the rest of the week TRUE
#3 Weird times: Jeff Smith just got thrown out attempting to steal home TRUE
#4 Rest day for Austin today TRUE
#5 Jeff says he expects to bat leadoff TRUE
#6 Jeff TRUE
#7 No reference to either name FALSE
Not a beauty, but it works.
#example data
namelist <- c('Jeff Smith', 'Kevin Arnold')
namelist_spreaded <- strsplit(namelist, split = ' ')
f <- function(x) {
paste0('(',
paste(x, collapse = ' '),
') | (',
paste(x, collapse = ') | ('),
')')
}
lapply(namelist_spreaded, f)
I'm newer to R and am playing around with the Titanic kaggle dataset. I've watched David Langer's great youtube videos on exploring this dataset and he is able to parse out the titles of each passenger with a for loop. However, I can't help but figure there is an easier way to do this with mutate and stringr.
note: titanic.full = data.frame
This is my best guess... obviously it doesn't work though:
mutate(titanic.full, Title = ifelse(str_detect(titanic.full$Name, "Mr."), "Mr.") elseif(str_detect(titanic.full$Name, "Mrs."), "Mrs."), "Other")
Any guidance would be very appreciated.
Using a regular expression match seems easier here. .*? matches all characters up to the first occurrence of what follows. (Mr|Mrs|Miss|$) matches any of the options with $ meaning end of line (in order to capture any lines that have none of the prior values). Finally .* matches whatever is left. "\\1" refers to the characters that match the portion of the pattern within parentheses.
titanic.full %>% mutate(Title = sub(".*?(Mr|Mrs|Miss|$).*", "\\1", Name))
Note: Since the input was not provided reproducibly in the question we provide it here:
u <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
titanic.full <- read.csv(u)
If you want the tidyverse solution you can do the following:
library(tidyverse)
df <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
df <- read.csv(df, stringsAsFactors = FALSE)
df <- as_tibble(df)
df
df %>%
extract(Name,
"Title",
"(Mr|Mrs|Miss) ([^ ]+)",
remove = FALSE) %>%
select(Name, Title)
Which returns:
# A tibble: 1,313 x 2
Name Title
* <chr> <chr>
1 Allen, Miss Elisabeth Walton Miss
2 Allison, Miss Helen Loraine Miss
3 Allison, Mr Hudson Joshua Creighton Mr
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) Mrs
5 Allison, Master Hudson Trevor <NA>
6 Anderson, Mr Harry Mr
7 Andrews, Miss Kornelia Theodosia Miss
8 Andrews, Mr Thomas, jr Mr
9 Appleton, Mrs Edward Dale (Charlotte Lamson) Mrs
10 Artagaveytia, Mr Ramon Mr
# ... with 1,303 more rows
Thanks to G. Grothendieck for providing the data.
I have the task of searching through text, replacing peoples names and nicknames with a generic character string.
Here is the structure of my data frame of names and corresponding nicknames:
names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)
Here is the structure of my data frame containing text
text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)
Giving these dataframes
df_name_nick:
names nicknames
1 Thomas Tom
2 Thomas Tommy
3 Abigail Abi
4 Abigail Abby
5 Abigail Abbey
df_name_comment:
text_names text_comment
1 Abigail Tommy sits next to Abbey
2 Thomas As a footballer Tommy is very good
3 Abigail Abby is a mature young lady
4 Thomas Tom is a handsome man
5 Colin Tom is friends with Colin and Abi
I am looking for a routine that will search through each row of df_name_comment and use the df_name_comment$text_names to look up the corresponding nickname from df_name_nick and replace it with XXX.
Note for each person's name there can be several nicknames.
Note that in each text comment, only the appropriate name for that row is replaced, so that we would get this as output:
Abigail "Tommy sits next to XXX"
Thomas "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas "XXX is a handsome man"
Colin "Tom is friends with Colin and Abi"
I’m thinking this will require a cunning combination of gsubs, matches and apply functions (either mapply, sapply, etc)
I've searched on Stack Overflow for something similar to this request and can only find very specific regex solutions based on data frames with unique row elements, and not something that I think will work with generic text lookups and gsubs via multiple nicknames.
Can anyone please help me solve my predicament?
With thanks
Nevil
(newbie R programmer since Jan 2017)
Here is an idea via base R. We basically paste the nicknames together for each name, collapsed by | so as to pass it as regex in gsub and replace the matched words of each comment with XXX. We use mapply to do that after we merge our aggregated nicknames with df_name_comment.
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2
Which gives,
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Abigail XXX is a mature young lady
3 Colin Tom is friends with Colin and Abi
4 Thomas As a footballer XXX is very good
5 Thomas XXX is a handsome man
Note1: Replacing NA in nicknames with 0 is due to the fact that NA (which is the default fill in merge for unmatched elements) would convert the comment string to NA as well when passed in gsub
Note2 The order is also changed due to merge, but you can sort as you wish as per usual.
Note3 Is better to have your variables as characters rather than factors. So you either read the data frames with stringsAsFactors = FALSE or convert via,
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)
EDIT
Based on your comment, we can simply match the comments' names with our aggregated data set, save that in a vector and use mapply directly on the original data frame, without having to merge and then drop variables, i.e.
#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0
df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
v1, df_name_comment$text_comment)
Hope this helps!
l <- apply(df_name_comment, 1, function(x)
ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0,
gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)
Don't forget to let us know if it solved your problem :)
Data
df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)
Solution 2
EDIT: In this initial solution I manually checked with grepl if the nickname was present, and then gsubbed with one of the matching ID's. I knew the '|' operator worked with grepl, but not with gsub. So credits to Sotos for that idea.
df = df_name_comment
for(i in 1:nrow(df))
{
matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
if(length(matching_nicknames)>0)
{
df$text_comment[i] = mapply(sub, pattern=paste(paste0("\\b",matching_nicknames,"\\b"),collapse="|"), "XXX", df$text_comment[i])
}
}
Output
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Thomas As a footballer XXX is very good
3 Abigail XXX is a mature young lady
4 Thomas XXX is a handsome man
5 Colin Tom is friends with Colin and Abi
Hope this helps!
I have been trying to duplicate a move that I've used a lot with SQL but can't seem to find an equivalent in R. I've been searching high and low on the list and other sources for a solution but can't find what I'm looking to do.
I have a data frame with a variable of full names, for example "Doe, John". I have been able to split these names using the following code:
# creates a split name matrix for each record
namesplit <- strsplit(crm$DEF_NAME, ',')
# takes the first/left part of matrix, after the comma
crm$LAST_NAME <- trimws(sapply(namesplit, function(x) x[1]))
# takes the last/right part of the matrix, after the comma
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[length(x)]))
But some of the names have "." instead of "," splitting the names. For example, "Doe. John". In other cases I have two ".", i.e. "Doe. John T.". Here's an example:
> test$LAST_NAME
[1] "DEWITT. B" "TAOY. PETER" "ZULLO. JASON"
[4] "LAWLOR. JOSEPH" "CRAWFORD. ADAM" "HILL. ROBERT W."
[7] "TAGERT. CHRISTOPHER" "ROSEBERY. SCOTT W." "PAYNE. ALBERT"
[10] "BUNTZ. BRIAN JOHN" "COLON. PERFECTO GAUD" "DIAZ. JOSE CANO"
[13] "COLON. ERIK D." "COLON. ERIK D." "MARTINEZ. DAVID C."
[16] "DRISKELL. JASON" "JOHNSON. ALEXANDER" "JACKSON. RONNIE WAYNE"
[19] "SIPE. DAVID J." "FRANCO. BRANDT" "FRANCO. BRANDT"
For these cases, I'm trying to find the position of the first "." so that I can use user-defined functions to split the name. Here are those functions.
left = function (string,char){
substr(string,1,char)}
right = function (string, char){
substr(string,nchar(string)-(char-1),nchar(string))}
I've had some success with the following, but it takes the position of the first record only, so for example it'll grab position 6 for all the records rather than changing for each row.
test$LAST_NAME2 <- left(test$LAST_NAME,
which(strsplit(test$LAST_NAME, '')[[1]]=='.')-1)
I've played around with apply and sapply, but I'm obviously missing something because they don't seem to work.
My plan was to use an ifelse function to apply the "." parsing to the records that have this issue.
I fear the answer is simple. But I'm stuck. Thanks so much for your help.
I would just modify your original function namesplit to this:
namesplit <- strsplit(crm$DEF_NAME, ',|\\.')
which will split on , or ..
Also, maybe change your first name function to
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[2:length(x)]))
to catch any instances where there is a comma or period that is not in the last position.
With tidyr,
library(tidyr)
test %>% separate(LAST_NAME, into = c('LAST_NAME', 'FIRST_NAME'), extra = 'merge')
## LAST_NAME FIRST_NAME
## 1 DEWITT B
## 2 LAWLOR JOSEPH
## 3 TAGERT CHRISTOPHER
## 4 BUNTZ BRIAN JOHN
## 5 COLON ERIK D.
## 6 DRISKELL JASON
## 7 SIPE DAVID J.
## 8 TAOY PETER
## 9 CRAWFORD ADAM
## 10 ROSEBERY SCOTT W.
## 11 COLON PERFECTO GAUD
## 12 COLON ERIK D.
## 13 JOHNSON ALEXANDER
## 14 FRANCO BRANDT
## 15 ZULLO JASON
## 16 HILL ROBERT W.
## 17 PAYNE ALBERT
## 18 DIAZ JOSE CANO
## 19 MARTINEZ DAVID C.
## 20 JACKSON RONNIE WAYNE
## 21 FRANCO BRANDT
Data
test <- structure(list(LAST_NAME = c("DEWITT. B", "LAWLOR. JOSEPH", "TAGERT. CHRISTOPHER",
"BUNTZ. BRIAN JOHN", "COLON. ERIK D.", "DRISKELL. JASON", "SIPE. DAVID J.",
"TAOY. PETER", "CRAWFORD. ADAM", "ROSEBERY. SCOTT W.", "COLON. PERFECTO GAUD",
"COLON. ERIK D.", "JOHNSON. ALEXANDER", "FRANCO. BRANDT", "ZULLO. JASON",
"HILL. ROBERT W.", "PAYNE. ALBERT", "DIAZ. JOSE CANO", "MARTINEZ. DAVID C.",
"JACKSON. RONNIE WAYNE", "FRANCO. BRANDT")), row.names = c(NA,
-21L), class = "data.frame", .Names = "LAST_NAME")