I have a data frame of tweets for a sentiment analysis I am working on. I want to remove references to some proper names (for example, "Jeff Smith"). Is there a way to remove all or partial references to a name in the same command? Right now I am doing it the long way:
library(stringr)
str_detect(text, c('(Jeff Smith) | (Jeff) | (Smith)' ))
But that obviously gets cumbersome as I add more names. Ideally there'd be some way to feed just "Jeff Smith" and then be able to match all or some of it. Does anybody have any ideas?
Some sample code if you would like to play with it:
tweets = data.frame(text = c('Smith said he’s not counting on Monday being a makeup day.',
"Williams says that Steve Austin will miss the rest of the week",
"Weird times: Jeff Smith just got thrown out attempting to steal home",
"Rest day for Austin today",
"Jeff says he expects to bat leadoff", "Jeff", "No reference to either name"))
name = c("Jeff Smith", "Steve Austin")
Based on the data showed, all of them should be TRUE
library(dplyr)
library(stringr)
pat <- str_c(gsub(" ", "\\b|\\b", str_c("\\b", name, "\\b"),
fixed = TRUE), collapse="|")
tweets %>%
mutate(ind = str_detect(text, pat))
-output
# text ind
#1 Smith said he’s not counting on Monday being a makeup day. TRUE
#2 Williams says that Steve Austin will miss the rest of the week TRUE
#3 Weird times: Jeff Smith just got thrown out attempting to steal home TRUE
#4 Rest day for Austin today TRUE
#5 Jeff says he expects to bat leadoff TRUE
#6 Jeff TRUE
#7 No reference to either name FALSE
Not a beauty, but it works.
#example data
namelist <- c('Jeff Smith', 'Kevin Arnold')
namelist_spreaded <- strsplit(namelist, split = ' ')
f <- function(x) {
paste0('(',
paste(x, collapse = ' '),
') | (',
paste(x, collapse = ') | ('),
')')
}
lapply(namelist_spreaded, f)
Related
Hi I have a column with names. It has names in the format of:
c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner").
I just want to extract the first name but I am not exactly how to do it in an easy way due to the prefixes on the names. Please let me know if you have any suggestions.
This is an approach:
library(magrittr) # for %>%
dirty_names <- c(
"Tom",
"Tom Turner",
"Dr. Tom Turner",
"R. Tom Turner",
"J Tom Turner",
"Jr. Tom Turner"
)
dirty_names %>%
# Remove first word if it ends with . e.g. Dr., Jr., R.
sub("^\\w+\\.", "", .) %>%
trimws() %>%
# Remove first word if it is one letter e.g. J
sub("^[A-Za-z] ", "", .) %>%
# Delete everything after first word
sub("(\\w+).*", "\\1", .)
# [1] "Tom" "Tom" "Tom" "Tom" "Tom" "Tom"
Solution
Here is a solution in the tidyverse, which uses regular expressions ("regex") to extract every component of interest:
Optional prefix: either a single letter (J), or several letters followed by a period (Jr.); separated from the ensuing name by whitespace ( ).
Required first_name: a "streak" of characters before the next whitespace.
Optional last_name: a "streak" of characters after that next whitespace.
# Load useful functions.
library(tidyverse)
# ...
# Code to generate a 'dirty_data' table with a 'dirty_name' column.
# ...
# Define the regex for extracting the name components, each within a (capture group).
dirty_regex <-
# Prefix Next Whitespace
# |----------------------------------------------| |------------|
"^((([[:alpha:]])|([[:alpha:]]+\\.))[[:blank:]]+)?([^[:blank:]]+)(([[:blank:]]*)(.*))?$"
# |-------------| |--|
# First Name Last Name
# Clean the 'dirty_data' and store it in a fresh table: 'clean_data'.
clean_data <- dirty_data %>%
mutate(
# Remove external whitespace for easier analysis.
clean_full_name = str_trim(dirty_name),
# Break the dirty names (using regex) into a matrix of their components.
name_components = str_match(dirty_name, dirty_regex),
# Extract each component.
clean_prefix = name_components[, 2],
clean_first_name = name_components[, 6],
clean_last_name = name_components[, 9],
# Remove the matrix.
name_components = NULL,
# Trim any external whitespace in the (new) components.
across(starts_with("clean_") & !clean_full_name, str_trim),
# Replace any empty strings ("") with blanks (NAs).
across(starts_with("clean_"), na_if, y = "")
)
# Print and inspect our result.
clean_data
Result
Given data like your dirty_data below
# The dirty names.
dirty_names_vec <- c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")
# A table with a column for the dirty names.
dirty_data <- tibble(dirty_name = dirty_names_vec)
this workflow should yield the following result for clean_data:
# A tibble: 6 × 5
dirty_name clean_full_name clean_prefix clean_first_name clean_last_name
<chr> <chr> <chr> <chr> <chr>
1 Tom Tom NA Tom NA
2 Tom Turner Tom Turner NA Tom Turner
3 Dr. Tom Turner Dr. Tom Turner Dr. Tom Turner
4 R. Tom Turner R. Tom Turner R. Tom Turner
5 J Tom Turner J Tom Turner J Tom Turner
6 Jr. Tom Turner Jr. Tom Turner Jr. Tom Turner
Note
If other "dirty" names are in different formats, you must modify your dirty_regex accordingly. You should likewise adjust the index i of each capture group, used to extract the components via clean_* = name_components[, i].
See str_match() from the stringr package, for extracting components in "capture groups". For further information on defining those groups, see regular expressions with stringr.
I have a requirement to identify specific words and combinations of specific words within a free text description column. My dataset contains two columns - a reference number and description. The data relates to repairs. I need to be able to determine which room the repair took place in for each reference number. This could include “kitchen”, “bathroom”, “dining room” amongst others.
The dataset looks like this
|reference|description |
|————————-|———————————————————————-|
|123456 |repair light in kitchen |
The output I require is something like this:
|reference|Room |
|————————-|————————|
|123456 |kitchen |
Any help very much appreciated.
This will pull the first match from room_vector in each description.
room_vector = c("kitchen", "bathroom", "dining room")
library(stringr)
your_data$room = str_extract(your_data$description, paste(room_vector, collapse = "|"))
This version takes into account the combination with the word repair:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") &
str_detect(description, pattern) ~ str_extract(description, pattern)))
If you apply the code to this dataframe:
reference description
1 123456 live in light in kitchen
you will get:
reference description Room
1 123456 live in light in kitchen <NA>
First version does not take the combination with the word repair into account:
Similar to Gregor Thomas solution:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") |
str_detect(description, pattern) ~ str_extract(description, pattern)))
reference description Room
1 123456 repair light in kitchen kitchen
Using Base R:
rooms <- c("kitchen", "bathroom", "dining room")
pat <- sprintf('.*repair.*(%s).*|.*', paste0(rooms, collapse = '|'))
transform(df, room = sub(pat, '\\1', reference))
reference room
1 repair bathroom bathroom
2 live bathroom
3 repair lights in kitchen kitchen
4 food in kitchen
5 tv in dining room
6 table repair dining room dining room
Data:
df <- structure(list(reference = c("repair bathroom", "live bathroom",
"repair lights in kitchen", "food in kitchen", "tv in dining room",
"table repair dining room ")), class = "data.frame", row.names = c(NA,
-6L))
Consider the following data.frame:
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
I would like to remove the duplicates First Name/Last Name if the Full Name is available in the String. Also, no changes made to the string if there is no match. The result should be like the data-frame provided below;
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood"), UniqueName = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Michael Jordan, John DeNero, Ani Adhikari, Mia Scher", "Nenshad Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
Any Inputs will be really appreciable.
Answer
Use grepl to find strings that [1] do not contain a space, and [2] are present in other names.
Code
df$UniqueName <- sapply(df$Name, function(x) {
sn <- unlist(strsplit(x, split = ", ", fixed = TRUE))
sn2 <- sn[!(!grepl(" ", sn) & sapply(sn, function(y) sum(grepl(y, sn)) > 1))]
paste(sn2, collapse = ", ")
})
Rationale
We use sapply since each entry needs a lot of work. We essentially perform 3 steps: [1] split the string with strsplit, [2] subset to keep only those that you want, [3] paste the string back together with paste.
The reasoning here is that single first or last names do not contain a space, and if they are present in other names then you want to remove them. Hence, we find those that do not have a space (!grepl(" ", sn)) and that are a substring of another entry (sapply(sn, function(y) sum(grepl(y, sn)) > 1)). Then, we remove those using [!( )].
I have the task of searching through text, replacing peoples names and nicknames with a generic character string.
Here is the structure of my data frame of names and corresponding nicknames:
names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)
Here is the structure of my data frame containing text
text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)
Giving these dataframes
df_name_nick:
names nicknames
1 Thomas Tom
2 Thomas Tommy
3 Abigail Abi
4 Abigail Abby
5 Abigail Abbey
df_name_comment:
text_names text_comment
1 Abigail Tommy sits next to Abbey
2 Thomas As a footballer Tommy is very good
3 Abigail Abby is a mature young lady
4 Thomas Tom is a handsome man
5 Colin Tom is friends with Colin and Abi
I am looking for a routine that will search through each row of df_name_comment and use the df_name_comment$text_names to look up the corresponding nickname from df_name_nick and replace it with XXX.
Note for each person's name there can be several nicknames.
Note that in each text comment, only the appropriate name for that row is replaced, so that we would get this as output:
Abigail "Tommy sits next to XXX"
Thomas "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas "XXX is a handsome man"
Colin "Tom is friends with Colin and Abi"
I’m thinking this will require a cunning combination of gsubs, matches and apply functions (either mapply, sapply, etc)
I've searched on Stack Overflow for something similar to this request and can only find very specific regex solutions based on data frames with unique row elements, and not something that I think will work with generic text lookups and gsubs via multiple nicknames.
Can anyone please help me solve my predicament?
With thanks
Nevil
(newbie R programmer since Jan 2017)
Here is an idea via base R. We basically paste the nicknames together for each name, collapsed by | so as to pass it as regex in gsub and replace the matched words of each comment with XXX. We use mapply to do that after we merge our aggregated nicknames with df_name_comment.
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2
Which gives,
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Abigail XXX is a mature young lady
3 Colin Tom is friends with Colin and Abi
4 Thomas As a footballer XXX is very good
5 Thomas XXX is a handsome man
Note1: Replacing NA in nicknames with 0 is due to the fact that NA (which is the default fill in merge for unmatched elements) would convert the comment string to NA as well when passed in gsub
Note2 The order is also changed due to merge, but you can sort as you wish as per usual.
Note3 Is better to have your variables as characters rather than factors. So you either read the data frames with stringsAsFactors = FALSE or convert via,
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)
EDIT
Based on your comment, we can simply match the comments' names with our aggregated data set, save that in a vector and use mapply directly on the original data frame, without having to merge and then drop variables, i.e.
#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0
df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
v1, df_name_comment$text_comment)
Hope this helps!
l <- apply(df_name_comment, 1, function(x)
ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0,
gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)
Don't forget to let us know if it solved your problem :)
Data
df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)
Solution 2
EDIT: In this initial solution I manually checked with grepl if the nickname was present, and then gsubbed with one of the matching ID's. I knew the '|' operator worked with grepl, but not with gsub. So credits to Sotos for that idea.
df = df_name_comment
for(i in 1:nrow(df))
{
matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
if(length(matching_nicknames)>0)
{
df$text_comment[i] = mapply(sub, pattern=paste(paste0("\\b",matching_nicknames,"\\b"),collapse="|"), "XXX", df$text_comment[i])
}
}
Output
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Thomas As a footballer XXX is very good
3 Abigail XXX is a mature young lady
4 Thomas XXX is a handsome man
5 Colin Tom is friends with Colin and Abi
Hope this helps!
I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:
sampleData
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.
Words I do not want to include in my output:
stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")
I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(sampleData$Company,stopwords)
The output for the above function looks like this:
[1] "XYZ Company Consulting Firm Smith"
T
he output should be:
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith
Any help would be appreciated.
We can use 'tm' package
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check
With a little check on the stopwords( having inserted "\" in Co. to avoid regex, spaces ): (But the previous answer should be preferred if you dont want to keep an eye on stopwords)
stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")
gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company" "Consulting Firm " "Smith "
df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
# Location Company
#1 New York, NY XYZ Company
#2 Chicago, IL Consulting Firm
#3 Miami, FL Smith