Remove certain words in string from column in dataframe in R - r

I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:
sampleData
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.
Words I do not want to include in my output:
stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")
I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(sampleData$Company,stopwords)
The output for the above function looks like this:
[1] "XYZ Company Consulting Firm Smith"
T
he output should be:
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith
Any help would be appreciated.

We can use 'tm' package
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check

With a little check on the stopwords( having inserted "\" in Co. to avoid regex, spaces ): (But the previous answer should be preferred if you dont want to keep an eye on stopwords)
stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")
gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company" "Consulting Firm " "Smith "
df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
# Location Company
#1 New York, NY XYZ Company
#2 Chicago, IL Consulting Firm
#3 Miami, FL Smith

Related

Extract the longest matching string R

I would like to extract from a column the names matching the names contained in a large vector of characters. In some cases, the extracted string is not complete because of whitespaces.
Here below a replicable example:
library(stringr)
library(dplyr)
library(tidyr)
library(stringi)
data <- data.frame (address = c("to New York street", "to New cafe", "to Paris avenue", "to London hostel"))
search_string<-c("London","Paris", "New", "New York")%>% paste(collapse = " |to ")
data %>% dplyr::mutate(temp_com = str_extract_all(paste(address), search_string))
This is the results :
address temp_com
1 to New York street to New
2 to New cafe to New
3 to Paris avenue to Paris
4 to London hostel London
And this is what I would like:
address temp_com
1 to New York street to New York
2 to New cafe to New
3 to Paris avenue to Paris
4 to London hostel London
Thank you very much for your help
Change the order of your search strings to longest-to-shortest. (Also, I'm inferring you intend to have "to " before your first search string, it's being omitted in your current example.)
search_string <- c("London","Paris", "New", "New York")
search_string <- paste(paste("to", search_string[order(-nchar(search_string))]), collapse = "|")
search_string
# [1] "to New York|to London|to Paris|to New"
data %>%
dplyr::mutate(temp_com = str_extract_all(paste(address), search_string))
# address temp_com
# 1 to New York street to New York
# 2 to New cafe to New
# 3 to Paris avenue to Paris
# 4 to London hostel to London

A more elegant way to remove duplicated names (phrases) in the elements of a character string

I have a vector of organization names in a dataframe. Some of them are just fine, others have the name repeated twice in the same element. Also, when that name is repeated, there is no separating space so the name has a camelCase appearance.
For example (id column added for general dataframe referencing):
id org
1 Alpha Company
2 Bravo InstituteBravo Institute
3 Charlie Group
4 Delta IncorporatedDelta Incorporated
but it should look like:
id org
1 Alpha Company
2 Bravo Institute
3 Charlie Group
4 Delta Incorporated
I have a solution that gets the result I need--reproducible example code below. However, it seems a bit lengthy and not very elegant.
Does anyone have a better approach for the same results?
Bonus question: If organizations have 'types' included, such as Alpha Company, LLC, then my gsub() line to fix the camelCase does not work as well. Any suggestions on how to adjust the camelCase fix to account for the ", LLC" and still work with the rest of the solution?
Thanks in advance!
(Thanks to the OP & those who helped on the previous SO post about splitting camelCase strings in R)
# packages
library(stringr)
# toy data
df <- data.frame(id=1:4, org=c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
# split up & clean camelCase words
df$org_fix <- gsub("([A-Z])", " \\1", df$org)
df$org_fix <- str_trim(str_squish(df$org_fix))
# temp vector with half the org names
df$org_half <- word(df$org_fix, start=1, end=(sapply(strsplit(df$org_fix, " "), length)/2)) # stringr::word
# double the temp vector
df$org_dbl <- paste(df$org_half, df$org_half)
# flag TRUE for orgs that contain duplicates in name
df$org_dup <- df$org_fix == df$org_dbl
# corrected the org names
df$org_fix <- ifelse(df$org_dup, df$org_half, df$org_fix)
# drop excess columns
df <- df[,c("id", "org_fix")]
# toy data for the bonus question
df2 <- data.frame(id=1:4, org=c("Alpha Company, LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
Another approach is to compare the first half of the string with the second half of the string. If equal, pick the first half. It also works if there are numbers, underscores or any other characters present in the company name.
org <- c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated", "WD40WD40", "3M3M")
ifelse(substring(org, 1, nchar(org) / 2) == substring(org, nchar(org) / 2 + 1, nchar(org)), substring(org, 1, nchar(org) / 2), org)
# [1] "Alpha Company" "Bravo Institute" "Charlie Group" "Delta Incorporated" "WD40" "3M"
You can use regex as this line below :
my_df$org <- str_extract(string = my_df$org, pattern = "([A-Z][a-z]+ [A-Z][a-z]+){1}")
If all individual words start with a capital letter (not followed by an other capital letter), then you can use it to split on. Only keep unique elements, and paste + collapse. Will also work om the bonus LCC-option
org <- c("Alpha CompanyCompany , LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated")
sapply(
lapply(
strsplit(gsub("[^A-Za-z0-9]", "", org),
"(?<=[^A-Z])(?=[A-Z])",
perl = TRUE),
unique),
paste0, collapse = " ")
[1] "Alpha Company LLC" "Bravo Institute" "Charlie Group" "Delta Incorporated"

R - All or Partial String Matching?

I have a data frame of tweets for a sentiment analysis I am working on. I want to remove references to some proper names (for example, "Jeff Smith"). Is there a way to remove all or partial references to a name in the same command? Right now I am doing it the long way:
library(stringr)
str_detect(text, c('(Jeff Smith) | (Jeff) | (Smith)' ))
But that obviously gets cumbersome as I add more names. Ideally there'd be some way to feed just "Jeff Smith" and then be able to match all or some of it. Does anybody have any ideas?
Some sample code if you would like to play with it:
tweets = data.frame(text = c('Smith said he’s not counting on Monday being a makeup day.',
"Williams says that Steve Austin will miss the rest of the week",
"Weird times: Jeff Smith just got thrown out attempting to steal home",
"Rest day for Austin today",
"Jeff says he expects to bat leadoff", "Jeff", "No reference to either name"))
name = c("Jeff Smith", "Steve Austin")
Based on the data showed, all of them should be TRUE
library(dplyr)
library(stringr)
pat <- str_c(gsub(" ", "\\b|\\b", str_c("\\b", name, "\\b"),
fixed = TRUE), collapse="|")
tweets %>%
mutate(ind = str_detect(text, pat))
-output
# text ind
#1 Smith said he’s not counting on Monday being a makeup day. TRUE
#2 Williams says that Steve Austin will miss the rest of the week TRUE
#3 Weird times: Jeff Smith just got thrown out attempting to steal home TRUE
#4 Rest day for Austin today TRUE
#5 Jeff says he expects to bat leadoff TRUE
#6 Jeff TRUE
#7 No reference to either name FALSE
Not a beauty, but it works.
#example data
namelist <- c('Jeff Smith', 'Kevin Arnold')
namelist_spreaded <- strsplit(namelist, split = ' ')
f <- function(x) {
paste0('(',
paste(x, collapse = ' '),
') | (',
paste(x, collapse = ') | ('),
')')
}
lapply(namelist_spreaded, f)

How to extract matching values from a column in a dataframe when semicolons are present in R?

I have a large dataframe of published articles for which I would like to extract all articles relating to a few authors specified in a separate list. The authors in the dataframe are grouped together in one column separated by a ; . Not all authors need to match, I would like to extract any article which has one author matched to the list. An example is below.
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
I would expect to return;
Title AU
A Mark; John
C Sarah; Luke
However with my large dataframe this command does not work to return all AU, it only returns rows which have a single AU not multiple ones.
Here is a dput from my larger dataframe of 5 rows
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
An option with str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
data
df <- data.frame(Title, Names)
in Base-R you can access it like so
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
This can be accomplished by subsetting on those Names that match the pattern specified in the first argument to the function grepl:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"

Deleting rows that have more than a certain number of columns in a comma delimited file

I have rows/observations in a comma delimited file that ideally should have 55 columns. But there are fields such as addresses that have an an extra comma within them. Such as Manhattan, New York should be one field Manhattan, New York but I get two fields Manhattan and New York when I read the file which increases the number of columns.
Is there anyway I can delete such observations using R or any tool such as Delimit or Excel?
I would eventually like to load this file into R for analysis.
I agree my question is similar to Delete lines or rows in a tab-delimited file, by number of cells in that lines or rows but I am looking for a solution in R.
Input
Name, Address, DOB
John, Manhattan, New York, 2/8/1990
Jacob, Arizona, 9/10/2012
Smith, New Jersey, 8/10/2016
Expected Output
Name, Address, DOB
Jacob, Arizona, 9/10/2012
Smith, New Jersey, 8/10/2016
In general, I do not advocate doing what you want to do, which is to throw away records. Nonetheless, if this is what you want to do, you could do so as follows.
Assuming your data is stored as a text in a file called foo, you can use the count.fields function to count fields defined by the presence of sep. Then just omit them from the readLines function.
text <-
"Name, Address, DOB
John, Manhattan, New York, 2/8/1990
Jacob, Arizona, 9/10/2012
Smith, New Jersey, 8/10/2016
"
cat(text, file = "foo", sep = ",")
fields <- count.fields("foo", sep = ",")
readLines("foo")[fields == 3]
One option would be to read with readLines and then create a quote around the words with sub, and then read the dataset with read.table
lines1 <- gsub(",", " ", lines)
lines1[-1] <- sub("^(\\S+)\\s+([^0-9]+\\b)\\s+(\\d+.*)", "\\1 '\\2' \\3",
lines1[-1])
read.table(text=lines1, stringsAsFactors=FALSE, header = TRUE)
# Name Address DOB
#1 John Manhattan New York 2/8/1990
#2 Jacob Arizona 9/10/2012
#3 Smith New Jersey 8/10/2016
data
lines <- readLines("yourfile.txt")
We can count the number of commas in each line and subset the line vector for only those lines that have the expected number of commas:
## read in raw file lines using readLines()
lines1 <- readLines(textConnection('Name, Address, DOB\nJohn, Manhattan, New York, 2/8/1990\nJacob, Arizona, 9/10/2012\nSmith, New Jersey, 8/10/2016\n'));
## subset for lines with the expected number of commas
lines2 <- lines1[2L==sapply(lines1,function(s) nchar(s)-nchar(gsub(',','',s)))];
## result
lines1;
## [1] "Name, Address, DOB"
## [2] "John, Manhattan, New York, 2/8/1990"
## [3] "Jacob, Arizona, 9/10/2012"
## [4] "Smith, New Jersey, 8/10/2016"
## [5] ""
lines2;
## [1] "Name, Address, DOB"
## [2] "Jacob, Arizona, 9/10/2012"
## [3] "Smith, New Jersey, 8/10/2016"

Resources