Recognizing synonyms in left_join in R - r

I have several quite large data tables containing characters, which I would like to join with the entries in my database. The spelling is often not quite right, thus joining is not possible.
I know there is no way around creating a synonym table to replace some misspelled characters. But is there a way to automatically detect certain anomalies (see example below)?
My data tables look similar to this:
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
The characters in my database are similar to this:
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))
Currently if i perform a left_join only "Apple" will join:
data <- data %>%
left_join(characters.database, by = c('products'))
Result:
products
ID
potatoe Chips
NA
potato Chips
NA
potato chips
NA
Potato-chips
NA
apple
NA
Apple
3
Appl
NA
Apple Gala
NA
Is it possible to automatically ignore: "Case letters", space" ", "-", and an "e" at the end of a word during left_join?
This would be the table i would like:
products
ID
potatoe Chips
1
potatoChips
1
potato chips
1
Potato-chips
1
apple
1
Apple
3
Appl
1
Apple Gala
NA
Any Ideas?

If I were you, I'd do a few things:
I'd strip all special characters, lower case all characters, remove spaces, etc. That'd help a bunch (i.e. potato chips, Potato Chips, and Potato-chips all go to "potatochips" which you can then join on).
There's a package called fuzzyjoin that will let you join on regular expressions, by edit distance, etc. That'll help with Apple vs Apple Gala and misspellings, etc.
You can strip special characters (only keep letters) + lowercase with something like:
library(stringr)
library(magrittr)
string %>%
str_remove_all("[^A-Za-z]+") %>%
tolower()

Thanks Matt Kaye for your suggestion I did something similar now.
As I need the correct spelling in the data base and some of my characters contain symbols and numbers which are relevant I did the following:
#data
data <- data.table(products=c("potatoe Chips", "potato Chips", "potato chips", "Potato-chips", "apple", "Apple", "Appl", "Apple Gala"))
characters.database <- data.table(products=c("Potato Chips", "Potato Chips Paprika", "Apple"), ID=c("1", "2", "3"))
#remove spaces and capital letters in data
data <- data %>%
mutate(products= tolower(products)) %>%
mutate(products= gsub(" ", "", products))
#add ID to database
characters.database <- characters.database %>%
dplyr::mutate(ID = row_number())
#remove spaces and capital letters in databasr product names
characters.database_syn <- characters.database %>%
mutate(products= tolower(products)) %>%
mutate(products= gsub(" ", "", products))
#join and add correct spelling from database
data <- data %>%
left_join(characters.database_syn, by = c('products')) %>%
select(product_syn=products, 'ID') %>%
left_join(characters.database, by = c('ID'))
#other synonyms have to manually be corrected or with the help of a synonym table (As in MY data special caracters are relevant!)

Related

Is there a way to show the matching element of a specific case using the grepl function in R?

I checked whether the brands of the data frame "df1"
brands
1 Nike
2 Adidas
3 D&G
are to be found in the elements of the following column of the data frame "df2"
statements
1 I love Nike
2 I don't like Adidas
3 I hate Puma
For this I use the code:
subset_df2 <- df2[grepl(paste(df1$brands, collapse="|"), ignore.case=TRUE, df2$statements), ]
The code works and I get a subset of df2 containing only the lines with the desired brands:
statements*
1 I love Nike
2 I don't like Adidas
Is there also a way to display which element of the cells from df2$statements exactly matches with df1$brands? For instance, a vector like [Nike, Adidas]. So, I only want to get the Nike and Adidas elements as my output and not the whole statement.
Many thanks in advance!
brands <- c("nike", "adidas", "d&g") # lower-case here
text <- c("I love Nike", "I love Adidas")
ptns <- paste(brands, collapse = "|")
ptns
# [1] "nike|adidas|d&g"
text2 <- text[NA]
text2[grepl(ptns, text, ignore.case=TRUE)] <- gsub(paste0(".*(", ptns, ").*"), "\\1", text, ignore.case = TRUE)
text2
# [1] "Nike" "Adidas"
The pre-assignment of text[NA] is because gsub will make no change if the pattern is not found. I'm using text[NA], but we could also use rep(NA_character_, length(text)), it's the same effect.
If you need multiple matches per text, then perhaps
brands <- c("Nike", "Adidas", "d&g")
text <- c("I love nike", "I love Adidas and Nike")
ptns <- paste(brands, collapse = "|")
gre <- gregexpr(ptns, text, ignore.case = TRUE)
sapply(regmatches(text, gre), paste, collapse = ";")
# [1] "nike" "Adidas;Nike"

R, stringr, mutate (I think) - multiple partial string replacements in multiple strings

I am new to text mining, R and the tidy approach and am looking for kind advice to overcome a hurdle with pre-processing text strings read in from pdf files. The specific problem is with a multiple string replacement over multiple strings.
I have data from 2 sources:
PDF reports: I have used map and pdf_text functions to read a directory of pdf reports into a data frame which creates a tibble with 3 columns: page_string, filename and pagenumber. There are 1,191 entries, and page_string holds a string being one page of pdf text.
CSV file of professional words and replacements: I have used the read_CSV function to import this. The resultant df has 2 columns with 77 entries: target_vocab (e.g. social worker) and replace_token (e.g. social_worker).
My aim is to amend the current character strings in my main data frame, replacing strings which match the professional words in target_vocab with the associated compound token in replace_token prior to tokenization.
String example - before and after string substitution:
"Social workers and early help staff work with multi-agency partners to produce child in need plans led by the allocated social worker".
"Social_workers and early_help staff work with multi_agency partners to produce CIN plans led by the allocated social_worker".
It is hopefully clear that I want "social workers", "early help", "multi-agency", "child in need" and "social worker" replaced with compound tokens.
My code:
#a bank of pdf reports and "professional_words.csv" in current working directory
library(tidyverse)
library(pdftools)
#> Using poppler version 0.73.0
library(tidytext)
library(stringr)
pdf_filenames <- list.files(pattern = "pdf$")
words_df <- read_csv("professional_words.csv", skip = 1, col_names = c("target_vocab", "replace_token"))
pattern_vector <- words_df$target_vocab
replacement_vector <- words_df$replace_token
pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
mutate(filename = .x, pagenumber = row_number()) %>%
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
The bit that doesn't work within the map function is:
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
I have tried all sorts of variations, including gsub, breaking it away from the pipe to a separate map function etc. but with my limited knowledge I am not fixing it.
I have consistently had the warning:
In stri_replace_all_regex(string, pattern,
fix_replacement(replacement), : longer object length is not a
multiple of shorter object length
With this variation of code I am also getting the error:
Problem with mutate() input page_string. x Input
page_string can't be recycled to size 10. ℹ Input page_string is
str_replace_all(page_string, pattern = pattern_vector, replacement = replace_vector). ℹ Input page_string must be size 10 or 1, not 77.
My sense is that map or list functions will help me but I seem to be going round in circles and I haven't yet found a Stack Overflow response that has helped me fix the problem.
There is a way to do what you want with str_replace_all from stringr. Instead of providing a pattern and a replacement, pass a named vector to pattern. Something like pattern = c("social worker" = social_worker", "early help" = "early_help", "multi agency" = "multi_agency"). I'll start with a simple example, and then show you how to have R build that named vector from your words_df.
# Simple example
library(stringr)
string <- "The quick brown fox"
str_replace_all(string, pattern = c("brown" = "green", "fox" = "badger"))
[1] "The quick green badger"
Here is how you do it with some fake data that looks like yours, having R build the named replacement vector.
# Making the fake data
words_df <- data.frame(target = c("fox", "brown", "quick"),
replacement = c("badger", "green", "versatile"))
strings_df <- data.frame(page_string = c("The quick brown fox",
"The sad yellow fox",
"The quick old dog",
"The lazy brown dog",
"The quick happy fox"))
# Making the named replacement vector from words_df
replacements <- c(words_df$replacement)
names(replacements) <- c(words_df$target)
# Doing the replacement
library(dplyr)
strings_df %>%
mutate(new_string = str_replace_all(page_string,
pattern = replacements))
# The output
page_string new_string
1 The quick brown fox The versatile green badger
2 The sad yellow fox The sad yellow badger
3 The quick old dog The versatile old dog
4 The lazy brown dog The lazy green dog
5 The quick happy fox The versatile happy badger
str_replace_all does not work like that. If you provide vectors for pattern and replacement, the first pattern/replacement is applied to the first element of string and so on. See the following example:
library(stringr)
fruits <- c("one apple two", "two pears", "three bananas")
pattern_v <- c("one", "two", "three")
replace_v <- c("1", "2", "3")
str_replace_all(fruits, pattern_v, replace_v)
#> [1] "1 apple two" "2 pears" "3 bananas"
Created on 2020-08-25 by the reprex package (v0.3.0)
Note that "two" gets only replaced with "2" in the second element of string. Therefore, it doesn't work if the pattern/replacement vectors are not of the same length (or a multiple) of string:
pattern_v <- c("one", "two")
replace_v <- c("1", "2")
str_replace_all(fruits, pattern_v, replace_v)
[1] "1 apple two" "2 pears" "three bananas"
warning:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
To circumvent this problem, you can pass a named vector for pattern:
str_replace_all(fruits, c("one" = "1", "two" = "2", "three" = "3"))
[1] "1 apple 2" "2 pears" "3 bananas"
Ben's answer gives a great way how to make the creation of the vector easy:
pattern_new <- c("one", "two", "three")
names(pattern_new) <- c("1", "2", "3")
str_replace_all(fruits, pattern_new)
[1] "one apple two" "two pears" "three bananas"
Problem solved thanks to speedy responses and here is the working code to resolve my question for those that may be struggling in the future:
professional_terms <- c(words_df$replace_token)
names(professional_terms) <- c(words_df$target_words)
pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
mutate(filename = .x, pagenumber = row_number(), page_string = str_replace_all(page_string,pattern = professional_terms)))

Exact match from list of words from a text in R

I have list of words and I am looking for words that are there in the text.
The result is that in the last column is always found as it is searching for patterns. I am looking for exact match that is there in words. Not the combinations. For the first three records it should be not found.
Please guide where I am going wrong.
col_1 <- c(1,2,3,4,5)
col_2 <- c("work instruction change",
"technology npi inspections",
" functional locations",
"Construction has started",
" there is going to be constn coon")
df <- as.data.frame(cbind(col_1,col_2))
df$col_2 <- tolower(df$col_2)
words <- c("const","constn","constrction","construc",
"construct","construction","constructs","consttntype","constypes","ct","ct#",
"ct2"
)
pattern_words <- paste(words, collapse = "|")
df$result<- ifelse(str_detect(df$col_2, regex(pattern_words)),"Found","Not Found")
Use word boundaries around the words.
library(stringr)
pattern_words <- paste0('\\b', words, '\\b', collapse = "|")
df$result <- c('Not Found', 'Found')[str_detect(df$col_2, pattern_words) + 1]
#OR with `ifelse`
#df$result <- ifelse(str_detect(df$col_2, pattern_words), "Found", "Not Found")
df
# col_1 col_2 result
#1 1 work instruction change Not Found
#2 2 technology npi inspections Not Found
#3 3 functional locations Not Found
#4 4 construction has started Found
#5 5 there is going to be constn coon Found
You can also use grepl here to keep it in base R :
grepl(pattern_words, df$col_2)

Returning Specific String found in text [duplicate]

This question already has an answer here:
How to find matching words in a DF from list of words and returning the matched words in new column [duplicate]
(1 answer)
Closed 3 years ago.
I have the following column in a df
c("I love bananas and apples.",
"I hate apples and pears.",
"I love to eat food.",
"I hate lettuce and bananas")
and I have a vector of fruit
fruit <- c("apples", "bananas", "pears")
I know using str_detect can return TRUE or FALSE per observation using
str_detect(df$text, paste(fruit, collapse='|'))
but what I would like is a column that has the variables that matched up, like the following
"I love bananas and apples." "bananas","apples"
"I hate apples and pears." "apples","pears"
"I love to eat food."
"I hate lettuce and bananas." "bananas"
is there a way to accomplish this? Is this outside the str_detect domain?
sapply(v, function(s){
toString(unlist(lapply(fruit, function(f){
if(grepl(f, s)) f
})))
},
USE.NAMES = FALSE)
#[1] "apples, bananas" "apples, pears" "" "bananas"
We can use str_extract_all to extract all the 'fruit' elements from the 'text' column in a list, loop through the list with map and paste (toString) them together to create the 'newtext' column
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(newtext = map_chr(str_extract_all(text,
str_c(fruit, collapse='|')), ~toString(unique(.x)))

Gsub apostrophe in data frame R

I need to remove all apostrophes from my data frame but as soon as I use....
textDataL <- gsub("'","",textDataL)
The data frame gets ruined and the new data frame only contains values and NAs, when I am only looking to remove any apostrophes from any text that might be in there? Am I missing something obvious with apostrophes and data frames?
To keep the structure intact:
dat1 <- data.frame(Col1= c("a woman's hat", "the boss's wife", "Mrs. Chang's house", "Mr Cool"),
Col2= c("the class's hours", "Mr. Jones' golf clubs", "the canvas's size", "Texas' weather"),
stringsAsFactors=F)
I would use
dat1[] <- lapply(dat1, gsub, pattern="'", replacement="")
or
library(stringr)
dat1[] <- lapply(dat1, str_replace_all, "'","")
dat1
# Col1 Col2
# 1 a womans hat the classs hours
# 2 the bosss wife Mr. Jones golf clubs
# 3 Mrs. Changs house the canvass size
# 4 Mr Cool Texas weather
You don't want to apply gsub directly on a data frame, but column-wise instead, e.g.
apply(textDataL, 2, gsub, pattern = "'", replacement = "")

Resources