String replacements: how to deal with similar strings and spaces - r

Context: translate a table from French to English using a table containing corresponding replacements.
Problem: character strings sometimes are very similar, when white space are involved str_replace() does not consider the whole string.
Reproductible example:
library(stringr) #needed for the str_replace_all() function
#datasets
# test is the table indicating corresponding strings
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
# test1 is the table I want to translate
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
# here is a function to translate
test2 = str_replace_all(test1$totrans, setNames(test$en, test$fr))
Output:
I get
> test2
[1] "Other" "Others" "Other encore"
Expected result:
> testexpected
[1] "Other" "Others" "Other again"
As you can see, if strings starts the same but there is no whitespace, replacement is a succes (see Other and Others) but when there is a whitespace, it fails ("Autre encore" is replaced by "Other encore" and not by "Other again").
I feel the answer is very obvious but I just can't find out how to solve it... Any suggestion is welcome.

I think you just need word boundaries (i.e. "\\b") around your look ups. It is straightforward to add these with a paste0 call inside str_replace_all.
Note you don't need to include the whole tidyverse for this; the str_replace_all function is part of the stringr package, which is just one of several packages loaded when you call library(tidyverse):
library(stringr)
test = data.frame(fr = as.character(c("Autre", "Autres", "Autre encore")),
en = as.character(c("Other", "Others", "Other again")),
stringsAsFactors = FALSE)
test1 = data.frame(totrans = as.character(c("Autre", "Autres", "Autre encore")),
stringsAsFactors = FALSE)
str_replace_all(test1$totrans, paste0("\\b", test$fr, "\\b"), test$en)
#> [1] "Other" "Others" "Other again"
Created on 2020-05-14 by the reprex package (v0.3.0)

Related

How to perform "find and replace" with multiple patterns to be found in a string in R

I am trying to switch genders of words in a string in R. For example, if I have the sentence "My gf has a mother who talks to my father and his bf", I want it to read "My bf has a father who talks to my mother and her gf".
I have a key-value pair list which contains a list of gender pairs -- right now it is just a dataframe which looks something like the below. Then my naive way of solving it was just to do a string replace where I iterate through the list and replace the key with the value. The obvious problem with this is that it just ends up swapping everything in the sentence, and then swapping it all back. You can see this is the example code below.
library(stringr)
key_vals = data.frame(first_word = c("bf", "gf", "mother", "father", "his", "her"), second_word = c("gf", "bf", "father", "mother", "her", "his"))
ex = "My gf has a mother who talks to my father and his bf"
for(i in 1:nrow(key_vals)){
ex = str_replace_all(ex, key_vals$first_word[i], key_vals$second_word[i])
}
My other idea was making two lists, one which had all male keys and all female values, and one which was the opposite. Then if I split up the sentence into individual words, for each word I could do an if statement like "if a male string is present, replace it with a female string, elif a female string is present, replace it with a male string, else do nothing". However, I can't figure out how to get just the words alone in a way I can then easily recombine into a working sentence. String split based on regex etc. just deletes the words, so I'm really struggling.
Another problem is that if, for example, there is something like "mother", it might get replaced to be "mothis", since I'm using a stupid way of matching strings which doesn't first identify the words, so it seems like I need to split it into words in any case.
This feels like it should be much more straightforward than it has been for me! Any help would be very appreciated.
We may use gsubfn
library(gsubfn)
gsubfn("(\\w+)", setNames(as.list(key_vals[[2]]), key_vals[[1]]), ex)
[1] "My bf has a father who talks to my mother and her gf"
Change for loop part to this:
plyr::mapvalues(str_split(ex, ' ')[[1]], key_vals$first_word, key_vals$second_word) %>%
str_flatten(' ')
The following `from` values were not present in `x`: her
[1] "My bf has a father who talks to my mother and her gf"
ex
[1] "My gf has a mother who talks to my father and his bf"
I think the warning can be ignored as it is just complaining that her is not in the sentence that ex contains.
The code first splits the character into a vector, then replaces the individual words and then pastes them back together again.
Rather than relying on a data frame of replacements, you could use a named vector, which is similar to a dictionary of values:
replacements <- key_vals$second_word
names(replacements) <- key_vals$first_word
bf gf mother father his her
"gf" "bf" "father" "mother" "her" "his"
ex_split <- str_split(ex, ' ')[[1]]
swapped <- replacements[ex_split]
final <- paste0(ifelse(!is.na(swapped), swapped, ex_split), collapse = ' ')
"My bf has a father who talks to my mother and her gf"
After creating ex_split, you could also substitute and glue everything together with Reduce:
Reduce(function(x, y) paste(x, ifelse(!is.na(replacements[y]), replacements[y], y)), ex_split)
Here is a base R option using strsplit + match like below
with(
key_vals,
{
v <- unlist(strsplit(ex, "(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)", perl = TRUE))
p <- second_word[match(v, first_word)]
paste0(ifelse(is.na(p), v, p), collapse = "")
}
)
and it yields
[1] "My bf has a father who talks to my mother and her gf"
This does what you need.
library(stringr)
# I've updated the columns names, for clarity
key_vals <- data.frame(words = c("bf", "gf", "mother", "father", "his", "her"), swapped_words = c("gf", "bf", "father", "mother", "her", "his"))
# used str_split to break the sentence into multiple words
ex <- "My gf has a mother who talks to my father and his bf"
words <- stringr::str_split(ex, " ")[[1]] #break into words
# do a inner join between the two tables
dict <- merge(data.frame(words=words), key_vals, by = "words", all.x = TRUE, incomparables = NA)
# now we basically apply the dictionary to the string, using an apply function
# we also use paste(..., collapse = " ") to make them into one sentence again
words <- paste(sapply(words, function(x) {
if (!x %in% key_vals$words)
return (x)
return(dict$swapped_words[dict$words == x])
}), collapse=" ")

Cleaning strings in sparklyr using regex

I'm trying to clean strings in a table in sparklyr using regexp_replace. I need to remove both multiple spaces between words and specific whole words.
Establish Spark Connection
pharms <- spark_read_parquet(sc, 'pharms', 's3/path/to/pharms', infer_schema = TRUE, memory = FALSE)
Vector to clean
The df vector I want to clean looks like this, but it is within a table in the sparklyr connection:
drug_strings <- c("tablomiacin sodium tab mg", "nsaid caps mg")
The desired output once the regex processes the data would be something like this:
Desired Outcomes
[1] "tablomiacin sodium", "nsaid"
Attempts
I've tried various combinations used in regex such as:
pharms_cln <- pharms %>%
distinct(drug_strings)%>%
mutate(new_strings=regexp_replace(drug_strings, "\\b(caps|mg|tab)\\b", ""))
pharms_cln <- pharms %>%
distinct(drug_strings)%>%
mutate(new_strings=regexp_replace(drug_strings, "\\s+", ""))
But they all just replace all letters or substrings and not just the individual word or print an error related to hive. Similarly the efforts I've tried to remove blanks spaces just seem to remove the letter 's'.
If the rule for the sought replacement "anything preceding caps|mg|tab", then this may work:
Data:
drug_strings <- c("tablomiacin sodium tab mg", "nsaid caps mg")
Solution:
trimws(gsub("\\b(tab|mg|caps)\\b", "", drug_strings))
[1] "tablomiacin sodium" "nsaid"
If for some reason you need to use str_extract, you can do this:
str_extract(gsub("\\s{2,}", " ", drug_strings), "\\b\\w+\\b(\\s\\b\\w+\\b)*(?=\\s\\b(tab|mg|caps)\\b)")
This first reduces all multiple white space characters to just one such char, and then does the extraction.
Someone who knows regex could certainly streamline this code, but the following using using the str_remove function from the stringr package.
drug_strings <- c("tablomiacin tab mg", "nsaid caps mg")
drug_strings <- data.frame(drug_strings)
drug_strings <- drug_strings %>%
mutate(new_strings=str_remove(drug_strings, "\\b(caps|mg|tab)\\b")) %>%
mutate(new_strings=str_remove(new_strings, "\\s+")) %>%
mutate(new_strings = str_remove(new_strings, "mg"))
``

Subset strings in R

One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.

R Replace stopwords in a column made of lists

Considering this data frame
test = data.frame(language=c("german", "english"), text=I(list(c("und das Beil", "wichtige Thematik der"), c("some useful information", "the most unuseful product"))))
I need to delete the stopwords in each vector of column "text" according to which language the row belongs. Actually, I only need to differ between german and english, so I thought of using apply in combination with ifelse like this:
test[2] = apply(test, 1, function(x) ifelse(x[1] == "german", lapply(x[2], function(y)removeWords(y, stopwords("de"))), lapply(x[2], function(y)removeWords(y, stopwords("en")))))
But this doesn´t work..
Maybe there is even an more elegant way to solve this?
As a first step you could do:
library(tm)
apply(test, 1, function(x) removeWords(x[["text"]], stopwords(x[["language"]])))
Which gives you as a result:
[,1] [,2]
[1,] " Beil" " useful information"
[2,] "wichtige Thematik " " unuseful product"
I dont know what the desired output is though...
Here is a tidy solution that can easily be extended to multiple languages:
library(tidyverse)
test <- tibble(
language = c("german", "english"),
text = I(list(c("und das Beil", "wichtige Thematik der"),
c("some useful information", "the most unuseful product")))
)
test %>%
mutate(lang_abr = recode(language, "german" = "de", "english" = "en")) %>%
mutate(text = map2(text, lang_abr, ~ removeWords(.x, stopwords(.y))))

Efficient way to remove all proper names from corpus

Working in R, I'm trying to find an efficient way to search through a file of texts and remove or replace all instances of proper names (e.g., Thomas). I assume there is something available to do this but have been unable to locate.
So, in this example the words "Susan" and "Bob" would be removed. This is a simplified example, when in reality would want this to apply to hundreds of documents and therefore a fairly large list of names.
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
names(texts) [1] <- "text"
Here's one approach based upon a data set of firstnames:
install.packages("gender")
library(gender)
install_genderdata_package()
sets <- data(package = "genderdata")$results[,"Item"]
data(list = sets, package = "genderdata")
stopwords <- unique(kantrowitz$name)
texts <- as.data.frame (rbind (
'This text stuff if quite interesting',
'Where are all the names said Susan',
'Bob wondered what happened to all the proper nouns'
))
removeWords <- function(txt, words, n = 30000L) {
l <- cumsum(nchar(words)+c(0, rep(1, length(words)-1)))
groups <- cut(l, breaks = seq(1,ceiling(tail(l, 1)/n)*n+1, by = n))
regexes <- sapply(split(words, groups), function(words) sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), collapse = "|")))
for (regex in regexes) txt <- gsub(regex, "", txt, perl = TRUE, ignore.case = TRUE)
return(txt)
}
removeWords(texts[,1], stopwords)
# [1] "This text stuff if quite interesting"
# [2] "Where are all the names said "
# [3] " wondered what happened to all the proper nouns"
It may need some tuning for your specific data set.
Another approach could be based upon part-of-speech tagging.

Resources