Detect part of a string in R (not exact match) - r

Consider the following dataset :
a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)
I m trying to detect string1 in string2 BUT my goal is to not only detect exact matching. I m looking for a way to detect the presence of string1 words in string2, whatever the order words appear. As an example, the string "my beautiful house is cool" should return TRUE when searching for "my house".
I have tried to illustrate the expected behaviour of the script in the "return" column of above the example dataset.
I have tried grepl() and str_detect() functions but it only works with exact match. Can you please help ? Thanks in advance

The trick here is to not use str_detect as is but to first split the search_words into individual words. This is done in strsplit() below. We then pass this into str_detect to check if all words are matched.
library(stringr)
search_words <- c("my house", "green", "the cat is", "a girl")
words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
patterns <- strsplit(search_words," ")
mapply(function(word,string) all(str_detect(word,string)),words,patterns)

One base R option without the involvement of split could be:
n_words <- lengths(regmatches(df[, 1], gregexpr(" ", df[, 1], fixed = TRUE))) + 1
n_matches <- mapply(FUN = function(x, y) lengths(regmatches(x, gregexpr(y, x))),
df[, 2],
gsub(" ", "|", df[, 1], fixed = TRUE),
USE.NAMES = FALSE)
n_matches == n_words
[1] TRUE TRUE TRUE FALSE
It, however, makes the assumption that there is at least one word per row in string1

Related

Exact match from list of words from a text in R

I have list of words and I am looking for words that are there in the text.
The result is that in the last column is always found as it is searching for patterns. I am looking for exact match that is there in words. Not the combinations. For the first three records it should be not found.
Please guide where I am going wrong.
col_1 <- c(1,2,3,4,5)
col_2 <- c("work instruction change",
"technology npi inspections",
" functional locations",
"Construction has started",
" there is going to be constn coon")
df <- as.data.frame(cbind(col_1,col_2))
df$col_2 <- tolower(df$col_2)
words <- c("const","constn","constrction","construc",
"construct","construction","constructs","consttntype","constypes","ct","ct#",
"ct2"
)
pattern_words <- paste(words, collapse = "|")
df$result<- ifelse(str_detect(df$col_2, regex(pattern_words)),"Found","Not Found")
Use word boundaries around the words.
library(stringr)
pattern_words <- paste0('\\b', words, '\\b', collapse = "|")
df$result <- c('Not Found', 'Found')[str_detect(df$col_2, pattern_words) + 1]
#OR with `ifelse`
#df$result <- ifelse(str_detect(df$col_2, pattern_words), "Found", "Not Found")
df
# col_1 col_2 result
#1 1 work instruction change Not Found
#2 2 technology npi inspections Not Found
#3 3 functional locations Not Found
#4 4 construction has started Found
#5 5 there is going to be constn coon Found
You can also use grepl here to keep it in base R :
grepl(pattern_words, df$col_2)

Collapsing rows using two vectors as indicators

This is my first time posting; please let me know if I'm doing any beginner mistakes. In my specific case I have a vector of strings, and I want to collapse some adjacent rows. I have one vector indicating the starting position and one indicating the last element. How can I do this?
Here is some sample code and my approach that does not work:
text <- c("cat", "dog", "house", "mouse", "street")
x <- c(1,3)
y <- c(2,5)
result <- as.data.frame(paste(text[x:y],sep = " ",collapse = ""))
In case it's not clear, the result I want is a data frame consisting of two strings: "cat dog" and "house mouse street".
Not sure this is the best option, but it does the job,
sapply(mapply(seq, x, y), function(i)paste(text[i], collapse = ' '))
#[1] "cat dog" "house mouse street"
Either use base R with
mapply(function(.x,.y) paste(text[.x:.y],collapse = " "), x, y)
or use the purrr package as
map2_chr(x,y, ~ paste(text[.x:.y],collapse = " "))
Both yield
# [1] "cat dog" "house mouse street"
The output as a data frame depends on the structure you want: rows or columns
I think you want
result <- data.frame(combined = c(paste(text[x[1]:y[1]], collapse = " "),
paste(text[x[2]:y[2]], collapse = " ")))
Which gives you
result
#> combined
#> 1 cat dog
#> 2 house mouse street
Another base R solution, using parse + eval
result <- data.frame(new = sapply(paste0(x,":",y),function(v) paste0(text[eval(parse(text = v))],collapse = " ")),
row.names = NULL)
such that
> result
new
1 cat dog
2 house mouse street

Replace second occurrence of a string in one column based on value in other column in R

Here is a sample dataframe:
a <- c("cat", "dog", "mouse")
b <- c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse")
df <- data.frame(a,b)
I'd like to be able to remove the second occurrence of the value in col a in col b.
Here is my desired output:
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
I've tried different combinations of gsub and some stringr functions, but I haven't even gotten close to being able to remove the second (and only the second) occurrence of the string in col a in col b. I think I'm asking something similar to this one, but I'm not familiar with Perl and couldn't translate it to R.
Thanks!
It takes a little work to build the right Regex.
P1 = paste(a, collapse="|")
PAT = paste0("((", P1, ").*?)(\\2)")
sub(PAT, "\\1", b, perl=TRUE)
[1] "my cat is a tabby and is a friendly cat"
[2] "walk the dog"
[3] "the mouse is scared of the other "
I've actually found another solution that, though longer, may be clearer for other regex beginners:
library(stringr)
# Replace first instance of col a in col b with "INTERIM"
df$b <- str_replace(b, a, "INTERIM")
# Now that the original first instance of col a is re-labeled to "INTERIM", I can again replace the first instance of col a in col b, this time with an empty string
df$b <- str_replace(df$b, a, "")
# And I can re-replace the re-labeled "INTERIM" to the original string in col a
df$b <- str_replace(df$b, "INTERIM", a)
# Trim "double" whitespace
df$b <- str_replace(gsub("\\s+", " ", str_trim(df$b)), "B", "b")
df
a b
cat my cat is a tabby and is a friendly cat
dog walk the dog
mouse the mouse is scared of the other
You could do this...
library(stringr)
df$b <- str_replace(df$b,
paste0("(.*?",df$a,".*?) ",df$a),
"\\1")
df
a b
1 cat my cat is a tabby and is a friendly cat
2 dog walk the dog
3 mouse the mouse is scared of the other
The regex finds the first string of characters with df$a somewhere in it, followed by a space and another df$a. The capture group is the text up to the space before the second occurrence (indicated by the (...)), and the whole text (including the second occurrence) is replaced by the capture group \\1 (which has the effect of deleting the second df$a and its preceding space). Anything after the second df$a is not affected.
Base R, split-apply-combine solution:
# Split-apply-combine:
data.frame(do.call("rbind", lapply(split(df, df$a), function(x){
b <- paste(unique(unlist(strsplit(x$b, "\\s+"))), collapse = " ")
return(data.frame(a = x$a, b = b))
}
)
),
stringsAsFactors = FALSE, row.names = NULL
)
Data:
df <- data.frame(a = c("cat", "dog", "mouse"),
b = c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse"),
stringsAsFactors = FALSE)

Remove specific words conditionnally in R

I am trying to remove a list of words in sentences according to specific conditions.
Let's say we have this dataframe :
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)
> df
questions responses
[1,] "The highest mountain in the world" "The Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "A bird"
[4,] "Widest liquid area on earth" "The Pacific ocean"
And the following list of specific words:
articles <- c("The","A")
geowords <- c("mountain","liquid area")
I would like to do 2 things:
Remove the articles in first position in the responses column when adjacent to a word starting by a lower case letter
Remove the articles in first position in the responses column when (adjacent to a word starting by an upper case letter) AND IF (a geoword is in the corresponding question)
The expected result should be:
questions responses
[1,] "The highest mountain in the world" "Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "bird"
[4,] "Widest liquid area on earth" "Pacific ocean"
I'll try gsub without success as I'm not familiar at all with regex...
I have searched in Stackoverflow without finding really similar problem. If a R and regex all star could help me, I would be very thankfull!
The same as you mentioned has been written as two logical columns and ifelse is used to validate and gsub:
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- data.frame(cbind(questions,responses), stringsAsFactors = F)
df
articles <- c("The ","A ")
geowords <- c("mountain","liquid area")
df$f_caps <- unlist(lapply(df$responses, function(x) {grepl('[A-Z]',str_split(str_split(x,' ', simplify = T)[2],'',simplify = T)[1])}))
df$geoword_flag <- grepl(paste(geowords,collapse='|'),df[,1])
df$new_responses <- ifelse((df$f_caps & df$geoword_flag) | !df$f_caps,
{gsub(paste(articles,collapse='|'),'', df$responses ) },
df$responses)
df$new_responses
> df$new_responses
[1] "Himalaya" "The Americans" "bird" "Pacific ocean"
I taught myself some R today. I used a function to get the same result.
#!/usr/bin/env Rscript
# References
# https://stackoverflow.com/questions/1699046/for-each-row-in-an-r-dataframe
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)
articles <- c("The","A")
geowords <- c("mountain","liquid area")
common_pattern <- paste( "(?:", paste(articles, "", collapse = "|"), ")", sep = "")
pattern1 <- paste(common_pattern, "([a-z])", sep = "")
pattern2 <- paste(common_pattern, "([A-Z])", sep = "")
geo_pattern <- paste(geowords, collapse = "|")
f <- function (x){
q <- x[1]
r <- x[2]
a1 <- gsub (pattern1, "\\1", r)
if ( grepl(geo_pattern, q)){
a1 <- gsub (pattern2, "\\1", a1)
}
x[1] <- q
x[2] <- a1
}
apply (df, 1, f)
running;
Rscript stacko.R
[1] "Himalaya" "The Americans" "bird" "Pacific ocean"
You may choose to use simple regex with , grepl and gsub as below:
df <- data.frame(cbind(questions,responses), stringsAsFactors = F) #Changing to data frame, since cbind gives a matrix, stringsAsFactors will prevent to not change the columns to factors
regx <- paste0(geowords, collapse="|") # The "or" condition between the geowords
articlegrep <- paste0(articles, collapse="|") # The "or" condition between the articles
df$responses <- ifelse(grepl(regx, df$questions)|grepl(paste0("(",articlegrep,")","\\s[a-z]"), df$responses),
gsub("\\w+ (.*)","\\1",df$responses),df$responses) #The if condition for which replacement has to happen
> print(df)
questions responses
#1 The highest mountain in the world Himalaya
#2 A cold war serie from 2013 The Americans
#3 A kiwi which is not a fruit bird
#4 Widest liquid area on earth Pacific ocean
For the fun, here's a tidyverse solution:
df2 <-
df %>%
as.tibble() %>%
mutate(responses =
#
if_else(str_detect(questions, geowords),
#
str_replace(string = responses,
pattern = regex("\\w+\\b\\s(?=[A-Z])"),
replacement = ""),
#
str_replace(string = responses,
pattern = regex("\\w+\\b\\s(?=[a-z])"),
replacement = ""))
)
Edit: without the "first word" regex, with inspiration from #Calvin Taylor
# Define articles
articles <- c("The", "A")
# Make it a regex alternation
art_or <- paste0(articles, collapse = "|")
# Before a lowercase / uppercase
art_upper <- paste0("(?:", art_or, ")", "\\s", "(?=[A-Z])")
art_lower <- paste0("(?:", art_or, ")", "\\s", "(?=[a-z])")
# Work on df
df4 <-
df %>%
as.tibble() %>%
mutate(responses =
if_else(str_detect(questions, geowords),
str_replace_all(string = responses,
pattern = regex(art_upper),
replacement = ""),
str_replace_all(string = responses,
pattern = regex(art_lower),
replacement = "")
)
)

replacement of words in strings

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"

Resources