Simple Comparing of two texts in R - r

I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)
on the base of #joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.
this is how it looks like:
textparts <- function (text){
textparts <- c("\\,", "\\.")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text, textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")
commonWords <- intersect(textparts1, textparts2)
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
However, sometimes it works, sometimes it doesn't.
I WOULD like to have results like these:
> return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence" " whereas this is a dependent clause*" " This thing works*"
[[2]]
[1] "This could be a sentence" " whereas this is a dependent clause*" " Plagiarism is not cool" " This thing works*"
whereas i get none results.

There are some problems with the answer of #Chase :
differences in capitalization are not taken into account
interpunction can mess up results
if there is more than one word similar, then you get a lot of warnings due to the gsub call.
Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :
compareSentences <- function(sentence1, sentence2) {
# split everything on "not a word" and put all to lowercase
x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
x2 <- tolower(unlist(strsplit(sentence2, "\\W")))
commonWords <- intersect(x1, x2)
#add word beginning and ending and put words between ()
# to allow for match referencing in gsub
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
# replace the match by the match with star added
sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
}
return(list(sentence1,sentence2))
}
This gives following result :
text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "
compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"
[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "

I am sure that there are far more robust functions on the natural language processing page, but here's one solution using intersect() to find the common words. The approach is to read in the two sentences, identify the common words and gsub() them with a combination of the word and a moniker of our choice. Here I chose to use *, but you could easily change that, or add something else.
sent1 <- "I shot the sheriff."
sent2 <- "Dick Cheney shot a man."
compareSentences <- function(sentence1, sentence2) {
sentence1 <- unlist(strsplit(sentence1, " "))
sentence2 <- unlist(strsplit(sentence2, " "))
commonWords <- intersect(sentence1, sentence2)
return(list(
sentence1 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence1), collapse = " ")
, sentence2 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence2), collapse = " ")
))
}
> compareSentences(sent1, sent2)
$sentence1
[1] "I shot* the sheriff."
$sentence2
[1] "Dick Cheney shot* a man."

Related

how to split the words in R

I have a list of words in a file. For example they are NUT, CHANNEL, DIA, CARBON, STEEL , integrated, packaging, solutions
Now I have a sentence that says NUTCHANNELDIA 16U NCCARBONSTEEL. Now I need to split this output like below
a= NUTCHANNELDIA 16U NCCARBONSTEEL, integratedpackagingsolutions
a= split words(NUTCHANNELDIA 16U NCCARBONSTEEL,
integratedpackagingsolutions)
a= NUT CHANNEL DIA 16U NC CARBON STEEL
Is there any method for that
Here is a base R option using strsplit. We can try splitting on the following pattern:
(?<=NUT|CHANNEL|DIA|CARBON|STEEL)|(?<=.)(?=NUT|CHANNEL|DIA|CARBON|STEEL)
This will split if, at any point in the string, what either precedes or follows is one of your keywords. Note that the (?<=.) term is necessary due to the way positive lookaheads in strsplit behave.
terms <- c("NUT", "CHANNEL", "DIA", "CARBON", "STEEL")
regex <- paste(terms, collapse="|")
a <- "NUTCHANNELDIA 16U NCCARBONSTEEL"
strsplit(a, paste0("(?<=", regex, ")|(?<=.)(?=", regex, ")"), perl=TRUE)
[[1]]
[1] "NUT" "CHANNEL" "DIA" " 16U NC" "CARBON" "STEEL"
Demo
The 16U NC term has some leading whitespace which I didn't attempt to remove. If this be a concern of yours, you could either trim whitespace on each term as you consume it, or we could try to modify the pattern to do that.
This is a very simple approach which might work for you:
word.list <- c("NUT", "CHANNEL", "DIA", "CARBON", "STEEL")
a <- "NUTCHANNELDIA 16U NCCARBONSTEEL"
for (word in word.list) {
a <- gsub(word, paste0(word, " "), a)
}
print(a)
[1] "NUT CHANNEL DIA 16U NCCARBON STEEL "
It is unclear to me, if you just want the string to be more readable, or to have it actually split up into a vector. In any case, the above should be fairly simple to modify.

replacement of words in strings

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"

Replace string in R with patterns and replacements both vectors

Let's say I have two vectors like so:
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
I also have a string variable like so:
string <- "this is a story about a test"
I want to replace values in string so that it becomes the following:
string <- "that was a story about a boy"
I could do this using a for loop but I want this to be vectorized. How should I do this?
If you're open to using a non-base package, stringi will work really well here:
stringi::stri_replace_all_fixed(string, a, b, vectorize_all = FALSE)
#[1] "that was a story about a boy"
Note that this also works the same way for input strings of length > 1.
To be on the safe side, you can adapt this - similar to RUser's answer - to check for word boundaries before replacing:
stri_replace_all_regex(string, paste0("\\b", a, "\\b"), b, vectorize_all = FALSE)
This would ensure that you don't accidentally replace his with hwas, for example.
Here are some solutions. They each will work even if string is a character vector of strings in which case substitutions will be done on each component of it.
1) Reduce This uses no packages.
Reduce(function(x, i) gsub(paste0("\\b", a[i], "\\b"), b[i], x), seq_along(a), string)
## [1] "that was a story about a boy"
2) gsubfn gsubfn is like gsub but the replacement argument can be a list of substitutions (or certain other objects).
library(gsubfn)
gsubfn("\\w+", setNames(as.list(b), a), string)
## [1] "that was a story about a boy"
3) loop This isn't vectorized but have added for comparison. No packages are used.
out <- string
for(i in seq_along(a)) out <- gsub(paste0("\\b", a[i], "\\b"), b[i], out)
out
## [1] "that was a story about a boy"
Note: There is some question of whether cycles are possible. For example, if
a <- c("a", "A")
b <- rev(a)
do we want
"a" to be replaced with "A" and then back to "a" again, or
"a" and "A" to be swapped.
All the solutions shown above assume the first case. If we wanted the second case then perform the operation twice. We will illustrate with (2) because it is the shortest but the same idea applies to them all:
# swap "a" and "A"
a <- c("a", "A")
b <- rev(a)
tmp <- gsubfn("\\w+", setNames(as.list(seq_along(a)), a), string)
gsubfn("\\w+", setNames(as.list(b), seq_along(a)), tmp)
## [1] "this is A story about A test"
> library(stringi)
> stri_replace_all_regex(string, "\\b" %s+% a %s+% "\\b", b, vectorize_all=FALSE)
#[1] "that was a story about a boy"
Chipping in as well with a little function that relies only on R base:
repWords <- function(string,toRep,Rep,sep='\\s'){
wrds <- unlist(strsplit(string,sep))
ix <- match(toRep,wrds)
wrds[ix] <- Rep
return(paste0(wrds,collapse = ' '))
}
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
string <- "this is a story about a test"
> repWords(string,a,b)
[1] "that was a story about a boy"
Note:
This assumes you have a matching number of replacements. You can define the separator with sep.
Talking of external packages, here's another one:
a <- c("this", "is", "test")
b <- c("that", "was", "boy")
x <- "this is a story about a test"
library(qdap)
mgsub(a,b,x)
which gives:
"that was a story about a boy"

how to remove duplicate words in a certain pattern from a string in R

I aim to remove duplicate words only in parentheses from string sets.
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
What I want to get is just like this
a
[1]'I (have|has) certain (words|word|worded) certain'
[2]'(You|Youre) (can|cans) do this (works|worked)'
[3]'I (am|are) pretty (sure|surely) you know (what|when) (you|her) should (do|)'
In order to get the result, I used a code like this
a = gsub('\\|', " | ", a)
a = gsub('\\(', "( ", a)
a = gsub('\\)', " )", a)
a = vapply(strsplit(a, " "), function(x) paste(unique(x), collapse = " "), character(1L))
However, it resulted in undesirable outputs.
a
[1] "I ( have | has ) certain words word worded"
[2] "( You | Youre ) can cans do this works worked"
[3] "I ( am | are ) sure surely you know what when her should do"
Why did my code remove parentheses located in the latter part of strings?
What should I do for the result I want?
We can use gsubfn. Here, the idea is to select the characters inside the brackets by matching the opening bracket (\\( have to escape the bracket as it is a metacharacter) followed by one or more characters that are not a closing bracket ([^)]+), capture it as a group within the brackets. In the replacement, we split the group of characters (x) with strsplit, unlist the list output, get the unique elements and paste it together
library(gsubfn)
gsubfn("\\(([^)]+)", ~paste0("(", paste(unique(unlist(strsplit(x,
"[|]"))), collapse="|")), a)
#[1] "I (have|has) certain (words|word|worded) certain"
#[2] "(You|Youre) (can|cans) do this (works|worked)"
#[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
Take the answer above. This is more straightforward, but you can also try:
library(stringi)
library(stringr)
a_new <- gsub("[|]","-",a) # replace this | due to some issus during the replacement later
a1 <- str_extract_all(a_new,"[(](.*?)[)]") # extract the "units"
# some magic using stringi::stri_extract_all_words()
a2 <- unlist(lapply(a1,function(x) unlist(lapply(stri_extract_all_words(x), function(y) paste(unique(y),collapse = "|")))))
# prepare replacement
names(a2) <- unlist(a1)
# replacement and finalization
str_replace_all(a_new, a2)
[1] "I (have|has) certain (words|word|worded) certain"
[2] "(You|Youre) (can|cans) do this (works|worked)"
[3] "I (am|are) (sure|surely) you know (what|when) (you|her) should (do)"
The idea is to extract the words within the brackets as unit. Then remove the duplicates and replace the old unit with the updated.
a longer but more elaborate try
a = c( 'I (have|has|have) certain (words|word|worded|word) certain',
'(You|You|Youre) (can|cans|can) do this (works|works|worked)',
'I (am|are|am) (sure|sure|surely) you know (what|when|what) (you|her|you) should (do|do)' )
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# blank output
new_a <- c()
for (sentence in 1:length(a)) {
split <- trim(unlist(strsplit(a[sentence],"[( )]")))
newsentence <- c()
for (i in split) {
j1 <- as.character(unique(trim(unlist(strsplit(gsub('\\|'," ",i)," ")))))
if( length(j1)==0) {
next
} else {
ifelse(length(j1)>1,
newsentence <- c(newsentence,paste("(",paste(j1,collapse="|"),")",sep="")),
newsentence <- c(newsentence,j1[1]))
}
}
newsentence <- paste(newsentence,collapse=" ")
print(newsentence)
new_a <- c(new_a,newsentence)}
# [1] "I (have|has) certain (words|word|worded) certain"
# [2] "(You|Youre) (can|cans) do this (works|worked)"
# [3] "I (am|are) (sure|surely) you know (what|when) (you|her) should do"

Replacing strings in R

I am trying to replace strings in R in a large number of texts.
Essentially, this reproduces the format of the data from which I try to delete the '\n' parts.
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
I can do this with a loop and gsub but it takes forever. I looked at this post for a solution. So I tried: temp <- apply(document, 2, function(x) gsub("\\n", " ", fixed=TRUE)). I also used lapply, but it also gives an error message. I can't figure this out, help!
use lapply if you want to return a list
document <- as.list(c("This is \\na try-out", "And it \\nfails"))
temp <- lapply(document, function(x) gsub("\\n", " ", x, fixed=TRUE))
##[[1]]
##[1] "This is a try-out"
##[[2]]
##[1] "And it fails"

Resources