how to split the words in R - r

I have a list of words in a file. For example they are NUT, CHANNEL, DIA, CARBON, STEEL , integrated, packaging, solutions
Now I have a sentence that says NUTCHANNELDIA 16U NCCARBONSTEEL. Now I need to split this output like below
a= NUTCHANNELDIA 16U NCCARBONSTEEL, integratedpackagingsolutions
a= split words(NUTCHANNELDIA 16U NCCARBONSTEEL,
integratedpackagingsolutions)
a= NUT CHANNEL DIA 16U NC CARBON STEEL
Is there any method for that

Here is a base R option using strsplit. We can try splitting on the following pattern:
(?<=NUT|CHANNEL|DIA|CARBON|STEEL)|(?<=.)(?=NUT|CHANNEL|DIA|CARBON|STEEL)
This will split if, at any point in the string, what either precedes or follows is one of your keywords. Note that the (?<=.) term is necessary due to the way positive lookaheads in strsplit behave.
terms <- c("NUT", "CHANNEL", "DIA", "CARBON", "STEEL")
regex <- paste(terms, collapse="|")
a <- "NUTCHANNELDIA 16U NCCARBONSTEEL"
strsplit(a, paste0("(?<=", regex, ")|(?<=.)(?=", regex, ")"), perl=TRUE)
[[1]]
[1] "NUT" "CHANNEL" "DIA" " 16U NC" "CARBON" "STEEL"
Demo
The 16U NC term has some leading whitespace which I didn't attempt to remove. If this be a concern of yours, you could either trim whitespace on each term as you consume it, or we could try to modify the pattern to do that.

This is a very simple approach which might work for you:
word.list <- c("NUT", "CHANNEL", "DIA", "CARBON", "STEEL")
a <- "NUTCHANNELDIA 16U NCCARBONSTEEL"
for (word in word.list) {
a <- gsub(word, paste0(word, " "), a)
}
print(a)
[1] "NUT CHANNEL DIA 16U NCCARBON STEEL "
It is unclear to me, if you just want the string to be more readable, or to have it actually split up into a vector. In any case, the above should be fairly simple to modify.

Related

Match and replace misspelled words in a string in R

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
everything works perfectly, until it finds a similar string, and replaces it with another word
if I have a pattern like the following:
"plea", the correct one is "please", but when I execute it removes it and replaces it with "pleased".
What I am looking for is that if a string is already correct, it is no longer modified, in case it finds a similar pattern.
Perhaps you need to perform progressive replace. e.g. you should have multiple set of badwords and goodwords. First replace with badwords having more letters so that matching pattern is not found and then got go for smaller ones.
From the list provided by you, I would create 2 sets as:
goodwords1<-c( "three", "teasing")
badwords1<- c("thre", "teeasing")
goodwords2<-c("tree", "testing")
badwords2<- c("tre", "tesing")
First replace with 1st set and then with 2nd set. You can create many such sets.
str_replace_all takes regex as the pattern, so you can paste0 word boundaries \\b around each badwords so that a replacement will only be made if the whole word is matched:
library(stringr)
string <- c("tre", "tree", "teeasing", "tesing")
goodwords <- c("tree", "three", "teasing", "testing")
badwords <- c("tre", "thre", "teeasing", "tesing")
# Paste word boundaries around badwords
badwords <- paste0("\\b", badwords, "\\b")
vect.corpus <- goodwords
names(vect.corpus) <- badwords
str_replace_all(string, vect.corpus)
[1] "tree" "tree" "teasing" "testing"
The advantage of this is that you don't have to keep track of which strings are the longer strings.
This is what badwords looks like after pasting:
> badwords
[1] "\\btre\\b" "\\bthre\\b" "\\bteeasing\\b" "\\btesing\\b"

replacement of words in strings

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"

Removing all words except for words in a vector

It's common to remove stopwords from a text or character vector. I use the function removeWords from the tm package.
However, I'm trying to remove all the words except for stopwords. I have a list of words I made called x. When I use
removeWords(text, x)
I get this error:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), PCRE pattern compilation error 'regular expression is too large'`
I've also tried using grep:
grep(x, text)
But that won't work, because x is a vector and not a single character string.
So, how can I remove all the words that aren't in that vector? Or alternatively, how can I select only the words in the vector?
If you want x as a regex pattern for grep, just use x <- paste(x, collapse = "|"), which will allow you to look for those words in text. But keep in mind that the regex might still be too large. If you want to remove any word that is not a stopword(), you can create your own function:
keep_stopwords <- function(text) {
stop_regex <- paste(stopwords(), collapse = "\\b|\\b")
stop_regex <- paste("\\b", stop_regex, "\\b", sep = "")
tmp <- strsplit(text, " ")[[1]]
idx <- grepl(stop_regex, tmp)
txt <- paste(tmp[idx], collapse = " ")
return(txt)
}
text = "How much wood would a woodchuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_stopwords(text)
# [1] "would a if a could than most would if could but than other"
Basically, we just setup the stopwords() as a regex that will look for any of those words. But we have to be careful about partial matches, so we wrap each stop word in \\b to ensure it's a full match. Then we split the string so that we match each word individually and create an index of the words that are stop words. Then we paste those words together again and return it as a single string.
Edit
Here's another approach, which is simpler and easier to understand. It also doesn't rely on regular expressions, which can be expensive in large documents.
keep_words <- function(text, keep) {
words <- strsplit(text, " ")[[1]]
txt <- paste(words[words %in% keep], collapse = " ")
return(txt)
}
x <- "How much wood would a woodchuck chuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_words(x, stopwords())
# [1] "would a if a could than most could if a could but than other"

Consecutive string matching in a sentence using R

I have names of some 7 countries which is stored somewhere like:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Now, I have to find out using r if a given sentence has these words.
Sometimes the name of a country is hiding in the consecutive letters within a sentence.
for ex:
You all must pay it bac**k, or ea**ch of you will be in trouble.
If this sentence is passed it should return "korea"
I have tried:
grep('You|all|must|pay|it|back|or|each|of|you|will|be|in|trouble',Random, value = TRUE,ignore.case=TRUE,
fixed = FALSE)
it should return korea
but it's not working. Perhaps I should not use Partial Matching, but i dont have much knowledge regarding it.
Any help is appreciated.
You can use the handy stringr library for this. First, remove all the punctuation and spaces from your sentence that we want to match.
> library(stringr)
> txt <- "You all must pay it back, or each of you will be in trouble."
> g <- gsub("[^a-z]", "", tolower(txt))
# [1] "Youallmustpayitbackoreachofyouwillbeintrouble"
Then we can use str_detect to find the matches.
> Random[str_detect(g, Random)]
# [1] "korea"
Basically you're just looking for a sub-string within a sentence, so collapsing the sentence first seems like a good way to go. Alternatively, you could use str_locate with str_sub to find the relevant sub-strings.
> no <- na.omit(str_locate(g, Random))
> str_sub(g, no[,1], no[,2])
# [1] "korea"
Edit Here's one more I came up with
> Random[Vectorize(grepl)(Random, g)]
# [1] "korea"
Using base functions only:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Random2=paste(Random,collapse="|") #creating pattern for match
text="bac**k, or ea**ch of you will be in trouble."
text2=gsub("[[:punct:][:space:]]","",text,perl=T) #removing punctuations and space characters
regmatches(text2,gregexpr(Random2,text2))
[[1]]
[1] "korea"
You could use stringi which is faster for these operations
library(stringi)
Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)]
#[1] "korea"
#data
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
Try:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
tt <- gsub("[[:punct:]]|\\s+", "", txt)
unlist(sapply(Random, function(r) grep(r, tt)))
korea
1

Simple Comparing of two texts in R

I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)
on the base of #joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.
this is how it looks like:
textparts <- function (text){
textparts <- c("\\,", "\\.")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text, textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")
commonWords <- intersect(textparts1, textparts2)
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
However, sometimes it works, sometimes it doesn't.
I WOULD like to have results like these:
> return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence" " whereas this is a dependent clause*" " This thing works*"
[[2]]
[1] "This could be a sentence" " whereas this is a dependent clause*" " Plagiarism is not cool" " This thing works*"
whereas i get none results.
There are some problems with the answer of #Chase :
differences in capitalization are not taken into account
interpunction can mess up results
if there is more than one word similar, then you get a lot of warnings due to the gsub call.
Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :
compareSentences <- function(sentence1, sentence2) {
# split everything on "not a word" and put all to lowercase
x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
x2 <- tolower(unlist(strsplit(sentence2, "\\W")))
commonWords <- intersect(x1, x2)
#add word beginning and ending and put words between ()
# to allow for match referencing in gsub
commonWords <- paste("\\<(",commonWords,")\\>",sep="")
for(x in commonWords){
# replace the match by the match with star added
sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
}
return(list(sentence1,sentence2))
}
This gives following result :
text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "
compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"
[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "
I am sure that there are far more robust functions on the natural language processing page, but here's one solution using intersect() to find the common words. The approach is to read in the two sentences, identify the common words and gsub() them with a combination of the word and a moniker of our choice. Here I chose to use *, but you could easily change that, or add something else.
sent1 <- "I shot the sheriff."
sent2 <- "Dick Cheney shot a man."
compareSentences <- function(sentence1, sentence2) {
sentence1 <- unlist(strsplit(sentence1, " "))
sentence2 <- unlist(strsplit(sentence2, " "))
commonWords <- intersect(sentence1, sentence2)
return(list(
sentence1 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence1), collapse = " ")
, sentence2 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence2), collapse = " ")
))
}
> compareSentences(sent1, sent2)
$sentence1
[1] "I shot* the sheriff."
$sentence2
[1] "Dick Cheney shot* a man."

Resources