Consecutive string matching in a sentence using R

Consecutive string matching in a sentence using R - r

I have names of some 7 countries which is stored somewhere like:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Now, I have to find out using r if a given sentence has these words.
Sometimes the name of a country is hiding in the consecutive letters within a sentence.
for ex:
You all must pay it bac**k, or ea**ch of you will be in trouble.
If this sentence is passed it should return "korea"
I have tried:
grep('You|all|must|pay|it|back|or|each|of|you|will|be|in|trouble',Random, value = TRUE,ignore.case=TRUE,
fixed = FALSE)
it should return korea
but it's not working. Perhaps I should not use Partial Matching, but i dont have much knowledge regarding it.
Any help is appreciated.

You can use the handy stringr library for this. First, remove all the punctuation and spaces from your sentence that we want to match.
> library(stringr)
> txt <- "You all must pay it back, or each of you will be in trouble."
> g <- gsub("[^a-z]", "", tolower(txt))
# [1] "Youallmustpayitbackoreachofyouwillbeintrouble"
Then we can use str_detect to find the matches.
> Random[str_detect(g, Random)]
# [1] "korea"
Basically you're just looking for a sub-string within a sentence, so collapsing the sentence first seems like a good way to go. Alternatively, you could use str_locate with str_sub to find the relevant sub-strings.
> no <- na.omit(str_locate(g, Random))
> str_sub(g, no[,1], no[,2])
# [1] "korea"
Edit Here's one more I came up with
> Random[Vectorize(grepl)(Random, g)]
# [1] "korea"

Using base functions only:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Random2=paste(Random,collapse="|") #creating pattern for match
text="bac**k, or ea**ch of you will be in trouble."
text2=gsub("[[:punct:][:space:]]","",text,perl=T) #removing punctuations and space characters
regmatches(text2,gregexpr(Random2,text2))
[[1]]
[1] "korea"

You could use stringi which is faster for these operations
library(stringi)
Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)]
#[1] "korea"
#data
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."

Try:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
tt <- gsub("[[:punct:]]|\\s+", "", txt)
unlist(sapply(Random, function(r) grep(r, tt)))
korea
1

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here

Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))

You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"

Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")

You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

R count the number of words starts with given letter in a phrase

i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)

We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1

You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")

Match and replace misspelled words in a string in R

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
everything works perfectly, until it finds a similar string, and replaces it with another word
if I have a pattern like the following:
"plea", the correct one is "please", but when I execute it removes it and replaces it with "pleased".
What I am looking for is that if a string is already correct, it is no longer modified, in case it finds a similar pattern.

Perhaps you need to perform progressive replace. e.g. you should have multiple set of badwords and goodwords. First replace with badwords having more letters so that matching pattern is not found and then got go for smaller ones.
From the list provided by you, I would create 2 sets as:
goodwords1<-c( "three", "teasing")
badwords1<- c("thre", "teeasing")
goodwords2<-c("tree", "testing")
badwords2<- c("tre", "tesing")
First replace with 1st set and then with 2nd set. You can create many such sets.

str_replace_all takes regex as the pattern, so you can paste0 word boundaries \\b around each badwords so that a replacement will only be made if the whole word is matched:
library(stringr)
string <- c("tre", "tree", "teeasing", "tesing")
goodwords <- c("tree", "three", "teasing", "testing")
badwords <- c("tre", "thre", "teeasing", "tesing")
# Paste word boundaries around badwords
badwords <- paste0("\\b", badwords, "\\b")
vect.corpus <- goodwords
names(vect.corpus) <- badwords
str_replace_all(string, vect.corpus)
[1] "tree" "tree" "teasing" "testing"
The advantage of this is that you don't have to keep track of which strings are the longer strings.
This is what badwords looks like after pasting:
> badwords
[1] "\\btre\\b" "\\bthre\\b" "\\bteeasing\\b" "\\btesing\\b"

How to return the index of a vector that contains at least a string in another vector in R

I have a list containing verbs. I have another list containing sentences. How do I return the index of the sentence list that contains at least a verb in the verb list for that sentence?
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
I want it to return indexes 1, 3, and 4

Using no additional packages, we can sort of "or" different search terms together using | as follows:
Original question:
verbList <- list("punching, kicking, jumping, hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- gsub(", ", "|", verbList)
grep(v, sentenceList)
New question:
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- paste(verbList, collapse = "|")
grep(v, sentenceList)

A solution from stringr and rebus. We can first split the string, and then use str_which to check if the pattern is in the vector to return the index.
library(stringr)
library(rebus)
# Check the index
result <- str_which(sentenceList, or1(verbList))
result
# [1] 1 3 4

Extracting Headers from a list [duplicate]

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"

Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.

Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.

I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.

Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Consecutive string matching in a sentence using R - r

You could use stringi which is faster for these operations library(stringi) Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)] #[1] "korea" #data Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran') txt <- "You all must pay it back, or each of you will be in trouble."

Try: Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran') txt <- "You all must pay it back, or each of you will be in trouble." tt <- gsub("[[:punct:]]|\\s+", "", txt) unlist(sapply(Random, function(r) grep(r, tt))) korea 1

Related

How to randomly reshuffle letters in words

R count the number of words starts with given letter in a phrase

Match and replace misspelled words in a string in R

How to return the index of a vector that contains at least a string in another vector in R

Extracting Headers from a list [duplicate]

Categories

Resources