Extract First string in the sentence - r

U have a sentence where I need to extract the first even word. For example
df <- ("This is not the sentence")
For the above sentence, I need "This" to be extracted because it is the first even word
Another example is
df <- ("She is not going anywhere")
For the above sentence, I need "is" to be extracted because it is the first even word

We can write a function to do this. We split the string on whitespace count number of characters in each word and return the first even word.
extract_first_even_word <- function(text) {
all_words <- strsplit(text, "\\s+")[[1]]
all_words[which.max(nchar(all_words) %% 2 == 0)]
}
extract_first_even_word("This is not the sentence")
#[1] "This"
extract_first_even_word("She is not going anywhere")
#[1] "is"

Related

How to extract first 2 words from a string in R?

I need to extract first 2 words from a string. If the string contains more than 2 words, it should return the first 2 words else if the string contains less than 2 words it should return the string as it is.
I've tried using 'word' function from stringr package but it's not giving the desired output for cases where len(string) < 2.
word(dt$var_containing_strings, 1,2, sep=" ")
Example:
Input String: Auto Loan (Personal)
Output: Auto Loan
Input String: Others
Output: Others
If you want to use stringr::word(), you can do:
ifelse(is.na(word(x, 1, 2)), x, word(x, 1, 2))
[1] "Auto Loan" "Others"
Sample data:
x <- c("Auto Loan (Personal)", "Others")
Something like this?
a <- "this is a character string"
unlist(strsplit(a, " "))[1:2]
[1] "this" "is"
EDIT:
To add the part where original string is returned if number of worlds is less than 2, a simple if-else function can be used:
a <- "this is a character string"
words <- unlist(strsplit(a, " "))
if (length(words) > 2) {
words[1:2]
} else {
a
}
You could use regex in base R using sub
sub("(\\w+\\s+\\w+).*", "\\1", "Auto Loan (Personal)")
#[1] "Auto Loan"
which will also work if you have only one word in the text
sub("(\\w+\\s+\\w+).*", "\\1", "Auto")
#[1] "Auto"
Explanation :
Here we extract the pattern shown inside round brackets which is (\\w+\\s+\\w+) which means :
\\w+ One word followed by \\s+ whitespace followed by \\w+ another word, so in total we extract two words. Extraction is done using backreference \\1 in sub.

R count the number of words starts with given letter in a phrase

i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")

Match and replace misspelled words in a string in R

I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
everything works perfectly, until it finds a similar string, and replaces it with another word
if I have a pattern like the following:
"plea", the correct one is "please", but when I execute it removes it and replaces it with "pleased".
What I am looking for is that if a string is already correct, it is no longer modified, in case it finds a similar pattern.
Perhaps you need to perform progressive replace. e.g. you should have multiple set of badwords and goodwords. First replace with badwords having more letters so that matching pattern is not found and then got go for smaller ones.
From the list provided by you, I would create 2 sets as:
goodwords1<-c( "three", "teasing")
badwords1<- c("thre", "teeasing")
goodwords2<-c("tree", "testing")
badwords2<- c("tre", "tesing")
First replace with 1st set and then with 2nd set. You can create many such sets.
str_replace_all takes regex as the pattern, so you can paste0 word boundaries \\b around each badwords so that a replacement will only be made if the whole word is matched:
library(stringr)
string <- c("tre", "tree", "teeasing", "tesing")
goodwords <- c("tree", "three", "teasing", "testing")
badwords <- c("tre", "thre", "teeasing", "tesing")
# Paste word boundaries around badwords
badwords <- paste0("\\b", badwords, "\\b")
vect.corpus <- goodwords
names(vect.corpus) <- badwords
str_replace_all(string, vect.corpus)
[1] "tree" "tree" "teasing" "testing"
The advantage of this is that you don't have to keep track of which strings are the longer strings.
This is what badwords looks like after pasting:
> badwords
[1] "\\btre\\b" "\\bthre\\b" "\\bteeasing\\b" "\\btesing\\b"

How to return the index of a vector that contains at least a string in another vector in R

I have a list containing verbs. I have another list containing sentences. How do I return the index of the sentence list that contains at least a verb in the verb list for that sentence?
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
I want it to return indexes 1, 3, and 4
Using no additional packages, we can sort of "or" different search terms together using | as follows:
Original question:
verbList <- list("punching, kicking, jumping, hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- gsub(", ", "|", verbList)
grep(v, sentenceList)
New question:
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- paste(verbList, collapse = "|")
grep(v, sentenceList)
A solution from stringr and rebus. We can first split the string, and then use str_which to check if the pattern is in the vector to return the index.
library(stringr)
library(rebus)
# Check the index
result <- str_which(sentenceList, or1(verbList))
result
# [1] 1 3 4

How to delete all strings except some specific letters in R?

after researching for a while, I didn't find exactly what I would like.
What I'd like to do is to keep an exact pattern in a string.
So this is my example:
text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS")
how to get exactly "THIS" in all strings:
res=c("THIS","THIS","THIS","","")
I tried gsubin r, but I don't know how to match characters.
For example I tried:
gsub("(THIS).*", "\\1", text) # This delete all string after "THIS".
gsub(".*(THIS)", "\\1", text) # This delete all string before "THIS".
To extract THIS or THAT as whole words, you may use the following regex:
\b(THIS|THAT)\b
where \b is a word boundary and (...|...) is a capturing group with | alternation operator (that can appear more than once, more alternatives can be added).
Since regmatches with gregexpr return a list of vectors with some empty entries whenever no match is found, you need to convert them into NA first, then unlist, and then turn to "".
Here is some base R code:
> text=c("hello, please keep THIS","THIS is important","all THIS should be done","not exactly This","not THHIS", "THAT is something I need, too")
[1] "THIS" "THIS" "THIS" "" "" ""
> matches <- regmatches(text, gregexpr("\\b(THIS|THAT)\\b", text))
> res <- lapply(matches, function(x) if (length(x) == 0) NA else x)
> res[is.na(res)] <- ""
> unlist(res)
[1] "THIS" "THIS" "THIS" "" "" "THAT"
We can use str_extract
library(stringr)
str_extract(text, "THIS")
#[1] "THIS" "THIS" "THIS" NA
It is better to have NA rather than ""
This will first delete elements which don't match THIS and then follows your original idea while storing intermediate result to a variable. It seems that you want to have empty strings for elements that do not match, and last line does that.
tmp <- text[grepl("THIS", text)]
gsub("(THIS).*", "\\1", tmp) -> tmp
gsub(".*(THIS)", "\\1", tmp) -> tmp
c(tmp, rep("", length(text) - length(tmp)))
gsub("[^THIS]","",text) seems to do the trick? "[^THIS]" matches everything except for THIS, and gsub replaces those matches with the empty string given as the second parameter. see comment, doesn't work as expected.

Resources