Removing duplicate words in a string in R

Removing duplicate words in a string in R - r

Just to help someone who's just voluntarily removed their question, following a request for code he tried and other comments. Let's assume they tried something like this:
str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')
and wanted to learn a better way. So what is the best way to remove a duplicate word from the string?

If you are still interested in alternate solutions you can use unique which slightly simplifies your code.
paste(unique(d), collapse = ' ')
As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.
d <- gsub("[[:punct:]]", "", d)

There are no need additional package
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
Atomic function:
rem_dup.one <- function(x){
paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")
Vectorize
rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)
REsult
"how do i best try and find a way to improve this code" "and here's a second one not third"

To remove duplicate words except for any special characters. use this function
rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse =
" ")
}
Input data:
duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg
(Silver)"
rem_dup_word(duptest)
output: samsung wa80e5lec top loading with diamond drum 6 kg (silver)
It will treat "Samsung" and "SAMSUNG" as duplicate

I'm not sure if string case is a concern. This solution uses qdap with the add-on qdapRegex package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
library(qdap)
library(dplyr) # so that pipe function (%>% can work)
str %>%
tolower() %>%
word_split() %>%
sapply(., function(x) unbag(unique(x))) %>%
rm_white_endmark() %>%
rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
unname()
## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."

Related

Remove pattern that occurs outside of words

I am trying to remove pattern 'SO' from the end of a character vector. The issue I run into with the below code is that it will remove any sequence of 'SO' case insensitive/just removes the whole string (vs. last pattern detected). One solution I had was to do some manual cleaning and force all to lower with the exception of final 'SO' and leaving it case sensitive.
x <- data.frame(y = c("Solutions are welcomed, please SO # 12345")
x <- x %>% mutate(y = stri_replace_last_regex(x$y,"SO.*","",case_insensitive = TRUE)) # This will remove the string entirely - I'm not really sure why.
The desired output is:
'Solutions are welcomed, please'
I have used an iteration of regex that looks like \\b\\SO{2}\\b and \\b\\D{2}*\\b|[[:punct:]] - I believe the answer could lie here by setting word boundaries but I am not sure. The second one gets rid of the SO but I feel if there are so letters in sequence elsewhere separate from wording that would get removed as well. I just need the last occurrence of SO and everything after to be removed including punctuation in the whole string.
Any guidance on this would come much appreciated to me.

You can use gsub to remove the pattern you don't want.
gsub("\\sSO.+$", "", x$y)
[1] "Solutions are welcomed, please"
Use [[:upper:]]{2} if you want to generalise to any two consecutive upper case letters.
gsub("\\s[[:upper:]]{2}.+$", "", x$y)
[1] "Solutions are welcomed, please"
UPDATE: the above code might not be accurate if you have more than one "SO" in the string
To demonstrate, I have created another string with multiple "SO". Here, we are capturing any characters from the start of the string (^), until before the last occurrence of "SO" (SO.+$). These strings are stored in the first capture group (it's the regex (.*)). Then we can use gsub to replace the entire string with the first capture group (\\1), thus getting rid of everything that is after the last occurrence of "SO".
x <- data.frame(y = "Solutions are SO welcomed, SO please SO # 12345")
gsub('^(.*)SO.+$', '\\1', x$y)
[1] "Solutions are SO welcomed, SO please "

library(dplyr)
library(stringr)
x %>%
mutate(y = str_replace_all(y, 'SO.*', ''))
or
library(dplyr)
library(stringr)
x %>%
mutate(y = str_replace_all(y, 'SO\\s\\#\\s\\d*', ''))
output:
y
1 Solutions are welcomed, please

Extract a string or value based on specific word before and a % sign after in R

I have a Text column with thousands of rows of paragraphs, and I want to extract the values of "Capacity > x%". The operation sign can be >,<,=, ~... I basically need the operation sign and integer value (e.g. <40%) and place it in a column next to the it, same row. I have tried, removing before/after text, gsub, grep, grepl, string_extract, etc. None with good results. I am not sure if the percentage sign is throwing it or I am just not getting the code structure. Appreciate your assistance please.
Here are some codes I have tried (aa is the df, TEXT is col name):
str_extract(string =aa$TEXT, pattern = perl("(?<=LVEF).*(?=%)"))
gsub(".*[Capacity]([^.]+)[%].*", "\\1", aa$TEXT)
genXtract(aa$TEXT, "Capacity", "%")
gsub("%.*$", "%", aa$TEXT)
grep("^Capacity.*%$",aa$TEXT)

Since you did not provide a reproducible example, I created one myself and used it here.
We can use sub to extract everything after "Capacity" until a number and % sign.
sub(".*Capacity(.*\\d+%).*", "\\1", aa$TEXT)
#[1] " > 10%" " < 40%" " ~ 230%"
Or with str_extract
stringr::str_extract(aa$TEXT, "(?<=Capacity).*\\d+%")
data
aa <- data.frame(TEXT = c("This is a temp text, Capacity > 10%",
"This is a temp text, Capacity < 40%",
"Capacity ~ 230% more text ahead"), stringsAsFactors = FALSE)

gsub solution
I think your gsub solution was pretty close, but didn't bring along the percentage sign as it's outside the brackets. So something like this should work (the result is assigned to the capacity column):
aa$capacity <- gsub(".*[Capacity]([^.]+%).*", "\\1", aa$TEXT)
Alternative method
The gsub approach will match the whole string when there is no operator match. To avoid this, we can use the stringr package with a more specific regular expression:
library(magrittr)
library(dplyr)
library(stringr)
aa %>%
mutate(capacity = str_extract(TEXT, "(?<=Capacity\\s)\\W\\s?\\d+\\s?%")) %>%
mutate(Capacity = str_squish(Capacity)) # Remove excess white space
This code will give NA when there is no match, which I believe is your desired behaviour.

remove multiple patterns from text vector r

I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `

I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.

Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants

In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)

I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.

How do I remove a specific sign like a comma partially from a data set

I have a data set like this:
Quest_main=c("quest2,","quest5,","quest4,","quest12,","quest4,","quest5,quest7")
And I would like to remove the comma from for example "quest2," so that it is "quest2", but not from the "quest5,quest7". I think I have to use substr or ifelse, but I am not sure. The final result is this when I call up Quest_main:
"quest2" "quest5" "quest4" "quest12" "quest4" "quest5,quest7"
Thanks!

All you need is
gsub(",$","",Quest_main)
The $ signifies the end of a string: for full explanation, see the (long and complicated) ?regexp, or a more general introduction to regular expressions, or search for the tags [r] [regex] on Stack Overflow.
If you insist on doing it with substr() and ifelse(), you can:
nc <- nchar(Quest_main)
lastchar <- substr(Quest_main,nc,nc)
ifelse(lastchar==",",substr(Quest_main,1,nc-1),
Quest_main)

With substring and ifelse:
ifelse(substring(Quest_main,nchar(Quest_main))==',',substring(Quest_main,1,nchar(Quest_main)-1),Quest_main)

Here's an alternative approach (just for general knowledge) using negative lookahead
gsub("(,)(?!\\w)", "", Quest_main, perl = TRUE)
## [1] "quest2" "quest5" "quest4" "quest12" "quest4" "quest5,quest7"
This approach is more general in case you want to delete commas not only from end of the word, but specify other conditions too
A more general solution would be using stringis stri_trim_right which will work in cases Bens or Jealie solutions will fail, for example when you have many commas at the end of the sentence which you want to get rid of, for example:
Quest_main <- c("quest2,,,," ,"quest5,quest7,,,,")
Quest_main
#[1] "quest2,,,," "quest5,quest7,,,,"
library(stringi)
stri_trim_right(Quest_main, pattern = "[^,]")
#[1] "quest2" "quest5,quest7"

Extracting specified word from a vector using R

I have a text e.g
text<- "i am happy today :):)"
I want to extract :) from text vector and report its frequency

Here's one idea, which would be easy to generalize:
text<- c("i was happy yesterday :):)",
"i am happy today :)",
"will i be happy tomorrow?")
(nchar(text) - nchar(gsub(":)", "", text))) / 2
# [1] 2 1 0

I assume you only want the count, or do you also want to remove :) from the string?
For the count you can do:
length(gregexpr(":)",text)[[1]])
which gives 2. A more generalized solution for a vector of strings is:
sapply(gregexpr(":)",text),length)
Edit:
Josh O'Brien pointed out that this also returns 1 of there is no :) since gregexpr returns -1 in that case. To fix this you can use:
sapply(gregexpr(":)",text),function(x)sum(x>0))
Which does become slightly less pretty.

This does the trick but might not be the most direct way:
mytext<- "i am happy today :):)"
# The following line inserts semicolons to split on
myTextSub<-gsub(":)", ";:);", mytext)
# Then split and unlist
myTextSplit <- unlist(strsplit(myTextSub, ";"))
# Then see how many times the smiley turns up
length(grep(":)", myTextSplit))
EDIT
To handle vectors of text with length > 1, don't unlist:
mytext<- rep("i am happy today :):)",2)
myTextSub<-gsub(":\\)", ";:\\);", mytext)
myTextSplit <- strsplit(myTextSub, ";")
sapply(myTextSplit,function(x){
length(grep(":)", x))
})
But I like the other answers better.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Removing duplicate words in a string in R - r

Related

Remove pattern that occurs outside of words

Extract a string or value based on specific word before and a % sign after in R

remove multiple patterns from text vector r

How do I remove a specific sign like a comma partially from a data set

Extracting specified word from a vector using R

Categories

Resources