remove words based on first letter using stringr - r

I want to remove all words that start with "a" in a string.
Input:
string <- "This is a sentence about nothing."
My attempt:
stringr::str_remove_all(string,"a*\\b")
output I got:
[1] "This is sentence about nothing."
output I want:
[1] "This is sentence nothing."
I am not sure how to detect based on one letter but perform action(e.g., remove, replace) on the whole word. Any input is appreciated!

The a*\b pattern matches zero or more a chars followed with end of string or a word char. It does not match a word unless it is an a word.
You can use
stringr::str_remove_all(string,"\\ba\\w*")
stringr::str_replace_all(string,"\\ba\\w*", "")
gsub("\\ba\\w*", "", string, perl=TRUE) ## ASCII only letters/digits
where \ba\w* matches a word boundary, a, and then zero or more word chars.
If you also want to remove any whitespaces before the word, add \s* at the start:
stringr::str_remove_all(string,"\\s*\\ba\\w*")
stringr::str_replace_all(string,"\\s*\\ba\\w*", "")
gsub("\\s*\\ba\\w*", "", string, perl=TRUE) ## ASCII only letters/digits/whitespaces
If you need to make sure you only remove natural langugage words consisting only of letters, then you can replace \w with \p{L}:
stringr::str_remove_all(string,"\\s*\\ba\\p{L}*")
stringr::str_replace_all(string,"\\s*\\ba\\p{L}*", "")
gsub("(*UCP)\\s*\\ba\\p{L}*", "", string, perl=TRUE) ## any Uncicode letters/digits/whitespaces

Related

Replace a whole word containing a pattern - gsub and R

I am trying to clean some garbage out of some text. While doing this, I am assuming that any word that has a letter (any letter) repeated three or more times is garbage - and I want to remove it.
I've come up with this:
gsub(pattern = "[a-zA-Z]\\1\\1", replacement = "", string)
in which string is the character vector, but this doesn't work. Everything else I've tried might find the pattern, but it just removes the pattern, leaving a mess. I'm trying to remove the whole word with the pattern in it.
Any ideas?
You need
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
stringr::str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
See an R demo:
string <- "This is a baaaad unnnnecessary short word"
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
library(stringr)
str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
All yielding [1] "This is a short word".
See the regex demo. Regex details:
\s* - zero or more whitespaces
\p{L}* / [[:alpha:]]* - zero or more letters
(\p{L}) - Capturing group 1: any single letter
\1{2} - two occurrences of the same value as in Group 1
\p{L}* / [[:alpha:]]* - zero or more letters.
You need to assign a "capture group" to the [.] class by wrapping it in parens, since the \\1 needs something to reference:
gsub("([a-zA-Z])\\1\\1", "", "aabbbccdddee")
# [1] "aaccee"
Updated on OP comment:
Try this:
gsub("([A-Z]&|[a-z])\\1{2, }", "", "AAA")
[1] "AAA"
gsub("([A-Z]&|[a-z])\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"
r2evans example with different regex:
gsub("(\\w)\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"

Remove all words in string containing punctuation (R)

How (in R) would I remove any word in a string containing punctuation, keeping words without?
test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"
desired <- "I a see works not"
Here is an approach using sub which seems to work:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)
[1] "I a see works not"
This approach is to use the following regex pattern:
[A-Za-z]* match a leading letter zero or more times
[^A-Za-z ] then match a symbol once (not a space character or a letter)
\\S* followed by any other non whitespace character
\\s* followed by any amount of whitespace
Then, we just replace with empty string, to remove the words having one or more symbols in them.
You can use this regex
(?<=\\s|^)[a-z0-9]+(?=\\s|$)
(?<=\\s|^) - positive lookbehind, match should be preceded by space or start of string.
[a-z0-9]+ - Match alphabets and digits one or more time,
(?=\\s|$) - Match must be followed by space or end of string
Demo
Tim's edit:
This answer uses a whitelist approach, namely identify all words which the OP does want to retain in his output. We can try matching using the regex pattern given above, and then connect the vector of matches using paste:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")
[1] "I a see works not"
Here are couple of more approaches
First approach:
str_split(test.string, " ", n=Inf) %>% # spliting the line into words
unlist %>%
.[!str_detect(., "\\W|\r")] %>% # detect words without punctuation or \r
paste(.,collapse=" ") # collapse the words to get the line
Second approach:
str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>%
unlist %>%
trimws() %>%
paste(., collapse=" ")
^\\w+ - words having only [a-zA-Z0-9_] and also is start of the string
\\s\\w+\\s - words with [a-zA-Z0-9_] and having space before and after the word
\\w+$ - words having [a-zA-Z0-9_] and also is the end of the string

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

I have a list of words in R as shown below:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
And I want to remove the words which are found in the above list from the text as below:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
After removing the unwanted myList words, the myText should look like:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
I was using :
stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")
But this is not helping me. What I should do??
You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
See the regex demo.
Details
\s* - 0 or more whitespaces
(?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
(?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.
NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.
See an R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
Details
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.
gsub(paste0(myList, collapse = "|"), "", myText)
gives:
[1] "This is Sample Text, which is better and cleaned , where is not equal to . This is messy text ."

A regex to remove all words which contains number in R

I want to write a regex in R to remove all words of a string containing numbers.
For example:
first_text = "a2c if3 clean 001mn10 string asw21"
second_text = "clean string
Try with gsub
trimws(gsub("\\w*[0-9]+\\w*\\s*", "", first_text))
#[1] "clean string"
It is easier to select words with no numbers than to select and delete words with numbers:
> library(stringr)
> str1 <- "a2c if3 clean 001mn10 string asw21"
> paste(unlist(str_extract_all(str1, "(\\b[^\\s\\d]+\\b)")), collapse = " ")
[1] "clean string"
Note:
Backslashes have to be escaped in R to work properly, hence double backslashes
\b is word boundary
\s is white space
\d is digit character
a caret (^) inside square brackets is a negater: find characters that do not match ...
"+" after the character group inside [] means "1 or more" occurrences of those (non white space and non digit) characters
Just another alternative using gsub
trimws(gsub("[^\\s]*[0-9][^\\s]*", "", first_text, perl=T))
#[1] "clean string"
A bit longer than some of the answers but very tractable is to first convert the string to a vector of words, then check word by word if there are any numbers and use standard R subsetting.
first_text_vec <- strsplit(first_text, " ")[[1]]
first_text_vec
[1] "a2c" "if3" "clean" "001mn10" "string" "asw21"
paste(first_text_vec[!grepl("[0-9]", first_text_vec)], collapse = " ")
[1] "clean string"

In R: grab all alnum characters before the first punctuation

I have a vector s of strings (or NAs), and would like to get a vector of same length of everything before first occurrence of punctionation (.).
s <- c("ABC1.2", "22A.2", NA)
I would like a result like:
[1] "ABC1" "22A" NA
You can remove all symbols (incl. a newline) from the first dot with the following Perl-like regex:
s <- c("ABC1.2", "22A.2", NA)
gsub("[.][\\s\\S]*$", "", s, perl=T)
## => [1] "ABC1" "22A" NA
See IDEONE demo
The regex matches
[.] - a literal dot
[\\s\\S]* - any symbols incl. a newline
$ - end of string.
All matched strings are removed from the input with "". As the regex engine analyzes the string from left to right, the first dot is matched with \\., and the greedy * quantifier with [\\s\\S] will match all up to the end of string.
If there are no newlines, a simpler regex will do: [.].*$:
gsub("[.].*$", "", s)
See another demo

Resources