I am trying to clean some garbage out of some text. While doing this, I am assuming that any word that has a letter (any letter) repeated three or more times is garbage - and I want to remove it.
I've come up with this:
gsub(pattern = "[a-zA-Z]\\1\\1", replacement = "", string)
in which string is the character vector, but this doesn't work. Everything else I've tried might find the pattern, but it just removes the pattern, leaving a mess. I'm trying to remove the whole word with the pattern in it.
Any ideas?
You need
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
stringr::str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
See an R demo:
string <- "This is a baaaad unnnnecessary short word"
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
library(stringr)
str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
All yielding [1] "This is a short word".
See the regex demo. Regex details:
\s* - zero or more whitespaces
\p{L}* / [[:alpha:]]* - zero or more letters
(\p{L}) - Capturing group 1: any single letter
\1{2} - two occurrences of the same value as in Group 1
\p{L}* / [[:alpha:]]* - zero or more letters.
You need to assign a "capture group" to the [.] class by wrapping it in parens, since the \\1 needs something to reference:
gsub("([a-zA-Z])\\1\\1", "", "aabbbccdddee")
# [1] "aaccee"
Updated on OP comment:
Try this:
gsub("([A-Z]&|[a-z])\\1{2, }", "", "AAA")
[1] "AAA"
gsub("([A-Z]&|[a-z])\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"
r2evans example with different regex:
gsub("(\\w)\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"
Related
I want to remove all words that start with "a" in a string.
Input:
string <- "This is a sentence about nothing."
My attempt:
stringr::str_remove_all(string,"a*\\b")
output I got:
[1] "This is sentence about nothing."
output I want:
[1] "This is sentence nothing."
I am not sure how to detect based on one letter but perform action(e.g., remove, replace) on the whole word. Any input is appreciated!
The a*\b pattern matches zero or more a chars followed with end of string or a word char. It does not match a word unless it is an a word.
You can use
stringr::str_remove_all(string,"\\ba\\w*")
stringr::str_replace_all(string,"\\ba\\w*", "")
gsub("\\ba\\w*", "", string, perl=TRUE) ## ASCII only letters/digits
where \ba\w* matches a word boundary, a, and then zero or more word chars.
If you also want to remove any whitespaces before the word, add \s* at the start:
stringr::str_remove_all(string,"\\s*\\ba\\w*")
stringr::str_replace_all(string,"\\s*\\ba\\w*", "")
gsub("\\s*\\ba\\w*", "", string, perl=TRUE) ## ASCII only letters/digits/whitespaces
If you need to make sure you only remove natural langugage words consisting only of letters, then you can replace \w with \p{L}:
stringr::str_remove_all(string,"\\s*\\ba\\p{L}*")
stringr::str_replace_all(string,"\\s*\\ba\\p{L}*", "")
gsub("(*UCP)\\s*\\ba\\p{L}*", "", string, perl=TRUE) ## any Uncicode letters/digits/whitespaces
I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
I have this string:
cd/etc/init[BKSP][BKSP]it.d[ENTER]
I want the end result to be like this :
cd/etc/init.d[ENTER]
It would remove all the [BKSP] substrings along with an immediate character in front of it.
I have this sub function:
sub(“(.?\\[BKSP\\]+)+”, “”, string, perl = TRUE)
But getting: cd/etc/iniit.d[ENTER] instead.
Any help would be greatly appreciated! Thanks!
You may use
gsub("(?s).(?R)?\\[BKSP]", "", string, perl=TRUE)
See the regex demo
Details
(?s) - turns on the DOTALL modifier
. - matches any char
(?R)? - matches 1 or 0 ocurrences of the whole pattern (recurses the whole pattern)
\\[BKSP] - a literal substring [BKSP].
R demo:
string <- c("cd/etc/init[BKSP][BKSP]it.d[ENTER]", "abcd[BKSP]e")
gsub("(?s).(?R)?\\[BKSP]", "", string, perl=TRUE)
## => [1] "cd/etc/init.d[ENTER]" "abce"
You could use
test <- "cd/etc/init[BKSP][BKSP]it.d[ENTER]"
pattern <- "\\[BKSP\\]\\w*"
gsub(pattern, "", test)
Which yields
[1] "cd/etc/init.d[ENTER]"
I want to write a regex in R to remove all words of a string containing numbers.
For example:
first_text = "a2c if3 clean 001mn10 string asw21"
second_text = "clean string
Try with gsub
trimws(gsub("\\w*[0-9]+\\w*\\s*", "", first_text))
#[1] "clean string"
It is easier to select words with no numbers than to select and delete words with numbers:
> library(stringr)
> str1 <- "a2c if3 clean 001mn10 string asw21"
> paste(unlist(str_extract_all(str1, "(\\b[^\\s\\d]+\\b)")), collapse = " ")
[1] "clean string"
Note:
Backslashes have to be escaped in R to work properly, hence double backslashes
\b is word boundary
\s is white space
\d is digit character
a caret (^) inside square brackets is a negater: find characters that do not match ...
"+" after the character group inside [] means "1 or more" occurrences of those (non white space and non digit) characters
Just another alternative using gsub
trimws(gsub("[^\\s]*[0-9][^\\s]*", "", first_text, perl=T))
#[1] "clean string"
A bit longer than some of the answers but very tractable is to first convert the string to a vector of words, then check word by word if there are any numbers and use standard R subsetting.
first_text_vec <- strsplit(first_text, " ")[[1]]
first_text_vec
[1] "a2c" "if3" "clean" "001mn10" "string" "asw21"
paste(first_text_vec[!grepl("[0-9]", first_text_vec)], collapse = " ")
[1] "clean string"
I would like to remove everything after a space in a string.
For example:
"my string is sad"
should return
"my"
I've been trying to figure out how to do this using sub/gsub but have been unsuccessful so far.
You may use a regex like
sub(" .*", "", x)
See the regex demo.
Here, sub will only perform a single search and replace operation, the .* pattern will find the first space (since the regex engine is searching strings from left to right) and .* matches any zero or more characters (in TRE regex flavor, even including line break chars, beware when using perl=TRUE, then it is not the case) as many as possible, up to the string end.
Some variations:
sub("[[:space:]].*", "", x) # \s or [[:space:]] will match more whitespace chars
sub("(*UCP)(?s)\\s.*", "", x, perl=TRUE) # PCRE Unicode-aware regex
stringr::str_replace(x, "(?s) .*", "") # (?s) will force . to match any chars
See the online R demo.
strsplit("my string is sad"," ")[[1]][1]
or, substitute everything behind the first space to nothing:
gsub(' [A-z ]*', '' , 'my string is sad')
And with numbers:
gsub('([0-9]+) .*', '\\1', c('c123123123 0320.1'))
If you want to do it with a regex:
gsub('([A-z]+) .*', '\\1', 'my string is sad')
Stringr is your friend.
library(stringr)
word("my string is sad", 1)