Strip single forward slash from text only in R - r

I am trying to remove only the / from any text using R. I have tried different approaches and I got mixed results.
This is the text I am dealing with s/p Left IOLI 3/9/04.
I am trying to produce an output like this sp Left IOLI 3/9/04.
Only strip the / in text and not numbers.
I have tried these four
gsub("\", "", str, fixed=T)
gsub("/", ".", str, fixed=T)
gsub("[^A-Za-z]", ".", str, perl =T)
str_replace( str, "/", "")
So far only gsub("[^A-Za-z]", ".", str, perl =T) worked. sucker stripped the / off of everything text numbers and everything. I just need the / from text be gone. Any help is much appreciated folks.

We can use regex lookarounds to remove the forward slash that are not between numbers.
gsub('(?<![0-9])/(?![0-9])', '', str, perl=TRUE)
#[1] "sp Left IOLI 3/9/04."
If we also need to remove / when either the left or right side contain non-numeric characters,
gsub('(?<![0-9])/|/(?![0-9])', '', str1, perl=TRUE)
#[1] "sp Left IOLI 3/9/04." "s12 45p sp Left"
data
str <- 's/p Left IOLI 3/9/04.'
str1 <- c(str, 's/12 45/p s/p Left')

An alternative way is to run multiple regexes. Demonstrated here using str_replace_all of package stringr, but obviously will work using base functions as well.
#First correct for / between 2 alphabets like s/p
mystring <- str_replace_all(mystring, "([a-zA-Z])/([a-zA-Z])", "\\1\\2")
#Next, correct for / between 1 alphabet and 1 number like s/12 or 45/p
mystring <- str_replace_all(mystring, "([a-zA-Z])/([\\d])", "\\1\\2")
mystring <- str_replace_all(mystring, "([\\d])/([a-zA-Z])", "\\1\\2")

Related

Regex expression to remove left over ascii hex code

I am analysing some tweets and I have written an basic emoji to text dictionary. I use the following to convert emoji's to r-encoded unicode;
df$text <- iconv(df$text, from = "latin1", to = "ascii", sub = "byte")
After that I swap the unicode to a text string that describes the emoji, for example <c2><ae> becomes 'copyright'
Problem is I have a lot of emoji's that aren't in the dictionary and I need to remove the strings that represent them. I can remove the <> symbols with "[[:punct:]]", "", but I need to get rid of the alpha numeric characters inside the <>'s too.
I was thinking something like
gsub("^<", "")
but i'm honestly stumped on how to find the < > symbols and remove anything found between them, or how to make a regex that finds < then removes it and the next 3 characters.
Appreciate any help
example
text <- ("have a <ed><a0><bd><ed><b8><80> day")
gsub("[[:punct:]]", "", text)
gives "have a eda0bdedb880 day"
but I want "have a day"
We can use a regex to match the < followed by characters that are not space ([^ ]+), ending in > and replace with blank ("")
gsub("\\<[^ ]+\\>\\s*", "", text, perl = TRUE)
#[1] "have a day"

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

I have a list of words in R as shown below:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
And I want to remove the words which are found in the above list from the text as below:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
After removing the unwanted myList words, the myText should look like:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
I was using :
stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")
But this is not helping me. What I should do??
You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
See the regex demo.
Details
\s* - 0 or more whitespaces
(?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
(?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.
NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.
See an R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
Details
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.
gsub(paste0(myList, collapse = "|"), "", myText)
gives:
[1] "This is Sample Text, which is better and cleaned , where is not equal to . This is messy text ."

Remove punctuation but keep hyphenated phrases in R text cleaning

Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"?
I used the following function to clean my text
clean.text = function(x)
{
# remove rt
x = gsub("rt ", "", x)
# remove at
x = gsub("#\\w+", "", x)
x = gsub("[[:punct:]]", "", x)
x = gsub("[[:digit:]]", "", x)
# remove http
x = gsub("http\\w+", "", x)
x = gsub("[ |\t]{2,}", "", x)
x = gsub("^ ", "", x)
x = gsub(" $", "", x)
x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
#return(x)
}
and apply it on hyphenated expressions that returned
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"
while my desired output is
"accident-prone"
I have referenced this thread but didn't find it worked on my situation. There must be some regex things that I haven't figured out. It will be really appreciated if someone could enlighten me on this.
Putting my two cents in, you could use (*SKIP)(*FAIL) with perl = TRUE and remove any non-word characters:
data <- c("my-test of #$%^&*", "accident-prone")
(gsub("(?<![^\\w])[- ](?=\\w)(*SKIP)(*FAIL)|\\W+", "", data, perl = TRUE))
Resulting in
[1] "my-test of" "accident-prone"
See a demo on regex101.com.
Here the idea is to match what you want to keep
(?<![^\\w])[- ](?=\\w)
# a whitespace or a dash between two word characters
# or at the very beginning of the string
let these fail with (*SKIP)(*FAIL) and put what you want to be removed on the right side of the alternation, in this case
\W+
effectively removing any non-word-characters not between word characters.
You'd need to provide more examples for testing though.
The :punct: set of characters includes the dash and you are removing them. You could make an alternate character class that omits the dash. You do need to pay special attention to the square-brackets placements and escape the double quote and the backslash:
(test <- gsub("[]!\"#$%&'()*+,./:;<=>?#[\\^_`{|}~]", "", "my-test of #$%^&*") )
[1] "my-test of "
The ?regex (help page) advises against using ranges. I investigated whether there might be any simplification using my local ASCII sequence of punctuation, but it quickly became obvious that was not the way to go for other reasons. There were 5 separate ranges, and the "]" was in the middle of one of them so there would have been 7 ranges to handle in addition to the "]" which needs to come first.

Remove everything after space in string

I would like to remove everything after a space in a string.
For example:
"my string is sad"
should return
"my"
I've been trying to figure out how to do this using sub/gsub but have been unsuccessful so far.
You may use a regex like
sub(" .*", "", x)
See the regex demo.
Here, sub will only perform a single search and replace operation, the .* pattern will find the first space (since the regex engine is searching strings from left to right) and .* matches any zero or more characters (in TRE regex flavor, even including line break chars, beware when using perl=TRUE, then it is not the case) as many as possible, up to the string end.
Some variations:
sub("[[:space:]].*", "", x) # \s or [[:space:]] will match more whitespace chars
sub("(*UCP)(?s)\\s.*", "", x, perl=TRUE) # PCRE Unicode-aware regex
stringr::str_replace(x, "(?s) .*", "") # (?s) will force . to match any chars
See the online R demo.
strsplit("my string is sad"," ")[[1]][1]
or, substitute everything behind the first space to nothing:
gsub(' [A-z ]*', '' , 'my string is sad')
And with numbers:
gsub('([0-9]+) .*', '\\1', c('c123123123 0320.1'))
If you want to do it with a regex:
gsub('([A-z]+) .*', '\\1', 'my string is sad')
Stringr is your friend.
library(stringr)
word("my string is sad", 1)

Resources