How to remove ascii characters from a string in r - r

Perhaps I dont understand the nuances of ascii but I am failing to remove encodings from a string.
The input string is:
mystring<-"complications: noneco-morbidity:nil \\x0c\\\\xd6\\p__"
My desired output is:
"complications: noneco-morbidity:nil __"
My attempt is:
iconv(x, "latin1", "ASCII", sub = "")
but nothing is removed

Use the following pattern as a regular expression with gsub:
"[\\x00-\\x7F]+"
This expression matches any non-ASCII character and gsub removes it (replacement="")
Example:
gsub(pattern = "[\\x00-\\x7F]+", replacement = "", "complications: noneco-morbidity:nil \\x0c\\\\xd6\\p__")
[1] "complications noneco-morbiditynil cdp__"

Below is not a clean solution. But still might be useful.
gsub("x0c|xd6|\\p|\\\\","", mystring)

Related

How to using `regexp` to remove all the character not in chinese and english

There is ori_string ,how to using regexp to remove all the character not in chinese and english? Thanks!
ori_string<-"没a w t _ 中/国.sz"
the wished result is
"没awt中国sz"
I have coded it in python, as you didn't specify anything. The idea is here.
def remove_non_english_chinese(text):
# Use a regex pattern to match any character that is not a letter or number
pattern = r'[^a-zA-Z0-9\u4e00-\u9fff]'
# Replace all non-English and non-Chinese characters with an empty string
return re.sub(pattern, '', text)
Seems you want to remove punctuation and spaces:
> regex <- '[[:punct:][:space:]]+'
> gsub(regex, '', ori_string)
[1] "没awt中国sz"

Regex expression to remove left over ascii hex code

I am analysing some tweets and I have written an basic emoji to text dictionary. I use the following to convert emoji's to r-encoded unicode;
df$text <- iconv(df$text, from = "latin1", to = "ascii", sub = "byte")
After that I swap the unicode to a text string that describes the emoji, for example <c2><ae> becomes 'copyright'
Problem is I have a lot of emoji's that aren't in the dictionary and I need to remove the strings that represent them. I can remove the <> symbols with "[[:punct:]]", "", but I need to get rid of the alpha numeric characters inside the <>'s too.
I was thinking something like
gsub("^<", "")
but i'm honestly stumped on how to find the < > symbols and remove anything found between them, or how to make a regex that finds < then removes it and the next 3 characters.
Appreciate any help
example
text <- ("have a <ed><a0><bd><ed><b8><80> day")
gsub("[[:punct:]]", "", text)
gives "have a eda0bdedb880 day"
but I want "have a day"
We can use a regex to match the < followed by characters that are not space ([^ ]+), ending in > and replace with blank ("")
gsub("\\<[^ ]+\\>\\s*", "", text, perl = TRUE)
#[1] "have a day"

How to throw out spaces and underscores only from the beginning of the string?

I want to ignore the spaces and underscores in the beginning of a string in R.
I can write something like
txt <- gsub("^\\s+", "", txt)
txt <- gsub("^\\_+", "", txt)
But I think there could be an elegant solution
txt <- " 9PM 8-Oct-2014_0.335kwh "
txt <- gsub("^[\\s+|\\_+]", "", txt)
txt
The output should be "9PM 8-Oct-2014_0.335kwh ". But my code gives " 9PM 8-Oct-2014_0.335kwh ".
How can I fix it?
You could bundle the \s and the underscore only in a character class and use quantifier to repeat that 1+ times.
^[\s_]+
Regex demo
For example:
txt <- gsub("^[\\s_]+", "", txt, perl=TRUE)
Or as #Tim Biegeleisen points out in the comment, if only the first occurrence is being replaced you could use sub instead:
txt <- sub("[\\s_]+", "", txt, perl=TRUE)
Or using a POSIX character class
txt <- sub("[[:space:]_]+", "", txt)
More info about perl=TRUE and regular expressions used in R
R demo
The stringr packages offers some task specific functions with helpful names. In your original question you say you would like to remove whitespace and underscores from the start of your string, but in a comment you imply that you also wish to remove the same characters from the end of the same string. To that end, I'll include a few different options.
Given string s <- " \t_blah_ ", which contains whitespace (spaces and tabs) and underscores:
library(stringr)
# Remove whitespace and underscores at the start.
str_remove(s, "[\\s_]+")
# [1] "blah_ "
# Remove whitespace and underscores at the start and end.
str_remove_all(s, "[\\s_]+")
# [1] "blah"
In case you're looking to remove whitespace only – there are, after all, no underscores at the start or end of your example string – there are a couple of stringr functions that will help you keep things simple:
# `str_trim` trims whitespace (\s and \t) from either or both sides.
str_trim(s, side = "left")
# [1] "_blah_ "
str_trim(s, side = "right")
# [1] " \t_blah_"
str_trim(s, side = "both") # This is the default.
# [1] "_blah_"
# `str_squish` reduces repeated whitespace anywhere in string.
s <- " \t_blah blah_ "
str_squish(s)
# "_blah blah_"
The same pattern [\\s_]+ will also work in base R's sub or gsub, with some minor modifications, if that's your jam (see Thefourthbird`s answer).
You can use stringr as:
txt <- " 9PM 8-Oct-2014_0.335kwh "
library(stringr)
str_trim(txt)
[1] "9PM 8-Oct-2014_0.335kwh"
Or the trimws in Base R
trimws(txt)
[1] "9PM 8-Oct-2014_0.335kwh"

R how to remove VERY special characters in strings?

I'm trying to remove some VERY special characters in my strings.
i've read other post like:
Remove all special characters from a string in R?
How to remove special characters from a string?
but these are not what im looking for.
lets say my string is as following:
s = "who are í ½í¸€ bringing?"
i've tried following:
test = tm_map(s, function(x) iconv(enc2utf8(x), sub = "byte"))
test = iconv(s, 'UTF-8', 'ASCII')
none of above worked.
edit:
I am looking for a GENERAL solution!
I cannot (and prefer not) manually identify all the special characters.
also these VERY special characters MAY (not 100% sure) be result from emoticons
please help or guide me to the right posts.
Thank you!
So, I'm going to go ahead and make an answer, because I believe this is what you're looking for:
> s = "who are í ½í¸€ bringing?"
> rmSpec <- "í|½|€" # The "|" designates a logical OR in regular expressions.
> s.rem <- gsub(rmSpec, "", s) # gsub replace any matches in remSpec and replace them with "".
> s.rem
[1] "who are ¸ bringing?"
Now, this does have the caveat that you have to manually define the special character in the rmSpec variable. Not sure if you know what special characters to remove or if you're looking for a more general solution.
EDIT:
So it appears you almost had it with iconv, you were just missing the sub argument. See below:
> s
[1] "who are í ½í¸€ bringing?"
> s2 <- iconv(s, "UTF-8", "ASCII", sub = "")
> s2
[1] "who are bringing?"

Remove parenthesis from a character string

I am trying to remove a parenthesis from a string in R and run into the following error:
string <- "log(M)"
gsub("log", "", string) # Works just fine
gsub("log(", "", string) #breaks
# Error in gsub("log(", "", test) :
# invalid regular expression 'log(', reason 'Missing ')''
Escape the parenthesis with a double-backslash:
gsub("log\\(", "", string)
(Obligatory: http://xkcd.com/234/)
Ben's answer gives you the good generally applicable way of doing this.
Alternatively, in your situation you could use the fixed=TRUE argument, like this:
gsub("log(", "", string, fixed=TRUE)
# [1] "M)"
It is appropriate whenever the pattern argument to gsub() is a character string containing the literal sequence of characters you are searching for. Then, it's nice because it allows you to type the exact pattern that you are searching for, without escapes etc.
If you are not a regex specialist (many of us are not!), I find it more straight forward to separate the removal of the unneeded text and the parens, provided your query supports that.
The question seems to indicate only wanting to remove parens, so you could use:
gsub(paste(c("[(]", "[)]"), collapse = "|"), "", string)
This results in the string without parens: "logM"
If you also want to remoeve the "M"
gsub(paste(c("M", "[(]", "[)]"), collapse = "|"), "", string)
This results in the result "log"

Resources