R how to remove VERY special characters in strings? - r

I'm trying to remove some VERY special characters in my strings.
i've read other post like:
Remove all special characters from a string in R?
How to remove special characters from a string?
but these are not what im looking for.
lets say my string is as following:
s = "who are í ½í¸€ bringing?"
i've tried following:
test = tm_map(s, function(x) iconv(enc2utf8(x), sub = "byte"))
test = iconv(s, 'UTF-8', 'ASCII')
none of above worked.
edit:
I am looking for a GENERAL solution!
I cannot (and prefer not) manually identify all the special characters.
also these VERY special characters MAY (not 100% sure) be result from emoticons
please help or guide me to the right posts.
Thank you!

So, I'm going to go ahead and make an answer, because I believe this is what you're looking for:
> s = "who are í ½í¸€ bringing?"
> rmSpec <- "í|½|€" # The "|" designates a logical OR in regular expressions.
> s.rem <- gsub(rmSpec, "", s) # gsub replace any matches in remSpec and replace them with "".
> s.rem
[1] "who are ¸ bringing?"
Now, this does have the caveat that you have to manually define the special character in the rmSpec variable. Not sure if you know what special characters to remove or if you're looking for a more general solution.
EDIT:
So it appears you almost had it with iconv, you were just missing the sub argument. See below:
> s
[1] "who are í ½í¸€ bringing?"
> s2 <- iconv(s, "UTF-8", "ASCII", sub = "")
> s2
[1] "who are bringing?"

Related

Remove all special characters from string except Turkish ones

there are tons of similar questions but i couldn't find the exact answer.
I have a text like this;
str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم "
I want to remove all special and non-Turkish characters and keep the others. Desired output is;
"NGÜÇ12 NET GROUP"
I really appreciate your help.
Please try
library(stringr)
str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم "
str_replace_all(str, '[^[\\da-zA-Z ÜüİıÇ窺Ğğ]]', '')
Using base gsub:
gsub("[^0-9a-zA-Z ÜüİıÇ窺Ğğ]", "", str)
# [1] "NGÜÇ12 NET GROUP "

How to using `regexp` to remove all the character not in chinese and english

There is ori_string ,how to using regexp to remove all the character not in chinese and english? Thanks!
ori_string<-"没a w t _ 中/国.sz"
the wished result is
"没awt中国sz"
I have coded it in python, as you didn't specify anything. The idea is here.
def remove_non_english_chinese(text):
# Use a regex pattern to match any character that is not a letter or number
pattern = r'[^a-zA-Z0-9\u4e00-\u9fff]'
# Replace all non-English and non-Chinese characters with an empty string
return re.sub(pattern, '', text)
Seems you want to remove punctuation and spaces:
> regex <- '[[:punct:][:space:]]+'
> gsub(regex, '', ori_string)
[1] "没awt中国sz"

substitute string when there is a dot + number + ':'

I have strings that look like these:
> ABCD.1:f_HJK
> ABFD.1:f_HTK
> CJD:f_HRK
> QQYP.2:f_HDP
So basically, I have always a string in the first part, I could have a part with . and a number, and after this part I always have ':' and a string.
I would like to remove the '. + number' when it is included in the string, using R.
I know that maybe regular expressions could be useful but I have not idea about I can apply them in this context. I know that I can substitute the '.' with gsub, but not idea about how I can add the information about number and ':'.
Thank you for your help.
Does this work:
v <- c('ABCD.1:f_HJK','ABFD.1:f_HTK','CJD:f_HRK','QQYP.2:f_HDP')
v
[1] "ABCD.1:f_HJK" "ABFD.1:f_HTK" "CJD:f_HRK" "QQYP.2:f_HDP"
gsub('([A-Z]{,4})(\\.\\d)?(:.*)','\\1\\3',v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
You could also use any of the following depending on the structure of your string
If no other period and numbers in the string
sub("\\.\\d+", "", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
If you are only interested in the first pattern matched.
sub("^([A-Z]+)\\.\\d+:", "\\1:", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
Same as above, invoking perl. ie no captured groups
sub("^[A-Z]+\\K\\.\\d+", "", v, perl = TRUE)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
If I understood your explanation correctly, this should do the trick:
gsub("(\\.\\d+)", "", string)

extract two hashtags next to each other in r

I'm analyzing twitter data and would like to extract all the hashtags in tweets. I used to extract hashtags like this:
tweet = 'I like #apple #orange'
str_extract_all(tweet,"#\\S+")
This works in most situations. But sometimes two hashtags are next to each other.
tweet = 'I like #apple#orange'
str_extract_all(tweet,"#\\S+")
what I got is this:
[[1]]
[1] "#apple#orange"
Does anyone know how I can properly extract hashtags when they are either separated or next to each other?
You are overmatching with \S because that will match a non whitespace character and a # as well.
You could use a negated character class to not match a whitespace char as well as not a #
#[^#\\s]+
Your code might look like
tweet = 'I like #apple#orange'
str_extract_all(tweet,"#[^#\\s]+")
Result
[[1]]
[1] "#apple" "#orange
R demo
My guess is that this simple expression might work:
#([^#\s]+)
which excludes spaces and #s after the first #.
Demo
Another(arguably less concise) base possibility:
gsub("([a-z](?=#))(#\\w)","\\1 \\2",
strsplit(tweet," (?=#+)",perl = TRUE)[[1]][2], perl=TRUE)
[1] "#apple #orange"
If you need them separated:
strsplit(gsub("([a-z](?=#))(#\\w)","\\1 \\2",
strsplit(tweet," (?=#+)",perl = TRUE)[[1]][2], perl=TRUE),
" ")
[[1]]
[1] "#apple" "#orange"

How to Convert "space" into "%20" with R

Referring the title, I'm figuring how to convert space between words to be %20 .
For example,
> y <- "I Love You"
How to make y = I%20Love%20You
> y
[1] "I%20Love%20You"
Thanks a lot.
Another option would be URLencode():
y <- "I love you"
URLencode(y)
[1] "I%20love%20you"
gsub() is one option:
R> gsub(pattern = " ", replacement = "%20", x = y)
[1] "I%20Love%20You"
The function curlEscape() from the package RCurl gets the job done.
library('RCurl')
y <- "I love you"
curlEscape(urls=y)
[1] "I%20love%20you"
I like URLencode() but be aware that sometimes it does not work as expected if your url already contains a %20 together with a real space, in which case not even the repeated option of URLencode() is doing what you want.
In my case, I needed to run both URLencode() and gsub consecutively to get exactly what I needed, like so:
a = "already%20encoded%space/a real space.csv"
URLencode(a)
#returns: "encoded%20space/real space.csv"
#note the spaces that are not transformed
URLencode(a, repeated=TRUE)
#returns: "encoded%2520space/real%20space.csv"
#note the %2520 in the first part
gsub(" ", "%20", URLencode(a))
#returns: "encoded%20space/real%20space.csv"
In this particular example, gsub() alone would have been enough, but URLencode() is of course doing more than just replacing spaces.

Resources