remove invisible characters from a UTF-8 string

remove invisible characters from a UTF-8 string - r

I have an R tibble with UTF-8 character column. When I print the contents of this column for a certain problematic record, everything looks fine: one ‭two‬ three. There are, however, problems when I try to use this string in a RDBMS query which I construct in R and send to the database.
If I copy this string to Notepad++ and convert the encoding to ANSI, I can see that the string actually contains some additional characters that cause the problem: one â€twoâ€¬ three.
A partial solution that works would be conversion to ASCII:
iconv(my_string, "UTF-8", "ASCII", sub = "")
, but all non-ASCII characters are lost here.
Conversion from UTF-8 to UTF-8 doesn't solve my problem:
iconv(my_string, "UTF-8", "UTF-8", sub = "").
Is it possible to remove all invisible characters like the ones above without losing the UTF-8 encoding?
That is:
how can I convert my string to the form that I see when I print it out in R (without hidden parts)?

Not sure I completely understand what you are trying to do, but you can use stringi or stringr to explicitly specify what characters you want to retain. For your example, it could look something like this. You may have to expand the characters you want to retain, but this approach is one option:
library(stringr)
my_string <- "one â€twoâ€¬ three"
# Specifying that you only want upper and lowercase letters,
# numbers, punctuation, and whitespace.
str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]")
[1] "one two three"
# Just checking
stringi::stri_enc_isutf8(str_remove_all(my_string, "[^A-z|0-9|[:punct:]|\\s]"))
[1] TRUE
EDIT: I do want to note that you should check and see how robust this approach is. I have not dealt with invisible characters often so this may not be the best way to go about removing them.

You haven't given us a way to construct your bad string so I can't test this on your data, but it works on this example.
badString <- "one \u200Btwo\u200B three"
chars <- strsplit(badString, "")[[1]] # Assume badString has one entry; if not, add a loop
chars <- chars[nchar(chars, type = "width") > 0]
goodString <- paste(chars, collapse = "")
Both badString and goodString look the same when printed:
> badString
[1] "one two three"
> goodString
[1] "one two three"
but they have different numbers of characters:
> nchar(badString)
[1] 15
> nchar(goodString)
[1] 13

Related

Remove Single Backslash String R [duplicate]

I have a string that looks like:
str<-"a\f\r"
I'm trying to remove the backslashes but nothing works:
gsub("\","",str, fixed=TRUE)
gsub("\\","",str)
gsub("(\)","",str)
gsub("([\])","",str)
...basically all the variations you can imagine. I have even tried the string_replace_all function. ANY HELP??
I'm using R version 3.1.1; Mac OSX 10.7; the dput for a single string in my vector of strings gives:
dput(line)
"ud83d\ude21\ud83d\udd2b"
I imported the file using
readLines from a standard
.txt file. The content of the file looks something like:
got an engineer booked for this afternoon \ud83d\udc4d all now hopefully sorted\ud83d\ude0a I m going to go insane ud83d\ude21\ud83d\udd2b in utf8towcs …
Thanks.

One quite universal solution is
gsub("\\\\", "", str)
Thanks to the comment above.

When inputting backslashes from the keyboard, always escape them.
str <-"this\\is\\my\\string" # note doubled backslashes -> 'this\is\my\string'
gsub("\\", "", str, fixed=TRUE) # ditto
str2 <- "a\\f\\r" # ditto -> 'a\f\r'
gsub("\\", "", str2, fixed=TRUE)# ditto
Note that if you do
str <- "a\f\r"
then str contains no backslashes. It consists of the 3 characters a, \f (which is not normally printable, except as \f, and \r (same).
And just to head off a possible question. If your data was read from a file, the file doesn't have to have doubled backslashes. For example, if you have a file test.txt containing
a\b\c\d\e\f
and you do
str <- readLines("test.txt")
then str will contain the string a\b\c\d\e\f as you'd expect: 6 letters separated by 5 single backslashes. But you still have to type doubled backslashes if you want to work with it.
str <- gsub("\\", "", str, fixed=TRUE) # now contains abcdef
From the dput, it looks like what you've got there is UTF-16 encoded text, which probably came from a Windows machine. According to
https://en.wikipedia.org/wiki/Unicode#Character_General_Category
https://en.wikipedia.org/wiki/UTF-16
it encodes glyphs in the Supplementary Multilingual Plane, which is pretty obscure. I'll guess that you need to supply the argument encoding="UTF-16" to readLines when you read in the file.

Since there isn't any direct ways to dealing with single backslashes, here's the closest solution to the problem as provided by David Arenburg in the comments section
gsub("[^A-Za-z0-9]", "", str) #remove all besides the alphabets & numbers

This might be helpful :)
require(stringi)
stri_escape_unicode("ala\\ma\\kota")
## [1] "ala\\\\ma\\\\kota"
stri_unescape_unicode("ala\\ ma\\ kota")
## [1] "ala ma kota"

As of R 4.0.0, you can now use raw strings to avoid confusion with backlashes, just use the following syntax: r"(your_raw_expression)" (parentheses included):
str<-r"(ud83d\ude21\ud83d\udd2b)" #Equivalent of "ud83d\\ude21\\ud83d\\udd2b"
gsub(r"(\\)", "", str)
# [1] "ud83dude21ud83dudd2b"

Filtering out entries in a column that contain UTF-8 arabic characters in R

I have a data set called event_table that has a column titled "c.Comments" which contains strings mostly in english, but has some arabic in some of the comment entries. I want to filter out rows in which the comments entry contains arabic characters.
I read the data into R from an xlsx file and the arabic characters show as UTF-8 "< U+4903 >< U+483d >" (with no spaces) etc.
I've tried using regular expressions to achieve what I want, but the strings I'm trying to match refuse to be filtered out. I've tried all kinds of different regular expressions, but none of them seem to do the trick. I'll try filtering out literally "
event_table <- event_table %>%
filter(!grepl("<U+", c.Comments, fixed = TRUE))
event_table <- event_table %>%
filter(!grepl("<U\\+", c.Comments)
"\x", "\d\d\d\d", and all sorts of other combinations have done nothing for me
I'm starting to suspect that my method of filtering may be the issue rather than the regular expression, so any suggestions would be greatly appreciated.

Arabic chars can be detected with grep/grepl using a PCRE regex like \p{Arabic}:
> df <- data.frame(x=c("123", "abc", "ﺏ"))
> df
x
1 123
2 abc
3 <U+FE8F>
> grepl("\\p{Arabic}", df$x, perl=TRUE)
[1] FALSE FALSE TRUE
In your case, the code will look like
event_table <- event_table %>%
filter(!grepl("\\p{Arabic}", c.Comments, perl=TRUE))

Look at the ?Syntax help page. The Unicode character associated with may vary with the assumed codepage. On my machine the R character would be created with the string: "\u4903" but it prints as a Chinese glyph. The regex engine in R (as documented in the ?regex help page which you should refer to now) is PCRE.
The pattern in this grepl expression will filter out the printing non-ASCII characters:
grepl("[[:alnum:]]|[[:punct:]]", "\u4903")
[1] FALSE
And I don't think you should be negating that grepl result:
dplyr::filter(data.frame("\u4903"), grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
[1] X.䤃.
<0 rows> (or 0-length row.names)
dplyr::filter(data.frame("\u4903"), !grepl("[[:alnum:]]|[[:punct:]]", "\u4903"))
X.䤃.
1 䤃

regex - define boundary using characters & delimiters

I realize this is a rather simple question and I have searched throughout this site, but just can't seem to get my syntax right for the following regex challenges. I'm looking to do two things. First have the regex to pick up the first three characters and stop at a semicolon. For example, my string might look as follows:
Apt;House;Condo;Apts;
I'd like to go here
Apartment;House;Condo;Apartment
I'd also like to create a regex to substitute a word in between delimiters, while keep others unchanged. For example, I'd like to go from this:
feline;labrador;bird;labrador retriever;labrador dog; lab dog;
To this:
feline;dog;bird;dog;dog;dog;
Below is the regex I'm working with. I know ^ denotes the beginning of the string and $ the end. I've tried many variations, and am making substitutions, but am not achieving my desired out put. I'm also guessing one regex could work for both? Thanks for your help everyone.
df$variable <- gsub("^apt$;", "Apartment;", df$variable, ignore.case = TRUE)

Here is an approach that uses look behind (so you need perl=TRUE):
> tmp <- c("feline;labrador;bird;labrador retriever;labrador dog; lab dog;",
+ "lab;feline;labrador;bird;labrador retriever;labrador dog; lab dog")
> gsub( "(?<=;|^) *lab[^;]*", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The (?<=;|^) is the look behind, it says that any match must be preceded by either a semi-colon or the beginning of the string, but what is matched is not included in the part to be replaced. The * will match 0 or more spaces (since your example string had one case where there was space between the semi-colon and the lab. It then matches a literal lab followed by 0 or more characters other than a semi-colon. Since * is by default greedy, this will match everything up to, but not including' the next semi-colon or the end of the string. You could also include a positive look ahead (?=;|$) to make sure it goes all the way to the next semi-colon or end of string, but in this case the greediness of * will take care of that.
You could also use the non-greedy modifier, then force to match to end of string or semi-colon:
> gsub( "(?<=;|^) *lab.*?(?=;|$)", "dog", tmp, perl=TRUE)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
The .*? will match 0 or more characters, but as few as it can get away with, stretching just until the next semi-colon or end of line.
You can skip the look behind (and perl=TRUE) if you match the delimiter, then include it in the replacement:
> gsub("(;|^) *lab[^;]*", "\\1dog", tmp)
[1] "feline;dog;bird;dog;dog;dog;"
[2] "dog;feline;dog;bird;dog;dog;dog"
With this method you need to be careful that you only match the delimiter on one side (the first in my example) since the match consumes the delimiter (not with the look-ahead or look-behind), if you consume both delimiters, then the next will be skipped and only every other field will be considered for replacement.

I'd recommend doing this in two steps:
Split the string by the delimiters
Do the replacements
(optional, if that's what you gotta do) Smash the strings back together.
To split the string, I'd use the stringr library. But you can use base R too:
myString <- "Apt;House;Condo;Apts;"
# base R
splitString <- unlist(strsplit(myString, ";", fixed = T))
# with stringr
library(stringr)
splitString <- as.vector(str_split(myString, ";", simplify = T))
Once you've done that, THEN you can do the text substitution:
# base R
fixedApts <- gsub("^Apt$|^Apts$", "Apartment", splitString)
# with stringr
fixedApts <- str_replace(splitString, "^Apt$|^Apts$", "Apartment")
# then do the rest of your replacements
There's probabably a better way to do the replacements than regular expressions (using switch(), maybe?)
Use paste0(fixedApts, collapse = "") to collapse the vector into a single string at the end if that's what you need to do.

selective removal of characters following a pattern using R

How to selectively remove characters from a string following a pattern?
I wish to remove the 7 figures and the preceding colon.
For example:
"((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
should become
"((Northern_b,Tropical_b)N19"

x <- "((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950"
gsub("[:]\\d{1}[.]\\d{6}", "", x)
The gsub function does string replacement and replaces all matched found in the string (see ?gsub). An alternative, if you want something with a more friendly names is str_replace_all from the stringr package.
The regular expression makes use of the \\d{n} search, which looks for digits. The integer indicates the number of digits to look for. So \\d{1} looks for a set of digits with length 1. \\d{6} looks for a set of digits of length 6.

gsub('[:][0-9.]+','',x)
[1] "((Northern_b,Tropical_b)N19"

Another approach to solve this problem
library(stringr)
str1 <- c("((Northern_b:0.005926,Tropical_b:0.000000)N19:0.002950")
str_replace_all(str1, regex("(:\\d{1,}\\.\\d{1,})"), "")
#[1] "((Northern_b,Tropical_b)N19"

print backslash in R strings

GNU R 3.02
> bib <- "\cite"
Error: '\c' is an unrecognized escape in character string starting ""\c"
> bib <- "\\cite"
> print(bib)
[1] "\\cite"
> sprintf(bib)
[1] "\\cite"
>
how can I print out the string variable bib with just one "\"?
(I've tried everything conceivable, and discover that R treats the "\\" as one character.)
I see that in many cases this is not a problem, since this is usually handled internally by R, say, if the string were to be used as text for a plot.
But I need to send it to LaTeX. So I really have to remove it.
I see cat does the trick. If cat could only be made to send its result to a string.

You should use cat.
bib <- "\\cite"
cat(bib)
# \cite
You can remove the ## and [1] by setting a few options in knitr. Here is an example chunk:
<<newChunk,echo=FALSE,comment=NA,background=NA>>=
bib <- "\\cite"
cat(bib)
#
which gets you \cite. Note as well that you can set these options globally.

There is no backslash in the character element "\cite". The backslash is being interpreted as an escape and the two character "\c" is being interpreted as a cntrl-c. Except that is not a recognized character. See ?Quotes. The second version has only one backslash followed by 4 alpha characters. Count the characters to see this:
nchar("\\cite")
[1] 5

OK,
<<echo=FALSE,result='asis'>>
result <- cat(rbib)
#
does the trick (without the result <- bit, [1] is added). It just feels kludgy.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

remove invisible characters from a UTF-8 string - r

Related

Remove Single Backslash String R [duplicate]

Filtering out entries in a column that contain UTF-8 arabic characters in R

regex - define boundary using characters & delimiters

selective removal of characters following a pattern using R

print backslash in R strings

Categories

Resources