converting Unicode characters from string column in R - r

Imported a bunch of CSV and one of the columns has what i think are Unicode chars.
something like:
PEÃ<U+0083>â<U+0080><U+0098>A
SOPEÃ<U+0083>â<U+0080><U+0098>A
Not in all rows, just some, but I've tried to convert to "human readable" chars but to no avail.
Tested this solution from SO but no success so far: unicode characters conversion in R
and this brute substitution didn't worked
gsub('Ã<U+0083>â<U+0080><U+0098>', 'Ñ', 'PEÃ<U+0083>â<U+0080><U+0098>A')
[1] "Ã<U+0083>â<U+0080><U+0098>"

Related

How to generate all possible unicode characters?

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

Encoding problem: Convert bytes to Chinese characters in R

I read from a html file in R which contains Chinese characters. But it shows something like
" <td class=\"forumCell\">\xbbָ\xb4</td>"
It is the "\x" strings that I need to extract. How can I convert them into readable Chinese characters?
By the way, somehow simply copy and pasting the above \x strings would not replicate the problem.
are you sure they are all chinese characters? what is the html page encoding? the strings you pasted looks to be a mix of hex \xc4\xe3 and unicode chars \u0237.

R programming - How to remove special characters from a data set?

I have a data set that contains strings and special characters like the one below can be found in the data set.
Special character
How do I remove special characters like the above from my data set?
Use regular expressions to remove unwanted characters, for example:
dataset$textcolumn <- gsub("[^\\w\\s]", "", dataset$textcolumn, perl=TRUE)
to remove everything except word characters and spaces. To do more complex replacements look into the help topic ?regexp.
Also look into the encoding (Encoding and iconv are helpful here.), maybe the text is correct but the wrong encoding is assumed.

How to handle blank items when converting dates in R

I have a csv download of data from a Management Information system. There are some variables which are dates and are written in the csv as strings of the format "2012/11/16 00:00:00".
After reading in the csv file, I convert the date variables into a date using the function as.Date(). This works fine for all variables that do not contain any blank items.
For those which do contain blank items I get the following error message:
"character string is not in a standard unambiguous format"
How can I get R to replace blank items with something like "0000/00/00 00:00:00" so that the as.Date() function does not break? Are there other approaches you might recommend?
If they're strings, does something as simple as
mystr <- c("2012/11/16 00:00:00"," ","")
mystr[grepl("^ *$",mystr)] <- NA
as.Date(mystr)
work? (The regular expression "^ *$" looks for strings consisting of the start of the string (^), zero or more spaces (*), followed by the end of the string ($). More generally I think you could use "^[[:space:]]*$" to capture other kinds of whitespace (tabs etc.)
Even better, have the NAs correctly inserted when you read in the CSV:
read.csv(..., na.strings='')
or to specify a vector of all the values which should be read as NA...
read.csv(..., na.strings=c('',' ',' '))

Working with Unicode in R

I read in text from a MySQL table into and R dataframe. (using RODBC, sqlFetch). Have two questions:
How do I figure out if R has read it in as utf-8? It's character
type but what's the function to show encoding?
How do I compute the number of characters in an Unicode string in R?
The length function does not work with Unicode and always returns 1 I think.
You should be able to read the encoding (assuming it is specified) with:
Encoding(x)
The number of characters can be determined with:
nchar(x)

Resources