I'm working on being able to read transcripts of dialogue into R. However I run into a bump with special characters like curly quotes en and em dashes etc. Typically I replace these special characters in a microsoft product first with replace. Typically I replace special characters with plain text but on some occasions desire to replace them with other characters (ie I replace “ ” with { }). This is tedious and not always thorough. If I could read the transcripts into R as is and then use Encoding to switch their encoding to a recognizable unicode format, I could gsub them out and replace them with plain text versions. However the file is read in in some way I don't understand.
Here's an xlsx of what my data may look like:
http://dl.dropbox.com/u/61803503/test.xlsx
This is what is in the .xlsx file
text num
“ ” curly quotes 1
en dash (–) and the em dash (—) 2
‘ ’ curly apostrophe-ugg 3
… ellipsis are uck in R 4
This can be read into R with:
URL <- "http://dl.dropbox.com/u/61803503/test.xlsx"
library(gdata)
z <- read.xls(URL, stringsAsFactors = FALSE)
The result is:
text num
1 “ †curly quotes 1
2 en dash (–) and the em dash (—) 2
3 ‘ ’ curly apostrophe-ugg 3
4 … ellipsis are uck in R 4
So I tried to use Encoding to convert to Unicode:
iconv(z[, 1], "latin1", "UTF-8")
This gives:
[1] "â\u0080\u009c â\u0080\u009d curly quotes" "en dash (â\u0080\u0093) and the em dash (â\u0080\u0094)"
[3] "â\u0080\u0098 â\u0080\u0099 curly apostrophe-ugg" "â\u0080¦ ellipsis are uck in R"
Which makes gsubing less useful.
What can I do to convert these special characters to distinguishable unicode so I can gsub them out appropriately? To be more explicit I was hoping to have z[1, 1] read:
\u201C 2\u01D curly quotes
To make it even more clear my desired outcome I will webscrape the tables from a page like wikipedia's: http://en.wikipedia.org/wiki/Quotation_mark_glyphs and use the unicode reference chart to replace characters appropriately. So I need the characters to be in unicode or some standard format that I can systematically go through and replace the characters. Maybe it already is and I'm missing it.
PS I don't save the files as .csv or plain text because the special characters are replaced with ? hence the use of read.xls I'm not attached to any particular method of reading in the file (ie read.xls) if you've got a better alternative.
Maybe this will help (I'll have access to a Windows machine tomorrow and can probably play with it more at that point if SO doesn't get you the answer first).
On my Linux system, when I do the following:
iconv(z$text, "", "cp1252")
I get:
[1] "\x93 \x94 curly quotes" "en dash (\x96) and the em dash (\x97)"
[3] "\x91 \x92 curly apostrophe-ugg" "\x85 ellipsis are uck in R"
This is not UTF, but (I believe) ISO hex entities. Still, if you are able to get to this point also, then you should be able to use gsub the way you intend to.
See this page (reserved section in particular) for conversions.
Update
You can also try converting to an encoding that doesn't have those characters, like ASCII and set sub to "byte". On my machine, that gives me:
iconv(z$text, "", "ASCII", "byte")
# [1] "<e2><80><9c> <e2><80><9d> curly quotes"
# [2] "en dash (<e2><80><93>) and the em dash (<e2><80><94>)"
# [3] "<e2><80><98> <e2><80><99> curly apostrophe-ugg"
# [4] "<e2><80><a6> ellipsis are uck in R"
It's ugly, but UTF-8(e2, 80, 9c) is a right curly quote (each character, I believe, is a set of three values in angled brackets). You can find conversions at this site where you can search by punctuation mark name.
Try
> iconv(z, "UTF-8", "UTF-8")
[1] "c(\"“—” curly quotes\", \"en dash (–) and the em dash (—)\", \"‘—’ curly apostrophe-ugg\", \"… ellipsis are uck in R\")"
[2] "c(1, 2, 3, 4)"
windows is very problematic with encodings. Maybe you can look at http://www.vmware.com/products/player/ and run linux.
This works on my windows box. Initial input was as you had. You may have a different experience.
Related
If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.
What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:
Glyph Decimal Unicode Usage in R
! 33 U+0021 "\U0021"
So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:
paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"
I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).
Any idea how to generate a vector with all the unicode characters listed in the table linked?
(The question started with a real world problem but now I am mostly simply eager to know how to do this.)
There may be easier ways to do this, but here goes. The Unicode package contains everything you need.
First we can get a list of unicode scripts and the block ranges:
library(Unicode)
uranges <- u_scripts()
Check what we've got:
head(uranges, 3)
$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B U+1E950..U+1E959 U+1E95E..U+1E95F
$Ahom
[1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726 U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F
[11] U+11740..U+11746
$Anatolian_Hieroglyphs
[1] U+14400..U+14646
Next we can convert the ranges into their sequences.
expand_uranges <- lapply(uranges, as.u_char_seq)
To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:
all_unicode_chars <- unlist(expand_uranges)
# The Wikipedia page linked states there are 144,697 characters
length(all_unicode_chars)
[1] 144762
So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:
intToUtf8(expand_uranges$Katakana[[1]])
[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"
I have to request an API with an URL encoding according to RFC 3986, knowing that I have accented characters in my query.
For instance, this argument :
quel écrivain ?
should be encoded like this:
quel%20%C3%A9crivain%20%3F%0D%0A
Unfortunately, when I use URLencode, encoding, url_encode, or curlEscape, I have the resulting encoding:
URLencode("quel écrivain ?")
[1] "quel%20%E9crivain%20?"
The problem is on accented letters: for instance "é" is converted into "%E9" instead of "%C3%A9"...
I struggle with this URL encoding without finding any issue... As I don't have the hand on the API, I don't know how it handles the encoding.
A weird thing is that using POST instead of GET leads to a response in which word with accent are cutted into 2 different lines :
"1\tquel\tquel\tDET\tDET\tGender=Masc|Number=Sing\t5\tdet\t0\t_\n4\t<U+FFFD>\t<U+FFFD>\tSYM\tSYM\t_\t5\tcompound\t0\t_\n5\tcrivain\tcrivain\
As you can see, "écrivain" is splitted into "<U+FFFD>" (which is an ASCII encoding of "é") and "crivain".
I become mad with this encoding problem, if a brilliant mind could help me I would be very gratefull!
Set reserved = TRUE
i.e.
your_string <- "quel écrivain ?"
URLencode(your_string, reserved = TRUE)
# [1] "quel%20%C3%A9crivain%20%3F"
I do not think I am a brilliant mind, but I still have a possible solution for you. After using URLencode() it seems that your accented characters are converted into the trailing part of their unicode representation preceeded by a %. To convert your characters into readable characters you might turn them into "real unicode" and use the package stringi to make them readable. For your single string the solution worked on my machine, at least. I hope it also works for you.
Please note that I have introduced a % character at the end of your string to demonstrate that below gsub command should work in any case.
You might have to adapt the replacement pattern \\u00 to also cover unicode patterns that have more than the last two positions filled with something but 0, if this is relevant in your case.
library(stringi)
str <- "quel écrivain ?"
str <- URLencode(str)
#"quel%20%E9crivain%20?"
#replacing % by a single \ backslash to directly get correct unicode representation
#does not work since it is an escape character, therefore "\\"
str <- gsub("%", paste0("\\", "u00"), str , fixed = T)
#[1] "quel\\u0020\\u00E9crivain\\u0020?"
#since we have double escapes, we need the unescape function from stringi
#which recognizes double backslash as single backslash for the conversion
str <- stri_unescape_unicode(str)
#[1] "quel écrivain ?"
Say I want to print degrees Celsius in R, I could use unicode like this:
print("\U00B0 C")
[1] "° C"
Note, however, the space. I don't want it there, so I remove it:
print("\U00B0C")
[1] "ଌ"
Clearly, 00B0C is unicode for a very different character! Presumably, if there is any alphanumeric after the unicode it will, understandably, just interpret that as part of the unicode. I could use paste or something similar like this:
print(paste("\U00B0","C", sep = ""))
[1] "°C"
but is there a more concise way to indicate that the unicode is finished and I'm now just using regular letters?
Use lower case u:
print("\u00B0C")
I am trying to remove specific multi-byte characters in R.
Multibyte <- "Sungpil_한성필_韓盛弼_Han"
The linguistic structure of Multibyte is "English_Korean_Chinese_English" What I want to remove is the Korean word only or Chinese word only (not both).
A desired result is either :
Sungpil_한성필__Han # Chinese characters were removed.
or
Sungpil__韓盛弼_Han # Korean characters were removed.
Is there a simple way to do it by using gsub? I am only aware of a method to get English-only characters.
gsub("[^A-Za-z_]", "", Multibyte)
[1] "Sungpil___Han"
Answering the question itself, yes, you may do it with a mere gsub using a PCRE regex and Unicode property classes \p{Hangul} for matching Korean chars, and \p{Han} to match Chinese chars:
> Multibyte <- "Sungpil_한성필_韓盛弼_Han"
> gsub("\\p{Hangul}+", "",Multibyte, perl=TRUE)
[1] "Sungpil__韓盛弼_Han"
> gsub("\\p{Han}+", "",Multibyte, perl=TRUE)
[1] "Sungpil_한성필__Han"
See R online demo.
However, if you have a specific structure of the input text, use the other solution.
We can try with sub
sub("[^_]+_([A-Za-z]+)$", "_\\1", Multibyte)
#[1] "Sungpil_한성필__Han"
I want a character variable in R taking the value from, lets say "a", and adding " \%", to create a %-sign later in LaTeX.
Usually I'd do something like:
a <- 5
paste(a,"\%")
but this fails.
Error: '\%' is an unrecognized escape in character string starting "\%"
Any ideas? A workaround would be to define another command giving the %-sign in LaTeX, but I'd prefer a solution within R.
As many other languages, certain characters in strings have a different meaning when they're escaped. One example for that is \n, which means newline instead of n. When you write \%, R tries to interpret % as a special character and fails doing so. You might want to try to escape the backslash, so that it is just a backslash:
paste(a, "\\%")
You can read on escape sequences here.
You can also look at the latexTranslate function from the Hmisc package, which will escape special characters from strings to make them LaTeX-compatible :
R> latexTranslate("You want to give me 100$ ? I agree 100% !")
[1] "You want to give me 100\\$ ? I agree 100\\% !"