First time caller.
I just want to change string encoding from UTF-8 to LATIN1. I use Xpath to retrieve the data from the web:
>library(RCurl)
>library(rvest)
>library(XML)
>library(httr)
>library(reshape2)
>library(reshape)
>response <- GET(paste0("http://www.visalietuva.lt/imone/jogminda-uab-telsiai-muziejaus-g-35"))
>doc <- content(response,type="text/html")
>base <- xpathSApply(doc, "//ul//li//span",xmlValue)[5]
As as result I get the following:
>base
[1] "El. paštas"
When I check the encoding I have UTF-8:
>Encoding(base)
[1] "UTF-8"
I suspect I need LATIN1 encoding. So that the result would be "El. paštas", instead of "El. paÅ¡tas".
Although when I specifie the LATIN1 encoding I get the following:
>latin <- iconv(base, from = "UTF-8", to = "LATIN1")
[1] "El. paštas"
i.e. the same result as with UTF-8. Changing the encoding does not help to get "El. paštas".
Moreover I need the correct LATIN1 encoding of the string while saving data to .csv file. I tried to save the data to .csv:
write.table(latin,file = "test.csv")
and get the same strange characters as mentioned above: "El. paštas".
Any advice on how to change the encoding would be more than welcome. Thank you.
Try
doc <- content(response,type="text/html", encoding = "UTF-8")
Related
I'm writing a package function where I want to check if some text contains any of the following characters:
äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ
Problem is that devtools::check() returns a warning:
W checking R files for non-ASCII characters ... Found the
following file with non-ASCII characters:
gb_data_prepare.R Portable packages must use only ASCII characters in their R code, except perhaps in comments. Use
\uxxxx escapes for other characters.
So I tried to convert these characters into unicode, but actually don't really know how.
stringi::stri_encode("äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ", to = "Unicode")
Error in stringi::stri_encode(x, to = "Unicode") :
embedded nul in string: '\\xff\\xfe\\xe4'
doesn't work. Same with
iconv("äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ", from = "UTF-8", to = "Unicode")
Error in iconv(x, from = "UTF-8", to = "Unicode") :
unsupported conversion from 'UTF-8' to 'Unicode' in codepage 1252
Any ideas what I can do?
Note: weird thing also is that if I do:
x <- "äöüßâçèéêîôûąęćśłńóżźžšůěřáúëïùÄÖÜSSÂÇÈÉÊÎÔÛĄĘĆŚŁŃÓŻŹŽŠŮĚŘÁÚËÏÙ"
x now returns "äöüßâçèéêîôûaecslnózzžšueráúëïùÄÖÜSSÂÇÈÉÊÎÔÛAECSLNÓZZŽŠUERÁÚËÏÙ" which is wrong. So I guess it also has something to do with my general R encoding?
I'm trying to save data extracted with RSelenium from https://www.magna.isa.gov.il/Details.aspx?l=he, but although R succeeds printing Hebrew character to the console, it does not when exporting TXT, CSV or in other simple R functions, like data.frame(), readHTMLTable(), etc.
Here goes an example.
> head(lines)
[1] "גלובל פיננס ג'י.אר. 2 בע\"מ נתונים כספיים באלפי דולר ארה\"ב"
[2] "513435404"
[3] ""
[4] ""
[5] ""
[6] "4,481"
First line changes to weird characters (below) when using data.frame()
> head(as.data.frame(lines))
[1] <U+05D2><U+05DC><U+05D5><U+05D1><U+05DC> <U+05E4><U+05D9><U+05E0><U+05E0><U+05E1> <U+05D2>'<U+05D9>.<U+05D0><U+05E8>. 2 <U+05D1><U+05E2>"<U+05DE> <U+05E0><U+05EA><U+05D5><U+05E0><U+05D9><U+05DD> <U+05DB><U+05E1><U+05E4><U+05D9><U+05D9><U+05DD> <U+05D1><U+05D0><U+05DC><U+05E4><U+05D9> <U+05D3><U+05D5><U+05DC><U+05E8> <U+05D0><U+05E8><U+05D4>"<U+05D1>
The same happens when exporting .TXT or .CSV by write.table or write.csv:
write.csv(lines,"lines.csv",row.names=FALSE)
I tried to change the encoding to "UTF-8", like suggested in several alike questions, yet, the issue remains in a different format:
iconv(lines, to = "UTF-8")
1 ׳’׳׳•׳‘׳ ׳₪׳™׳ ׳ ׳¡ ׳’'׳™.׳׳¨. 2 ׳‘׳¢"׳ ׳ ׳×׳•׳ ׳™׳ ׳›׳¡׳₪׳™׳™׳ ׳‘׳׳׳₪׳™ ׳“׳•׳׳¨ ׳׳¨׳”"׳‘
Same for Hebrew ISO-8859-8:
iconv(lines, to = "ISO-8859-8")
1 ×'×o×.×'×o ×₪×T× × ×! ×''×T.×ר. 2 ×'×¢"×z × ×a×.× ×T× ×>×!×₪×T×T× ×'××o×₪×T ×"×.×oר ×ר×""×'
I don't understand why the console prints Hebrew characters well while write.table(), write.csv() and data.frame() presents encoding issues.
Anyone to help me exporting it?
That was answered by Ken, exporting text with writeLines() worked well:
f = file("lines.txt", open = "wt", encoding = "UTF-8")
writeLines(lines, "lines.txt", useBytes = TRUE)
close(f)
Yet, the main issue R has with Hebrew encoding is while dealing with tables, in the form of as.data.frame(), write.table() and write.csv(). Any thoughts?
Some machine info:
Sys.info()
sysname release version
"Windows" "7 x64" "build 7601, Service Pack 1"
nodename machine login
"TALIS-TP" "x86"
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Many many people have similar problems working with UTF-8 text on platforms that have 8-bit system encodings (Windows). Encoding in R can be tricky, because different methods handle encoding and conversions differently, and what appears to work fine on one platform (OS X or Linux) works poorly on another.
The problem has to do with your output connection and how Windows handles encodings and text connections. I've tried to replicate the problem using some Hebrew texts in both UTF-8 and an 8-bit encoding. We'll walk through the file reading issues as well, since there could be some snags there too.
For Tests
Created a short Hebrew language text file, encoded as UTF-8: hebrew-utf8.txt
Created a short Hebrew language text file, encoded as ISO-8859-8: hebrew-iso-8859-8.txt. (Note: You might need to tell your browser about the encoding in order to view this one properly - that's the case for Safari for instance.)
Ways to read the files
Now let's experiment. I am using Windows 7 for these tests (it actually works in OS X, my usual OS).
lines <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt")
lines
## [1] "העברי ×”×•× ×—×‘×¨ בקבוצה ×”×›× ×¢× ×™×ª של שפות שמיות."
## [2] "זו היתה ×©×¤×ª× ×©×œ ×”×™×”×•×“×™× ×ž×•×§×“×, ×בל מן 586 ×œ×¤× ×”\"ס ×–×” התחיל להיות מוחלף על ידי ב×רמית."
That failed because it assumed the encoding was your system encoding, Windows-1252. But because no conversion occurred when you read the files, you can fix this just by setting the Encoding bit to UTF-8:
# this sets the bit for UTF-8
Encoding(lines) <- "UTF-8"
lines
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
## [2] "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה\"ס זה התחיל להיות מוחלף על ידי בארמית."
But better to do this when you read the file:
# this does it in one pass
lines2 <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt", encoding = "UTF-8")
lines2[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
Encoding(lines2)
## [1] "UTF-8" "UTF-8"
Now look at what happens if we try to read the same text, but encoded as the 8-bit ISO Hebrew code page.
lines3 <- readLines("http://kenbenoit.net/files/hebrew-iso-8859-8.txt")
lines3[1]
## [1] "äòáøé äåà çáø á÷áåöä äëðòðéú ùì ùôåú ùîéåú."
Setting the Encoding bit is of no help here, because what was read does not map to the Unicode code points for Hebrew, and Encoding() does no actual encoding conversion, it merely sets an extra bit that can be used to tell R one of a few possible encoding values. We could have solved this by adding encoding = "ISO-8859-8" to the readLines() call. We can also convert the text after loading, using iconv():
# this will not fix things
Encoding(lines3) <- "UTF-8"
lines3[1]
## [1] "\xe4\xf2\xe1\xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\xf0\xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa."
# but this will
iconv(lines3, "ISO-8859-8", "UTF-8")[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
Overall I think the method used above for lines2 is the best approach.
How to output the files, preserving encoding
Now to your question about how to write this: The safest way is to control your connection at a low level, where you can specify the encoding. Otherwise, the default is for R/Windows to choose your system encoding, which will lose the UTF-8. I thought this would work, which does work absolutely fine in OS X - and on OS X also works fine calling writeLines() just naming a text file without the textConnection.
## to write lines, use the encoding option of a connection object
f <- file("hebrew-output-UTF-8.txt", open = "wt", encoding = "UTF-8")
writeLines(lines2, f)
close(f)
But it does not work on Windows. You can see the Windows 7 results here: hebrew-output-UTF-8-file_encoding.txt.
So, here is how to do it in Windows: Once you are sure your text is encoded as UTF-8, just write it as raw bytes, without using any encoding, like this:
writeLines(lines2, "hebrew-output-UTF-8-useBytesTRUE.txt", useBytes = TRUE)
You can see the results at hebrew-output-UTF-8-useBytesTRUE.txt, which is now UTF-8 and looks correct.
Added for write.csv
Note that the only reason you would want to do this is to make the .csv file available for import into other software, such as Excel. (And good luck working with UTF-8 in Excel/Windows...) Otherwise, you should just write the data.table as binary using write(myDataFrame, file = "myDataFrame.RData"). But if you really need to output .csv, then:
How to write UTF-8 .csv files from a data.table in Windows
The problem with writing UTF-8 files using write.table() and write.csv() is that these open text connections, and Windows has limitations about encodings and text connections with respect to UTF-8. (This post offers a helpful explanation.) Following from an SO answer posted here, we can override this to write our own function to output UTF-8 .csv files.
This assumes that you have already set the Encoding() for any character elements to "UTF-8" (which happens upon import above for lines2).
df <- data.frame(int = 1:2, text = lines2, stringsAsFactors = FALSE)
write_utf8_csv <- function(df, file) {
firstline <- paste('"', names(df), '"', sep = "", collapse = " , ")
data <- apply(df, 1, function(x) {paste('"', x, '"', sep = "", collapse = " , ")})
writeLines(c(firstline, data), file , useBytes = TRUE)
}
write_utf8_csv(df, "df_csv.txt")
When we now look at that file in non-Unicode-challenged OS, it now looks fine:
KBsMBP15-2:Desktop kbenoit$ cat df_csv.txt
"int" , "text"
"1" , "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
"2" , "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה"ס זה התחיל להיות מוחלף על ידי בארמית."
KBsMBP15-2:Desktop kbenoit$ file df_csv.txt
df_csv.txt: UTF-8 Unicode text, with CRLF line terminators
I was wonderiing if there exist a short way to get accents of a character string in R. For instance when I the "accentued" string is "Université d'Aix-Marseille" with my script I get "Universit%C3A9 d%27Aix-Marseille". Is there any function or algorithm to get the former one directly ?
I precise tht the file from where a get all my character string is encoded is UTF-8.
Sincerely yours.
You can get and set the encoding of a character vector like this:
s <- "Université d'Aix-Marseille"
Encoding(s)
# set encoding to utf-8
Encoding(s) <- "UTF-8"
s
If that fixes it, you could change your default encoding to UTF-8.
I want to write to a file with UTF-8 encoding containing the character
10001100 which is Œ the Latin capital ligature OE in extended ASCII table,
zz <- file("c:/testbin", "wb")
writeBin("10001100",zz)
close(zz)
When I open the file with office(encoding=utf-8), I can see Œ what I can not read is with readBin?
zz <- file("c:/testbin", "rb")
readBin(zz,raw())->x
x
[1] c5
readBin(zz,character())->x
Warning message:
In readBin(zz, character()) :
incomplete string at end of file has been discarded
x
character(0)
There are multiple difficulties here.
Firstly, there are actually several "Extended ASCII" tables. Since you are on Windows you are probably using CP1252 which is one of them, also called Windows-1252 or ANSI, and the Win default "latin" encoding. However the code for Œ varies within this family of tables. In CP1252, "Œ" is represented by 10001100 or "\x8c", as you wrote. However it does not exist in ISO-8859-1. And in UTF-8 it corresponds to "\xc5\x92" or "\u0152", as rlegendi indicated.
So, to write UTF-8 from CP1252-as-binary-as-string, you have to convert your string into it a "raw" number (the R class for bytes) and then a character, change its "encoding" from CP1252 to UTF-8 (in fact convert its byte value to the corresponding one for the same character in UTF-8), after that you can re-convert it to raw, and finally write to the file:
char_bin_str <- '10001100'
char_u <- iconv(rawToChar(as.raw(strtoi(char_bin_str, base=2))),
# "\x8c" 8c 140 '10001100'
from="CP1252",
to="UTF-8")
test.file <- "~/test-unicode-bytes.txt"
zz <- file(test.file, 'wb')
writeBin(charToRaw(char_u), zz)
close(zz)
Secondly, when you readBin(), do not forget to give a number of bytes to read which is big enough (n=file.info(test.file)$size here), otherwise it reads only the first byte (see below):
zz <- file(test.file, 'rb')
x <- readBin(zz, 'raw', n=file.info(test.file)$size)
close(zz)
x
[1] c5 92
Thirdly, if in the end you want to turn it back into a character, correctly understood and displayed by R, you have first to convert it into a string with rawToChar(). Now, the way it will be displayed depends on your default encoding, see Sys.getlocale() to see what it is (probably something ending with 1252 on Windows). The best is probably to specify that your character should be read as UTF-8 – otherwise it will be understood with your default encoding.
xx <- rawToChar(x)
Encoding(xx) <- "UTF-8"
xx
[1] "Œ"
This should keep things under control, write the correct bytes in UTF-8, and be the same on every OS. Hope it helps.
PS: I am not exactly sure why in your code x returned c5, and I guess it would have returned c5 92 if you had set n=2 (or more) as a parameter to readBin(). On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns 31, the hex ASCII representation of '1' (the first char in '10001100', which makes sense), with your code. Maybe you opened your file in Office as CP1252 and saved it as UTF-8 there, before coming back to R?
Try this instead (I replaced the binary value with the UTF encoding because I think it is better when you want such an output):
writeBin(charToRaw("\u0152"), zz)
I am doing some web scraping of names into a dataframe
For a name such as "Tomáš Rosický, I get a result "Tomáš Rosický"
I tried
Encoding("Tomáš Rosický") # with latin1 response
but was not sure where to go from there to get the original name with accents back. Played around with iconv without success
I would be satisfied (and might even prefer) an output of "Tomas Rosicky"
You've read in a page encoded in UTF-8. if x is your column of names, use Encoding(x) <- "UTF-8".
You should use this:
df$colname <- iconv(df$colname, from="UTF-8", to="LATIN1")
To do a correct read of the file use the scan function:
namb <- scan(file='g:/testcodering.txt', fileEncoding='UTF-8',
what=character(), sep='\n', allowEscapes=T)
cat(namb)
This also works:
namc <- readLines(con <- file('g:/testcodering.txt', "r",
encoding='UTF-8')); close(con)
cat(namc)
This will read the file with the correct accents
A way to export accents correctly:
enc2utf8(as(dataframe$columnname, "character"))