character encoding error not resolved by specifying encoding - r

I am trying to extract text from a Spanish-language source in R, and running into a character encoding problem which is not resolved by explicitly specifying the encoding within htmlParse, as recommended here.
library(XML)
library(httr)
url <- "http://www3.hcdn.gov.ar//folio-cgi-bin/om_isapi.dll?E1=&E11=&E12=&E13=&E14=&E15=&E16=&E17=&E18=&E2=&E3=&E5=ley&E6=&E7=&E9=&headingswithhits=on&infobase=proy.nfo&querytemplate=Consulta%20de%20Proyectos%20Parlamentarios&record={4EBB}&recordswithhits=on&softpage=Document42&submit=ejecutar%20"
doc <- htmlParse(rawToChar(GET(url)$content),encoding="windows-1252")
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
text[77]
The 77th element, which includes an accented i, has the offending characters. The fourth line has some additional hoops I have to jump through to read this source. The document itself claims to be encoded in "windows-1252." Specifying "latin1" and several other encodings I have tried are no better. In my actual application, I have already downloaded many of these files and am reading them locally using readLines...and I can tell that the error is not present after reading the file into R, so the problem must be in htmlParse. Also, just accepting the encoding error and correcting it ex post does not seem to be an option, as R does not even recognize the characters it is spitting out if I try to copy and paste them back into a script.

Here is a quick fix that may work after you bring the file into R
Encoding(text) <- "UTF-8"
Changing the coding to "UTF-8" makes Spanish files a lot more usable.

Related

R version 4.2.0 and Swedish letters (ä ö å) not working in newest R release. Anyone found a solution?

I have updated to the latest R release (R version 4.2.0), but I am now facing the problem that all the Swedish special letters cannot be read anymore. I am working with a database that has many Swedish letters in its factor lables and even if I am reading them in as strings R doesn't recognise them, with the consequence that all summary tables that are based on these factors as groups are not calculated correctly anymore. The code has been working fine under the previous release (but I had issues with knitting Rmarkdown files, therefore the need for updating).
I have set the encoding to iso-5889-4 (which is nothern languages) after UTF-8 has not worked. Is there anything else I could try? Or has anyone come to a solution on how to fix this, other than to rename all lables before reading in the .csv files? (I would really like to avoid this fix, since I am often working with similar data)
I have used read.csv() and it produces cryptic outputs replacing the special letters with for example <d6> instead of ö and <c4> instead of ä.
I hope that someone has an idea for a fix. Thanks.
edit: I use windows.
Sys.getlocale("LC_CTYPE")
[1] "Swedish_Sweden.utf8"
Use the encoding parameter
I have been able to detect failed loads by attempting to apply toupper to strings, which gives me errors such as
Error in toupper(dataset$column) :
invalid multibyte string 999751
This is resolved and expected outcomes obtained by using
read.csv(..., encoding = 'latin1')
or
data.table::fread(..., encoding = 'Latin-1')
I believe this solution should apply to Swedish characters as they are also covered by the Latin-1 encoding.
I have the same problem, what worked for me was like the answer above said but I used encoding ISO-8859-1 instead. It works for both reading from file and saving to file for Swedish characters å,ä,ö,Å,Ä,Ä, i.e:
read.csv("~/test.csv", fileEncoding = "ISO-8859-1")
and
write.csv2(x, file="test.csv", row.names = FALSE, na = "", fileEncoding = "ISO-8859-1")
It's tedious but it works right now. Another tip is if you use Rstudio is to go to Global options -> Code -> Saving and set your default text encoding to ISO-8859-1 and restart Rstudio. It will save and read your scripts in that encoding as default if I understand correctly. I had the problem when I opened my scripts with Swedish characters, they would display wrong characters. This solution fixed that.

Attempts to parse bencode / torrent file in R

I wish I could parse torrent files automatically via R. I tried to use R-bencode package:
library('bencode')
test_torrent <- readLines('/home/user/Downloads/some_file.torrent', encoding = "UTF-8")
decoded_torrent <- bencode::bdecode(test_torrent)
but faced to error:
Error in bencode::bdecode(test_torrent) :
input string terminated unexpectedly
In addition if I try to parse just part of this file bdecode('\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf'), I get
Error in bdecode("\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf") :
Wrong encoding '�'. Allowed values are i, l, d or a digit.
Maybe there are another ways to do it in R? Or probably I can insert another language code in Rscript?
Thanks in advance!
It might be that the torrent file is somehow corrupted.
A bencode value must begin with the character i (for integers), l (for lists), d (for dictionaries) or a number (for the length of a string).
The example string ('\xe7\xc9...'), doesn't start with any of those characters, and hence it can't be decoded.
See this for more info on the bencode format.
There seem to be several issues here.
Firstly, your code should not treat torrent files as text files in UTF-8 encoding. Each torrent file is split into equally-sized pieces (except for the last piece ; )). Torrents contain a concatenation of SHA1 hashes of each of the pieces. SHA1 hashes are unlikely to be valid UTF-8 strings.
So, you should not read the file into memory using readLines, because that is for text files. Instead, you should use a connection:
test_torrent <- file("/home/user/Downloads/some_file.torrent")
open(test_torrent, "rb")
bencode::bdecode(test_torrent)
Secondly, it seems that this library is also suffering from a similar issue. As readChar that it makes use of, also assumes that it's dealing with text.
This might be due to recent R version changes though seeing as the library is over 6 years old. I was able to apply a quick hack and get it working by passing useBytes=TRUE to readChar.
https://github.com/UkuLoskit/R-bencode/commit/b97091638ee6839befc5d188d47c02567499ce96
You can install my version as follows:
install.packages("devtools")
library(devtools)
devtools::install_github("UkuLoskit/R-bencode")
Caveat lector! I'm not a R programmer :).

Reading diacritics in R

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?
As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

Cannot get Character Encodings right with rtweet and RMeCab tokenizer

I am trying to text mine Japanese Tweets and am running into seemingly unsolvable Issues with character encodings.
After mining the tweets and setting locale with Sys.setlocale("LC_ALL", "Japanese_Japan.932") I get a dataframe that looks as expected:
I want to run these Tweets through a Tokenizer for Japanese language, namely RMeCab, based on MeCab (all available here). MeCabcan be compiled in UTF-8, SHIFT-JIS and a few others, but recompiling in another encoding doesn't make my problem go away or even change the end result.
So, after compiling MeCab and installing RMeCab, I extract the first tweet and try to tokenize it with
tweet1 <- trump_ja[1,5]
x <- RMeCabC(str = tweet1)
This yields the following output:
These are unfortunately not the correct Japanese Characters. I have tried the following alternations between SHIFT-JIS and UTF-8 encoding (and all combinations of these changes) to overcome this problem:
Open the R Script with different encoding (makes a difference, but just shows different garbled characters, so I'm assuming the garbling happens within RMeCab)
Switch locale between Sys.setlocale("LC_ALL", "English_United States.1252") and Sys.setlocale("LC_ALL", "Japanese_Japan.932")
Recompile MeCab in a different encoding
I am now at the end of the line and would like to ask for help.
EDIT:
I have now figured out that running an iconv (result, from = "UTF8", to = "UTF-8") conversion on a tokenized (and garbled) character string shows me the correct Japanese characters for the tokens. This doesn't see to make a lot of sense, but it does the trick. However, I'd like to avoid this extra step, as the conversion only works on character strings, and not on lists or vectors.

R changing names when there is ä ü ö

OK, this is an extremly annoying problem and I was not able to find a solution on the internet, therefore I come to you.
When importing data sets that contain German names with Umlaut (ä, ö, ü), R modifies the names. Somethin like Möhlin -> M<f6>hlin.
When writing code word containing Umlaut cause no problem, until saving the script. After reloading a save script all my beloved Umlaut are modified. Aka all the names of my plots, the name of the variables, etc etc ...
Please, anyone can help me ?
Try setting the locale:
Sys.setlocale(category = "LC_ALL", locale = "German")
Try changing default codepage to UTF-8 in RStudio via:
Tools - Global Options - Code - Saving - Default Text Encoding - UTF-8
then restart RStudio and save and reopen your script with umlauts.
I'd just try to make sure all your files are UTF-8 encoded, ie. know their Umlauts.
Thus, when writing and reading files, try to always explicitly set the file encoding to "UTF-8".
For instance, when writing df to file,
write.csv(tt, "output.csv", fileEncoding = "UTF-8")
The same logic applies to read.csv(), etc.
Note that opening files that way will only work properly when you saved them as UTF-8 in the first place.
I know that some people like to use stringr for string manipulation in general when working with non-English text, but I have never used it.

Resources