Cannot get Character Encodings right with rtweet and RMeCab tokenizer

Cannot get Character Encodings right with rtweet and RMeCab tokenizer - r

I am trying to text mine Japanese Tweets and am running into seemingly unsolvable Issues with character encodings.
After mining the tweets and setting locale with Sys.setlocale("LC_ALL", "Japanese_Japan.932") I get a dataframe that looks as expected:
I want to run these Tweets through a Tokenizer for Japanese language, namely RMeCab, based on MeCab (all available here). MeCabcan be compiled in UTF-8, SHIFT-JIS and a few others, but recompiling in another encoding doesn't make my problem go away or even change the end result.
So, after compiling MeCab and installing RMeCab, I extract the first tweet and try to tokenize it with
tweet1 <- trump_ja[1,5]
x <- RMeCabC(str = tweet1)
This yields the following output:
These are unfortunately not the correct Japanese Characters. I have tried the following alternations between SHIFT-JIS and UTF-8 encoding (and all combinations of these changes) to overcome this problem:
Open the R Script with different encoding (makes a difference, but just shows different garbled characters, so I'm assuming the garbling happens within RMeCab)
Switch locale between Sys.setlocale("LC_ALL", "English_United States.1252") and Sys.setlocale("LC_ALL", "Japanese_Japan.932")
Recompile MeCab in a different encoding
I am now at the end of the line and would like to ask for help.
EDIT:
I have now figured out that running an iconv (result, from = "UTF8", to = "UTF-8") conversion on a tokenized (and garbled) character string shows me the correct Japanese characters for the tokens. This doesn't see to make a lot of sense, but it does the trick. However, I'd like to avoid this extra step, as the conversion only works on character strings, and not on lists or vectors.

Related

R version 4.2.0 and Swedish letters (ä ö å) not working in newest R release. Anyone found a solution?

I have updated to the latest R release (R version 4.2.0), but I am now facing the problem that all the Swedish special letters cannot be read anymore. I am working with a database that has many Swedish letters in its factor lables and even if I am reading them in as strings R doesn't recognise them, with the consequence that all summary tables that are based on these factors as groups are not calculated correctly anymore. The code has been working fine under the previous release (but I had issues with knitting Rmarkdown files, therefore the need for updating).
I have set the encoding to iso-5889-4 (which is nothern languages) after UTF-8 has not worked. Is there anything else I could try? Or has anyone come to a solution on how to fix this, other than to rename all lables before reading in the .csv files? (I would really like to avoid this fix, since I am often working with similar data)
I have used read.csv() and it produces cryptic outputs replacing the special letters with for example <d6> instead of ö and <c4> instead of ä.
I hope that someone has an idea for a fix. Thanks.
edit: I use windows.
Sys.getlocale("LC_CTYPE")
[1] "Swedish_Sweden.utf8"

Use the encoding parameter
I have been able to detect failed loads by attempting to apply toupper to strings, which gives me errors such as
Error in toupper(dataset$column) :
invalid multibyte string 999751
This is resolved and expected outcomes obtained by using
read.csv(..., encoding = 'latin1')
or
data.table::fread(..., encoding = 'Latin-1')
I believe this solution should apply to Swedish characters as they are also covered by the Latin-1 encoding.

I have the same problem, what worked for me was like the answer above said but I used encoding ISO-8859-1 instead. It works for both reading from file and saving to file for Swedish characters å,ä,ö,Å,Ä,Ä, i.e:
read.csv("~/test.csv", fileEncoding = "ISO-8859-1")
and
write.csv2(x, file="test.csv", row.names = FALSE, na = "", fileEncoding = "ISO-8859-1")
It's tedious but it works right now. Another tip is if you use Rstudio is to go to Global options -> Code -> Saving and set your default text encoding to ISO-8859-1 and restart Rstudio. It will save and read your scripts in that encoding as default if I understand correctly. I had the problem when I opened my scripts with Swedish characters, they would display wrong characters. This solution fixed that.

R changing names when there is ä ü ö

OK, this is an extremly annoying problem and I was not able to find a solution on the internet, therefore I come to you.
When importing data sets that contain German names with Umlaut (ä, ö, ü), R modifies the names. Somethin like Möhlin -> M<f6>hlin.
When writing code word containing Umlaut cause no problem, until saving the script. After reloading a save script all my beloved Umlaut are modified. Aka all the names of my plots, the name of the variables, etc etc ...
Please, anyone can help me ?

Try setting the locale:
Sys.setlocale(category = "LC_ALL", locale = "German")

Try changing default codepage to UTF-8 in RStudio via:
Tools - Global Options - Code - Saving - Default Text Encoding - UTF-8
then restart RStudio and save and reopen your script with umlauts.

I'd just try to make sure all your files are UTF-8 encoded, ie. know their Umlauts.
Thus, when writing and reading files, try to always explicitly set the file encoding to "UTF-8".
For instance, when writing df to file,
write.csv(tt, "output.csv", fileEncoding = "UTF-8")
The same logic applies to read.csv(), etc.
Note that opening files that way will only work properly when you saved them as UTF-8 in the first place.
I know that some people like to use stringr for string manipulation in general when working with non-English text, but I have never used it.

character encoding error not resolved by specifying encoding

I am trying to extract text from a Spanish-language source in R, and running into a character encoding problem which is not resolved by explicitly specifying the encoding within htmlParse, as recommended here.
library(XML)
library(httr)
url <- "http://www3.hcdn.gov.ar//folio-cgi-bin/om_isapi.dll?E1=&E11=&E12=&E13=&E14=&E15=&E16=&E17=&E18=&E2=&E3=&E5=ley&E6=&E7=&E9=&headingswithhits=on&infobase=proy.nfo&querytemplate=Consulta%20de%20Proyectos%20Parlamentarios&record={4EBB}&recordswithhits=on&softpage=Document42&submit=ejecutar%20"
doc <- htmlParse(rawToChar(GET(url)$content),encoding="windows-1252")
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
text[77]
The 77th element, which includes an accented i, has the offending characters. The fourth line has some additional hoops I have to jump through to read this source. The document itself claims to be encoded in "windows-1252." Specifying "latin1" and several other encodings I have tried are no better. In my actual application, I have already downloaded many of these files and am reading them locally using readLines...and I can tell that the error is not present after reading the file into R, so the problem must be in htmlParse. Also, just accepting the encoding error and correcting it ex post does not seem to be an option, as R does not even recognize the characters it is spitting out if I try to copy and paste them back into a script.

Here is a quick fix that may work after you bring the file into R
Encoding(text) <- "UTF-8"
Changing the coding to "UTF-8" makes Spanish files a lot more usable.

Reading CSV files with Chinese Characters

There are a number of StackOverflow posts about opening CSV files containing (UTF-8 encoded) Chinese characters into R, in Windows. None of the answers I've found seem to work completely.
If I read.csv with encoding="UTF-8", then the Chinese characters are shown encoded (<U+XXXX>, which I've manually verified are at least correct). However, if I interrogate the data frame to get just one row or a specific cell from a row, then it's printed properly.
One post suggested this is due to strings being typed as factors. However, setting stringsAsFactors=FALSE had no effect.
Other posts say the locale must be set correctly. My system locale is apparently English_United Kingdom.1252; a Windows code page looks decidedly non-Unicode friendly! If I try to change it to any of en.UTF-8, en_GB.UTF-8 or en_US.UTF-8 (or even UTF-8 or Unicode), I get an error saying that my OS cannot honour the request.
If I try Sys.setlocale(category="LC_ALL", locale="Chinese"), the locale does change (albeit to another Windows code page; still no Unicode) but then the CSV files can't be parsed. That said, if I read the files in the English locale and then switch to Chinese afterwards, the data frame is printed out correctly in the console. However, this is cludgy and, regardless, View(myData) now shows mojibake rather than the encoded Unicode code points.
Is there any way to just make it all work? That is, correct Chinese characters are echoed from the data frame to the console and View works, without having to perform secret handshakes when reading the data?
My gut feeling is that the problem is the locale: It should be set to a UTF-8 locale and then everything should [might] just work. However, I don't know how to do that...

The UTF notation is good and it means your characters were read in property. The issue is on R's side with printing to console, which shouldn't be a big problem unless you are copying and pasting output. Writing out is a bit tricky: you want to open a UTF-8 file connection, then write to that file.

Displaying UTF-8 encoded Chinese characters in R

I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.
For instance:
data <-read.csv("mydata.csv", encoding="UTF-8")
data
will produce unicode characters, while:
data <-read.csv("mydata.csv", encoding="UTF-8")
data[,1]
will actually display Chinese characters.
If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.
I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.
My current locale is:
"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"
Any help to get R to consistently display Chinese characters would be greatly appreciated...

Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.
You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.
#nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.

In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.The utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims that it save as utf8.
(1) Download 'Open Sheet' software.
(2) Open it properly. You can scroll the encoding method until you
see the Chinese character displayed in the preview windows.
(3) Save it as utf-8(if you want utf-8). (UTF-8 is not solution to every problem, you HAVE TO know the default encoding in your system first)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex