Reading diacritics in R

Reading diacritics in R - r

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?

As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

Related

R version 4.2.0 and Swedish letters (ä ö å) not working in newest R release. Anyone found a solution?

I have updated to the latest R release (R version 4.2.0), but I am now facing the problem that all the Swedish special letters cannot be read anymore. I am working with a database that has many Swedish letters in its factor lables and even if I am reading them in as strings R doesn't recognise them, with the consequence that all summary tables that are based on these factors as groups are not calculated correctly anymore. The code has been working fine under the previous release (but I had issues with knitting Rmarkdown files, therefore the need for updating).
I have set the encoding to iso-5889-4 (which is nothern languages) after UTF-8 has not worked. Is there anything else I could try? Or has anyone come to a solution on how to fix this, other than to rename all lables before reading in the .csv files? (I would really like to avoid this fix, since I am often working with similar data)
I have used read.csv() and it produces cryptic outputs replacing the special letters with for example <d6> instead of ö and <c4> instead of ä.
I hope that someone has an idea for a fix. Thanks.
edit: I use windows.
Sys.getlocale("LC_CTYPE")
[1] "Swedish_Sweden.utf8"

Use the encoding parameter
I have been able to detect failed loads by attempting to apply toupper to strings, which gives me errors such as
Error in toupper(dataset$column) :
invalid multibyte string 999751
This is resolved and expected outcomes obtained by using
read.csv(..., encoding = 'latin1')
or
data.table::fread(..., encoding = 'Latin-1')
I believe this solution should apply to Swedish characters as they are also covered by the Latin-1 encoding.

I have the same problem, what worked for me was like the answer above said but I used encoding ISO-8859-1 instead. It works for both reading from file and saving to file for Swedish characters å,ä,ö,Å,Ä,Ä, i.e:
read.csv("~/test.csv", fileEncoding = "ISO-8859-1")
and
write.csv2(x, file="test.csv", row.names = FALSE, na = "", fileEncoding = "ISO-8859-1")
It's tedious but it works right now. Another tip is if you use Rstudio is to go to Global options -> Code -> Saving and set your default text encoding to ISO-8859-1 and restart Rstudio. It will save and read your scripts in that encoding as default if I understand correctly. I had the problem when I opened my scripts with Swedish characters, they would display wrong characters. This solution fixed that.

Is there any way to use Å letter in R?

Is there any way to deal with this letter in R -Å?
In some configuration I'm able to read this letter from SQL by RODBC, but I didn't found any solution to save this letter to csv or txt. It's always getting converted to normal A or Ĺ.
Also, how to read this letter correctly from Excel file?

I understand from you question that the letter displays properly inside R but you have problems writing it to files.
R's writing functions usually have an encoding parameter (for example, for write.csv and write.table it's called fileEncoding).
When you don't set it explicitly, the function will encode the file using your OS's (or R-installations) native encoding, which can sometimes cause problems with special characters. What exactly goes wrong and how to fix it depends heavily on your system setup - especially if you're also interacting with databases, as you describe.
But very often, an easy fix is writing files in UTF-8 encoding, i.e.
write.csv(your_df, your_path, fileEncoding='UTF-8')
as most external programs (such as Excel) are able to automatically detect and properly read UTF-8 encoded files.

Set the fileEncoding argument on write.table to fit your needs (e.g., if your text is encoded as UTF-8, try write.table(my_tab, file = "my_tab.txt", fileEncoding = "UTF8")).

R changing names when there is ä ü ö

OK, this is an extremly annoying problem and I was not able to find a solution on the internet, therefore I come to you.
When importing data sets that contain German names with Umlaut (ä, ö, ü), R modifies the names. Somethin like Möhlin -> M<f6>hlin.
When writing code word containing Umlaut cause no problem, until saving the script. After reloading a save script all my beloved Umlaut are modified. Aka all the names of my plots, the name of the variables, etc etc ...
Please, anyone can help me ?

Try setting the locale:
Sys.setlocale(category = "LC_ALL", locale = "German")

Try changing default codepage to UTF-8 in RStudio via:
Tools - Global Options - Code - Saving - Default Text Encoding - UTF-8
then restart RStudio and save and reopen your script with umlauts.

I'd just try to make sure all your files are UTF-8 encoded, ie. know their Umlauts.
Thus, when writing and reading files, try to always explicitly set the file encoding to "UTF-8".
For instance, when writing df to file,
write.csv(tt, "output.csv", fileEncoding = "UTF-8")
The same logic applies to read.csv(), etc.
Note that opening files that way will only work properly when you saved them as UTF-8 in the first place.
I know that some people like to use stringr for string manipulation in general when working with non-English text, but I have never used it.

Importing data with special characters in R

The following pic shows how the data is before i import it(notepad) in R and after importing.
I use the following command to import it in R:
Data <- read.csv('data.csv',stringsAsFactors = FALSE,header = TRUE,quote = "")
It can be seen that the special characters such as the ae is replaced with something like A| (line 19 on the left,line 18 or the right). Is there a way to import the CSV file as it is? (Using R)

Your problem is an encoding issue. There are two aspects to this: First, what is saved by Notepad++ may not correspond to the encoding that you are expecting in the saved text file, and second, R may be reading the file in using read.csv() based on a different encoding, which is especially possible since if you are using Notepad++ then this suggests you are using Windows, and therefore you may be unable to have UTF-8 as your system locale for R.
So taking each issue in turn:
Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.
To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)
Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:
> Sys.getlocale()
[1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.

In my case, I use only the parameter [encoding = "Latin-1"] and it worked. Thanks.
read.csv(paste(src,sprintf("%s.csv",x), sep = "/"), header = TRUE,
stringsAsFactors = FALSE, encoding = "Latin-1")

How do I read in a CSV file into R?

I'm having some technical issues with loading a CSV file into R. When I inspect the csv file in RStudio's Source pane, all the characters are surrounded by weird red circles or dots. When I inspect another self-made CSV file, the characters appear perfectly fine, without any of the red circles.
What is this issue/symptom, and what would be the best way to fixing this for about 40 similar CSV files?
When I try to run readfile <- read.csv("filename.csv", sep="", collapse=NULL) I get the following error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
empty beginning of file

My guess would be that you ran into some encoding issue.
Especially on Windows you can run into all sorts of problems with that.
Try opening the csv file with a text editor that has the capability of saving files with various Encodings (e.g. Notpead++) then change that to e.g. UTF-8 (which is the preferred Encoding of RStudio most other Editors and R itself), save the file and try to run the import again.
Just make sure that you don't loose characters - especially special characters tend to get lost during Encoding changes.
Greetings ...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reading diacritics in R - r

Related

R version 4.2.0 and Swedish letters (ä ö å) not working in newest R release. Anyone found a solution?

Is there any way to use Å letter in R?

R changing names when there is ä ü ö

Importing data with special characters in R

How do I read in a CSV file into R?

Categories

Resources