Umlauts in R not correctly displayed - r

I am trying to use multiple sources of German/Swiss data with umlauts in it. When trying to merge, I realized that the umlauts do not display correctly in R and the same names were rendered differently in different files.
map <-readOGR("/path/to/data.gdb", layer = "layer")
map#data$name
# [1] L\303\266rrach
# [2] Karlsruhe
# [3] ...
Along with several other posts, I read Encoding of German umlauts when using readOGR because one of my data sources is a shp file I read in with readOGR.
Appending use_iconv = TRUE, encoding = "UTF-8") to the end of readOGR did not help. And the problem exists outside of the use of redOGR. I saw that using Sys.setlocale() and a locale which supports UTF-8 worked for that poster, but I don't know what that means after looking at the ?Sys.setlocale information.
How do I correctly read in German data in R on a Mac using English? Sys.getlocale reports C.

Could you somehow include an exemplary .gdb-file?
What happens, if you try encoding="latin1"?
Maybe the gdb-data was saved in a wrong encoding? Are you creating it yourself, or you downloaded it from somewhere?
You could also check the information of the gdb-file with this command:
ogrinfo -al "/path/to/data.gdb"

Related

Reading diacritics in R

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?
As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

Is there a way to set character encoding when reading sas files to spark or when pulling the data to the r session?

So, I have sas7bdat files that are huge and I would like to read them to spark, process them and then collect the results to a r session. I'm reading them to spark using the package spark.sas7bdat and function spark_read_sas. So far so good. The problem is that the character encoding of the sas7bdat files is iso-8859-1 but to show the content correctly in R it would need to be UTF-8. When I pull the results to R, my data looks like this. (Let's first create an example that has the same raw bytes that my results have)
mydf <- data.frame(myvar = rawToChar(as.raw(c(0xef, 0xbf, 0xbd, 0x62, 0x63))))
head(mydf$myvar,1) # should get äbc if it was originally read correctly
> �bc
Changing the encoding afterwards doesn't work for some reason.
iconv(head(mydf$myvar,1), from = 'iso-8859-1', to = 'UTF-8')
> �bc
If I use haven package and read_sas('myfile.sas7bdat', encoding = 'iso-8859-1') to read the file directly to my r-session, everything works as expected.
head(mydf$myvar,1)
> äbc
I would be very grateful for a solution that enables me to do the processing in spark and then collect only the results to R-session because the files are so big. I guess this could potentially either be solved when a) reading the file to spark (but I did not find and option that would work) or b) correcting the encoding in R (could not get it to work but do not understand why, maybe it could have something to do with special character encoding in the sas7bdat file).

Importing data with special characters in R

The following pic shows how the data is before i import it(notepad) in R and after importing.
I use the following command to import it in R:
Data <- read.csv('data.csv',stringsAsFactors = FALSE,header = TRUE,quote = "")
It can be seen that the special characters such as the ae is replaced with something like A| (line 19 on the left,line 18 or the right). Is there a way to import the CSV file as it is? (Using R)
Your problem is an encoding issue. There are two aspects to this: First, what is saved by Notepad++ may not correspond to the encoding that you are expecting in the saved text file, and second, R may be reading the file in using read.csv() based on a different encoding, which is especially possible since if you are using Notepad++ then this suggests you are using Windows, and therefore you may be unable to have UTF-8 as your system locale for R.
So taking each issue in turn:
Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.
To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)
Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:
> Sys.getlocale()
[1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.
In my case, I use only the parameter [encoding = "Latin-1"] and it worked. Thanks.
read.csv(paste(src,sprintf("%s.csv",x), sep = "/"), header = TRUE,
stringsAsFactors = FALSE, encoding = "Latin-1")

read an MSWord file into R

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.
I am using the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try to read an MSWord file containing the following text:
A 20 1000 AA
B 30 1001 BB
C 10 1500 CC
I get a warning message that says:
Warning message:
In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") :
incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'
and my.data appears to be gibberish:
# [1] "PK\003\004\024" "¤l" "ÈFÃË‹Átí"
I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.
I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.
Thank you for any suggestions.
First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.
The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.
The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.
Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.
In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).
I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.
I converted a pdf to MSWord with Acrobat X Pro
The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.
Convert the MSWord file to a text file after deleting vertical lines in Step 2.
Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.
You can do this with RDCOMClient very easily.
In saying so, some characters will not read in correctly.
require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE
# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text())

Resources