download.file returns error, when using special letters - r

I'm trying to download several zip files from a webpage. The problem is when, I try to use the download.file function it returns the following error message cannot open URL if I include the special danish letters æ, ø or å. I have the following piece of code
link <- "http://web.econ.ku.dk/polit/studerende/eksamen/opgrv/filer/rv%20Øk%20B%20X2015S_takehome_answers.zip"
download.file(link, getwd())
Can someone explain to me, how I can use the download.files() with special letters.

Related

UTF-8 problems in RStudio

I am passing on my work with some R-files to my colleague atm, and we are having a lot of trouble getting the files to work on his computer. The script as well as the data contains the nordic letters, and so to prevent this from being an issue, we have made sure to save the R-files with encoding UTF-8.
Still, there are two problems. A solution to either one would be much appreciated:
Problem 1: loading the standard data CSV-file (semicolon separated - which works on my computer), my colleague gets the following error:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string 3
But then, instead we have tried to make it work, both with a CSV-file that he has saved with UTF-8 format and also an excel (xlsx) file. Both these files he can load fine (with read.csv2 and read_excel from the officer-package, respectively), and in both cases, when he opens up the data in R, it looks fine to him too ("æ", "ø" and "å" are included).
The problem first comes when he tries to run the plots that actually has to "grap" and display the values from the data columns where "æ", "ø" and "å" are included in the values. Here, he gets the following error message:
in grid.call(c_textBounds, as.graphicAnnot(x$label), x$x, x$y, : invalid input 'value with æ/ø/å' in 'utf8towcs'
When I try to run the R-script with the CSV-UTF-8 data-file (comma-separated), and I open up the data in a tab in RStudio, I can see that æ, ø and å is not written correctly (but they are just a bunch of weird signs). This is weird, considering that it should work more optimally with this type of CSV-file, and instead I'm having problems with this and not the standard CSV-file (the non UTF-8, semicolon-separated file).
When I try to run the script with the xlsx-file it then works totally fine for me. I get to the plot that has to display the data values with æ, ø and å, and it works completely fine. I do not get the same error message.
Why does my colleague get these errors?
(we have also made sure that he has installed the danish version of R at the CRAN-website)
We have tried all of the above.

R version 4.2.0 and Swedish letters (ä ö å) not working in newest R release. Anyone found a solution?

I have updated to the latest R release (R version 4.2.0), but I am now facing the problem that all the Swedish special letters cannot be read anymore. I am working with a database that has many Swedish letters in its factor lables and even if I am reading them in as strings R doesn't recognise them, with the consequence that all summary tables that are based on these factors as groups are not calculated correctly anymore. The code has been working fine under the previous release (but I had issues with knitting Rmarkdown files, therefore the need for updating).
I have set the encoding to iso-5889-4 (which is nothern languages) after UTF-8 has not worked. Is there anything else I could try? Or has anyone come to a solution on how to fix this, other than to rename all lables before reading in the .csv files? (I would really like to avoid this fix, since I am often working with similar data)
I have used read.csv() and it produces cryptic outputs replacing the special letters with for example <d6> instead of ö and <c4> instead of ä.
I hope that someone has an idea for a fix. Thanks.
edit: I use windows.
Sys.getlocale("LC_CTYPE")
[1] "Swedish_Sweden.utf8"
Use the encoding parameter
I have been able to detect failed loads by attempting to apply toupper to strings, which gives me errors such as
Error in toupper(dataset$column) :
invalid multibyte string 999751
This is resolved and expected outcomes obtained by using
read.csv(..., encoding = 'latin1')
or
data.table::fread(..., encoding = 'Latin-1')
I believe this solution should apply to Swedish characters as they are also covered by the Latin-1 encoding.
I have the same problem, what worked for me was like the answer above said but I used encoding ISO-8859-1 instead. It works for both reading from file and saving to file for Swedish characters å,ä,ö,Å,Ä,Ä, i.e:
read.csv("~/test.csv", fileEncoding = "ISO-8859-1")
and
write.csv2(x, file="test.csv", row.names = FALSE, na = "", fileEncoding = "ISO-8859-1")
It's tedious but it works right now. Another tip is if you use Rstudio is to go to Global options -> Code -> Saving and set your default text encoding to ISO-8859-1 and restart Rstudio. It will save and read your scripts in that encoding as default if I understand correctly. I had the problem when I opened my scripts with Swedish characters, they would display wrong characters. This solution fixed that.

Reading diacritics in R

I have imported several .txt files (texts written in Spanish) to RStudio using the following code:
content = readLines(paste("my_texts", "text1",sep = "/"))
However, when I read the texts in RStudio, they contain codes instead of diacritics. For example, I see the code <97> instead of an "ó" or the code <96> instead of an "ñ".
I have realized also that if the .txt file was originally written using a computer configured in Spanish, I don't see the codes but the actual diacritics. And if the texts were written using a a computer configured in English, then I do get the codes (even though when opening the .txt file on TextEdit I see the diacritics).
I don't know why R displays those symbols and what I can do to retain the diacritics I see in the original .txt files.
I read I could possibly solve this by changing the encoding to UTF-8, so I tried this:
content = readLines(paste("my_texts", "text1",sep = "/"), encoding = "UTF-8")
But that didn't work. Any ideas what those codes are and how to keep my diacritics?
As you figured out, you need to set the correct encoding. Unfortunately the text file was written using a legacy encoding rather than UTF-8 — namely, MacRoman. Ideally the application producing the file would not use this encoding, and Apple products by default no longer produce it.
But since this is what you’ve got, we have to deal with it, and we can. But unfortunately we need to go a detour because the encoding argument of readLines is a bit useless. Instead, we need to manually open a file connection:
con = file(file.path("my_texts", "text1"), encoding = "macintosh")
on.exit(close(con)) # Always make sure to close connections!
contents = readLines(con)
Do note that the encoding name “macintosh” is strictly speaking not portable, so this might not work on all platforms.

How to use knitr when there are some accented characters in path of directory?

I am writing a function in R that allows to produce a html report for any list in R memory. The function relies on knitr.
The function is available here:
the ez.html function
The function is working quite well, except if the path of the working directory contains special characters (e.g., accented characters).
In other words, if path is :
C:\Users\Nicolas
everything is ok. However, if path is :
C:\Users\Véro
knitr is not able to change the directory.
I found that the author of knitr urges not to use special character in the path. However, as I would like to share the function, I cannot ensure that other people does not use non-ascii characters.
I tried to avoid the problem by testing whether the path contains non-ascii characters and to create a new directory when there is at least one ascii character.
wd<-getwd()
if(grepl("[^[:alnum:]]", wd)) {
wd.decomp<-str_split(wd, "/")
special.chr<-grepl("[^[:alnum:]]",unlist(wd.decomp) )
special.chr<-which(special.chr)[2]
special.chr<-special.chr-1
wd.decomp<-unlist(wd.decomp)
new.wd<-wd.decomp[1:special.chr]
new.wd.<-str_flatten(new.wd, "/")
new.wd.<-paste0(new.wd., "/res.easieR")
dir.create( new.wd., showWarnings = FALSE)
test<-try(setwd(new.wd.))
if(class(test)== "try-error"){
new.wd.<-str_flatten(new.wd, '\\')
new.wd.<-paste0(new.wd., "\\res.easieR")
dir.create( new.wd., showWarnings = FALSE)
setwd( new.wd.)
}
Once again, this code chunk is working quite well, except if you need the administrator rights to create a directory.
Thus, if the path is
C:\Users\Nicolas\ça.a.marché
The function creates the following directory :
C:\Users\Nicolas\easieR.res
However, for
C:\Users\Véro
The function fails because administrator rights are required for creating
C:\Users\easieR.res
Does anyone have an idea either to make knitr accept a path with accented character or to create a directory without special characters ?
Thanks all.

character encoding error not resolved by specifying encoding

I am trying to extract text from a Spanish-language source in R, and running into a character encoding problem which is not resolved by explicitly specifying the encoding within htmlParse, as recommended here.
library(XML)
library(httr)
url <- "http://www3.hcdn.gov.ar//folio-cgi-bin/om_isapi.dll?E1=&E11=&E12=&E13=&E14=&E15=&E16=&E17=&E18=&E2=&E3=&E5=ley&E6=&E7=&E9=&headingswithhits=on&infobase=proy.nfo&querytemplate=Consulta%20de%20Proyectos%20Parlamentarios&record={4EBB}&recordswithhits=on&softpage=Document42&submit=ejecutar%20"
doc <- htmlParse(rawToChar(GET(url)$content),encoding="windows-1252")
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
text[77]
The 77th element, which includes an accented i, has the offending characters. The fourth line has some additional hoops I have to jump through to read this source. The document itself claims to be encoded in "windows-1252." Specifying "latin1" and several other encodings I have tried are no better. In my actual application, I have already downloaded many of these files and am reading them locally using readLines...and I can tell that the error is not present after reading the file into R, so the problem must be in htmlParse. Also, just accepting the encoding error and correcting it ex post does not seem to be an option, as R does not even recognize the characters it is spitting out if I try to copy and paste them back into a script.
Here is a quick fix that may work after you bring the file into R
Encoding(text) <- "UTF-8"
Changing the coding to "UTF-8" makes Spanish files a lot more usable.

Resources