R produces "unsupported URL scheme" error when getting data from https sites - r

R version 3.0.1 (2013-05-16) for Windows 8 knitr version 1.5 Rstudio 0.97.551
I am using knitr to do the markdown of my R code.
As part of my analysis I downloaded various data sets from the web, knitr is totally fine with getting data from http sites but from https ones where it generates an unsupported URL scheme message.
I know when using the download.file function on a mac the method parameter has to be set to curl to get data from an https however this doesn't help when using knitr.
What do I need to do so that knitr will gather data from Https websites?
Edit:
Here is the code chunk that returns an error in Knitr but when run through R works without error.
```{r}
fileurl <- "https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv"
download.file(fileurl, destfile = "C:/Users/xxx/yyy")
```

You could use https with download.file() function by passing "curl" to method as :
download.file(url,destination,method="curl")

Edit (May 2016): As of R 3.3.0, download.file() should handle SSL websites automatically on all platforms, making the rest of this answer moot.
You want something like this:
library(RCurl)
data <- getURL("https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv",
ssl.verifypeer=0L, followlocation=1L)
That reads the data into memory as a single string. You'll still have to parse it into a dataset in some way. One strategy is:
writeLines(data,'temp.csv')
read.csv('temp.csv')
You can also separate out the data directly without writing to file:
read.csv(text=data)
Edit: A much easier option is actually to use the rio package:
library("rio")
import("https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv")
This will read directly from the HTTPS URL and return a data.frame.

Use setInternet2(use = TRUE) before using the download.file() function. It works on Windows 7.
setInternet2(use = TRUE)
download.file(url, destfile = "test.csv")

I am sure you have already found solution to your problem by now.
I was working on an assignment right now and ended up getting the same error. I tried some of the tricks, but that did not work for me. Maybe because I am working on Windows machine.
Anyhow, I changed the link to http: rather than https: and that did the trick.
Following is chunk of my code:
if (!file.exists("./PeerAssesment2")) {dir.create("./PeerAssessment2")}
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, dest = "./PeerAssessment2/Data.zip")
install.packages("R.utils")
library(R.utils)
if (!file.exists("./PeerAssessment2/Data")) {
bunzip2 ("./PeerAssessment2/Data.zip", destname = "./PeerAssessment2/Data")
}
list.files("./PeerAssessment2")
noaaData <- read.csv ('./PeerAssessment2/Data')
Hope this helps.

I had the same issue with knitr and download.file() with a https url, on Windows 8.
You could try setInternet2(TRUE) before using the download.file() function. However I'm not sure that this fix works on Unix-like systems.
setInternet2(TRUE) # set the R_WIN_INTERNET2 to TRUE
fileurl <- "https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv"
download.file(fileurl, destfile = "C:/Users/xxx/yyy") # now it should work
Source : R documentation (?download.file()) :
Note that https:// URLs are only supported if --internet2 or environment variable R_WIN_INTERNET2 was set or setInternet2(TRUE) was used (to make use of Internet Explorer internals), and then only if the certificate is considered to be valid.

I had the same problem with a https with the following code running perfectly in R and getting unsupported URL scheme when knitting to html:
temp = tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", temp)
data = read.csv(unz(temp, "activity.csv"), colClasses = c("numeric", "Date", "numeric"))
I tried all the solutions posted here and nothing worked, in my absolute desperation I just eliminated the "s" in the "https" in the url and everything got fine...

Using the R download package takes care of the quirky details typically associated with file downloads. For you example, all you needed to do would have been:
```{r}
library(download)
fileurl <- "https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv"
download(fileurl, destfile = "C:/Users/xxx/yyy")
```

Related

How to make a GET request in base R?

I see several methods of making a GET in R.
Using httr
myurl <- "https://stackoverflow.com"
library(httr)
httr::GET(myurl)
Using rvest/xml2
library(rvest)
xml2::read_html(myurl)
Using curl
I have tested using curl and can confirm the following works from a standard macbook and from a windows 10 device:
command <- paste("curl", myurl)
system(command)
Question
The third method above (using curl) seems to work fairly universally.
Is there any better way of making a GET request (or similarly a HEAD, POST etc) than the method above using curl?
'better' in this case means works universally across operating systems with minimal coding / external programs/libraries being installed), or is using curl (through a call to system) the best way?
From base R there is url with method = "libcurl"
con <- url("https://www.stackoverflow.com", method = "libcurl")
tmp <- readLines(con)
Also, this is not strictly base R, but from utils there is url.show
utils::url.show("https://stackoverflow.com", method = "curl")

Installing pdftotext on Windows (for use with R, 'tm' package)

I am having trouble using R, 'tm' package, to read in .pdf files.
Specifically, I try to run the following code:
library(tm)
filename = "myfile.pdf"
tmp1 <- readPDF(PdftotextOptions="-layout")
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
doc[1:15]
...which gives me the error:
Error in readPDF(PdftotextOptions = "-layout") :
unused argument (PdftotextOptions = "-layout")
I assume this is due to the fact that the pdftotext program (part of xpdf, http://www.foolabs.com/xpdf/download.html) has not been installed correctly on my machine, so that R cannot access it.
What are the steps to install xpdf/pdftotext correctly such that the above R code can be executed? (I am aware of similar questions already posted, however they don't address the same issue)
PdftotextOptions is no parameter of readPDF. readPDF has a control parameter, which expects a list. So correct use would be:
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
tmp1 <- readPDF(control = list(text = "-layout"))
doc <- tmp1(elem=list(uri=filename),language="en",id="id1")
}
Set
setwd('C:/xpdf/bin64')
It works for me.

TreeTagger in R

I have downloaded TreeTaggerv3.2 for Windows and have configured it per the install.txt. I am trying to use it in R with koRpus package. I have set the kRp.env as -
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en",
preset="en", treetagger="manual", format="file",
TT.tknz=TRUE, encoding="UTF-8" )
.My data to be tagged is in a file and trying to use it as treetag("myfile.txt") but it is throwing the error-
Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE, :
'data' must be of a vector type, was 'NULL'
In addition: Warning message:
running command 'C:\windows\system32\cmd.exe /c C:\TreeTagger\bin\tag-english.bat
C:\Users\vivsingh\Desktop\NLP\tree_tag_ex.txt' had status 255
The standalone TreeTagger is working on by windows.Any idea on how it works?
I had the exact same error and warning while trying lemmatization on R word vector following Bernhard Learns blog using windows 7 and R 3.4.1 (x64). The issue was also appearing using textstem package but TreeTagger was running properly in cmd window.
I mixed several answers I found on this post and here is my steps and code running properly:
get into R win_library (~\Documents\R\win-library\3.4\rJava\jri\x64\jri.dll) and copy jri.dll (thanks kravi!) to replace it the parent folder.
close and restart R
library(koRpus)
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual", format="obj", TT.tknz=FALSE , lang="en", TT.options=list(path="c:/TreeTagger", preset="en"))
lemma_tagged_tbl <- tbl_df(lemma_tagged#TT.res)
Hope it helps.
I am posting this answer to keep a record. I also faced the same issue due to incorrect specification of the location of jri.dll on 64-Bit processor and windows 8.1. If we call
set.kRp.env(TT.cmd="manual", lang="en", TT.options=list(path="/path/to/tree-tagger-windows-x.x/TreeTagger", preset="en")) and we follow either of following two steps, we can resolve this error:
While installing R, if we install only 64 Bit version of R, and
specify the proper path for these variables
LD_LIBRARY_PATH = /path/to/rJava/jri
JAVA_HOME = /path/to/jdk1.x.x
java.library.path = /path/to/rJava/jri/jri.dll
CLASSPATH = /path/to/rJava/jri
If we already installed both versions viz. 32 bit and 64 bit of R on your computer then just copy jri.dll from /path/to/rJava/jri/x64/jri.dll and replace at path/to/rJava/jri/jri.dll. Further, we need to set the path of above mentioned four variables.
I've got this issue (very similar I guess) and posted query to GitHub.
https://github.com/unDocUMeantIt/koRpus/issues/7
The current working solution for me for this case was easier than I could expect, just downgrading the koRpus package. This can change with time but this version should remain appropriate.
library("devtools")
install_github("unDocUMeantIt/koRpus", ref="0.06-5")
This package is not Java related they said.
You can face the same error while setting up the korpus environment and getting the result from treetagger. For example, when you use:
tagged.text <- treetag(
"C:/temp/sample_text.txt",
treetagger = "manual",
lang = "en",
TT.options = list(
path = "c:/Treetagger",
preset = "en"
),
doc_id = "sample"
)
You would receive a similar error
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does not fail but produce a table with proper results, please contact the author!
Here you need to change the value of treetagger, from
treetagger = "manual"
to
treetagger = "kRp.env"
However, before that remember to set the kRp.env as #Xochitl C. suggested in their answer
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
Once you do this, you'll get the desired result.

R Importing excel file directly from web

I need to import excel file directly from NYSE website. The spreadsheet url is https://quotespeed.morningstar.com/exportChartDataToExcel.jsp?tickers=AAPL&symbols=126.1.AAPL&st=1980-12-1&ed=2015-6-8&f=m&dty=1&types=1&ver=1.6.0&qs_wsid=E43474CC03753FE0E777D89877788ECB . Tried using gdata package and changing https to http but still doesnt work. Does anybody know solution to such issue?
EDIT: Has to be imported to R directly from website (project requirement)
Without information about why using the gdata package does not work for you I have to assume. Make sure you have Perl installed - you can download it at http://www.activestate.com/activeperl
This works for me:
library('gdata')
## URL broken into multiple lines for readability
url <- paste("https://quotespeed.morningstar.com/exportChartDataToExcel.",
"jsp?tickers=AAPL&symbols=126.1.AAPL&st=1980-12-1&ed=2015-",
"6-8&f=m&dty=1&types=1&ver=1.6.0&qs_wsid=E43474CC03753FE0E",
"777D89877788ECB", sep = "")
url <- gsub("https", "http",url)
data <- read.xls(url, perl = "C:/Perl64/bin/perl.exe")
Without perl = "path_to_perl.exe" I got the error
Error in findPerl(verbose = verbose) :
perl executable not found. Use perl= argument to specify the correct path.
Error in file.exists(tfn) : invalid 'file' argument
Use the RCurl package to download the file and the readxl package by Hadley to read the excel file

Problems in R for OSX with the download.file function

OK, so this is an R problem specific to OSX.
I'm trying to download XML data through an API. The following code works just fine on a PC, but not on a Mac. I have rotated through all the "methods" (curl, etc.) to no avail. Any thoughts?
tempx <- "temp.xml"
url <- "http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1"
download.file(url, tempx, method="auto")
ETA: Here's my error:
trying URL 'http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1'
Error in download.file(url, tempx, method = "auto") :
cannot open URL 'http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1'
This works fine with httr:
library(httr)
url <- "http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1"
GET(url)
because it automatically handles the redirects:
GET(url)$url
# [1] "http://usaspending.gov/api/fpds_api_basic.php?fiscal_year=2012&maj_contracting_agency=97%2A&Contracts=c&sortby=SIGNED_DATE%2Basc&records_from=0&max_records=10&sortby=SIGNED_DATE+asc"
This is not so much an answer as a comment wth formatting. I'm also an OSX user and have the same problem with your code as well as problems with my efforts at a solution:
library(RCurl)
library(XML)
gotten <- getURL("http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1")
> gotten
[1] "\n"
> gotten2 <- getURLContent("http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1")
>
> gotten2
[1] "\n"
attr(,"Content-Type")
"text/xml"
So I think some sort of response occurs but that the initial response is very short and the code is not ready to accept what comes afterwards.

Resources