Problems in R for OSX with the download.file function - r

OK, so this is an R problem specific to OSX.
I'm trying to download XML data through an API. The following code works just fine on a PC, but not on a Mac. I have rotated through all the "methods" (curl, etc.) to no avail. Any thoughts?
tempx <- "temp.xml"
url <- "http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1"
download.file(url, tempx, method="auto")
ETA: Here's my error:
trying URL 'http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1'
Error in download.file(url, tempx, method = "auto") :
cannot open URL 'http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1'

This works fine with httr:
library(httr)
url <- "http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1"
GET(url)
because it automatically handles the redirects:
GET(url)$url
# [1] "http://usaspending.gov/api/fpds_api_basic.php?fiscal_year=2012&maj_contracting_agency=97%2A&Contracts=c&sortby=SIGNED_DATE%2Basc&records_from=0&max_records=10&sortby=SIGNED_DATE+asc"

This is not so much an answer as a comment wth formatting. I'm also an OSX user and have the same problem with your code as well as problems with my efforts at a solution:
library(RCurl)
library(XML)
gotten <- getURL("http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1")
> gotten
[1] "\n"
> gotten2 <- getURLContent("http://usaspending.gov/fpds/fpds.php?detail=b&fiscal_year=2012&maj_agency_cat=97&max_records=10&sortby=d&records_from=1")
>
> gotten2
[1] "\n"
attr(,"Content-Type")
"text/xml"
So I think some sort of response occurs but that the initial response is very short and the code is not ready to accept what comes afterwards.

Related

Webscraping, read_html() - Error in open.connection(x, "rb") : SSL certificate problem: certificate has expired

I am currently trying to build a small webscraper.
I am using the following code to scrape a website:
webpage <- "https://www.whisky.de/shop/Schottland/Single-Malt/Macallan-Triple-Cask-15-Jahre.html"
content <- read_html(webpage)
However, when I run the second line with the read_html command, I get the following error message:
Error in open.connection(x, "rb") :
SSL certificate problem: certificate has expired
Does anyone of you know where this is coming from? When I used it a few days ago, I did not have any trouble with it.
I am using Mac OS X 10.15.5, RStudio (1.2.5033)
I also installed the library "rvest"
Many thanks for your help in advance!
I was getting the same problem for another website, but the other answer did not solve it for me. I'm posting what worked for me in case it is useful to someone else.
library(tidyverse)
library(rvest)
webpage <- "https://www.whisky.de/shop/Schottland/Single-Malt/Macallan-Triple-Cask-15-Jahre.html"
content <- webpage %>%
httr::GET(config = httr::config(ssl_verifypeer = FALSE)) %>%
read_html()
See here for a discussion about this solution.
Try using the GET function.
webpage <- "https://www.whisky.de/shop/Schottland/Single-Malt/Macallan-Triple-Cask-15-Jahre.html"
content <- read_html(GET(webpage))
I should have mentioned the GET function is part of the httr R package. Make sure you use GET and not get.
I had the same problem. I fixed it by changing the ssl settings in R. Just add the following line to the beginning of your code (at least before you call read_html()):
httr::set_config(config(ssl_verifypeer = FALSE, ssl_verifyhost = FALSE))

How to use proxies with urls in R?

I am trying to use proxies with my request urls in R. It changes my requested url from "www.abc.com/games" to "www.abc.com/unsupportedbrowser"
The proxies are working since I tested them in python. However I would like to implement it in R
I tried using "httr" and "crul" library in R
#Using httr library
r <- GET(url,use_proxy("proxy_url", port = 12345, username = "abc", password ="xyz") )
text <-content(r, "text")
#using "crul"
res <- HttpClient$new(
url,
proxies = proxy(proxy_url:12345,"abc","xyz")
)
out <-res$get()
text <-out$parse("UTF-8")
Is there some other way to implement the above using proxies or how can I avoid getting the request url changing from "www.abc.com/games" to "www.abc.com/unsupportedbrowser"
I also tried using "requestsR" package
However when I try something like this:
library(dplyr)
library(reticulate)
library(jsonlite)
library(requestsR)
library(rvest)
library(listviewer)
proxies <-
"{'http': 'http://abc:xyz#proxy_url:12345',
'https': 'https://abc:xyz#proxy_url:12345'}" %>%
convert_dictionary_to_list()
res <- Get(url, proxy=proxies)
It gives an error: "r$get : $ operator is invalid for atomic vectors"
I don't understand why it raises such an error. Please Let me know if this can be resolved
Thanks!
I was able to solve the above problem using "user_agent" argument with my GET()

R. RCurl 400 Bad Request

I try to send a request to API, use RCurl library.
My code:
start = "2018-07-30"
end = "2018-08-15"
api_request <- paste("https://api-metrika.yandex.ru/stat/v1/data.csv?id=34904255&date1=",
start,
"&date2=",
end,
"&dimensions=ym:s:searchEngine&metrics=ym:s:visits&dimensions=ym:s:<attribution>SearchPhrase&filters=ym:s:<attribution>SearchPhrase!~'some|phrases|here'&limit=100000&oauth_token=OAuth_token_here", sep="")
s <- getURL(api_request)
And every time I try to do it I have the response "Error 400" or "Bad Request" if I use getUrlContent instead. When I just open this url in my browser - it works correctly.
I still couldn't find any solution for this problem, so if somebody knows something about it - please help me, kind man =)
There are several approaches you can use in case the URL is right. First you can add the following parameter to the getURL function. Setting the parameter followlocation equal TRUE allows to follow any "Location:" header that the server returns as part of the HTTP headers. Processing is recursive, PHP will follow any "Location:" header.
> s <- getURL(url1, .opts=curlOptions(followlocation = TRUE))
If this is not working an alternative way is to use the XML package by calling the htmlParse method instead of getURL
> library(XML)
> s <- htmlParse(api_request)
Another approach would be to use the httr package and call the GET function:
> library(httr)
> s <- GET(api_request)

Overcoming "Error: unexpected input" in RCurl

I am trying and failing to use RCurl to automate the process of fetching a spreadsheet from a web site, China Labour Bulletin's Strike Map.
Here is the URL for the spreadsheet with the options set as I'd like them:
http://strikemap.clb.org.hk/strikes/api.v4/export?FromYear=2011&FromMonth=1&ToYear=2015&ToMonth=6&_lang=en
Here is the code I'm using:
library(RCurl)
temp <- tempfile()
temp <- getForm("http://strikemap.clb.org.hk/strikes/api.v4/export",
FromYear="2011", FromMonth="1",
ToYear="2015", ToMonth="6",
_lang="en")
And here is the error message I get in response:
Error: unexpected input in:
" ToYear=2015, ToMonth=6,
_"
Any suggestions on how to get this to work?
Try enclosing _lang with a backtick.
temp <- getForm("http://strikemap.clb.org.hk/strikes/api.v4/export",
FromYear="2011",
FromMonth="1",
ToYear="2015",
ToMonth="6",
`_lang`="en")
I think R has trouble on the argument starting with an underscore. This seems to have worked for me.

R produces "unsupported URL scheme" error when getting data from https sites

R version 3.0.1 (2013-05-16) for Windows 8 knitr version 1.5 Rstudio 0.97.551
I am using knitr to do the markdown of my R code.
As part of my analysis I downloaded various data sets from the web, knitr is totally fine with getting data from http sites but from https ones where it generates an unsupported URL scheme message.
I know when using the download.file function on a mac the method parameter has to be set to curl to get data from an https however this doesn't help when using knitr.
What do I need to do so that knitr will gather data from Https websites?
Edit:
Here is the code chunk that returns an error in Knitr but when run through R works without error.
```{r}
fileurl <- "https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv"
download.file(fileurl, destfile = "C:/Users/xxx/yyy")
```
You could use https with download.file() function by passing "curl" to method as :
download.file(url,destination,method="curl")
Edit (May 2016): As of R 3.3.0, download.file() should handle SSL websites automatically on all platforms, making the rest of this answer moot.
You want something like this:
library(RCurl)
data <- getURL("https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv",
ssl.verifypeer=0L, followlocation=1L)
That reads the data into memory as a single string. You'll still have to parse it into a dataset in some way. One strategy is:
writeLines(data,'temp.csv')
read.csv('temp.csv')
You can also separate out the data directly without writing to file:
read.csv(text=data)
Edit: A much easier option is actually to use the rio package:
library("rio")
import("https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv")
This will read directly from the HTTPS URL and return a data.frame.
Use setInternet2(use = TRUE) before using the download.file() function. It works on Windows 7.
setInternet2(use = TRUE)
download.file(url, destfile = "test.csv")
I am sure you have already found solution to your problem by now.
I was working on an assignment right now and ended up getting the same error. I tried some of the tricks, but that did not work for me. Maybe because I am working on Windows machine.
Anyhow, I changed the link to http: rather than https: and that did the trick.
Following is chunk of my code:
if (!file.exists("./PeerAssesment2")) {dir.create("./PeerAssessment2")}
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, dest = "./PeerAssessment2/Data.zip")
install.packages("R.utils")
library(R.utils)
if (!file.exists("./PeerAssessment2/Data")) {
bunzip2 ("./PeerAssessment2/Data.zip", destname = "./PeerAssessment2/Data")
}
list.files("./PeerAssessment2")
noaaData <- read.csv ('./PeerAssessment2/Data')
Hope this helps.
I had the same issue with knitr and download.file() with a https url, on Windows 8.
You could try setInternet2(TRUE) before using the download.file() function. However I'm not sure that this fix works on Unix-like systems.
setInternet2(TRUE) # set the R_WIN_INTERNET2 to TRUE
fileurl <- "https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv"
download.file(fileurl, destfile = "C:/Users/xxx/yyy") # now it should work
Source : R documentation (?download.file()) :
Note that https:// URLs are only supported if --internet2 or environment variable R_WIN_INTERNET2 was set or setInternet2(TRUE) was used (to make use of Internet Explorer internals), and then only if the certificate is considered to be valid.
I had the same problem with a https with the following code running perfectly in R and getting unsupported URL scheme when knitting to html:
temp = tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", temp)
data = read.csv(unz(temp, "activity.csv"), colClasses = c("numeric", "Date", "numeric"))
I tried all the solutions posted here and nothing worked, in my absolute desperation I just eliminated the "s" in the "https" in the url and everything got fine...
Using the R download package takes care of the quirky details typically associated with file downloads. For you example, all you needed to do would have been:
```{r}
library(download)
fileurl <- "https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv"
download(fileurl, destfile = "C:/Users/xxx/yyy")
```

Resources