Can't download .csv from dropbox - r

require(data.table)
require(httr)
url = "http://www.dropbox.com/s/0brabdf53lc37i/data.csv?dl=1"
request <- GET(url)
Loading required package: data.table
Loading required package: httr
Error in curl::curl_fetch_memory(url, handle = handle) :
Couldn't resolve host name
Calls: GET ... request_fetch -> request_fetch.write_memory -> -> .Call
Execution halted
What gives? The URL works fine in my browser and others have had success downloading dropbox files this way...

I have an answer to your question, after a lot of searching around. When I noticed that R was changing the protocol of your DropBox URL from http to https, I became suspicious that you might have a certificate problem. As this SO post mentions, this seems to precisely be the case. Try using this code:
require(data.table)
require(httr)
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
url = "http://www.dropbox.com/s/0brabdf53lc37i/data.csv?dl=1"
request <- GET(url, config(cainfo = cafile))
What is happening here:
The cert file cacert.pem contains a list of trusted certificates, issued from a CA (Certificate Authority). When DropBox sends R its public SSL certificate, R will search through this list of trusted certificates to see if it can find it. If it can, it will allow the SSL handshake to complete, and your data will be downloaded.
The reason why you are having this problem but many who read your question do not have it, is that you likely never configured your curl settings in R.

The "can't resolve" message is occurring because dropbox is turning your http request into an https request (not all services can follow redirects) and (most likely) because your download protocol can't handle secure http and because the url doesn't request the raw data...
All problems are best fixed by moving to the format the host (dropbox) will direct you too, ie., https, and then switching your code (if necessary) to use the new protocol, and correcting the url to tell dropbox to serve the raw file (not the ?dl=1 suffix you're using, but ?raw=1
So:
switch the url to secure
"https://www.dropbox.com/s/0brabdf53lc37i/data.csv?dl=1"
Switch the request to raw
"http://www.dropbox.com/s/0brabdf53lc37i/data.csv?raw=1"
Test that in a browser - would work if this wasn't a bad link.
Open using R functions like url() that can handle secure transport (example of this is at this answer

Related

Reddit Scraping Error HTTP status was '403 Forbidden'

I am trying to scrape Reddit using RedditExtractoR
My query
df = get_reddit(search_terms = "blockchain",page_threshold = 2)
Shows me the error as
cannot open URL
'http://www.reddit.com/r/CryptoCurrency/comments/7vga1y/i_will_tell_you_exactly_what_is_going_on_here/.json?limit=500':
HTTP status was '403 Forbidden'cannot open URL
'http://www.reddit.com/r/IAmA/comments/blssl3/my_name_is_benjamin_zhang_and_im_a_transportation/.json?limit=500':
HTTP status was '403 Forbidden'cannot open URL
'http://www.reddit.com/r/IAmA/comments/blssl3/my_name_is_benjamin_zhang_and_im_a_transportation/.json?limit=500':
HTTP status was '403 Forbidden'3 Forbidden'
How can I resolve it?
The usual cause for 403 forbidden is either:
a) issues with the server
b) getting blocked
I created an R program that uses the get_reddit extractor library to test whether you could get blocked by using it
library(RedditExtractoR)
blockchain <- get_reddit(
search_terms = "blockchain",
page_threshold = 2,
)
And it worked like a charm no matter how I ran it.
Fortunately, the RedditExtractoR has built-in rate limits to prevent issues.
As per redditExtractorR package documentation:
Question: All functions in this library appear to be a little slow, why is that?
Answer: The Reddit API allows users to make 60 requests per minute (1 request per second), which is why URL parsers used in this library intentionally limit requests to conform to the API requirements
But a 403 error due to getting blocked is still a real possibility.
If you start getting 403 either:
wait a few days to make sure it's not the API that has issues
use a scraping service or a VPN

Webpage works in browser, but not from R: SSL certificate problem: certificate has expired

This url works in the browser, providing some JSON data.
It worked from R until very recently, it now returns:
library(jsonlite)
fromJSON("https://api.worldbank.org/v2/country?format=json")
# Error in open.connection(con, "rb") :
# SSL certificate problem: certificate has expired
library(rvest)
read_html("https://api.worldbank.org/v2/country?format=json")
# Error in open.connection(con, "rb") :
# SSL certificate problem: certificate has expired
What I know so far
I am not sure if this is an issue on the API side, or somewhere in R?
There seems to be an analogous solution here, although any solution I use must not use browser automation (selenium), but instead must use either jsonlite or rvest
For anyone else who is having a similar issue
The Cause
The website owner had an expired SSL certificate.
I was able to confirm this via this website:
(imperfect) Solution
Since I have no control over the url's SSL certificate, I simply changed all the urls I was using from https to http.
For example:
"https://api.worldbank.org/v2/country?format=json"
changes to
"http://api.worldbank.org/v2/country?format=json"
I actually have this issue too... Either way I cannot access it. I get the following error message (WDIcache() does not, of course, work either)
Error in file(con, "r") : cannot open the connection to 'http://api.worldbank.org/indicators?per_page=25000&format=json'
You'll have to set the ssl settings in R with
httr::set_config(config(ssl_verifypeer = FALSE, ssl_verifyhost = FALSE))

How to format request in HTTP.jl to include cert file

I would like to know how to include a cert file when sending requests in HTTP.jl.
In Python, using Requests it would look like this,
requests.get(url, params=payload, verify=cert_file)
The documentation mentions SSL certs, but is unclear.
It really is poorly documented, and in similar cases I've had to look at the source code to
MbedTLS (within the site https://tls.mbed.org/), which is what the package HTTP.jl calls for certificates.
MbedTLS in turn looks for the systems's installed certificates, so if you install the certificate for your user, HTTP.jl should use it for https. I realize this may not help your specific need, which may require something like this (untested):
using HTTP, MbedTLS
conf = MbedTLS.SSLConfig(cert_file, key_file)
resp = HTTP.get("https://httpbin.org/ip", sslconfig=conf)
println(resp)
itself.
If you have to go back to the MbedTLS source as I did, I suggest you look at the example at https://github.com/JuliaLang/MbedTLS.jl and the source at
https://github.com/JuliaLang/MbedTLS.jl/blob/master/src/MbedTLS.jl,
especially the function SSLConfig(cert_file, key_file) on line 103.

Check the existence of remote directory with R

I would like to check whether a given directory exists in a remote server, given the URL pointing to that directory. For example:
url <- "http://plasmodb.org/common/downloads/release-24"
How can this be accomplished in R? I have considered using url.show, which downloads and shows the url if it exists but gives an error in the case of a non-existent directory. But I am not sure what would be the best approach, optimally without having to download the whole URL in the case of an existing directory.
This will be highly dependent on the server/resource in question since it has more to do with HTTP status codes than R capability. Provided a remote server is configured to respond properly to directory index requests you can use HEAD from httr for this:
library(httr)
status <- HEAD("http://plasmodb.org/common/downloads/release-24/")
status$status_code
## [1] 200
status <- HEAD("http://plasmodb.org/common/downloads/release-100/")
status$status_code
## [1] 404
Here's a nicely-formatted list of status codes http://httpstatus.es and here's a salient RFC http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html that you should also peruse other sections of. Finally, here's a Wikipedia link http://en.wikipedia.org/wiki/Webserver_directory_index discussing the "directory index". That shows you may also get a 403 vs 200 or 404 depending on the config (and it's really not limited to that depending on the web server).

ResponseRedirect fails when source url path contains http://

As part of an imageprocessing module I accept urls in the following format in order to process and cache externally hosted images.
http://localhost:56639/remote.axd/http://ipcache.blob.core.windows.net/source/IMG_0671.JPG?width=400&filter=comic
After processing the file, if I use Response.Redirect(url, false) to redirect the server to a valid external cache url, the server returns a 404 error response citing the StaticFileHandler as the source of the error.
If the file comes from a local source something like.
http://localhost:56639/IMG_0671.JPG?width=400&filter=comic
The server redirects to the external url without issue. Can someone explain why and provide a solution?
Note: remote.axd does nothing other than allow the local server to intercept the external url. I use the .axd extension as it isn't mapped to route by default in MVC.
I've noticed that when looking at the request path the http:// segment is replaced with http:/. I don't know whether that causes an issue.
So the reference to StaticFileHandler is the clue.
Following the actions of my HttpModule the handler is attempting to process the request. When a locally cached file is used this finds the file and all is ok. Since I am redirecting to a remote url and have a remote source the handler is finding nothing and throwing a 404 exception.
Further processing of the request has to be halted following a rewrite using the following method.
HttpApplication.CompleteRequest

Resources