Non-blocking download using curl in R - r

I am writing some code where I download many pages of from a web API, do some processing, and combine them into a data frame. The API takes ~30 seconds to respond to each request, so it would be convenient to send the request for the next page while doing the processing for the current page. I can do this using, e.g., mcparallel, but that seems like overkill. The curl package claims that it can make non-blocking connections, but this does not seem to work for me.
From vignette("intro", "curl"):
As of version 2.3 it is also possible to open connetions in
non-blocking mode. In this case readBin and readLines will return
immediately with data that is available without waiting. For
non-blocking connections we use isIncomplete to check if the download
has completed yet.
con <- curl("https://httpbin.org/drip?duration=1&numbytes=50")
open(con, "rb", blocking = FALSE)
while(isIncomplete(con)){
buf <- readBin(con, raw(), 1024)
if(length(buf))
cat("received: ", rawToChar(buf), "\n")
}
close(con)
The expected result is that the open should return immediately, and then 50 asterisks should be progressively printed over 1 second as the results come in. For me, the open blocks for about a second, and then the asterisks are printed all at once.
Is there something else I need to do? Does this work for anyone else?
I am using R version 3.3.2, curl package version 3.1, and libcurl3 version 7.47.0 on Ubuntu 16.04 LTS. I have tried in RStudio and the command line R console, with the same results.

Related

Access locally served files within an R session

Context
In order to test the web capabilities of an R package I am writing, I'm attempting to serve a file locally use the httpuv package so that I can run tests using an offline copy of the page.
Issue
However, curl doesn't seem to want to play nice with httpuv - specifically, when trying to read the hosted file using curl (for example, with curl::curl() or curl::curl_fetch_memory()), the request hangs, and eventually times out if not manually interrupted.
Minimal example
# Serve a small page
server <- httpuv::startServer("0.0.0.0", port = 9359, app = list(
call = function(req) {
list(
status = 200L,
headers = list("Content-Type" = "text/html"),
body = "Some content..."
)
}
))
# Attempt to retrieve content (this hangs)
page <- curl::curl_fetch_memory(url = "http://127.0.0.1:9359")
httpuv::stopServer(server)
Current progress
Once the server has been started, running curl -v 127.0.0.1:9359 at the terminal returns content as expected. Additionally, if I open a new instance of RStudio and try to curl::curl_fetch_memory() in that new R session (while the old one is still open), it works perfectly.
Encouraged by that, I've been playing around with callr for a while, thinking maybe it's possible to launch the server in some background process, and then continue as usual. Unfortunately I haven't had any success so far with this approach.
Any insight or suggestions very much appreciated!
Isn't it a great feeling when you can come back and answer a question you asked!
From the httpuv::startServer() documentation:
startServer binds the specified port and listens for connections on an thread running in the background. This background thread handles the I/O, and when it receives a HTTP request, it will schedule a call to the user-defined R functions in app to handle the request. This scheduling is done with later(). When the R call stack is empty – in other words, when an interactive R session is sitting idle at the command prompt – R will automatically run the scheduled calls. However, if the call stack is not empty – if R is evaluating other R code – then the callbacks will not execute until either the call stack is empty, or the run_now() function is called. This function tells R to execute any callbacks that have been scheduled by later(). The service() function is essentially a wrapper for run_now().
In other words, if we want to respond to requests as soon as they are received, we have to explicitly do so using httpuv::service(). Something like the following does the trick!
s <- callr::r_session$new()
on.exit(s$close())
s$call(function() {
httpuv::startServer("0.0.0.0", port = 9359, app = list(
call = function(req) {
list(
status = 200L,
headers = list("Content-Type" = "text/html"),
body = "Some content...")
)
}
))
while (TRUE) httpuv::service()
})
# Give the server a chance to start
Sys.sleep(3)
page <- curl_fetch_memory(url = "http://127.0.0.1:9359")

R / Shiny promises and futures not working with httr

I am working in a Shiny app that connects to Comscore using their API. Any attempt at executing POST commands inside future/promises fails with the cryptic error:
Warning: Error in curl::curl_fetch_memory: Bulk data encryption algorithm failed in selected cipher suite.
This happens with any POST attempt, not only when/if I try to call Comscore´s servers. As an example of a simple, harmless and uncomplicated POST request that fails, here is one:
rubbish <- future(POST('https://appsilon.com/an-example-of-how-to-use-the-new-r-promises-package/'))
print(value(rubbish))
But everything works fine if I don not use futures/promises.
The problem I want to solve is that currently we have an app that works fine in a single user environment, but it must be upgraded for a multiuser scenario to be served by a dedicated Shiny Server machine. The app makes several such calls in a row (from a few dozen to some hundreds), taking from 5 to 15 minutes.
The code is running inside an observeEvent block, triggered by a button clicked by the user when he has configured the request to be submitted.
My actual code is longer, and there are other lines both before and after the POST command in order to prepare the request and then process the received answer.
I have verified that all lines before the POST command get executed, so the problem seems to be there, in the attempt to do a POST connecting to the outside world from inside a promise.
I am using RStudio Server 1.1.453 along with R 3.5.0 in a RHEL server.
Package versions are
shiny: 1.1.0
httr: 1.3.1
future; 1.9.0
promise: 1.0.1
Thanks in advance,

R not responding request to interrupt stop process

Every now and then I have to run a function that takes a lot of time and I need to interrupt the processing before it's complete. To do so, I click on the red sign of "stop" at the top of the console in Rstudio, which quite often returns this message below:
R is not responding to your request to interrupt processing so to stop the current operation you may need to terminate R entirely.
Terminating R will cause your R session to immediately abort. Active computations will be interrupted and unsaved source file changes and workspace objects will be discarded.
Do you want to terminate R now?
The problem is that I click "No" and then Rstudios seems to freeze completely. I would like to know if others face a similar issue and if there is any way to get around this.
Is there a way to stop a process in Rstudio quickly without loosing the objects in the workspace?
Unfortunately, RStudio is currently not able to interrupt R in a couple situations:
R is executing an external program (e.g. you cannot interrupt system("sleep 10")),
R is executing (for example) a C / C++ library call that doesn't provide R an opportunity to check for interrupts.
In such a case, the only option is to forcefully kill the R process -- hopefully this is something that could change in a future iteration of RStudio.
EDIT: RStudio v1.2 should now better handle interrupts in many of these contexts.
This could happen when R is not working within R and is invoking an external library call. The only option is to close the project window. Fortunately, unsaved changes including objects are retained on opening RStudio again.

streamR, filterStream ends instantly

I am new to R and apologize in advance if this is a very simple issue. I want to make use of the package streamR, however I receive the following output when I execute the filterStream function:
Capturing tweets...
Connection to Twitter stream was closed after 0 seconds with up to 1 tweets downloaded.
I am wondering if I am missing a step during authentication. I am able to successfully use the twitteR package and obtain tweets through the searchTwitter function. Is there something more that I need in order to gain access to the streaming api?
library("ROAuth")
library("twitteR")
library("streamR")
library("RCurl")
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
cred <- OAuthFactory$new(consumerKey="xxxxxyyyyyzzzzzz",
consumerSecret="xxxxxxyyyyyzzzzzzz111111222222"',
requestURL='https://api.twitter.com/oauth/request_token',
accessURL='https://api.twitter.com/oauth/access_token',
authURL='https://api.twitter.com/oauth/authorize')
cred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl") )
save(cred, file="twitter authentication.Rdata")
registerTwitterOAuth(cred)
scoring<- searchTwitter("Landon Donovan", n=100, cainfo="cacert.pem")
filterStream( file.name="tweets_rstats.json",track="Landon Donovan", tweets=10, oauth=cred)
Your credentials look fine. There are a few possible answers I've encountered while working within R:
If you are trying to run this on more than one computer at a time (or someone else is using your API keys), it won't work because you are only allowed one connection per API Key. The same goes for attempting to run it in two different projects.
It's possible that the internet connection won't allow access to the API to do the scraping. On certain days or in certain locations, my code won't run even after establishing connection. I'm not sure if this has to do with certain controls on the connection or what the deal is, but one day it'll run on client sites and others it won't.
3. It seems as though if you run a pull and then run another one immediately after, then the 0 tweets/1 second message appears. I don't think you're abusing the rate limit, but there may be a wait period in order to stream again.
4 . A final suggestion would be to make sure you have "Read, Write, and Access Direct Messages" allowed in your Application Preferences. This may not make the difference, but then you aren't limiting any connections.
Other than these, there aren't any "quick fixes" I've encountered, for me it's usually just waiting around, so when I run a pull I make sure to capture a large number of tweets or capture for a few hours at a time.

Async server or quickly loading state in R

I’m writing a webserver that sometimes has to pass data through a R script.
Unfortunately startup is slow, since i have to load some libraries which load other libraries etc.
Is there a way to either
load libraries, save the interpreter state to a file, and load that state fast when invoked next time? Or
maintain a background R process that can be sent messages (not just lowlevel data streams), which are delegated to asynchronous workers (i.e. sending a new message before the previous is parsed shouldn’t block)
R-Websockets is unfortunately synchronous.
Rserve and RSclient is an easy way to do create and use an Async server.
Open two R sessions.
in the first one type:
require(Rserve)
run.Rserve(port=6311L)
in the second one type:
require(RSclient)
rsc = RS.connect(port=6311L)
# start with a synchronous call
RS.eval(rsc, {x <<- vector(mode="integer")}, wait=TRUE)
# continue with an asynchronous call
RS.eval(rsc, {cat("begin")
for (i in 1:100000) x[[i]] <-i
cat("end")
TRUE
},
wait=FALSE)
# call collect until the result is available
RS.collect(rsc, timeout=1)

Resources