Can I halt ongoing operations without terminating my R session? - r

I sent API requests to get the address from a list of approximately 80,000 APIs.
I used the Google Maps API and the process finished, since the last request was 6 hours ago.
library(ggmap)
locations <- vector(mode = "list", length = nrow(data))
for(i in 1:nrow(data)){
gc <- c(data[i, 2], data[i, 1])
locations[i] <- revgeocode(as.numeric(gc))
}
It is still printing out the messages received from each API request and I'd like to use my R session to work on other problems. I can't terminate the session since the data only exists in my local environment and I haven't yet written it to a CSV or JSON file.
Here's what I've tried: I pressed stop a number of times.
Here's what I'd like to try: I could turn off my RStudio Server instance or I could turn off my EC2 instance that is hosting RStudio Server.
My concern is those will terminate the environment and cause me to lose the variables.
Do I have to just wait it out?

Related

Access locally served files within an R session

Context
In order to test the web capabilities of an R package I am writing, I'm attempting to serve a file locally use the httpuv package so that I can run tests using an offline copy of the page.
Issue
However, curl doesn't seem to want to play nice with httpuv - specifically, when trying to read the hosted file using curl (for example, with curl::curl() or curl::curl_fetch_memory()), the request hangs, and eventually times out if not manually interrupted.
Minimal example
# Serve a small page
server <- httpuv::startServer("0.0.0.0", port = 9359, app = list(
call = function(req) {
list(
status = 200L,
headers = list("Content-Type" = "text/html"),
body = "Some content..."
)
}
))
# Attempt to retrieve content (this hangs)
page <- curl::curl_fetch_memory(url = "http://127.0.0.1:9359")
httpuv::stopServer(server)
Current progress
Once the server has been started, running curl -v 127.0.0.1:9359 at the terminal returns content as expected. Additionally, if I open a new instance of RStudio and try to curl::curl_fetch_memory() in that new R session (while the old one is still open), it works perfectly.
Encouraged by that, I've been playing around with callr for a while, thinking maybe it's possible to launch the server in some background process, and then continue as usual. Unfortunately I haven't had any success so far with this approach.
Any insight or suggestions very much appreciated!
Isn't it a great feeling when you can come back and answer a question you asked!
From the httpuv::startServer() documentation:
startServer binds the specified port and listens for connections on an thread running in the background. This background thread handles the I/O, and when it receives a HTTP request, it will schedule a call to the user-defined R functions in app to handle the request. This scheduling is done with later(). When the R call stack is empty – in other words, when an interactive R session is sitting idle at the command prompt – R will automatically run the scheduled calls. However, if the call stack is not empty – if R is evaluating other R code – then the callbacks will not execute until either the call stack is empty, or the run_now() function is called. This function tells R to execute any callbacks that have been scheduled by later(). The service() function is essentially a wrapper for run_now().
In other words, if we want to respond to requests as soon as they are received, we have to explicitly do so using httpuv::service(). Something like the following does the trick!
s <- callr::r_session$new()
on.exit(s$close())
s$call(function() {
httpuv::startServer("0.0.0.0", port = 9359, app = list(
call = function(req) {
list(
status = 200L,
headers = list("Content-Type" = "text/html"),
body = "Some content...")
)
}
))
while (TRUE) httpuv::service()
})
# Give the server a chance to start
Sys.sleep(3)
page <- curl_fetch_memory(url = "http://127.0.0.1:9359")

Mongolite Error: Failed to read 4 bytes: socket error or timeout

I was trying to query a mongo database for all of the id's included in the database so I could compare the list to a separate data frame. However, when I attempt to find all sample_id fields I'm presented with:
Error: Failed to read 4 bytes: socket error or timeout
An example of the find query:
library(mongolite)
mongo <- mongo(collection,url = paste0("mongodb://", user,":",pass, "#", mongo_host, ":", port,"/",db))
mongo$find(fields = '{"sample_id":1,"_id":0}')
# Error: Failed to read 4 bytes: socket error or timeout
As the error indicates, this is probably due to some internal socket timeout problem due to the large amount of data. However, in the mongo documentation the default is set to never timeout.
socketTimeoutMS:
The time in milliseconds to attempt a send or receive on a socket before the attempt times out. The default is never to timeout, though different drivers might vary. See the driver documentation.
So my question was why does this error occur when using mongolite? I think I've solved it but I'd welcome any additional information or input.
The simple answer is that, as indicated in the above quote from the mongo documenation, "different drivers might vary". In this case the default for mongolite is 5 minutes, found in this github issue, I'm guessing it's related to the C drivers.
The default socket timeout for connections is 5 minutes. This means
that if your MongoDB server dies or becomes unavailable it will take 5
minutes to detect this. You can change this by providing
sockettimeoutms= in your connection URI.
Also noted in the github issue is a solution which is to increase the sockettimeoutms in the URI. At the end of the connection URI you should add ?sockettimeoutms=1200000 as an option to increase the length of time (20 minutes in this case) before a socket timeout. Modifying the original example code:
library(mongolite)
mongo <- mongo(collection,url = paste0("mongodb://", user,":",pass, "#", mongo_host, ":", port,"/",db,"?sockettimeoutms=1200000"))
mongo$find(fields = '{"sample_id":1,"_id":0}')
Laravel: in your database.php 'sockettimeoutms' => '1200000', add this and enjoy the ride

Non-blocking download using curl in R

I am writing some code where I download many pages of from a web API, do some processing, and combine them into a data frame. The API takes ~30 seconds to respond to each request, so it would be convenient to send the request for the next page while doing the processing for the current page. I can do this using, e.g., mcparallel, but that seems like overkill. The curl package claims that it can make non-blocking connections, but this does not seem to work for me.
From vignette("intro", "curl"):
As of version 2.3 it is also possible to open connetions in
non-blocking mode. In this case readBin and readLines will return
immediately with data that is available without waiting. For
non-blocking connections we use isIncomplete to check if the download
has completed yet.
con <- curl("https://httpbin.org/drip?duration=1&numbytes=50")
open(con, "rb", blocking = FALSE)
while(isIncomplete(con)){
buf <- readBin(con, raw(), 1024)
if(length(buf))
cat("received: ", rawToChar(buf), "\n")
}
close(con)
The expected result is that the open should return immediately, and then 50 asterisks should be progressively printed over 1 second as the results come in. For me, the open blocks for about a second, and then the asterisks are printed all at once.
Is there something else I need to do? Does this work for anyone else?
I am using R version 3.3.2, curl package version 3.1, and libcurl3 version 7.47.0 on Ubuntu 16.04 LTS. I have tried in RStudio and the command line R console, with the same results.

tm crashing R when converting VCorpus to corpus

I am using windows 7 with a 32-bit operating system with 4Gb RAM of which only 3Gb is accessible due to 32-bit limitations. I shut everything else down and can see that I have about 1Gb as cached and 1Gb available before starting. The "free" memory varies but is sometimes 0.
Using tm - I am successfully creating a 517Mb VCorpus three .txt documents from a Swiftkey dataset. When I attempt to take the next step to convert it to a "corpus" using the tm::Corpus() command, I get an error. Code and output follows:
cname <- file.path("./final/en_US/")
docs <- Corpus(DirSource(cname))
myCorpus <- tm::Corpus(docs)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
....and R terminates........any ideas how to prevent this?

Async server or quickly loading state in R

I’m writing a webserver that sometimes has to pass data through a R script.
Unfortunately startup is slow, since i have to load some libraries which load other libraries etc.
Is there a way to either
load libraries, save the interpreter state to a file, and load that state fast when invoked next time? Or
maintain a background R process that can be sent messages (not just lowlevel data streams), which are delegated to asynchronous workers (i.e. sending a new message before the previous is parsed shouldn’t block)
R-Websockets is unfortunately synchronous.
Rserve and RSclient is an easy way to do create and use an Async server.
Open two R sessions.
in the first one type:
require(Rserve)
run.Rserve(port=6311L)
in the second one type:
require(RSclient)
rsc = RS.connect(port=6311L)
# start with a synchronous call
RS.eval(rsc, {x <<- vector(mode="integer")}, wait=TRUE)
# continue with an asynchronous call
RS.eval(rsc, {cat("begin")
for (i in 1:100000) x[[i]] <-i
cat("end")
TRUE
},
wait=FALSE)
# call collect until the result is available
RS.collect(rsc, timeout=1)

Resources