Web scraping with splashr fails with curl error after many successes - r

I am scraping a few dozen URLs using splashr which uses Splash in a Docker container as documented here.
The code runs and completes fine when run directly from RStudio Server on my Digital Ocean Droplet. However, when it runs from a cron job it always fails when reading the 24th URL with this error:
Error in curl::curl_fetch_memory(url, handle = handle) : Recv failure: Connection reset by peer
Even when it works through running the code direct from RStudio I see this error the first 14 scrapes:
QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.
But it completes OK.
Is there some memory management or garbage collection that I'm supposed to be doing between scrapes? What would account for the success of a direct run and the failure of the same script being run by a cron job? In short, how do I avoid the curl error mentioned above?
library("tidyverse")
library("splashr")
library("rvest")
# Launch SplashR
# system2("docker", args = c("pull scrapinghub/splash:latest"))
# system2("docker", args = c("run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash:latest"), wait = FALSE)
# splash_active()
pause_after_html_read <- 5
pause_after_html_text <- 3
for(idx in 1:28){
splash(host = "localhost", port = 8050L) |>
splash_response_body(FALSE) %>%
splash_go(url = url_df$web_page[idx]) %>%
splash_wait(pause_after_html_read) %>%
splash_html() |>
html_text() -> pg
Sys.sleep(pause_after_html_text)
}

Reading this post told me about Aquarium. It uses a little bit more memory than before but doesn't crash.

Related

How to check if Shiny app is running on a specific ip and address?

I have a shiny app which runs on a specific ip address and port.
shiny_start_script.R
runApp(host = myhost, port = myport)
I would like to have another R script which frequently checks if that shiny app is live.
By this way, if it is terminated for some reason the script can run above runApp command.
How can I accomplish that?
I am using very similar script to make my shiny app be working always by the power of the windows task scheduler.
Here is the script you can run,
library(httr)
library(shiny)
url <- "http://....." # your url or ip adress
out <- tryCatch(GET(url), error = function(e) e) # To find if your url working or not
logic <- any(class(out) == "error") # returns TRUE/FALSE due to the existance of an error
# The codes below stands for rerunning the main script if there exists an error.
if(logic) {
Sys.sleep(2)
wdir <- "your_working_directory"
setwd(wdir)
require(shiny)
x <- system("ipconfig", intern=TRUE)
z <- x[grep("IPv4", x)]
ip <- gsub(".*? ([[:digit:]])", "\\1", z)
source("./global.R")
runApp(wdir , launch.browser=FALSE, port = myport , host = ip)
}
After saving the script. Say "control.R", the very next thing you should do is setting a windows task scheduler to run this script. In this question you can find information about this.

What is the best and correct way of hosting an endpoint running R code?

I think it must be a relatively common use case to load a model and invoke an endpoint to call R's predict(object, newdata, ...) function. I wanted to do this with a custom AWS Sagemaker container, using plumber on the R side. This example gives all the details, I think, and this bit of documentation also explains how the container should be built and react.
I followed the steps of these documents, but I get
The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.
in the Sagemaker console after a couple of long minutes, and the endpoint creation fails.
This is my container:
# --- Dockerfile
FROM rocker/r-base
RUN apt-get -y update && apt-get install -y libsodium-dev libcurl4-openssl-dev
RUN apt-get install -y \
ca-certificates
RUN R -e "install.packages(c('lme4', 'plumber'))"
ADD ./plumber.R /
ENTRYPOINT ["R", "-e", "plumber::pr_run(plumber::pr('plumber.R'), port=8080)", \
"--no-save"]
# --- plumber.R
library(plumber)
library(lme4)
prefix <- '/opt/ml'
print(dir('/opt/ml', recursive = TRUE))
model <- readRDS(file.path(prefix, 'model', 'model.RDS'))
#* #apiTitle Guess the likelihood of something
#' Ping to show server is there
#' #get /ping
function() {
print(paste('successfully pinged at', Sys.time()))
return('')}
#' Parse input and return prediction from model
#' #param req The http request sent
#' #post /invocations
function(req) {
print(paste('invocation triggered at', Sys.time()))
conn <- textConnection(gsub('\\\\n', '\n', req$postBody))
data <- read.csv(conn)
close(conn)
print(data)
predict(model, data,
allow.new.levels = TRUE,
type = 'response')
}
And then the endpoint is created using this code:
# run_on_sagemaker.py
# [...]
create_model_response = sm.create_model(
ModelName=model_name,
ExecutionRoleArn=role,
PrimaryContainer={
'Image': image_uri,
'ModelDataUrl': s3_model_location
}
)
create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[{
'InstanceType': instance_type,
'InitialInstanceCount': 1,
'ModelName': model_name,
'VariantName': 'AllTraffic'}])
print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])
print('Endpoint Response:')
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)
try:
sm.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
finally:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Arn: " + resp['EndpointArn'])
print("Status: " + status)
if status != 'InService':
raise Exception('Endpoint creation did not succeed')
print(create_model_response['ModelArn'])
Most code is actually copied from the abovementioned example, the most significant difference I note is that in my container the model is loaded right away while in the example it loads the model object every time an invocation is made (which must be slowing responses down, so i wonder, why?).
The logs on Cloudwatch equal the output of the container when it's run locally and indicate no failure. Locally I can query the container with
curl -d "data\nin\ncsv\nformat" -i localhost:8080/invocations and it works fine and gives back a prediction for every row in the POST data. Also, curl localhost:8080/ping returns [""], as it should, I think. And it shows no signs of being slow, the model object is a 4.4MiB in size (although this is to be extended greatly once this simple version runs).
The error on the terminal is
Traceback (most recent call last):
File "run_on_sagemaker.py", line 57, in <module>
sm.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
File "[...]/lib/python3.8/site-packages/botocore/waiter.py", line 53, in wait
Waiter.wait(self, **kwargs)
File "[...]/lib/python3.8/site-packages/botocore/waiter.py", line 320, in wait
raise WaiterError(
botocore.exceptions.WaiterError: Waiter EndpointInService failed: Waiter encountered a terminal failure state
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_on_sagemaker.py", line 64, in <module>
raise Exception('Endpoint creation did not succeed')
So, why is this failing on the Sagemaker console? Is this a good way, are there better ways, and how can I do further diagnostics? Generally, I also could not get the AWS example (see above) for your own R container running, so I wonder what the best way to run R predictions of a Sagemaker model is.

rsDriver error when executed through company's network

I am facing an issue while running the rsDriver() function to open up the chrome browser.
Code:
library("RSelenium")
library("wdman")
mybrowser <- rsDriver(browser=c("chrome"), chromever="80.0.3987.16",port = 443L)
remDr <- mybrowser$client
remDr$navigate("https://google.co.in/")
Sys.sleep(2)
When I run this code on my machine while connected to my home network the code works as expected. But when I run this code from my office network, the rsDriver(browser=c("chrome"), chromever="80.0.3987.16",port = 443L) gives me the below error and I am stuck at this point.
checking Selenium Server versions:
BEGIN: PREDOWNLOAD
Error in open.connection(con, "rb") :
Timeout was reached: [www.googleapis.com] Operation timed out after 10000 milliseconds with 0 out of
0 bytes received
I tried connecting through the company's proxy with the below code but still no luck. I tried using the port numbers 4444,4445,4567 but still the same error.
cprof <- list(chromeOptions = list(args = list("--proxy-server= gproxy.go.company.org:8080")))
mybrowser <- rsDriver(browser=c("chrome"), chromever="80.0.3987.16", port = 443L,extraCapabilities = cprof)
It would be very helpful someone can help me in understanding the issue and suggest me a solution. Am I missing something in the code. Any help would be highly appreciated.
Also do let me know for any additional information required.
To me this looks like a proxy issue. Are you able to retrieve an arbitrary website? E.g. using httr::GET("www.google.com"). If not, this would also point to a problem with the proxy.
Have you tried to configure it in .Renviron? Like so:
file.edit('~/.Renviron')
Add this line to the file and restart RStudio:
http_proxy=USER:PASSWORD#PROXY:PORT
Another option: setting proxy with httr/curl:
set_config(use_proxy(url="proxy.com",
port = 8080,
username = "foo",
password = "bar"))
Achieved this by switching the networks, first connected to my local network and when the browser opens up switch to company's network.

multi-computer makePSOCKcluster on Windows: Building a step-by-step guide

I've been trying to build a cluster using multiple computers for three days now and have failed spectacularly. So now I'm going to try to suck a bunch of you into solving my problem for me. If all goes well, I would hope we can generate a step-by-step guide to use as a reference to do this in the future, because as of yet, I haven't managed to find a decent reference for setting this up (perhaps it's too specific a task?)
In my case, let's assume Windows 7, with PuTTY as the SSH client, and 'localhost' is going to serve as the master.
Furthermore, let's assume only two computers on the same network for now. I imagine the process will generalize easily enough that if I can get it to work on two computers, I can get it to work on three. So we'll work on localhost and remote-computer.
Here's what I've gathered so far (with references linked at the bottom)
Install PuTTY on localhost.
Install PuTTY on remote-computer
Install an SSH server on remote-computer
Assign it a port to listen on? (I'm not sure about this step)
Install R on localhost
Install the same version of R on remote-computer
Add R to the PATH environment variable on both localhost and remote-computer
Run the R code below from localhost
code:
library(parallel)
cl <- makePSOCKcluster(c(rep("localhost", 2),
rep("remote-computer", 2)))
So far, I've done steps 1-3, not sure if I need to do 4, done 5-7, and the code for step 8 just hangs indefinitely.
When I check my SSH server logs, it doesn't appear that I'm hitting the SSH server from localhost. So it appears that my first problem is configuring the SSH correctly. Has anyone succeeded in doing this and would you be willing to share your expertise?
EDIT
Oops: references
http://www.milanor.net/blog/wp-content/uploads/2013/10/03.FirstStepinParallelComputing.pdf
R Parallel - connecting to remote cores
https://stat.ethz.ch/pipermail/r-sig-hpc/2010-October/000780.html
At best, this is a partial answer. I'm still not establishing a cluster, but the steps described here are a pretty good record of how I've gotten to this point.
CONFIGURATIONS:
Install PuTTY on 'remote-computer'
Install SSH server on 'remote-computer'
Install R on 'remote-computer' (Use the same version of R as on 'localhost')
Add R to the PATH
Install PuTTY on 'localhost'
Install R on 'localhost'
Add R to the PATH
TESTING THE CONNECTION: PHASE I
From the command line, run
C:\PuTTYPath\plink.exe -pw [password] [username]#[remote_ip_address] Rscript -e rnorm(100)
(Confirm return of 100 normal random variates
From the command line, run
C:\PuTTYPath\plink.exe -pw [password] [username]#[remoate_ip_address] RScript -e parallel:::.slaveRSOCK() MASTER=[local_ip_address] PORT=100501 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
(Confirm that a session is started on the SSH server logs on 'remote-computer')
TESTING THE CONNECTION: PHASE II
From an R Session, run
system(paste0("C:/PuTTYPath/plink.exe -pw [password] ",
"[username]#[remote_ip_address] ",
"RScript -e rnorm(100)"))
(Confirm return of 100 normal random variates)
From an R session, run
system(paste0("C:/PuTTY/plink.exe ",
"-pw [password] ",
"[username]#[remote_ip_address] ",
"RScript -e parallel:::.slaveRSOCK() ",
"MASTER=[local_ip_address] ",
"PORT=100501 ",
"OUT=/dev/null ",
"TIMEOUT=2592000 ",
"METHODS=TRUE ",
"XDR=TRUE"))
(Confirm that a session is started and maintained on the SSH server logs on 'remote-computer'
ESTABLISH A CLUSTER
From an R Session, run
library(snow)
cl <- makeCluster(spec = c("localhost", "[remote_ip_address]"),
rshcmd = "C:/PuTTY/plink.exe -pw [password]",
host = "[local_ip_address]")
(A session should be started and maintained on the SSH server logs on 'remote-computer'.
Ideally, the function will complete at 'cl' be assigned)
Establishing the cluster is the point at which I'm failing. I run makeCluster and watch my SSH server logs. It shows a connection is made and then immediately closed. makeCluster never finishes running, cl is not assigned, and I'm stuck on how to go on. I'm not even sure if this is an R problem or a configuration problem at this point.
EDIT AND RESOLUTION:
For no good reason, I tried running this with the snow package, as shown in the "Establish a Cluster" section above. When I used the snow package, the cluster is built and runs stably. Not sure why I couldn't get this to work with the parallel package, but at least I've got something functional.
For those who are looking for establishing clusters across several computers in Windows, #Benjamin's answer is almost correct, you need to follow his instructions until the last step, which is ESTABLISH A CLUSTER, and make sure the previous steps are all working in your computer. My solution is based on the package 'Parallel' instead of 'snow', which are essentially same.
Solution
Code template:
machineAddresses <-list(list(host='[Server address]',user='[user name]',rscript="[The Rscript file in the server]",rshcmd="plink -pw [Your password]"))
cl <- makePSOCKcluster(machineAddresses,manual = F)
You have to fill all the [] in your code. In my computer, it is:
machineAddresses <-list(list(host='192.168.1.220',user='jeff',rscript="C:/Program Files/R/R-3.3.2/bin/Rscript",rshcmd="plink -pw qwer"))
cl <- makePSOCKcluster(machineAddresses,manual = F)
Reason
Running cluster in Windows is very tricky, the function makePSOCKcluster usually does not work as expected. The easiest way to make it work is to change manual=F to manual=T and manually create workers. Here is a related post, which talks about why the function makePSOCKcluster will hang forever, and I think these two post basically stuck in the same place. I also post my answer to that question to discuss how to make it work.
R Parallel - connecting to remote cores
As I do not have the reputation to post a comment on Jeff's answer, I will post this as an answer:
The reason I have found that automatic start of cluster nodes using makePSOCKcluster does not work in Windows is that the arg and the outfile arguments in the internal parallel function newPSOCKnode are wrapped in the shQuotes function. This causes the combination of cmd.exe and Rscript.exe to return an error, which leads to makePSOCKcluster hanging forever.
The following two function definitions enable the automatic starting of the cluster nodes using makePSOCKcluter, assuming a proper configuration of ssh or putty/plink for key-based password-less login:
makePSOCKcluster <- function (names, ...)
{
if (is.numeric(names)) {
names <- as.integer(names[1L])
if (is.na(names) || names < 1L)
stop("numeric 'names' must be >= 1")
names <- rep("localhost", names)
}
parallel:::.check_ncores(length(names))
options <- parallel:::addClusterOptions(parallel:::defaultClusterOptions, list(...))
cl <- vector("list", length(names))
for (i in seq_along(cl)) cl[[i]] <- newPSOCKnode(names[[i]],
options = options, rank = i)
class(cl) <- c("SOCKcluster", "cluster")
cl
}
newPSOCKnode <- function (machine = "localhost", ..., options = parallel:::defaultClusterOptions,
rank)
{
options <- parallel:::addClusterOptions(options, list(...))
if (is.list(machine)) {
options <- parallel:::addClusterOptions(options, machine)
machine <- machine$host
}
outfile <- parallel:::getClusterOption("outfile", options)
master <- if (machine == "localhost")
"localhost"
else parallel:::getClusterOption("master", options)
port <- parallel:::getClusterOption("port", options)
setup_timeout <- parallel:::getClusterOption("setup_timeout", options)
manual <- parallel:::getClusterOption("manual", options)
timeout <- parallel:::getClusterOption("timeout", options)
methods <- parallel:::getClusterOption("methods", options)
useXDR <- parallel:::getClusterOption("useXDR", options)
env <- paste0("MASTER=", master, " PORT=", port, " OUT=",
#shQuote(outfile), " SETUPTIMEOUT=", setup_timeout, " TIMEOUT=",
(outfile), " SETUPTIMEOUT=", setup_timeout, " TIMEOUT=",
timeout, " XDR=", useXDR)
arg <- "parallel:::.slaveRSOCK()"
rscript <- if (parallel:::getClusterOption("homogeneous", options)) {
shQuote(parallel:::getClusterOption("rscript", options))
}
else "Rscript"
rscript_args <- parallel:::getClusterOption("rscript_args", options)
if (methods)
rscript_args <- c("--default-packages=datasets,utils,grDevices,graphics,stats,methods",
rscript_args)
cmd <- if (length(rscript_args))
paste(rscript, paste(rscript_args, collapse = " "), "-e",
#shQuote(arg), env)
arg, env)
#else paste(rscript, "-e", shQuote(arg), env)
else paste(rscript, "-e", arg, env)
renice <- parallel:::getClusterOption("renice", options)
if (!is.na(renice) && renice)
cmd <- sprintf("nice +%d %s", as.integer(renice), cmd)
if (manual) {
cat("Manually start worker on", machine, "with\n ",
cmd, "\n")
utils::flush.console()
}
else {
if (machine != "localhost") {
rshcmd <- parallel:::getClusterOption("rshcmd", options)
user <- parallel:::getClusterOption("user", options)
cmd <- shQuote(cmd)
cmd <- paste(rshcmd, "-l", user, machine, cmd)
}
if (.Platform$OS.type == "windows") {
system(cmd, wait = FALSE, input = "")
}
else system(cmd, wait = FALSE)
}
con <- socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, open = "a+b", timeout = timeout)
structure(list(con = con, host = machine, rank = rank), class = if (useXDR)
"SOCKnode"
else "SOCK0node")
}
I plan to update this response with more complete setup instructions when I have the chance.

Run R/Rook as a web server on startup

I have created a server using Rook in R - http://cran.r-project.org/web/packages/Rook
Code is as follows
#!/usr/bin/Rscript
library(Rook)
s <- Rhttpd$new()
s$add(
name="pingpong",
app=Rook::URLMap$new(
'/ping' = function(env){
req <- Rook::Request$new(env)
res <- Rook::Response$new()
res$write(sprintf('<h1>Pong</h1>',req$to_url("/pong")))
res$finish()
},
'/pong' = function(env){
req <- Rook::Request$new(env)
res <- Rook::Response$new()
res$write(sprintf('<h1>Ping</h1>',req$to_url("/ping")))
res$finish()
},
'/?' = function(env){
req <- Rook::Request$new(env)
res <- Rook::Response$new()
res$redirect(req$to_url('/pong'))
res$finish()
}
)
)
## Not run:
s$start(port=9000)
$ ./Rook.r
Loading required package: tools
Loading required package: methods
Loading required package: brew
starting httpd help server ... done
Server started on host 127.0.0.1 and port 9000 . App urls are:
http://127.0.0.1:9000/custom/pingpong
Server started on 127.0.0.1:9000
[1] pingpong http://127.0.0.1:9000/custom/pingpong
Call browse() with an index number or name to run an application.
$
And the process ends here.
Its running fine in the R shell but then i want to run it as a server on system startup.
So once the start is called , R should not exit but wait for requests on the port.
How will i convince R to simply wait or sleep rather than exiting ?
I can use the wait or sleep function in R to wait some N seconds , but that doesnt fit the bill perfectly
Here is one suggestion:
First split the example you gave into (at least) two files: One file contains the definition of the application, which in your example is the value of the app parameter to the Rhttpd$add() function. The other file is the RScript that starts the application defined in the first file.
For example, if the name of your application function is named pingpong defined in a file named Rook.R, then the Rscript might look something like:
#!/usr/bin/Rscript --default-packages=methods,utils,stats,Rook
# This script takes as a single argument the port number on which to listen.
args <- commandArgs(trailingOnly=TRUE)
if (length(args) < 1) {
cat(paste("Usage:",
substring(grep("^--file=", commandArgs(), value=T), 8),
"<port-number>\n"))
quit(save="no", status=1)
} else if (length(args) > 1)
cat("Warning: extra arguments ignored\n")
s <- Rhttpd$new()
app <- RhttpdApp$new(name='pingpong', app='Rook.R')
s$add(app)
s$start(port=args[1], quiet=F)
suspend_console()
As you can see, this script takes one argument that specifies the listening port. Now you can create a shell script that will invoke this Rscript multiple times to start multiple instances of your server listening on different ports in order to enable some concurrency in responding to HTTP requests.
For example, if the Rscript above is in a file named start.r then such a shell script might look something like:
#!/bin/sh
if [ $# -lt 2 ]; then
echo "Usage: $0 <start-port> <instance-count>"
exit 1
fi
start_port=$1
instance_count=$2
end_port=$((start_port + instance_count - 1))
fifo=/tmp/`basename $0`$$
exit_command="echo $(basename $0) exiting; rm $fifo; kill \$(jobs -p)"
mkfifo $fifo
trap "$exit_command" INT TERM
cd `dirname $0`
for port in $(seq $start_port $end_port)
do ./start.r $port &
done
# block until interrupted
read < $fifo
The above shell script takes two arguments: (1) the lowest port-number to listen on and (2) the number of instances to start. For example, if the shell script is in an executable file named start.sh then
./start.sh 9000 3
will start three instances of your Rook application listening on ports 9000, 9001 and 9002, respectively.
You see the last line of the shell script reads from the fifo which prevents the script from exiting until caused to by a received signal. When one of the specified signals is trapped, the shell script kills all the Rook server processes that it started before it exits.
Now you can configure a reverse proxy to forward incoming requests to any of the server instances. For example, if you are using Nginx, your configuration might look something like:
upstream rookapp {
server localhost:9000;
server localhost:9001;
server localhost:9002;
}
server {
listen your.ip.number.here:443;
location /pingpong/ {
proxy_pass http://rookapp/custom/pingpong/;
}
}
Then your service can be available on the public Internet.
The final step is to create a control script with options such as start (to invoke the above shell script) and stop (to send it a TERM signal to stop your servers). Such a script will handle things such as causing the shell script to run as a daemon and keeping track of its process id number. Install this control script in the appropriate location and it will start your Rook application servers when the machine boots. How to do that will depend on your operating system, the identity of which is missing from your question.
Notes
For an example of how the fifo in the shell script can be used to take different actions based on received signals, see this stack overflow question.
Jeffrey Horner has provided an example of a complete Rook server application.
You will see that the example shell script above traps only INT and TERM signals. I chose those because INT results from typing control-C at the terminal and TERM is the signal used by control scripts on my operating system to stop services. You might want to adjust the choice of signals to trap depending on your circumstances.
Have you tried this?
while (TRUE) {
Sys.sleep(0.5);
}

Resources