make concurrent RCurl GET requests for set of URLs

make concurrent RCurl GET requests for set of URLs - r

I wrote a function to use RCurl to obtain the effective URL for a list of shortened URL redirects (bit.ly, t.co, etc.) and handle errors when the effective URL locates a document (PDFs tend to throw "Error in curlPerform... embedded nul in string.")
I would like to make this function more efficiently if possible (while keeping it in R). As written the run-time is prohibitively long for un-shortening a thousand or more URLs.
?getURI tells us that by default, getURI/getURL goes asynchronous when the length of the url vector is >1. But my performance seems totally linear, presumably because sapply turns the thing into one big for loop and the concurrency is lost.
Is there anyway I can speed up these requests? Extra credit for fixing the "embedded nul" issue.
require(RCurl)
options(RCurlOptions = list(verbose = F, followlocation = T,
timeout = 500, autoreferer = T, nosignal = T,
useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)"))
# find successful location (or error msg) after any redirects
getEffectiveUrl <- function(url){
c = getCurlHandle()
h = basicHeaderGatherer()
curlSetOpt( .opts = list(header=T, verbose=F), curl= c, .encoding = "CE_LATIN1")
possibleError <- tryCatch(getURI( url, curl=c, followlocation=T,
headerfunction = h$update, async=T),
error=function(e) e)
if(inherits(possibleError, "error")){
effectiveUrl <- "ERROR_IN_PAGE" # fails on linked documents (PDFs etc.)
} else {
headers <- h$value()
names(headers) <- tolower(names(headers)) #sometimes cases change on header names?
statusPrefix <- substr(headers[["status"]],1,1) #1st digit of http status
if(statusPrefix=="2"){ # status = success
effectiveUrl <- getCurlInfo(c)[["effective.url"]]
} else{ effectiveUrl <- paste(headers[["status"]] ,headers[["statusmessage"]]) }
}
effectiveUrl
}
testUrls <- c("http://t.co/eivRJJaV4j","http://t.co/eFfVESXE2j","http://t.co/dLI6Q0EMb0",
"http://www.google.com","http://1.uni.vi/01mvL","http://t.co/05Mz00DHLD",
"http://t.co/30aM6L4FhH","http://www.amazon.com","http://bit.ly/1fwWZLK",
"http://t.co/cHglxQkz6Z") # 10th URL redirects to content w/ embedded nul
system.time(
effectiveUrls <- sapply(X= testUrls, FUN=getEffectiveUrl, USE.NAMES=F)
) # takes 7-10 secs on my laptop
# does Vectorize help?
vGetEffectiveUrl <- Vectorize(getEffectiveUrl, vectorize.args = "url")
system.time(
effectiveUrls2 <- vGetEffectiveUrl(testUrls)
) # nope, makes it worse

I had bad experience with RCurl and Async request. R would completely freeze (though no error message, CPU and RAM did not spike) with only concurrent 20 requests.
I recommend switching to CURL and using curl_fetch_multi() function. It my case it could easily handle 50000 JSON request in one pool (with some division into subpools under the hood).
https://cran.r-project.org/web/packages/curl/vignettes/intro.html#async_requests

Related

I am able to ping an IP through cmd, but not with the pingr ping function

I am able to ping an IP with pingr with the following code:
ping("10.0.4.101")
But I am not able to ping another IP with pingr with the following code:
ping("10.151.2.101")
Even though I am able to ping 10.151.2.101 through cmd:
Any advice would be appreciated, thanks!

A bit of debugging shows that pingr won't work if the response time is too short, or if the locale isn't english.
ping <- function (destination, continuous = FALSE, verbose = continuous,
count = 3L, timeout = 1)
{
if (!continuous && verbose) {
stop("'!continuous' && 'verbose' does not work currently")
}
os <- ping_os(destination, continuous, count, timeout)
status <- run(os$cmd[1], os$cmd[-1], error_on_status = FALSE)
output <- strsplit(status$stdout, "\r?\n")[[1]]
if (!continuous) {
timings <- grep(os$regex, output, value = TRUE, perl = TRUE)
times <- sub(os$regex, "\\1", timings, perl = TRUE)
res <- as.numeric(times)
length(res) <- count
res
}
else {
invisible()
}
}
timings are evaluated with following regex :
os$regex
[1] "^.*time=(.+)[ ]?ms.*$"
Obviously, time<1ms won't work as grep is looking for time=, filed an issue.

If pingr is not delivering I would just send the command directly to the shell:
system('ping("10.151.2.101")')

Test if socket is empty (was: Reading data from a raw socket)

At the start, I thought that the bad performance of my driver was caused by the way in which I read data from a socket.
This was the original function I used:
socket_char_reader = function(in_sock) {
string_read <- raw(0)
while((rd <- readBin(in_sock, what = "raw", n=1)) > 0) {
string_read <- c(string_read, rd)
}
return(string_read %>% strip_CR_NUL() %>% rawToChar())
}
The results from 3 consecutive calls to this function give the expected result. It takes 0.004 seconds to read 29 bytes in total.
My second try reads the socket until it is empty. Another function splits the resulting raw array in 3 separate parts.
get_response = function() {
in_sock <- self$get_socket()
BUF_SIZE <- 8
chars_read <- raw(0)
while (length(rd <- readBin(in_sock, what = "raw", n=BUF_SIZE)) > 0) {
chars_read <- c(chars_read, rd)
}
return(chars_read)
},
Reading from the socket now takes 2.049 seconds!
Can somebody explain to me what could be the cause for this difference (or what is the best method for reading the socket until it is empty)?
In the meantime I'll return to my original solution and continue looking for the cause of the bad performance.
Ben
I believe, I found the cause (but not the solution).
While debuging, I noticed that the delay is caused by the last call to readBin().
In socket_char_reader(), the stop-condition for the while-loop is based on the value of rd; if that value equals 0, the loop stops.
In get_response() the stop-condition is based on the number of bytes in the buffer. Before that number can be determined, readBin() first waits if any other bytes will be send to the socket.
The timeOut-period is set in the socketConnection() call.
private$conn <- socketConnection(host = "localhost", port,
open = "w+b", server = FALSE, blocking = TRUE,
encoding = "UTF-8", timeout = 1)
Timeout has to be give a value > 0, otherwise it will take days before the loop stops.
Is it possible to check if there still are any bytes in the socket without actually reading?
Ben

Using httr to place orders through BitMex API

I'm trying to use the httr R package to place orders on BitMex through their API.
I found some guidance over here, and after specifying both my API key and secret in respectively the objects K and S, I've tried the following
verb <- 'POST'
expires <- floor(as.numeric(Sys.time() + 10000))
path <- '/api/v1/order'
data <- '{"symbol":"XBTUSD","price":4500,"orderQty":10}'
body <- paste0(verb, path, expires, data)
signature <- hmac(S, body, algo = 'sha256')
body_l <- list(verb = verb, expires = expires, path = path, data = data)
And then both:
msg <- POST('https://www.bitmex.com/api/v1/order', encode = 'json', body = body_l, add_headers('api-key' = K, 'api-signature' = signature, 'api-expires' = expires))
and:
msg <- POST('https://www.bitmex.com/api/v1/order', body = body, add_headers('api-key' = K, 'api-signature' = signature, 'api-expires' = expires))
Give me the same error message when checked:
rawToChar(msg$content)
[1] "{\"error\":{\"message\":\"Signature not valid.\",\"name\":\"HTTPError\"}}"
I've tried to set it up according to how BitMex explains to use their API, but I appear to be missing something. They list out a couple of issues that might underly my invalid signature issue, but they don't seem to help me out. When following their example I get the exact same hashes, so that seems to be in order.

bit late to the party here but hopefully this helps!
Your POST call just needs some minor changes:
add content_type_json()
include .headers = c('the headers') in add_headers(). See example below:
library(httr)
library(digest)
S <- "your api secret"
K <- "your api key"
verb <- 'POST'
expires <- floor(as.numeric(Sys.time() + 10))
path <- '/api/v1/order'
data <- '{"symbol":"XBTUSD","price":4500,"orderQty":10}'
body <- paste0(verb, path, expires, data)
signature <- hmac(S, body, algo = 'sha256')
msg <- POST('https://www.bitmex.com/api/v1/order',
encode = 'json',
body = data,
content_type_json(),
add_headers(.headers = c('api-key' = K,
'api-signature' = signature,
'api-expires' = expires)))
content(msg, "text")
I have a package on CRAN - bitmexr - that provides a wrapper around the majority of BitMEX's API endpoints that you might be interested in. Still quite a "young" package so I would welcome any feedback!

How to improve formatting of slack messages using slackr?

I'm using slackr to send alert messages to a Slack channel. It works great except the message format is not great and I want to improve it.
install_github("hrbrmstr/slackr")
library(slackr)
slackr_setup(channel="#alerts", username="Mark Davis",
incoming_webhook_url = "https://hooks.slack.com/services/T31P8UDAB/BCH4HKQSC/*********",
api_token = "*********", echo = F)
alert="On Monday, 2018-09-03 # 2pm Pacific..."
slackr(alert)
Here is an example of how a message from slackr looks in Slack:
Here is an example of how I'd like it to look:
slackr doesn't seem to have many options in the way of formatting. I was thinking of building an image and inserting that, but I'm having trouble building an image out of a text file using R.
Perhaps there is another api I could call that could take my text and format it for slack?
I'm open to any suggestions.
Addendum:
Slackr has an option to upload files, so my latest attempt is to create an image from the text message and upload that object.
I am able to create a png file from the text message using the magick library. I created an image with a colored background, and I simply add the message text to the image:
library(magick)
alert_picture <- image_read('alert_480x150_dark_red.png')
alert_picture=image_annotate(alert_picture, DreamCloud_Alert, size = 20, gravity = "southwest",
color = "white", location = "+10+10")
image_write(alert_picture, path = "alert_picture.png", format = "png")
The image looks pretty good (although there doesn't seem to be an easy way to bold or underline specific words in the message), but the obstacle now is that I can't get the upload command to work.
slackr_upload(filename = "alert_picture.png")
I don't get any error messages but nothing is uploaded to slack.

I got around this issue by using the httr package to execute the post image function to slack.
Thanks to Adil B. for providing the solution:
Post Image to Slack Using HTTR package in R

I am not sure this is what you meant, but I solved allowing formatting like in a regular slack message by altering the slackr_bot() function and just removing the 2 sets of 3 back-ticks at the end of the code where it says text. Then just call it slackr_bot1() or something, and then you can post formatted messages. This is the function after the back-ticks removal:
slackr_bot1 <- function(...,
channel=Sys.getenv("SLACK_CHANNEL"),
username=Sys.getenv("SLACK_USERNAME"),
icon_emoji=Sys.getenv("SLACK_ICON_EMOJI"),
incoming_webhook_url=Sys.getenv("SLACK_INCOMING_URL_PREFIX")) {
if (incoming_webhook_url == "") {
stop("No incoming webhook URL specified. Did you forget to call slackr_setup()?", call. = FALSE)
}
if (icon_emoji != "") { icon_emoji <- sprintf(', "icon_emoji": "%s"', icon_emoji) }
resp_ret <- ""
if (!missing(...)) {
# mimics capture.output
# get the arglist
args <- substitute(list(...))[-1L]
# setup in-memory sink
rval <- NULL
fil <- textConnection("rval", "w", local = TRUE)
sink(fil)
on.exit({
sink()
close(fil)
})
# where we'll need to eval expressions
pf <- parent.frame()
# how we'll eval expressions
evalVis <- function(expr) withVisible(eval(expr, pf))
# for each expression
for (i in seq_along(args)) {
expr <- args[[i]]
# do something, note all the newlines...Slack ``` needs them
tmp <- switch(mode(expr),
# if it's actually an expresison, iterate over it
expression = {
cat(sprintf("> %s\n", deparse(expr)))
lapply(expr, evalVis)
},
# if it's a call or a name, eval, printing run output as if in console
call = ,
name = {
cat(sprintf("> %s\n", deparse(expr)))
list(evalVis(expr))
},
# if pretty much anything else (i.e. a bare value) just output it
integer = ,
double = ,
complex = ,
raw = ,
logical = ,
numeric = cat(sprintf("%s\n\n", as.character(expr))),
character = cat(sprintf("%s\n\n", expr)),
stop("mode of argument not handled at present by slackr"))
for (item in tmp) if (item$visible) { print(item$value, quote = FALSE); cat("\n") }
}
on.exit()
sink()
close(fil)
# combined all of them (rval is a character vector)
output <- paste0(rval, collapse="\n")
loc <- Sys.getlocale('LC_CTYPE')
Sys.setlocale('LC_CTYPE','C')
on.exit(Sys.setlocale("LC_CTYPE", loc))
resp <- POST(url = incoming_webhook_url, encode = "form",
add_headers(`Content-Type` = "application/x-www-form-urlencoded",
Accept = "*/*"), body = URLencode(sprintf("payload={\"channel\": \"%s\", \"username\": \"%s\", \"text\": \"%s\"%s}",
channel, username, output, icon_emoji)))
warn_for_status(resp)
}
return(invisible())
}
slackr_bot1("*test* on time")

Retain information about requested URL when using curl::curl_fetch_multi

I'm using the following code to perform multiple simultaneous requests.
urls <- c("https://httpbin.org/status/301", "https://httpbin.org/status/302", "https://httpbin.org/status/200")
result <- list()
p <- curl::new_pool(total_con = 10, host_con = 5, multiplex = T)
cb <- function(res) {
result <<- append(result, list(res))
cat("requested URL: ", url, "last URL: ", res$url, "\n\n")
}
for (url in urls) {
curl::curl_fetch_multi(url, done = cb, handle = curl::new_handle(failonerror = F, nobody = F, followlocation = T, ssl_verifypeer = 0), pool = p)
}
curl::multi_run(pool = p)
As you can see, I would like to print to the console the requested URL and the URL, that finally answered with 200 ok.
The following is printed to the console:
requested URL: https://httpbin.org/status/200 last URL: https://httpbin.org/status/200
requested URL: https://httpbin.org/status/200 last URL: https://httpbin.org/get
requested URL: https://httpbin.org/status/200 last URL: https://httpbin.org/get
The requested URL in the console output is always https://httpbin.org/status/200, because it's the last URL that used in the for-loop. So, that is the wrong way to do it.
How can I retain information about the initial requested URL when using curl_fetch_multi to use it after multi_run returned? That means it would be ideal if the requested URL would be added to the res-list to query it with something like cat("requested URL: ", res$requested_url, "last URL: ", res$url, "\n\n").

I had a similar issue where I wanted to do asynchronous POST requests using curl_fetch_multi and check which requests succeeded and which failed. However, due to the structure of the POST statement (all fields are in the request body) there is no identifying information whatsoever in the response object. My solution was to generate custom callback functions which carried an identifier.
urls <- c("https://httpbin.org/status/301", "https://httpbin.org/status/302", "https://httpbin.org/status/200")
result <- list()
# create an identifier for each url
url.ids = paste0("request_", seq_along(urls))
# custom callback function generator: generate a unique function for each url
cb = function(id){
function(res){
result[[id]] <<- res
}
}
# create the list of callback functions
cbfuns = lapply(url.ids, cb)
p <- curl::new_pool(total_con = 10, host_con = 5, multiplex = T)
for (i in seq_along(urls)) {
curl::curl_fetch_multi(urls[i], done = cbfuns[[i]], handle = curl::new_handle(failonerror = F, nobody = F, followlocation = T, ssl_verifypeer = 0), pool = p)
}
curl::multi_run(pool = p)
In this example, the custom callback functions are simply used to name the elements of result:
names(result)
## [1] "request_3" "request_1" "request_2"
which can then be used to tie each response back to the original request.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

make concurrent RCurl GET requests for set of URLs - r

Related

I am able to ping an IP through cmd, but not with the pingr ping function

Test if socket is empty (was: Reading data from a raw socket)

Using httr to place orders through BitMex API

How to improve formatting of slack messages using slackr?

Retain information about requested URL when using curl::curl_fetch_multi

Categories

Resources