GET in for loop needs to be faster (httr) in R - r

I need to write a R script which sends at the end a huge amount of get requests to a server. Each row of my data frame contains several information. The final column "url" builds http requests in each row - for example: https://logsxxx.xxx.com/xx.xx?.....
It may happen that I must send 300.000 - 1.000.000 get request with the script.
The good thing is that my script works and the requests reaches the server.
The bad thing is that the loop costs a lot of time until all rows are send. It takes about 9 hours for 300.000 rows.
I've tested if its possible with ifelse or apply but I failed...
system.time(
for (i in 1:300000){
try(
{
GET(mydata$url[i], timeout(3600))
print(paste("row",i,"sent at",Sys.time()))
}
, silent=FALSE)
}
)
Another bad thing is that the script may fails to send 100% if the internet connection breaks for any reason. Then I can see the following error:
[1] "row 18 sent at 2019-01-18 14:22:05"
[1] "row 19 sent at 2019-01-18 14:22:06"
[1] "row 20 sent at 2019-01-18 14:22:06"
[1] "row 21 sent at 2019-01-18 14:22:06"
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: Connection timed out after 10000 milliseconds
[1] "row 23 sent at 2019-01-18 14:22:16"
[1] "row 24 sent at 2019-01-18 14:22:16"
[1] "row 25 sent at 2019-01-18 14:22:16"
At least the script doesn't break completely and goes on with the next row. The problem is here that the longer the internet connection fails the more rows would not send.
I would be very grateful if:
someone could show me a faster way to send the requests - maybe without the nested for loop
and show me how I can do something like this with the code: "if the get request fails because of internet connection, retry 3 times before going to the next get request. Do that until all elements of i are send"
Kind regards,

Related

Adding a column to MongoDB from R via mongolite gives persistent error

I want to add a column to a MongoDB collection via R. The collection has tabular format and is already relatively big (14000000 entries, 140 columns).
The function I am currently using is
function (collection, name, value)
{
mongolite::mongo(collection)$update("{}", paste0("{\"$set\":{\"",
name, "\": ", value, "}}"), multiple = TRUE)
invisible(NULL)
}
It does work so far. (It takes about 5-10 Minutes, which is ok. Although, it would be nice if the speed could be improved somehow).
However, it also gives me persistently the following error that interrupts the execution of the rest of the script.
The error message reads:
Error: Failed to send "update" command with database "test": Failed to read 4 bytes: socket error or timeout
Any help on resolving this error would be appreciated. (If there are ways to improve the performance of the update itself I'd also be more than happy for any advices.)
the default socket timeout is 5 minutes.
You can override the default by setting sockettimeoutms directly in your connection URI:
mongoURI <- paste0("mongodb://", user,":",pass, "#", mongoHost, ":", mongoPort,"/",db,"?sockettimeoutms=<something large enough in milliseconds>")
mcon <- mongo(mongoCollection, url=mongoURI)
mcon$update(...)

SIM5320E - POST request with large data is slow

I have built a prototype using a raspberry and a sim5320E module. The goal is to send a large amount of data (~100Kb) through HTTP using this 3G module.
I have followed the instructions specified in section 16.5 (HTTPS) of the AT Command set for the SIM5320:
https://cdn-shop.adafruit.com/datasheets/SIMCOM_SIM5320_ATC_EN_V2.02.pdf
And it worked fine, except that it is slow.
From what I understand from the documentation (and seen from my tests), the data to be sent must be divided in chunks of max 4096 bytes.
Every chunk must be sent to what is called the "sending buffer" using the command AT+CHTTPSSEND.
Every now and then, we must check that the sending buffer does not have too much data in cache using the AT+CHTTPSSEND? command.
The last AT+CHTTPSSEND command commits all sending data.
My problem is that every AT+CHTTPSSEND takes around 10 seconds to complete, which means that my HTTP request will take around 250 seconds to complete.
Anybody knows what might cause this slowness?
Here is some code to illustrate the issue:
def send_chunk(self, chunk):
# Send chunk
self._send('CHTTPSSEND={}'.format(len(chunk)), wait_for=">")
self._send_raw(chunk.encode())
# Check how much data is left in the sending buffer
# Wait for this data to be under 3Kb
data_left = 3001
while data_left > 3000:
response = self._send('CHTTPSSEND?', wait_for="+CHTTPSSEND:")
data_left = int(response.strip().split(" ")[1])
time.sleep(2)
And here are the logs I get:
>> AT+CHTTPSSEND=4096 -> This commands takes ~10 seconds
<< >
>> Sending chunk of data
<< OK
>> AT+CHTTPSSEND?
<< +CHTTPSSEND: 0

Write and read from a serial port

I am using the following python script to write AT+CSQ on serial port ttyUSB1.
But I cannot read anything.
However, when I fire AT+CSQ on minicom, I get the required results.
What may be the issue with this script?
Logs:
Manual Script
root#imx6slzbha:~# python se.py
Serial is open
Serial is open in try block also
write data: AT+CSQ
read data:
read data:
read data:
read data:
Logs:
Minicom console
1. ate
OK
2. at+csq
+CSQ: 20,99
3. at+csq=?
OKSQ: (0-31,99),(99)
How can I receive these results in the following python script?
import serial, time
#initialization and open the port
#possible timeout values:
# 1. None: wait forever, block call
# 2. 0: non-blocking mode, return immediately
# 3. x, x is bigger than 0, float allowed, timeout block call
ser = serial.Serial()
ser.port = "/dev/ttyUSB1"
ser.baudrate = 115200
ser.bytesize = serial.EIGHTBITS #number of bits per bytes
ser.parity = serial.PARITY_NONE #set parity check: no parity
ser.stopbits = serial.STOPBITS_ONE #number of stop bits
ser.timeout = None #block read
#ser.timeout = 0 #non-block read
ser.timeout = 3 #timeout block read
ser.xonxoff = False #disable software flow control
ser.rtscts = False #disable hardware (RTS/CTS) flow control
ser.dsrdtr = False #disable hardware (DSR/DTR) flow control
ser.writeTimeout = 2 #timeout for write
try:
ser.open()
print("Serial is open")
except Exception, e:
print "error open serial port: " + str(e)
exit()
if ser.isOpen():
try:
print("Serial is open in try block also")
ser.flushInput() #flush input buffer, discarding all its contents
ser.flushOutput()#flush output buffer, aborting current output
#and discard all that is in buffer
#write data
ser.write("AT+CSQ")
time.sleep(1)
# ser.write("AT+CSQ=?x0D")
print("write data: AT+CSQ")
# print("write data: AT+CSQ=?x0D")
time.sleep(2) #give the serial port sometime to receive the data
numOfLines = 1
while True:
response = ser.readline()
print("read data: " + response)
numOfLines = numOfLines + 1
if (numOfLines >= 5):
break
ser.close()
except Exception, e1:
print "error communicating...: " + str(e1)
else:
print "cannot open serial port "
You have two very fundamental flaws in your AT command handling:
time.sleep(1)
and
if (numOfLines >= 5):
How bad are they? Nothing will ever work until you fix those, and by that I mean completely change the way you send and receive command and responses.
Sending AT commands to a modem is a communication protocol like any other protocols, where certain parts and behaviours are required and not optional. Just like you would not write a HTTP client that completely ignores the responses it gets back from the HTTP server, you must never write a program that sends AT commands to a modem and completely ignores the responses the modem sends back.
AT commands are a link layer protocol, with with a window size of 1 - one. Therefore after sending a command line, the sender MUST wait until has received a response from the modem that it is finished with processing the command line, and that kind of response is called Final result code.
If the modem uses 70ms before it responds with a final result code you have to wait at least 70ms before continuing, if it uses 4 seconds you have to wait at least 4 seconds before continuing, if it uses several minutes (and yes, there exists AT commands that can take minutes to complete) you have to wait for several minutes. If the modem has not responded in an hour, your only options are 1) continue waiting, 2) just give up or 3) disconnect, reconnect and start all over again.
This is why sleep is such a horrible approach that in the very best case is a time wasting ticking bomb. It is as useful as kicking dogs that stand in your way in order to get them to move. Yes it might actually work some times, but at some point you will be sorry for taking that approach...
And regarding numOfLines there is no way anyone in advance can know exactly how many lines a modem will respond with. What if your modem just responds with a single line with the ERROR final result code? The code will deadlock.
So this line number counting has to go completely away, and instead your code should be sending a command line and then wait for the final result code by reading and parsing the response lines from the modem.
But before diving too deep into that answer, start by reading the V.250 specification, at least all of chapter 5. This is the standard that defines the basics of AT command, and will for instance teach you the difference between a command and a command line. And how to correctly terminate a command line which you are not doing, so the modem will never start processing the commands you send.

Python socket sendall randomly fails

I have a simple script which transfers an image file from one machine to another,
does image processing and returns a result as "dice count".
The problem is that randomly the image received is partially grey. There seems no
reason as to when the transfer is incomplete and I see no other issues in the code.
The client gives no error or any indication that sendall failed or was incomplete.
Thanks guys.
server code:
def reciveImage():
#create local buffer file
fo = open("C:/foo.jpg", "wb")
print "inside reciveImage"
while True:
print "inside loop"
data = client.recv(4096)
print "data length: ", len(data)
fo.write(data)
print "data written"
if (len(data) < 4096):
break
print "break"
fo.close()
print "Image received"
And the (simplifyed) client code:
data = open("/home/nao/recordings/cameras/image1.jpg", "rb")
# prepare to send data
binary_data = data.read()
data.close()
sock.sendall(binary_data)
Normal server output:
Client Command: findBlob. Requesting image...
inside reciveImage
inside loop
data length: 4096
data written
#... This happens a bunch of times....
inside loop
data length: 4096
data written
inside loop
data length: 1861
data written
Image received
dice count: 0
Waiting for a connection
But randomly it will only loop a few times or less, like:
Client Command: findBlob. Requesting image...
inside reciveImage
inside loop
data length: 1448
data written
Image received
dice count: 0
Waiting for a connection
recv does not block until all data are received, it only blocks until some data are received and returns them. E.g. if the client send first 512 bytes and then a second later another 512 byte your recv will return the first 512 bytes, even if you asked for 4096. So you should only break if recv returns, that no more data are available (connection closed).

Implementation of simple polling of results file

For one of my dissertation's data collection modules, I have implemented a simple polling mechanism. This is needed, because I make each data collection request (one of many) as SQL query, submitted via Web form, which is simulated by RCurl code. The server processes each request and generates a text file with results at a specific URL (RESULTS_URL in code below). Regardless of the request, URL and file name are the same (I cannot change that). Since processing time for different data requests, obviously, is different and some requests may take significant amount of time, my R code needs to "know", when the results are ready (file is re-generated), so that it can retrieve them. The following is my solution for this problem.
POLL_TIME <- 5 # polling timeout in seconds
In function srdaRequestData(), before making data request:
# check and save 'last modified' date and time of the results file
# before submitting data request, to compare with the same after one
# for simple polling of results file in srdaGetData() function
beforeDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
beforeDate <<- strptime(beforeDate, "%a, %d %b %Y %X", tz="GMT")
<making data request is here>
In function srdaGetData(), called after srdaRequestData()
# simple polling of the results file
repeat {
if (DEBUG) message("Waiting for results ...", appendLF = FALSE)
afterDate <- url.exists(RESULTS_URL, .header=TRUE)["Last-Modified"]
afterDate <- strptime(afterDate, "%a, %d %b %Y %X", tz="GMT")
delta <- difftime(afterDate, beforeDate, units = "secs")
if (as.numeric(delta) != 0) { # file modified, results are ready
if (DEBUG) message(" Ready!")
break
}
else { # no results yet, wait the timeout and check again
if (DEBUG) message(".", appendLF = FALSE)
Sys.sleep(POLL_TIME)
}
}
<retrieving request's results is here>
The module's main flow/sequence of events is linear, as follows:
Read/update configuration file
Authenticate with the system
Loop through data requests, specified in configuration file (via lapply()),
where for each request perform the following:
{
...
Make request: srdaRequestData()
...
Retrieve results: srdaGetData()
...
}
The issue with the code above is that it doesn't seem to be working as expected: upon making data request, the code should print "Waiting for results ..." and then, periodically checking the results file for being modified (re-generated), print progress dots until the results are ready, when it prints confirmation. However, the actual behavior is that the code waits long time (I intentionally made one request a long-running), not printing anything, but then, apparently retrieves results and prints both "Waiting for results ..." and " Ready" at the same time.
It seems to me that it's some kind of synchronization issue, but I can't figure out what exactly. Or, maybe it's something else and I'm somehow missing it. Your advice and help will be much appreciated!
In a comment to the question, I believe MrFlick solved the issue: the polling logic appears to be functional, but the problem is that the progress messages are out of synch with current events on the system.
By default, the R console output is buffered. This is by design: to speed things up and avoid the distracting flicker that may be associated with frequent messages etc. We tend to forget this fact, particularly after we've been using R in a very interactive fashion, running various ad-hoc statement at the console (the console buffer is automatically flushed just before returning the > prompt).
It is however possible to get message() and more generally console output in "real time" by either explicitly flushing the console after each critical output statement, using the flush.console() function, or by disabling buffering at the level of the R GUI (right-click when on the console, see Buffered output Ctrl W item. This is also available in the Misc menu)
Here's a toy example of the explicit use of flush.console. Note the use of cat() rather than message() as the former doesn't automatically add a CR/LF to the output. The latter however is useful however because its messages can be suppressed with suppressMessages() and the like. Also as shown in the comment you can cat the "\b" (backspace) character to make the number overwrite one another.
CountDown <- function() {
for (i in 9:1){
cat(i)
# alternatively to cat(i) use: message(i)
flush.console() # <<<<<<< immediate ouput to console.
Sys.sleep(1)
cat(" ") # also try cat("\b") instead ;-)
}
cat("... Blast-off\n")
}
The output is the following, what is of course not evident in this print-out is that it took 10 seconds overall with one number printed every second, before the final "Blast off"; do remove the flush.console() statement and the output will come at once, after 10 seconds, i.e. when the function terminates (unless console is not buffered at the level of the GUI).
CountDown()
9 8 7 6 5 4 3 2 1 ... Blast-off

Resources