Deal with download interruptions in Tcl - http

I'm downloading a list of files using tcl's http package and was wondering what the best way to handle interruptions are. Right now the outline to my download procedure looks like this:
proc downloadFiles {url_list folder_location} {
foreach {i} $url_list {
regsub {.*/(.*)$} $i {\1} $name
set out [open $folder_location/$name w+] //not worried about errors here
if {[catch {set token [http::geturl $i -channel $out]}]} {
puts "Could not download $i"
} else {
http::cleanup $token
puts "Downloaded $i"
}
close $out
}
}
The line I'm having problems is the catch statement:
catch {set token [http::geturl $i -channel $out]}
Apparently despite me cutting off my internet and stopping a download half way the catch statement still returns a 0 for errors. Is there someway to catch this?

Detecting that the network has temporarily vanished is difficult (deliberately so; that's how TCP/IP works and HTTP just sits on top of that). What you could do is try getting the expected length of the data from the HTTP headers and comparing that to the length of data actually received (you'll want to force binary mode for this, but you're probably dealing with binary data anyway).
To get the expected data length and currently-downloaded length, you need a bit of magic with upvar (the internal alias name, state, is arbitrary):
upvar #0 $token state
puts "Content-Length was $state(totalsize), downloaded $state(currentsize)"
Note however that many pages don't supply a content length so the totalsize field is zero. The http package only knows in those cases that it's got the end when it gets to the end.
When restarting a download, you'll want to send a Range header. That's not an explicitly supported one so you'll need to give it by the -headers option to geturl.
http::geturl $url -headers [list Range bytes=$whereYouGotTo-]
Yes, the format is indeed that funky.

Related

How to send GET request longer than 65535 symbols from rust?

I am rewriting part of my API from python to rust. In particular, I am trying to make an HTTP request to OSRM server to get a big distance matrix. This kind of request can have quite large URLs. In python everything works fine, but in rust I get an error:
thread 'tokio-runtime-worker' panicked at 'a parsed Url should always be a valid Uri: InvalidUri(TooLong)'
I have tried to use several HTTP client libraries: reqwest, surf, isahc, awc. But it turns out that constraining logic is located at the URL processing library https://github.com/hyperium/http and most HTTP clients depend on this library. So they behave the same. I could not use some libs, for example with awc I got compile-time errors with my async code.
Is there any way to send a large GET request from rust, preferably asynchronously?
As freakish pointed out in the comments already, having such a long URL is a bad idea, anything longer than 2,000 characters won't work in most browsers.
That being said: In the comments, you stated that an external API wants those crazily long URIs, so you don't really have an alternative. Therefore, let's give this problem a shot.
It looks like the limitation to 65.534 bytes is because the http library stores the position of the query string as a u16 (and uses 65,535 if there is no query part). The following patch seems to make the code use u32 instead, thereby raising the number of characters to 4,294,967,294 (if you've got longer URIs than that, you might be able to use u64 instead, but that would be an URI of a length greater than 4 GB – I doubt you need this):
--- a/src/uri/mod.rs
+++ b/src/uri/mod.rs
## -141,7 +141,7 ## enum ErrorKind {
}
// u16::MAX is reserved for None
-const MAX_LEN: usize = (u16::MAX - 1) as usize;
+const MAX_LEN: usize = (u32::MAX - 1) as usize;
// URI_CHARS is a table of valid characters in a URI. An entry in the table is
// 0 for invalid characters. For valid characters the entry is itself (i.e.
diff --git a/src/uri/path.rs b/src/uri/path.rs
index be2cb65..9abec4c 100644
--- a/src/uri/path.rs
+++ b/src/uri/path.rs
## -11,10 +11,10 ## use crate::byte_str::ByteStr;
#[derive(Clone)]
pub struct PathAndQuery {
pub(super) data: ByteStr,
- pub(super) query: u16,
+ pub(super) query: u32,
}
-const NONE: u16 = ::std::u16::MAX;
+const NONE: u32 = ::std::u32::MAX;
impl PathAndQuery {
// Not public while `bytes` is unstable.
## -32,7 +32,7 ## impl PathAndQuery {
match b {
b'?' => {
debug_assert_eq!(query, NONE);
- query = i as u16;
+ query = i as u32;
break;
}
b'#' => {
You could try to get this merged, however the issue covering this problem sounds like a pull request might not be accepted. Depending on your use case, you could fork the repository, commit the fix and then use the Cargo features for overriding dependencies to make Cargo use your patched version instead of the version in the repositories. The following addition to your Cargo.toml might get you started:
[patch.crates-io]
http = { git = 'https://github.com/your/repository' }
Note however that this only overrides the current version of the Uri crate – as soon as a new version of the original crate is published, it will probably be chosen by Cargo until you update your fork.

Tcl fileevent reads what was sent previously

I'm a fresh tcl user and I've been working on recently on a script, that uses serial port. My greatest problem so far is that I can't effectively read my serial port.
I'm using fileevent to read, but it catches what was send previously. However if I dont send anything before, it waits for external data and catches it. My code:
global forever
proc read_serial {serial} {
set msg [read $serial]
set listOfLetters [split $msg {} ]
set serialIPinHex ""
foreach iChar $listOfLetters { ;#conversion to hex
scan $iChar "%c" decimalValue
set hexValue [format "%02X" $decimalValue ]
set hexValue "0x$hexValue"
set serialIPinHex "$serialIPinHex $hexValue"
}
puts "MCU RESPONSE: $serialIPinHex"
global forever
set forever 1 ;# end event loop
}
set serial [open com1 r+]
fconfigure $serial -mode "19200,n,8,1"
fconfigure $serial -blocking 0 -buffering none
fconfigure $serial -translation binary -encoding binary
fileevent $serial readable [list read_serial $serial ]
global forever
puts -nonewline $serial \x31\xce
flush $serial
delay 10
vwait forever
flush $serial
close $serial
The effect is "MCU RESPONSE: x31 xce", while it should await for sth to come by serial (in my understanding). I'm sure there is no echo function on the other end. Thanks in advance for your help. I hope my bug is not embarassing, I spend last few hours looking for it...
The main thing to note about serial ports is that they are slow. Really slow. A modern computer can run round the block three times before a serial port has passed one character over (well, metaphorically). That means you're going to get characters coming one at a time and you've got to deal with that.
You're advised to use read $serial 1 and to append that to a buffer you are keeping, or write your code to handle a single byte at a time. If you were using line-oriented messaging, you could take advantage of gets's good behaviour in non-blocking mode (the fblocked command is designed to support this), but read isn't quite so friendly (since it knows nothing about record separators).
Don't worry about missing a byte if you use read $serial 1; if there's one left, your callback will get called again immediately to deal with it.

IncompleteRead error when submitting neo4j batch from remote server; malformed HTTP response

I've set up neo4j on server A, and I have an app running on server B which is to connect to it.
If I clone the app on server A and run the unit tests, it works fine. But running them on server B, the setup runs for 30 seconds and fails with an IncompleteRead:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 208, in run
self.setUp()
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 291, in setUp
self.setupContext(ancestor)
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/suite.py", line 314, in setupContext
try_run(context, names)
File "/usr/local/lib/python2.7/site-packages/nose-1.3.1-py2.7.egg/nose/util.py", line 469, in try_run
return func()
File "/comps/comps/webapp/tests/__init__.py", line 19, in setup
create_graph.import_films(films)
File "/comps/comps/create_graph.py", line 49, in import_films
batch.submit()
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/neo4j.py", line 2643, in submit
return [BatchResponse(rs).hydrated for rs in responses.json]
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/packages/httpstream/http.py", line 563, in json
return json.loads(self.read().decode(self.encoding))
File "/usr/local/lib/python2.7/site-packages/py2neo-1.6.3-py2.7-linux-x86_64.egg/py2neo/packages/httpstream/http.py", line 634, in read
data = self._response.read()
File "/usr/local/lib/python2.7/httplib.py", line 532, in read
return self._read_chunked(amt)
File "/usr/local/lib/python2.7/httplib.py", line 575, in _read_chunked
raise IncompleteRead(''.join(value))
IncompleteRead: IncompleteRead(131072 bytes read)
-------------------- >> begin captured logging << --------------------
py2neo.neo4j.batch: INFO: Executing batch with 2 requests
py2neo.neo4j.batch: INFO: Executing batch with 1800 requests
--------------------- >> end captured logging << ---------------------
The exception happens when I submit a sufficiently large batch. If I reduce the size of the data set, it goes away. It seems to be related to request size rather than the number of requests (if I add properties to the nodes I'm creating, I can have fewer requests).
If I use batch.run() instead of .submit(), I don't get an error, but the tests fail; it seems that the batch is rejected silently. If I use .stream() and don't iterate over the results, the same thing happens as .run(); if I do iterate over them, I get the same error as .submit() (except that it's "0 bytes read").
Looking at httplib.py suggests that we'll get this error when an HTTP response has Transfer-Encoding: Chunked and doesn't contain a chunk size where one is expected. So I ran tcpdump over the tests, and indeed, that seems to be what's happening. The final chunk has length 0x8000, and its final bytes are
"http://10.210.\r\n
0\r\n
\r\n
(Linebreaks added after \n for clarity.) This looks like correct chunking, but the 0x8000th byte is the first "/", rather than the second ".". Eight bytes early. It also isn't a complete response, being invalid JSON.
Interestingly, within this chunk we get the following data:
"all_relatio\r\n
1280\r\n
nships":
That is, it looks like the start of a new chunk, but embedded within the old one. This new chunk would finish in the correct location (the second "." of above), if we noticed it starting. And if the chunk header wasn't there, the old chunk would finish in the correct location (eight bytes later).
I then extracted the POST request of the batch, and ran it using cat batch-request.txt | nc $SERVER_A 7474. The response to that was a valid chunked HTTP response, containing a complete valid JSON object.
I thought maybe netcat was sending the request faster than py2neo, so I introduced some slowdown
cat batch-request.txt | perl -ne 'BEGIN { $| = 1 } for (split //) { select(undef, undef, undef, 0.1) unless int(rand(50)); print }' | nc $SERVER_A 7474
But it continued to work, despite being much slower now.
I also tried doing tcpdump on server A, but requests to localhost don't go over tcp.
I still have a few avenues that I haven't explored: I haven't worked out how reliably the request fails or under precisely which conditions (I once saw it succeed with a batch that usually fails, but I haven't explored the boundaries). And I haven't tried making the request from python directly, without going through py2neo. But I don't particularly expect either of these to be be very informative. And I haven't looked closely at the TCP dump except for using wireshark's 'follow TCP stream' to extract the HTTP conversation; I don't really know what I'd be looking for there. There's a large section that wireshark highlights in black in the failed dump, and only isolated lines black in the successful dump, maybe that's relevant?
So for now: does anyone know what might be going on? Anything else I should try to diagnose the problem?
The TCP dumps are here: failed and successful.
EDIT: I'm starting to understand the failed TCP dump. The whole conversation takes ~30 seconds, and there's a ~28-second gap in which both servers are sending ZeroWindow TCP frames - these are the black lines I mentioned.
First, py2neo fills up neo4j's window; neo4j sends a frame saying "my window is full", and then another frame which fills up py2neo's window. Then we spend ~28 seconds with each of them just saying "yup, my window is still full". Eventually neo4j opens its window again, py2neo sends a bit more data, and then py2neo opens its window. Both of them send a bit more data, then py2neo finishes sending its request, and neo4j sends more data before also finishing.
So I'm thinking that maybe the problem is something like, both of them are refusing to process more data until they've sent some more, and neither can send some more until the other processes some. Eventually neo4j enters a "something's gone wrong" loop, which py2neo interprets as "go ahead and send more data".
It's interesting, but I'm not sure what it means, that the penultimate TCP frame sent from neo4j to py2neo starts \r\n1280\r\n - the beginning of the fake-chunk. The \r\n8000\r\n that starts the actual chunk, just appears part-way through an unremarkable TCP frame. (It was the third frame sent after py2neo finished sending its post request.)
EDIT 2: I checked to see precisely where python was hanging. Unsurprisingly, it was while sending the request - so BatchRequestList._execute() doesn't return until after neo4j gives up, which is why neither .run() or .stream() did any better than .submit().
It appears that a workaround is to set the header X-Stream: true;format=pretty. (By default it's just true; it used to be pretty, but that was removed due to this bug (which looks like it's actually a neo4j bug, and still seems to be open, but isn't currently an issue for me).
It looks like, by setting format=pretty, we cause neo4j to not send any data until it's processed the whole of the input. So it doesn't try to send data, doesn't block while sending, and doesn't refuse to read until it's sent something.
Removing the X-Stream header entirely, or setting it to false, seems to have the same effect as setting format=pretty (as in, making neo4j send a response which is chunked, pretty-printed, doesn't contain status codes, and doesn't get sent until the whole request has been processed), which is kinda weird.
You can set the header for an individual batch with
batch._batch._headers['X-Stream'] = 'true;format=pretty'
Or set the global headers with
neo4j._add_header('X-Stream', 'true;format=pretty')

Modifying HTML content on the fly => SLOW

We are working on a PROXY based protection software. It catches the user http request, do the proxy stuffs, and catch the http response, modify its content and send it back to the original user.
We had 2 tries:
SQUID proxy and a PHP rind out of SQUID.
It was promising, but at PHP stream we did not know about the length of response data we were expected, so it was timeouting every time => SLOW
Now, we wrote a .net application. It does everything we need, and its pretty fast even does not modify the content. If we need to GZIP/GUNZIP, or just modify the content, it becomes very slow.
Could you help us?
We are working on this project for almost a year in our University in Hungary. We wrote a automatic, self learning full semantical analizer engine, which can analyze and interpret in all language, and can detect and screen the target content. We also built an image recognition software, which can detect the target object in 90% confidence in all image.
So everything is ready, but our proxy application is stucked.
We also could pay for this job, if anybody would write it.
I spend a lot of time programming in PHP - yes, as an interpreted language it can be slow - and there is a huge amount of badly written code available - but even before you start to touch the code, tuning the environment can reduce execution time by a factor of 5-10. Then changing the code can make it go faster still; the biggest wins come from good choices for architecture and data structures (which is true of any language - not just PHP).
I don't know where you're starting from but find it surprising that you are not able to process the stream relative to the amount of time taken to generate the content and send it across the network. For it to be timing out something is very wrong. (you're not trying to parse the HTML using the one of the XML parsers are you?). The length of the content shoul have little impact on the performance of the script unless you are trying to map it all into PHP's address space at the same time.
However AFAIK, it's not possible to implement a content filter directly into Squid using PHP (if you did, I'd love to know how you did it, also if you've implemented ICAP, that's very interesting). I'm guessing that you are using a URL redirector to route the requests via a proxy script written in PHP.
It is possible to write an ECAP module in C/C++.
Image recognition and natural language processing are not trivial exercises in programming - so you must have some good programmers working on your team. Really addressing your problem goes rather beyond the scope of a stack overflow answer, and touting for contractors is definitely off topic.
Thanks for your reply!
First of all: our PHP is pretty fast, the fsockopen is slow, because it cannot know when to close the response connection from SQUID.
Here is our code:
$buffer = socket_read($client, 4096);
if ( !($handle = fsockopen(HOST, SQUIDPROXYPORT, $errno, $error, 1)) ) {
Log::write($this->log, 'Errno: ' . $errno . ' Error: ' . $error . "\n" . $buffer);
exit('Nem sikerült csatlakozni! ' . $errno . ':' . $error);
}
stream_set_timeout($handle, 0, 100000);
fwrite($handle, $buffer);
$result = '';
do {
$tmp = fgets($handle, 1024);
if ( $tmp ) {
$result .= $tmp;
}
} while ( !feof($handle) && $tmp != false );
socket_write($client, $result, strlen($result));
fclose($handle);
socket_close($client);
Again, how it works:
Client send HTTP request to us
Our PHP get the request, and send its header to SQUID proxy
Squid does its stuff, and send the response data back to our PHP
Our PHP gets by fsockopen the response data from squid
We analyze the response data, or modify it
We send it back to client
BUT:
While we are waiting for the response data, we receive it, but we cannot know, at what time to close the connection between our PHP and SQUID. This results a slow work, and timeout at almost every time.
If you have any idea, plesa share with us!

QNetworkAccessManager: post http multipart from serial QIODevice

I'm trying to use QNetworkAccessManager to upload http multiparts to a dedicated server.
The multipart consists of a JSON part describing the data being uploaded.
The data is read from a serial QIODevice, which encrypts the data.
This is the code that creates the multipart request:
QHttpMultiPart *multiPart = new QHttpMultiPart(QHttpMultiPart::FormDataType);
QHttpPart metaPart;
metaPart.setHeader(QNetworkRequest::ContentTypeHeader, "application/json");
metaPart.setHeader(QNetworkRequest::ContentDispositionHeader, QVariant("form-data; name=\"metadata\""));
metaPart.setBody(meta.toJson());
multiPart->append(metaPart);
QHttpPart filePart;
filePart.setHeader(QNetworkRequest::ContentTypeHeader, QVariant(fileFormat));
filePart.setHeader(QNetworkRequest::ContentDispositionHeader, QVariant("form-data; name=\"file\""));
filePart.setBodyDevice(p_encDevice);
p_encDevice->setParent(multiPart); // we cannot delete the file now, so delete it with the multiPart
multiPart->append(filePart);
QNetworkAccessManager netMgr;
QScopedPointer<QNetworkReply> reply( netMgr.post(request, multiPart) );
multiPart->setParent(reply.data()); // delete the multiPart with the reply
If the p_encDevice is an instance of QFile, that file gets uploaded just fine.
If the specialised encrypting QIODevice is used (serial device) then all of the data is read from my custom device. however QNetworkAccessManager::post() doesn't complete (hangs).
I read in the documentation of QHttpPart that:
if device is sequential (e.g. sockets, but not files),
QNetworkAccessManager::post() should be called after device has
emitted finished().
Unfortunately I don't know how do that.
Please advise.
EDIT:
QIODevice doesn't have finished() slot at all. What's more, reading from my custom IODevice doesn't happen at all if QNetworkAccessManager::post() is not called and therefore the device wouldn't be able to emit such an event. (Catch 22?)
EDIT 2:
It seems that QNAM does not work with sequential devices at all. See discussion on qt-project.
EDIT 3:
I managed to "fool" QNAM to make it think that it is reading from non-sequential devices, but seek and reset functions prevent seeking. This will work until QNAM will actually try to seek.
bool AesDevice::isSequential() const
{
return false;
}
bool AesDevice::reset()
{
if (this->pos() != 0) {
return false;
}
return QIODevice::reset();
}
bool AesDevice::seek(qint64 pos)
{
if (this->pos() != pos) {
return false;
}
return QIODevice::seek(pos);
}
You'll need to refactor your code quite a lot so that the variables you pass to post are available outside that function you've posted, then you'll need a new slot defined with the code for doing the post inside the implementation. Lastly you need to do connect(p_encDevice, SIGNAL(finished()), this, SLOT(yourSlot()) to glue it all together.
You're mostly there, you just need to refactor it out and add a new slot you can tie to the QIODevice::finished() signal.
I've had more success creating the http post data manually than with using QHttpPart and QHttpMultiPart. I know it's probably not what you want to hear, and it's a little messy, but it definitely works. In this example I am reading from a QFile, but you can call readAll() on any QIODevice. It also is worth noting, QIODevice::size() will help you check if all the data has been read.
QByteArray postData;
QFile *file=new QFile("/tmp/image.jpg");
if(!(file->open(QIODevice::ReadOnly))){
qDebug() << "Could not open file for reading: "<< file->fileName();
return;
}
//create a header that the server can recognize
postData.insert(0,"--AaB03x\r\nContent-Disposition: form-data; name=\"attachment\"; filename=\"image.jpg\"\r\nContent-Type: image/jpeg\r\n\r\n");
postData.append(file->readAll());
postData.append("\r\n--AaB03x--\r\n");
//here you can add additional parameters that your server may need to parse the data at the end of the url
QString check(QString(POST_URL)+"?fn="+fn+"&md="+md);
QNetworkRequest req(QUrl(check.toLocal8Bit()));
req.setHeader(QNetworkRequest::ContentTypeHeader,"multipart/form-data; boundary=AaB03x");
QVariant l=postData.length();
req.setHeader(QNetworkRequest::ContentLengthHeader,l.toString());
file->close();
//free up memory
delete(file);
//post the data
reply=manager->post(req,postData);
//connect the reply object so we can track the progress of the upload
connect(reply,SIGNAL(uploadProgress(qint64,qint64)),this,SLOT(updateProgress(qint64,qint64)));
Then the server can access the data like this:
<?php
$filename=$_REQUEST['fn'];
$makedir=$_REQUEST['md'];
if($_FILES["attachment"]["type"]=="image/jpeg"){
if(!move_uploaded_file($_FILES["attachment"]["tmp_name"], "/directory/" . $filename)){
echo "File Error";
error_log("Uploaded File Error");
exit();
};
}else{
print("no file");
error_log("No File");
exit();
}
echo "Success.";
?>
I hope some of this code can help you.
I think the catch is that QNetworkAccessManager does not support chunked transfer encoding when uploading (POST, PUT) data. This means that QNAM must know in advance the length of the data it's going to upload, in order to send the Content-Length header. This implies:
either the data does not come from sequential devices, but from random-access devices, which would correctly report their total size through size();
or the data comes from a sequential device, but the device has already buffered all of it (this is the meaning of the note about finished()), and will report it (through bytesAvailable(), I suppose);
or the data comes from a sequential device which has not buffered all the data, which in turn means
either QNAM reads and buffers itself all the data coming from the device (by reading until EOF)
or the user manually set the Content-Length header for the request.
(About the last two points, see the docs for the QNetworkRequest::DoNotBufferUploadDataAttribute.)
So, QHttpMultiPart somehow shares these limitations, and it's likely that it's choking on case 3. Supposing that you cannot possibly buffer in memory all the data from your "encoder" QIODevice, is there any chance you might know the size of the encoded data in advance and set the content-length on the QHttpPart?
(As a last note, you shouldn't be using QScopedPointer. That will delete the QNR when the smart pointer falls out of scope, but you don't want to do that. You want to delete the QNR when it emits finished()).
From a separate discussion in qt-project and by inspecting the source code it seems that QNAM doesn't work with sequential at all. Both the documentation and code are wrong.

Resources