gawk to read last bit of binary data over a pipe without timeout? - http

I have a program already written in gawk that downloads a lot of small bits of info from the internet. (A media scanner and indexer)
At present it launches wget to get the information. This is fine, but I'd like to simply reuse the connection between invocations. Its possible a run of the program might make between 200-2000 calls to the same api service.
I've just discovered that gawk can do networking and found geturl
However the advice at the bottom of that page is well heeded, I can't find an easy way to read the last line and keep the connection open.
As I'm mostly reading JSON data, I can set RS="}" and exit when body length reaches the expected content-length. This might break with any trailing white space though. I'd like a more robust approach. Does anyone have a nicer way to implement sporadic http requests in awk that keep the connection open. Currently I have the following structure...
con="/inet/tcp/0/host/80";
send_http_request(con);
RS="\r\n";
read_headers();
# now read the body - but do not close the connection...
RS="}"; # for JSON
while ( con |& getline bytes ) {
body = body bytes RS;
if (length(body) >= content_length) break;
print length(body);
}
# Do not close con here - keep open
Its a shame this one little thing seems to be spoiling all the potential here. Also in case anyone asks :) ..
awk was originally chosen for historical reasons - there were not many other language options on this embedded platform at the time.
Gathering up all of the URLs in advance and passing to wget will not be easy.
re-implementing in perl/python etc is not a quick solution.
I've looked at trying to pipe urls to a named pipe and into wget -i - , that doesn't work. Data gets buffered, and unbuffer not available - also I think wget gathers up all the URLS until EOF before processing.
The data is small so lack of compression is not an issue.

The problem with the connection reuse comes from the HTTP 1.0 standard, not gawk. To reuse the connection you must either use HTTP 1.1 or try some other non-standard solutions for HTTP 1.0. Don't forget to add the Host: header in your HTTP/1.1 request, as it is mandatory.
You're right about the lack of robustness when reading the response body. For line oriented protocols this is not an issue. Moreover, even when using HTTP 1.1, if your scripts locks waiting for more data when it shouldn't, the server will, again, close the connection due to inactivity.
As a last resort, you could write your own HTTP retriever in whatever langauage you like which reuses connections (all to the same remote host I presume) and also inserts a special record separator for you. Then, you could control it from the awk script.

Related

Optimizing file synchronization over HTTP

I'm attempting to synchornize a set of files over HTTP.
For the moment, I'm using HTTP PUT, and sending files that have been altered. However, this is very inefficient when synchronizing large files where the delta is very small.
I'd like to do something closer to what rsync does to transmit the deltas, but I'm wondering what the best approach to do this would be.
I know I could use an rsync library on both ends, and wrap their communication over HTTP, but this sounds more like an antipattern; tunneling a standalone protocol over HTTP. I'd like to do something that's more in line with how HTTP works, and not wrap binary data (except my files, duh) in an HTTP request/response.
I've also failed to find any relevant/useful functionality already implemented in WebDAV.
I have total control over the client and server implementation, since this is a desktop-ish application (meaning "I don't need to worry about browser compatibility").
The HTTP PATCH recommended in a comment requires the client to keep track of local changes. You may not be able to do that due to the size of the file.
Alternatively you could treat "chunks" of the huge file as resources: depending on the nature of the changes and the content of the file it could be by bytes, chapters, whatever.
The client could query the hash of all chunks, calculate the same for the local version, and PUT only the changed ones.

Prevent FIFO from closing / reuse closed FIFO

Consider the following scenario:
a FIFO named test is created. In one terminal window (A) I run cat <test and in another (B) cat >test. It is now possible to write in window B and get the output in window A. It is also possible to terminate the process A and relaunch it and still be able to use this setup as suspected. However if you terminate the process in window B, B will (as far as I know) send an EOF through the FIFO to process A and terminate that as well.
In fact, if you run a process that does not terminate on EOF, you'll still not be able to use your FIFO you redirected to the process. Which I think is because this FIFO is considered closed.
Is there anyway to work around this problem?
The reason to why I ran into this problem is because I'd like to send commands to my minecraft server running in a screen session. For example: echo "command" >FIFO_to_server. This is problably possible to do by using screen by itself but I'm not very comfortable with screen I think a solution using only pipes would be a simpler and cleaner one.
A is reading from a file. When it reaches the end of the file, it stops reading. This is normal behavior, even if the file happens to be a fifo. You now have four approaches.
Change the code of the reader to make it keep reading after the end of the file. That's saying the input file is infinite, and reaching the end of the file is just an illusion. Not practical for you, because you'd have to change the minecraft server code.
Apply unix philosophy. You have a writer and a reader who don't agree on protocol, so you interpose a tool that connects them. As it happens, there is such a tool in the unix toolbox: tail -f. tail -f keeps reading from its input file even after it sees the end of the file. Make all your clients talk to the pipe, and connect tail -f to the minecraft server:
tail -n +1 -f client_pipe | minecraft_server &
As mentioned by jilles, use a trick: pipes support multiple writers, and only become closed when the last writer goes away. So make sure there's a client that never goes away.
while true; do sleep 999999999; done >client_pipe &
The problem is that the server is fundamentally designed to handle a single client. To handle multiple clients, you should change to using a socket. Think of sockets as “meta-pipes”: connecting to a socket creates a pipe, and once the client disconnects, that particular pipe is closed, but the server can accept more connections. This is the clean approach, because it also ensures that you won't have mixed up data if two clients happen to connect at the same time (using pipes, their commands could be interspersed). However, it require changing the minecraft server.
Start a process that keeps the fifo open for writing and keeps running indefinitely. This will prevent readers from seeing an end-of-file condition.
From this answer -
On some systems like Linux, <> on a named pipe (FIFO) opens the named pipe without blocking (without waiting for some other process to open the other end), and ensures the pipe structure is left alive. For instance in:
So you could do:
cat <>up_stream >down_stream
# the `cat pipeline keeps running
echo 1 > up_stream
echo 2 > up_stream
echo 3 > up_stream
However, I can't find documentation about this behavior. So this could be implementation detail which is specific to some systems. I tried the above on MacOS and it works.
You can add multiple inputs ino a pipe by adding what you require in brackets with semi-colons in your 'mkfifo yourpipe':
(cat file1; cat file2; ls -l;) > yourpipe

Writing a Proxy/Caching server using Lua!

I'm still starting out with Lua, and would like to write a (relatively) simple proxy using it.
This is what I would like to get to:
Listen on port.
Accept connection.
Since this is a proxy, I'm expecting HTTP (Get/Post etc..)/HTTPS/FTP/whatever requests from my browser.
Inspect the request (Just to extract the host and port information?)
Create a new socket and connect to the host specified in the request.
Relay the exact request as it was received, with POST data and all.
Receive the response (header/body/anything else..) and respond to the initial request.
Close Connections? I suppose Keep-Alive shouldn't be respected?
I realize it's not supposed to be trivial, but I'm having a lot of trouble setting this up using LuaSockets or Copas --- how do I receive the entire request? Keep receiving until I scan \r\n\r\n? Then how do I pull the post data? and the body? Or accept a "download" file? I read about the "sink", but admittedly didn't understand most of what that meant, so maybe I should read up more on that?
In case it matters, I'm working on a windows machine, using LuaForWindows and am still rather new to Lua. Loving it so far though, tables are simply amazing :)
I discovered lua-http but it seems to have been merged into Xavante (and I didn't find any version for lua 5.1 and LuaForWindows), not sure if it makes my life easier?
Thanks in advance for any tips, pointers, libraries/source I should be looking at etc :)
Not as easy as you may think. Requests to proxies and request to servers are different. In rfc2616 you can see that, when querying a proxy, a client include the absolute url of the requested document instead of the usual relative one.
So, as a proxy, you have to parse incomming requests, modify them, query the appropriate servers, and return response.
Parsing incomming requests is quite complex as body length depends on various parameters ( method, content encoding, etc ).
You may try to use lua-http-parser.

How does the HTTP protocol work exactly?

I thought I had this all figured out, but now that I'm writing a webserver, something is not quite working right.
The app listens on a port for incoming requests, and when it receives one, it reads everything up to the sequence "\r\n\r\n". (Because that signifies the end of the headers - yes, I am ignoring possible POST data.)
Now, after it reads that far, it writes to the socket the response:
HTTP/1.1 200 OK\r\n
Host: 127.0.0.1\r\n
Content-type: text/html\r\n
Content-length: 6\r\n
\r\n
Hello!
However, when Firefox or Chrome tries to view the page, it won't display. Chrome informs me:
Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error.
What am I doing wrong?
Here is some of the code:
QTcpSocket * pSocket = m_server->nextPendingConnection();
// Loop thru the request until \r\n\r\n is found
while(pSocket->waitForReadyRead())
{
QByteArray data = pSocket->readAll();
if(data.contains("\r\n\r\n"))
break;
}
pSocket->write("HTTP/1.0 200 OK\r\n");
QString error_str = "Hello world!";
pSocket->write("Host: localhost:8081\r\n");
pSocket->write("Content-Type: text/html\r\n");
pSocket->write(tr("Content-Length: %1\r\n").arg(error_str.length()).toUtf8());
pSocket->write("\r\n");
pSocket->write(error_str.toUtf8());
delete pSocket;
I figured it out!
After writing the data to the socket, I have to call:
pSocket->waitForBytesWritten();
...or the buffer does not get outputted.
Could the problem be that you're not flushing and closing the socket before deleting it?
EDIT: George Edison answered his own question, but was kind enough to accept my answer. Here is the code that worked for him:
pSocket->waitForBytesWritten();
What you’ve shown looks okay, so it must be that you’re actually sending something different. (I presume that you're entering "http://127.0.0.1" in your browser.)
Download the crippleware version of this product and see what it reports:
http://www.charlesproxy.com/
Try it with telnet or nc (netcat) to debug it. Also, is it possible you're sending double newlines? I'm not sure your language, so if your print appends a newline, switch to:
HTTP/1.1 200 OK\r
It's also handy to have a small logging/forwarding program for debugging socket protocols. I don't know any offhand, but I've had one I've used for many years and it's a lifesaver when I need it.
[Edit]
Try using wget to fetch the file; use the -S -O file to save the output to a file and see what's wrong that way.
[#2]
pSocket->write(error_str.toUtf8());
delete pSocket;
I haven't used Qt for a long time, but do you need to flush or at least close the socket before deleting?

Is there anything in the FTP protocol like the HTTP Range header?

Suppose I want to transfer just a portion of a file over FTP - is it possible using a standard FTP protocol?
In HTTP I could use a Range header in the request to specify the data range of the remote resource. If it's a 1mb file, I could ask for the bytes from 600k to 700k.
Is there anything like that in FTP? I am reading the FTP RFC, don't see anything, but want to make sure I'm not missing anything.
There's a Restart command in FTP - would that work?
Addendum
After getting Brian Bondy's answer below, I wrote a read-only Stream class that wraps FTP. It supports Seek() and Read() operations on a resource that is read via FTP, based on the REST verb.
Find it at http://cheeso.members.winisp.net/srcview.aspx?dir=streams&file=FtpReadStream.cs
It's pretty slow to Seek(), because setting up the data socket takes a long time. Best results come when you wrap that stream in a BufferedStream.
Yes you can use the REST command.
REST sets the point at which a subsequent file transfer should start. It is used usually for restarting interrupted transfers. The command must come right before a RETR or STOR and so come after a PORT or PASV.
From FTP's RFC 959:
RESTART (REST) The argument field
represents the server marker at which
file transfer is to be restarted. This
command does not cause file transfer
but skips over the file to the
specified data checkpoint. This
command shall be immediately followed
by the appropriate FTP service command
which shall cause file transfer to
resume.
Read more:
http://www.faqs.org/rfcs/rfc959.html#ixzz0jZp8azux
You should check out how GridFTP does parallel transfers. That's using the sort of techniques that you want (and might actually be code that it is better to borrow rather than implementing from scratch yourself).

Resources