Is there a provably optimal block (piece) size for torrents (and individual files)? - rsync

The BitTorrent protocol doesn't specify block (piece) size. This is left to the user. (I've seen different torrents for the same content with 3 or more different choices.)
I'm thinking of filing a BitTorrent Enhancement Proposal which needs to make a specific block size mandatory — both for the whole torrent, and also for individual files (for which BTv2 (BEP 52) specifies bs=16KiB).
The only thing I've found that's close is the rsync block size algorithm in Tridgell & Mackerras' technical paper. Their bs=300-1100 B (# bytes aren't powers of 2).
Torrents, however, usually use bs=64kB–16MB (# bytes are powers of 2, and much larger than rsync's) for the whole torrent (and, for BTv2, 16KiB for files).
The specified block size doesn't need to be a constant. It could be a function of thing-hashed size, of course (like it is in rsync). It could also be a function of file type; e.g. there might be some block sizes which are better for making partial video/archive/etc files more usable.
See also this analysis of BitTorrent as a block-aligned file system.
So…
What are optimal block sizes for a torrent, generic file, or partial usefulness of specific file types?
Where did the 16KiB bs in BEP 52 come from?

Block and piece size are not the same thing.
A piece is the unit that is hashed into the pieces string in v1 torrents, one hash per piece.
A block is a part of a piece that is requested via request (ID 6) and delivered via piece (ID 7) messages. These messages basically consist of (piece number, offset, length) tuple where the length is the block size. In this sense blocks are very ephemeral constructs in v1 torrents but they are still important since downloading clients have to keep a lot of state about them in memory. Since the downloading client is in control of the request size they customarily use fixed 16KiB blocks, even though they could do this more flexibly. For an uploading client it does not really matter complexity-wise as they have to simply serve the bytes covered by (piece,offset,length) and keep no further state.
Since clients generally implement an upper message size limit to avoid DoS attacks 16KiB is also the recommended upper bound. Specialized implementations could use larger blocks, but for public torrents that doesn't really happen.
For v2 torrents the picture changes a bit. There now are three concepts
the ephemeral blocks sent via messages
the pieces (now representing some layer in the merkle tree), needed for v1 compatibility in hybrid torrents and also stored as piece layers outside the info dictionary to allow partial file resume
the leaf blocks of the merkle tree
The first type is essentially unchanged compared to v1 torrents but the incentive to use 16KiB-sized blocks is much stronger now because that is also the leaf hash size.
The piece size must now be a power of two and multiple of 16KiB, this constraint did not exist in v1 torrents.
The leaf block size is fixed to 16KiB, it is relevant when constructing the merkle tree and exchanging message IDs 21 (hash request) and 22 (hashes)
What are optimal block sizes for a torrent, generic file, or partial usefulness of specific file types?
For a v1 torrent the piece size combined with the file sizes determines a lower bound of the metadata (aka .torrent file) size. Each piece must be stored as a 20byte hash in pieces, thus larger pieces result in fewer hashes and smaller .torrent files. For terabyte-scale torrents a 16KiB piece size result in a ~1GB torrent file, which is unacceptable for most use-cases.
For a v2 torrent it would result in a similarly sized piece layers in the root dictionary. Or if a client does not have the piece layers data available (e.g. because they started a download via infohash) they will have to retrieve the data via hash request messages instead, ultimately resulting in the same overhead, albeit more spread out over the course of the download.
Where did the 16KiB bs in BEP 52 come from?
16KiB was already the de-facto block size for most clients. Since a merkle-tree must be calculated from some leaf hashes a fixed block size for those leaves had to be defined. Hence the established messaging block size was also chosen for the merkle tree blocks.
The only thing I've found that's close is the rsync block size algorithm in Tridgell & Mackerras' technical paper. Their bs=300-1100 B (# bytes aren't powers of 2).
rsync uses a rolling hash for content-aware chunking rather than fixed-size blocks and that is the primary driver for their chunk-size choices. So rsync considerations do not apply to bittorrent.

Related

Storage cost / supportability / performance tradeoffs using compact attributes in DynamoDB

I'm working on large scale component that generates unique/opaque tokens representing business entities. Over time there will be many billions of these records, but for the first year we're not expecting growth to exceed more than 2 billion individual items (probably less than 500 million).
The system itself is horizontally scaled but needs token generation to be idempotent; data integrity is maintained by using a contained but reasonably complex combination of transactional writes with embedded condition expressions AND standalone condition check write items.
The tokens themselves are UUIDs, and 'being efficient' are persisted as Binary attribute values (16 bytes) rather than the string representation (36 bytes), however the downside is that the data doesn't visualise nicely in query consoles making support hard if we encounter any bugs and/or broken data. Note there is no extra code complexity since we implement attributevalue.Marshaler interface to bind UUID (language) types to DynamoDB Binary attributes, and similarly do the same for any composite attributes.
My question relates to (mostly) data size/saving. Since the tokens are the partition keys, and some mapping columns are [token] -> [other token composite attributes], for example two UUIDs concatenated together into 32 bytes.
I wanted to keep really tight control over storage costs knowing that, over time, we will be spending ~$0.25/GB per month for this. My question is really three parts:
Are the PK/SK index size 'reserved' (i.e. padded) so it would make no difference at all to storage cost if we compress the overall field sizes down to the minimum possible size? (... I read somewhere that 100 bytes is typically reserved.
If they ARE padded, the cost savings for the data would be reasonably high, because each (tree) index node will be nearly as big as the data being mapped. (I assume a tree index is used once hashed PK has routed the query to the right server node/disk etc.)
Is there any observable query time performance benefit to compacting 36 bytes into 16 (beyond saving a few bytes across the network)? i.e. if Dynamo has to read fewer pages it'll work faster, but in practice are we talking microseconds at best?
This is a secondary concern, but is worth considering if there is a lot of concurrent access to the data. UUIDs will distribute partitions but inevitably sometimes we will have some more active partitions than others.
Are there any tools that can parse bytes back into human-readable UUIDs (or that we customise to inject behaviour to do this)?
This is concern, because making things small and efficient is ok, but supporting and resolving data issues will be difficult without significant tooling investment, and (unsurprisingly) the DynamoDB console, DynamoDB IntelliJ plugin and AWS NoSQL Workbench all garble the binary into unreadable characters.
No, the PK/SK types are not padded. There's 100 bytes of overhead per item stored.
Sending less data certainly won't hurt your performance. Don't expect a noticeable improvement though. If shorter values can keep your items at 1,024 bytes instead of 1,025 bytes then you save yourself a Write Unit during the save.
For the "garbled" binary values I assume you're looking at the base64 encoded values, which is a standard binary encoding standard which can be reversed by lots of tooling (now that you know the name of it).

pkcs11 - questions regarding `C_EncryptUpdate/C_EncryptFinal`

I'm using C_EncryptUpdate/C_EncryptFinal and I don't really understand what is supposed to do C_EncryptFinal.
My assumption is that C_EncryptFinal is used to add the padding and last block encryption when the size of the buffer to encrypt is not a multiple of the block size.
Am I correct? Shall C_EncryptFinal always be called after a C_EncryptUpdate loop ?
If you want to be sure you have all the ciphertext you should call C_EncryptFinal.
After calling C_EncryptInit, the application can either call C_Encrypt
to encrypt data in a single part; or call C_EncryptUpdate zero or
more times, followed by C_EncryptFinal, to encrypt data in multiple
parts. The encryption operation is active until the application uses
a call to C_Encrypt or C_EncryptFinal to actually obtain the final
piece of ciphertext. To process additional data (in single or
multiple parts), the application must call C_EncryptInit again.
You can replace calls to C_EncryptUpdate and C_EncryptFinal (aka multiple-part operation) with single C_Encrypt if you have all your plaintext ready in a single buffer (aka single-part operation).
(Beware that some mechanisms might support only single-part operation, e.g. CKM_RSA_X_509)
EDIT:
The C_EncryptFinal does not necessarily need to return any data (i.e. the returned encrypted data part length in pulEncryptedPartLen can be zero).
As you say, the CKM_AES_CBC encryption which was fed with block aligned data (via C_EncryptUpdate) will probably return no encrypted data part after C_EncryptFinal for most of the implementations (as they would return the corresponding ciphertext immediately in the C_EncryptUpdate).
But there might exist an implementation, which internally buffers this block aligned data without encrypting it (thus returning zero length output data part in C_EncryptUpdate) and which then encrypts all the buffered data at once during the C_EncryptFinal -- an example might be an implementation backed by a smart card (or a remote host), where it might be a good idea to send data in larger chunks (even if the cryptoki itself receives data in a block sized chunks).
PKCS#11 API allows that and you have to handle it correctly (i.e. check returned lengths, shift your destination pointers/update the space available accordingly).
Think of it as of a universal API which needs to support any imaginable mechanism.

Transfer files using checksums only?

Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a n byte file with n-1 bytes of data, that means you've got 256^(n-1) possible patterns of data you may have sent, but are selecting from a space of size 256^n. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.
Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file file.bin with a 1000-byte size. There are 2^(8*1000) different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,
you still have about 2^(7*1000) colliding alternatives.
By sending a single additional bit, you might be able cut those down by half... and you still have 2^6999 collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.
The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.

What is the best compression library for very small amounts of data (3-4 kib?)

I am working on a game engine which is loosely descended from Quake 2, adding some things like scripted effects (allowing the server to specify special effects in detail to a client, instead of having only a limited number of hardcoded effects which the client is capable of.) This is a tradeoff of network efficiency for flexibility.
I've hit an interesting barrier. See, the maximum packet size is 2800 bytes, and only one can go out per client per frame.
Here is the script to do a "sparks" effect (could be good for bullet impact sparks, electrical shocks, etc.)
http://pastebin.com/m7acdf519 (If you don't understand it, don't sweat it; it's a custom syntax I made and not relevant to the question I am asking.)
I have done everything possible to shrink the size of that script. I've even reduced the variable names to single letters. But the result is exactly 405 bytes. Meaning you can fit at most 6 of these per frame. I also have in mind a few server-side changes which could shave it down another 12, and a protocol change which might save another 6. Although the savings would vary depending on what script you are working with.
However, of those 387 bytes, I estimate that only 41 would be unique between multiple usages of the effect. In other words, this is a prime candidate for compression.
It just so happens that R1Q2 (a backward-compatible Quake 2 engine with an extended network protocol) has Zlib compression code. I could lift this code, or at least follow it closely as a reference.
But is Zlib necessarily the best choice here? I can think of at least one alternative, LZMA, and there could easily be more.
The requirements:
Must be very fast (must have very small performance hit if run over 100 times a second.)
Must cram as much data as possible into 2800 bytes
Small metadata footprint
GPL compatible
Zlib is looking good, but is there anything better? Keep in mind, none of this code is being merged yet, so there's plenty of room for experimentation.
Thanks,
-Max
EDIT: Thanks to those who have suggested compiling the scripts into bytecode. I should have made this clear-- yes, I am doing this. If you like you can browse the relevant source code on my website, although it's still not "prettied up."
This is the server-side code:
Lua component: http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/lua/scriptedfx.lua
C component: http://meliaserlow.dyndns.tv:8000/alienarena/lua_source/game/g_scriptedfx.c
For the specific example script I posted, this gets a 1172 byte source down to 405 bytes-- still not small enough. (Keep in mind I want to fit as many of these as possible into 2800 bytes!)
EDIT2: There is no guarantee that any given packet will arrive. Each packet is supposed to contain "the state of the world," without relying on info communicated in previous packets. Generally, these scripts will be used to communicate "eye candy." If there's no room for one, it gets dropped from the packet and that's no big deal. But if too many get dropped, things start to look strange visually and this is undesirable.
LZO might be a good candidate for this.
FINAL UPDATE: The two libraries seem about equivalent. Zlib gives about 20% better compression, while LZO's decoding speed is about twice as fast, but the performance hit for either is very small, nearly negligible. That is my final answer. Thanks for all other answers and comments!
UPDATE: After implementing LZO compression and seeing only sightly better performance, it is clear that my own code is to blame for the performance hit (massively increased number of scripted effects possible per packet, thus my effect "interpreter" is getting exercised a lot more.) I would like to humbly apologize for scrambling to shift blame, and I hope there are no hard feelings. I will do some profiling and then maybe I will be able to get some numbers which will be more useful to someone else.
ORIGINAL POST:
OK, I finally got around to writing some code for this. I started out with Zlib, here are the first of my findings.
Zlib's compression is insanely great. It is reliably reducing packets of, say, 8.5 kib down to, say, 750 bytes or less, even when compressing with Z_BEST_SPEED (instead of Z_DEFAULT_COMPRESSION.) The compression time is also pretty good.
However, I had no idea the decompression speed of anything could even possibly be this bad. I don't have actual numbers, but it must be taking 1/8 second per packet at least! (Core2Duo T550 # 1.83 Ghz.) Totally unacceptable.
From what I've heard, LZMA is a tradeoff of worse performance vs. better compression when compared to Zlib. Since Zlib's compression is already overkill and its performance is already incredibly bad, LZMA is off the table sight unseen for now.
If LZO's decompression time is as good as it's claimed to be, then that is what I will be using. I think in the end the server will still be able to send Zlib packets in extreme cases, but clients can be configured to ignore them and that will be the default.
zlib might be a good candidate - license is very good, works fast and its authors say it has very little overhead and overhead is the thing that makes use for small amounts of data problematic.
you should look at OpenTNL and adapt some of the techniques they use there, like the concept of Network Strings
I would be inclinded to use the most significant bit of each character, which is currently wasted, by shifting groups of 9 bytes leftwards, you will fit into 8 bytes.
You could go further and map the characters into a small space - can you get them down to 6 bits (i.e. only having 64 valid characters) by, for example, not allowing capital letters and subtracting 0x20 from each character ( so that space becomes value 0 )
You could go further by mapping the frequency of each character and make a Huffman type compression to reduce the avarage number bits of each character.
I suspect that there are no algorithms that will save data any better that, in the general case, as there is essentially no redundancy in the message after the changes that you have alrady made.
How about sending a binary representation of your script?
So I'm thinking in the lines of a Abstract Syntax Tree with each procedure having a identifier.
This means preformance gain on the clients due to the one time parsing, and decrease of size due to removing the method names.

Tibco Rendezvous - size constraints

I am attempting to put a potentially large string into a rendezvous message and was curious about size constraints. I understand there is a physical limit (64mb?) to the message as a whole, but I'm curious about how some other variables could affect it. Specifically:
How big the keys are?
How the string is stored (in one field vs. multiple fields)
Any advice on any of the above topics or anything else that could be relevant would be greatly appreciated.
Note: I would like to keep the message as a raw string (as opposed to bytecode, etc).
From the Tibco docs on Very Large Messages:
Rendezvous software can transport very
large messages; it divides them into
small packets, and places them on the
network as quickly as the network can
accept them. In some situations, this
behavior can overwhelm network
capacity; applications can achieve
higher throughput by dividing large
messages into smaller chunks and
regulating the rate at which it sends
those chunks. You can use the
performance tool to evaluate chunk
sizes and send rates for optimal
throughput.
This example, sends one message
consisting of ten million bytes.
Rendezvous software automatically
divides the message into packets and
sends them. However, this burst of
packets might exceed network capacity,
resulting in poor throughput:
sender> rvperfm -size 10000000 -messages 1
In this second example, the
application divides the ten million
bytes into one thousand smaller
messages of ten thousand bytes each,
and automatically determines the batch
size and interval to regulate the flow
for optimal throughput:
sender> rvperfm -size 10000 -messages 1000 -auto
By varying the -messages and -size
parameters, you can determine the
optimal message size for your
applications in a specific network.
Application developers can use this
information to regulate sending rates
for improved performance.
As to actual limits the Add string function takes a C style ansi string so is theoretically unbounded but, given the signature of the AddOpaque
tibrv_status tibrvMsg_AddOpaque(
tibrvMsg message,
const char* fieldName,
const void* value,
tibrv_u32 size);
which takes a u32 it would seem sensible to state that the limit is likely to be 4GB rather than 64MB.
That said using Tib to transfer such large packets is likely to be a serious performance bottleneck as it may have to buffer significant amounts of traffic as it tries to get these sorts of messages to all consumers. By default the rvd buffer is only 60 seconds so you may find yourself suffering message loss if this is a significant amount of your traffic.
Message overhead within tibco is largely as simple as:
the fixed cost associated with each message (the header)
All the fields (type info and the field id)
Plus the cost of all variable length aspects including:
the send and receive subjects (effectively limited to 256 bytes each)
the field names. I can find no limit to the length of the field names in the docs but the smaller they are the better, better still don't use them at all and use the numerical identifiers
the array/string/opaque/user defined variable length fields in the message
Note: If you use nested messages simply recurse the above.
In your case the payload overhead will be so vast in comparison to the names (so long as they are reasonable and simple) there is little point attempting to optimize these at all.
You may find you can considerable efficiency on the wire/buffered if you transmit the strings in a compressed form, either through the use of an rvrd with compression enabled or by changing your producer/consumer to use something fast but effective like deflate (or if you're feeling esoteric things like QuickLZ,FastLZ,LZO,etc. Especially ones with fixed memory footprint compress/decompress engines)
You don't say which platform api you are targeting (.net/java/C++/C for example) and this will colour things a little. On the wire all string data will be in 1 byte per character regardless of java/.net using UTF-16 by default however you will incur a significant translation cost placing these into/reading them out of the message because the underlying buffer cannot be reused in those cases and a copy (and compaction/expansion respectively) must be performed.
If you stick to opaque byte sequences you will still have the copy overhead in the naieve implementations possible through the managed wrapper apis but this will at least be less overhead if you have no need to work with the data as a native string.
The overall maximum size of a message is 64MB as was speculated in the OP. From the "Tibco Rendezvous Concepts" document:
Although the ability to exchange large data buffers is a feature of Rendezvous
software, it is best not to make messages too large. For example, to exchange data
up to 10,000 bytes, a single message is efficient. But to send files that could be
many megabytes in length, we recommend using multiple send calls, perhaps one
for each record, block or track. Empirically determine the most efficient size for
the prevailing network conditions. (The actual size limit is 64 MB, which is rarely
an appropriate size.)

Resources