multiple compressor for single array - zarr

Is it possible to have different compressors, e.g. lossy and lossless for individual chunks?
In a scenario, where you have a mask of importance, where you want to keep signal with lossless compression or even with no compression, but have other parts of the signal with lossy compression for efficiency and space.
For example we have:
import zarr
z = zarr.zeros((32, 32), chunks=(4, 4))
important region that we want to keep is A(4:11,4:11), where we want to go lossless, e.g. zlib, then for the rest we use quantize from numcodecs for lossy. So we will have high precision for interesting part within mask and have lossy compression for out of mask and have two different compressor for different parts of a single array at chunk level.

This is not currently possible. The compressor interface would have to receive coordinates for encode(). Then you could implement a compressor that would decide to lose information on encoding depending on coordinates. Since compressors operate on chunks, you'd have to choose the chunking so it aligns with the boundaries where you want to change fidelity.
Overall I think you'll have an easier time just writing a wrapper that combines several zarr stores for the different fidelities and overlay them on access and write.

Related

Is there a provably optimal block (piece) size for torrents (and individual files)?

The BitTorrent protocol doesn't specify block (piece) size. This is left to the user. (I've seen different torrents for the same content with 3 or more different choices.)
I'm thinking of filing a BitTorrent Enhancement Proposal which needs to make a specific block size mandatory — both for the whole torrent, and also for individual files (for which BTv2 (BEP 52) specifies bs=16KiB).
The only thing I've found that's close is the rsync block size algorithm in Tridgell & Mackerras' technical paper. Their bs=300-1100 B (# bytes aren't powers of 2).
Torrents, however, usually use bs=64kB–16MB (# bytes are powers of 2, and much larger than rsync's) for the whole torrent (and, for BTv2, 16KiB for files).
The specified block size doesn't need to be a constant. It could be a function of thing-hashed size, of course (like it is in rsync). It could also be a function of file type; e.g. there might be some block sizes which are better for making partial video/archive/etc files more usable.
See also this analysis of BitTorrent as a block-aligned file system.
So…
What are optimal block sizes for a torrent, generic file, or partial usefulness of specific file types?
Where did the 16KiB bs in BEP 52 come from?
Block and piece size are not the same thing.
A piece is the unit that is hashed into the pieces string in v1 torrents, one hash per piece.
A block is a part of a piece that is requested via request (ID 6) and delivered via piece (ID 7) messages. These messages basically consist of (piece number, offset, length) tuple where the length is the block size. In this sense blocks are very ephemeral constructs in v1 torrents but they are still important since downloading clients have to keep a lot of state about them in memory. Since the downloading client is in control of the request size they customarily use fixed 16KiB blocks, even though they could do this more flexibly. For an uploading client it does not really matter complexity-wise as they have to simply serve the bytes covered by (piece,offset,length) and keep no further state.
Since clients generally implement an upper message size limit to avoid DoS attacks 16KiB is also the recommended upper bound. Specialized implementations could use larger blocks, but for public torrents that doesn't really happen.
For v2 torrents the picture changes a bit. There now are three concepts
the ephemeral blocks sent via messages
the pieces (now representing some layer in the merkle tree), needed for v1 compatibility in hybrid torrents and also stored as piece layers outside the info dictionary to allow partial file resume
the leaf blocks of the merkle tree
The first type is essentially unchanged compared to v1 torrents but the incentive to use 16KiB-sized blocks is much stronger now because that is also the leaf hash size.
The piece size must now be a power of two and multiple of 16KiB, this constraint did not exist in v1 torrents.
The leaf block size is fixed to 16KiB, it is relevant when constructing the merkle tree and exchanging message IDs 21 (hash request) and 22 (hashes)
What are optimal block sizes for a torrent, generic file, or partial usefulness of specific file types?
For a v1 torrent the piece size combined with the file sizes determines a lower bound of the metadata (aka .torrent file) size. Each piece must be stored as a 20byte hash in pieces, thus larger pieces result in fewer hashes and smaller .torrent files. For terabyte-scale torrents a 16KiB piece size result in a ~1GB torrent file, which is unacceptable for most use-cases.
For a v2 torrent it would result in a similarly sized piece layers in the root dictionary. Or if a client does not have the piece layers data available (e.g. because they started a download via infohash) they will have to retrieve the data via hash request messages instead, ultimately resulting in the same overhead, albeit more spread out over the course of the download.
Where did the 16KiB bs in BEP 52 come from?
16KiB was already the de-facto block size for most clients. Since a merkle-tree must be calculated from some leaf hashes a fixed block size for those leaves had to be defined. Hence the established messaging block size was also chosen for the merkle tree blocks.
The only thing I've found that's close is the rsync block size algorithm in Tridgell & Mackerras' technical paper. Their bs=300-1100 B (# bytes aren't powers of 2).
rsync uses a rolling hash for content-aware chunking rather than fixed-size blocks and that is the primary driver for their chunk-size choices. So rsync considerations do not apply to bittorrent.

Why would VkImageView format differ from the underlying VkImage format?

VkImageCreateInfo has the following member:
VkFormat format;
And VkImageViewCreateInfo has the same member.
What I don't understand why you would ever have a different format in the VkImageView from the VkImage needed to create it.
I understand some formats are compatible with one another, but I don't know why you would use one of the alternate formats
The canonical use case and primary original motivation (in D3D10, where this idea originated) is using a single image as either R8G8B8A8_UNORM or R8G8B8A8_SRGB -- either because it holds different content at different times, or because sometimes you want to operate in sRGB-space without linearization.
More generally, it's useful sometimes to have different "types" of content in an image object at different times -- this gives engines a limited form of memory aliasing, and was introduced to graphics APIs several years before full-featured memory aliasing was a thing.
Like a lot of Vulkan, the API is designed to expose what the hardware can do. Memory layout (image) and the interpretation of that memory as data (image view) are different concepts in the hardware, and so the API exposes that. The API exposes it simply because that's how the hardware works and Vulkan is designed to be a thin abstraction; just because the API can do it doesn't mean you need to use it ;)
As you say, in most cases it's not really that useful ...
I think there are some cases where it could be more efficient, for example getting a compute shader to generate integer data for some types of image processing can be more energy efficient than either float computation or manually normalizing integer data to create unorm data. Using aliasing you the compute shader can directly write e.g. uint8 integers and a fragment shader can read the same data as unorm8 data

What is the advantage of using a 1d image over a 1d buffer?

I understand that in 2d, images are cached in x and y directions.
But in 1d, why would you want to use an image? Is the memory used
for images faster than memory used for buffers?
1D Image stays the image, so it has all advantages that Image has against Buffer. That is:
Image IO operations are usually well-cached.
Samplers can be used, which gives benefits like computationally cheap interpolation, hardware-resolved out-ouf-bound access, etc.
Though, you should remember that Image has some constraints in comparison to regular Buffer:
Single Image can be used either for reading or for writing within one kernel.
You can't use vloadN / vstoreN operations, which can handle up to 16 values per call. Your best option is read_imageX & write_imageX functions, which can load / store up to 4 values per one call. That can be serious issue on GPU, with vector architecture.
If you are not using 4-component format, usually, you are loosing part of performance as many functions process samples from color planes simultaneously. So, payload is decreasing.
If we talk about GPU, different parts of hardware are involved into processing of Images & Buffers, so it's difficult to draw up, how one is better than another. Carefull benchmarking & algorithm optimizations are needed.

Transfer files using checksums only?

Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a n byte file with n-1 bytes of data, that means you've got 256^(n-1) possible patterns of data you may have sent, but are selecting from a space of size 256^n. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.
Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file file.bin with a 1000-byte size. There are 2^(8*1000) different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,
you still have about 2^(7*1000) colliding alternatives.
By sending a single additional bit, you might be able cut those down by half... and you still have 2^6999 collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.
The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.

Does AES (128 or 256) encryption expand the data? If so, by how much?

I would like to add AES encryption to a software product, but am concerned by increasing the size of the data. I am guessing that the data does increase in size, and then I'll have to add a compression algorithm to compensate.
AES does not expand data. Moreover, the output will not generally be compressible; if you intend to compress your data, do so before encrypting it.
However, note that AES encryption is usually combined with padding, which will increase the size of the data (though only by a few bytes).
AES does not expand the data, except for a few bytes of padding at the end of the last block.
The resulting data are not compressible, at any rate, because they are basically random - no dictionary-based algorithm is able to effectively compress them. A best practice is to compress the data first, then encrypt them.
It is common to compress data before encrypting. Compressing it afterwards doesn't work, because AES encrypted data appears random (as for any good cipher, apart from any headers and whatnot).
However, compression can introduce side-channel attacks in some contexts, so you must analyse your own use. Such attacks have recently been reported against encrypted VOIP: the gist is that different syllables create characteristic variations in bitrate when compressed with VBR, because some sounds compress better than others. Some (or all) syllables may therefore be recoverable with sufficient analysis, since the data is transmitted at the rate it is generated. The fix is to either to use (less efficient) CBR compression, or to use a buffer to transmit at constant rate regardless of the data rate coming out of the encoder (increasing latency).
AES turns 16 byte input blocks into 16 byte output blocks. The only expansion is to round the data up to a whole number of blocks.
I am fairly sure AES encryption adds nothing to the data being encrypted, since that would give away information about the state variables, and that is a Bad Thing when it comes to cryptography.
If you want to mix compression and encryption, do them in that order. The reason is encrypted data (ideally) looks like totally random data, and compression algorithms will end up making the data bigger, due to its inability to actually compress any of it and overhead of book keeping that comes with any compressed file format.
If compression is necessary do it before you encrypt.
No. The only change will be a small amount of padding to align the data to the size of a block
However, if you are compressing the content note that you should do this before encrypting. Encrypted data should generally be indistinguishable from random data, which means that it will not compress.
#freespace and others: One of the things I remember from my cryptography classes is that you should not compress your data before encryption, because some repeatable chunks of compressed stream (like section headers for example) may make it easier to crack your encryption.

Resources