Decoding a huffman table fast - decoding

I have a huffman encoded byte array of size 400 MB. Where the huffman tokens are all possible 4 bit values (0-15). I have to decode it within 1 minute. I need an efficient way to decode. In a 16gb ram windows system with a processor speed 2.8 Ghz. can i make it in 1 minute?

It took about eight seconds on my four-year old 2 GHz i7 processor, using zlib's inflate decompressor given only Huffman encoded input that was compressed 4:1 down to 400 MB.
So yes, you should be able to do much better than a minute.

Related

Is there a way to load a vector equal by size to global memory size of GPU in OpenCl?

My GPU has 12 GB global memory (CL_DEVICE_GLOBAL_MEM_SIZE), but only 3 GB of memory which it can allocate (CL_DEVICE_MAX_MEM_ALLOC_SIZE). When I try to load a vector of size exceeding 3 GB, the program crashes. The question is, if it is possible to load a bigger vector into GPU memory to utilize it completely, how to do it?
By default, CL_DEVICE_MAX_MEM_ALLOC_SIZE reports 1/4 of CL_DEVICE_GLOBAL_MEM_SIZE, meaning it would only be allowed to allocate four 3GB buffers on a 12GB GPU.
However, Nvidia GPUs allow to allocate their full memory capacity in a single buffer, even though they also report to have the 1/4 limit.
Some AMD GPUs have the limit set higher, for example the Radeon VII lets you use 14/16GB for a single buffer.
The only devices I have ever seen that really inforce the 1/4 limit are Intel HD 4600 and 5500, so older Intel integrated GPUs. If you go above 1/4 in buffer size there, the cl::Buffer constructor throws error -61.
In case you are stuck with the 1/4 memory limit on your device, split your large 12GB buffer in 4 smaller 3GB buffers (for example one vector for x, y, z, w components of the vector each). If you use Windows, note that you might only be able to use ~11.5GB in total as some VRAM is reserved for the operating system.
I think your issue might not be CL_DEVICE_MAX_MEM_ALLOC_SIZE though, but 32-bit integer overflow for the array size above 4GB. Use the uint64_t data type to set the array size instead.
You might also be interested in this lightweight OpenCL-Wrapper for C++. There, the length of vectors always is in 64-bit integer, and it automatically keeps track on howm much memory you use in total on each device, telling you if you allocate too much. It also catches that -61 error on Intel iGPUs and tells you the maximum allowed buffer size then.

ffdf object consumes extra RAM (in GB)

I have decided to test the key advantage of ff package - RAM minimal allocation (PC specs: i5, RAM 8Gb, Win7 64 bit, Rstudio).
According to the package discription we can manipulate physical objects (files) like virtual ones as if they are allocated into RAM. Thus, actual RAM usage is reduced greatly (from Gb to kb). The code I have used as follows:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=100000, next.rows=100000000,levels=NULL))
print(object.size(x)/1024/1024)
print(class(x))
The actual file size is 4.5 Gb, the actual RAM used varies in such a way (by Task Manager): 2,92 -> upper limit(~8Gb) -> 5.25Gb.
The object size (by object.size()) is about 12 kb.
My concern is about RAM extra allocations (~2.3 GB). According to the package discription it should have increased only by 12 kb. I dont use any characters.
Maybe I have missed something of ff package.
Well, I have found a solution to eliminate the use of extra RAM.
First of all it is necessary to pay attention to such arguments as 'first.rows' and 'next.rows' of method 'read.table.ffdf' in ff package.
The first argument ('first.rows') stipulates the initial chunk in row quantity and it stipulates the initial memory allocation. I have used the default value (1000 rows).
The extra memory allocation is the subject of the second argument ('next.rows'). If you want to have ffdf object without extra RAM allocations (in my case - in Gb) so you need to select such a number of rows for the next chunk that the size of the chunk should not exceed the value of 'getOption("ffbatchbytes")'.
In my case I have used 'first.rows=1000' and 'next.rows=1000' and the total RAM allocation has varied up to 1Mb in Task Manager.
The increase of 'next.rows' up to 10000 has caused the RAM growth by 8-9 Mb.
So this arguments are subject to your experiments to pick up the best proportion.
Besides, you must keep in mind that the increase of 'next.rows' will impact the processing time to make ffdf object(by several runs):
'first.rows=1000' and 'next.rows=1000' is around 1500 sec. (RAM ~ 1Mb)
'first.rows=1000' and 'next.rows=10000' is around 230 sec. (RAM ~ 9Mb)

Sampling / Quantization / PCM networks

Suppose an analog audio signal is sampled 16,000 times per second, and each sample is quantized into one of 1024 levels. What would be the resulting bit rate of the PCM digital audio signal?
so that a question in Top down approach book , i answered it but just want to make sure it is correct
my answer is
1024 = 2 ^10
so PCM bit rate = 10 * 16000 = 160 , 000 bps
is that correct
Software often makes the trade off between time and space. Your answer is correct, however to write software you typically read/write data into storage units of bytes (8 bits). Since your answer says 10 bits, your code would use two bytes (16 bits) per sample. So the file consumption rate would be 16 * 160000 == 256000 bits per second (32000 bytes per second). This is mono so stereo would double this. Your software to actually store 10 bits per sample instead of 16 bits would shift this time/space trade-off in the direction of increased computational time (and code complexity) to save storage space.

Code profiling to improve performance : see CPU cycles inside mscorlib.dll?

I made a small test benchmark comparing .NET's System.Security.Cryptography AES implementation vs BouncyCastle.Org's AES.
Link to GitHub code: https://github.com/sidshetye/BouncyBench
I'm particularly interested in AES-GCM since it's a 'better' crypto algorithm and .NET is missing it. What I noticed was that while the AES implementations are very comparable between .NET an BouncyCastle, the GCM performance is quite poor (see extra background below for more). I suspect it's due to many buffer copies or something. To look deeper, I tried profiling the code (VS2012 => Analyze menu bar option => Launch performance wizard) and noticed that there was a LOT of CPU burn inside mscorlib.dll
Question: How can I figure out what's eating most of the CPU in such a case? Right now all I know is "some lines/calls in Init() burn 47% of CPU inside mscorlib.ni.dll" - but without knowing what specific lines, I don't know where to (try and) optimize. Any clues?
Extra background:
Based on the "The Galois/Counter Mode of Operation (GCM)" paper by David A. McGrew, I read "Multiplication in a binary field can use a variety of time-memory tradeoffs. It can be implemented with no key-dependent memory, in which case it will generally run several times slower than AES. Implementations that are willing to sacrifice modest amounts of memory can easily realize speeds greater than that of AES."
If you look at the results, the basic AES-CBC engine performances are very comparable. AES-GCM adds the GCM and reuses the AES engine beneath it in CTR mode (faster than CBC). However, GCM also adds multiplication in the GF(2^128) field in addition to the CTR mode, so there could be other areas of slowdown. Anyway, that's why I tried profiling the code.
For the interested, where is my quick test performance benchmark. It's inside a Windows 8 VM and YMMV. The test is configurable but currently it's to simulate crypto overhead in encrypting many cells of a database (=> many but small plaintext input)
Creating initial random bytes ...
Benchmark test is : Encrypt=>Decrypt 10 bytes 100 times
Name time (ms) plain(bytes) encypted(bytes) byte overhead
.NET ciphers
AES128 1.5969 10 32 220 %
AES256 1.4131 10 32 220 %
AES128-HMACSHA256 2.5834 10 64 540 %
AES256-HMACSHA256 2.6029 10 64 540 %
BouncyCastle Ciphers
AES128/CBC 1.3691 10 32 220 %
AES256/CBC 1.5798 10 32 220 %
AES128-GCM 26.5225 10 42 320 %
AES256-GCM 26.3741 10 42 320 %
R - Rerun tests
C - Change size(10) and iterations(100)
Q - Quit
This is a rather lame move from Microsoft as they obviously broke a feature that worked well before Windows 8, but no longer, as explained in this MSDN blog post:
:
On Windows 8 the profiler uses a different underlying technology than
what it does on previous versions of Windows, which is why the
behavior is different on Windows 8. With the new technology, the
profiler needs the symbol file (PDB) to know what function is
currently executing inside NGEN’d images.
(...)
It is however on our backlog to implement in the next version of Visual Studio.
The post gives directions to generate the PDB files yourself (thanks!).

Work-items, Work-groups and Command Queues organization and memory limit in OpenCL

Okay i have already been through most of the ati and nvidia guides to OpenCL, there are some stuff that i just want to be sure of, and some need clarification. Nothing in the documentation gives a clear cut answer.
Now i have a radeon 4650, now on querying my device, i got
CL_DEVICE_MAX_COMPUTE_UNITS: 8
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 128 / 128 / 128
CL_DEVICE_MAX_WORK_GROUP_SIZE: 128
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 256 MByte
ok first, my card has 1GB memory, why am i allowed to 256MB only?
2nd i don't understand the Work-item dimension part, does that mean i can have up to 128*3 or 128^3 work-items?
when i calculated this before i run the query, i got 8 cores * 16 stream processors * 4 work-items = 512 why is this wrong?
also i got the same 3 dimension work-item stuff for my inte core 2 duo CPU, does the same calculations apply?
As for the command queues, when i tried accessing my core duo CPU as a device using OpenCL, stuff got processed on one core only, i tried doing multiple queues and queueing several entries, but still got processed on one core only, i used a global_work_size of 128*128*128*8 for a simple write program where each work-item writes its own global-id to the buffer and i got only zeros.
and what about Nvidia Cards? on a Nvidia 9500 GT with 32 cuda cores, does the work-items calculate similarly?
Thanks alot, i've been really all over the place trying to find answers.
ok first, my card has 1GB memory, why
am i allowed to 256MB only?
This is an ATI driver bug/limitation AFAIK. I'll check on my 5850 if I can repro.
http://devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=124142&messid=1069111&parentid=0&FTVAR_FORUMVIEWTMP=Branch
2nd i don't understand the Work-item
dimension part, does that mean i can
have up to 128*3 or 128^3 work-items?
No. That means you can have max 128 on one dim since CL_DEVICE_MAX_WORK_ITEM_SIZES is 128 / 128 / 128. And since CL_DEVICE_MAX_WORK_GROUP_SIZE is 128, you can have, e.g: work_group_size(128, 1, 1) or work_group_size(1, 128, 1) or work_group_size(64, 1, 2), or work_group_size(8, 4, 4) etc, as long as product of each dim is <= 128 it will be fine.
when i calculated this before i run
the query, i got 8 cores * 16 stream
processors * 4 work-items = 512 why is
this wrong?
also i got the same 3 dimension
work-item stuff for my inte core 2 duo
CPU, does the same calculations apply?
Don't understand what you are trying to compute here.

Resources