How to understand havex malware encoder with some code? - encryption

I'm analyzing havex, a malware. It generate a dll file and I use ida to reverse it. I found some code, and cann't understand.
while ( *(_DWORD *)(a1 + 32) < 8 )
{
v300 = *(_DWORD **)a1;
if ( !*(_DWORD *)(*(_DWORD *)a1 + 4) )
goto LABEL_367;
v301 = *(unsigned __int8 *)*v300;
v302 = *(_DWORD *)(a1 + 28);
*(_DWORD *)(a1 + 32) += 8;
*(_DWORD *)(a1 + 28) = v301 | (v302 << 8);
++*v300;
--*(_DWORD *)(*(_DWORD *)a1 + 4);
if ( !++*(_DWORD *)(*(_DWORD *)a1 + 8) )
++*(_DWORD *)(*(_DWORD *)a1 + 12);
}
v303 = *(_DWORD *)(a1 + 32);
v304 = *(_DWORD *)(a1 + 28) >> (v303 - 8);
*(_DWORD *)(a1 + 32) = v303 - 8;
*(_DWORD *)(a1 + 3164) = (unsigned __int8)v304 | (*(_DWORD *)(a1 + 3164) << 8);
*(_DWORD *)(a1 + 4) = 1;
v305 = 4;
goto LABEL_34;
I think it's a check code, but I'm not sure. Is it a decryption or some other check code?

Look, this is simply not enough data to properly reverse this. You should do some work with your code:
Change a1 to structure (or an array, but that's unlikely)
Change the types of variables to remove excessive typecasting
Analyse code behaviour dynamically, not only statically
That said, this is probably not decryption. Encryption/decryption is build mostly on xor, and the only data-changing operations here are addition, bit shifts and bitwise or. If it's decryption - than that's one weird encryption scheme.
I have a feeling that this might be some kind of hashing operation, you can totally implement them with or and bitshifts, but it's too much work to analyze Interactive DisAssembler code by looking at non-interactive text ;)

Related

What does MPI_File_Open

I can't understand this instruction what MPI_File_Open and related MPI_seek and MPI_read does. Is there someone who can help me?
With the instruction MPI_File_Open the file is read simultaneously between all processors when I run the program? For example, if I specify mpirun -n 4 etc, is the file read by all four processors?
This is the code:
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, image, MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);
for (i = 1 ; i <= rows ; i++) {
MPI_File_seek(fh, 3*(start_row + i-1) * width + 3*start_col, MPI_SEEK_SET);
tmpbuf = offset(src, i, 3, cols*3+6);
MPI_File_read(fh, tmpbuf, cols*3, MPI_BYTE, &status);
}
MPI_File_close(&fh);
Is there a way I can turn this into openMP or optimize it? I tried to modify the code like this:
#pragma omp parallel for num_threads(2)
for (i = 1 ; i <= rows ; i++) {
MPI_File_seek(fh, 3*(start_row + i-1) * width + 3*start_col, MPI_SEEK_SET);
tmpbuf = offset(src, i, 3, cols*3+6);
MPI_File_read(fh, tmpbuf, cols*3, MPI_BYTE, &status);
}
Running the program with this modification I don't get any speedup of the execution. Runs at the same time as code with MPI only. What am I doing wrong?

Writing to Global Memory Causing Crash in OpenCL in For Loop

One of my OpenCL helper functions writing to global memory in one place runs just fine, and the kernel executes typically. Still, when run from directly after that line, it freezes/crashes the kernel, and my program can't function.
The values in this function change (different values for an NDRange of 2^16), and therefore the loops change as well, and not all threads can execute the same code because of the conditionals.
Why exactly is this an issue? Am I missing some kind of memory blocking or something?
void add_world_seeds(yada yada yada...., const uint global_id, __global long* world_seeds)
for (; indexer < (1 << 16); indexer += increment) {
long k = (indexer << 16) + c;
long target2 = (k ^ e) >> 16;
long second_addend = get_partial_addend(k, x, z) & MASK_16;
if (ctz(target2 - second_addend) < mult_trailing_zeroes) { continue; }
long a = (((first_mult_inv * (target2 - second_addend)) >> mult_trailing_zeroes) ^ (J1_MUL >> 32)) & mask;
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE
}
}
for (; a < (1 << 16); a += increment) {
world_seeds[global_id] = (a << 32) + k; //WORKS HERE
if (get_population_seed((a << 32) + k, x, z) != population_seed_state) { continue; }
world_seeds[global_id] = (a << 32) + k; //DOES NOT WORK HERE
}
There was in fact a bug causing the undefined behavior in the code, in particular the main reversal kernel included a variable in the arguments called "increment", and in that same kernel I defined another variable called increment. It compiled fine but led to completely all over the wall wrong results and memory crashes.

Two pointers pointing to a same memory and realloc failing

void container_row_change(struct brick_win_size *win, int character)
{
row_container *container = &(win->container[win->current_row]);
/*int offset = win->current_column;
char *data = win->container[win->current_row]. data;
if(offset < 0 || offset >= win->col)
offset = container->size;
data = realloc(data, win->container[win->current_row].size + 2 );
memmove(&data[offset + 1], &data[offset], (win->container[win->current_row].size - offset + 1));
data[offset] = character;
win->current_column++;
container->size ++;
*/
int offset = win->current_column;
if(offset < 0 || offset >= win->col)
offset = container->size;
win->container[win->current_row].data = realloc(win->container[win->current_row]. data, win->container[win->current_row]. size + 2);
memmove(&(win->container[win->current_row].data[offset + 1]), &(win->container[win->current_row].data[offset]), win->container[win->current_row].size - offset + 1);
win->container[win->current_row].data[offset] =character;
win->current_column++;
win->container[win->current_row].size ++;
}
Can anybody tell why the commented line fails and not the other one although both are same?
I am wondering is there any errors in the way I assign pointers and reallocate it
In the commented out code, you realloc the local pointer data and never update the .data pointer field in the data structure, so it continues to point at the old (now freed) memory, leading to corruption when you try to use it.
Add the line win->container[win->current_row].data = data; to the commented out code and it will actually be equivalent to the later code.
Note that in either case, if realloc fails, you'll crash -- you should be checking for failure and doing something appropriate (probably printing an error message and trying to exit gracefully).

openCL CL_OUT_OF_RESOURCES Error

I'm Trying to convert a code written in Cuda to openCL and run into some trouble. My final goal is to implement the code on an Odroid XU3 board with a Mali T628 GPU.
In order to simplify the transition and save time trying to debug openCL kernels I've done the following steps:
Implement the code in Cuda and test it on a Nvidia GeForce 760
Implement the code in openCL and test it on a Nvidia GeForce 760
test the openCL code on an Odroid XU3 board with a Mali T628 GPU.
I know that different architectures may have different optimizations but that isn't my main concern for now. I manged to run the openCL code on my Nvidia GPU with no apparent issues but keep getting strange errors when trying to run the code on the Odroid board. I know that different architectures have different handling of exceptions etc. but I'm not sure how to solve those.
Since the openCL code works on my Nvidia I assume that I managed to do the correct transition between thread/blocks -> workItems/workGroups etc.
I already fixed several issues that relate to the cl_device_max_work_group_size issue so that can't be the cuase.
When running the code i'm getting a "CL_OUT_OF_RESOURCES" error. I've narrowed the cause of the error to 2 lines in the code but not sure to fix those issues.
the error is caused by the following lines:
lowestDist[pixelNum] = partialDiffSumTemp; both variables are private variables of the kernel and therefor I don't see any potential issue.
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 0] = bestDisparity[0];
Here I guess the cause is "OUT_OF_BOUND" but not sure how to debug it since the original code doesn't have any issue.
My Kernel code is is:
#define ALIGN_IMAGE_WIDTH 64
#define NUM_PIXEL_PER_THREAD 4
#define MIN_DISPARITY 0
#define MAX_DISPARITY 55
#define WINDOW_SIZE 19
#define WINDOW_RADIUS (WINDOW_SIZE / 2)
#define TILE_SHARED_MEM_WIDTH 96
#define TILE_SHARED_MEM_HEIGHT 32
#define TILE_BOUNDARY_WIDTH 64
#define TILE_BOUNDARY_HEIGHT (2 * WINDOW_RADIUS)
#define BLOCK_WIDTH (TILE_SHARED_MEM_WIDTH - TILE_BOUNDARY_WIDTH)
#define BLOCK_HEIGHT (TILE_SHARED_MEM_HEIGHT - TILE_BOUNDARY_HEIGHT)
#define THREAD_NUM_WIDTH 8
#define THREADS_NUM_HEIGHT TILE_SHARED_MEM_HEIGHT
//TODO fix input arguments
__kernel void hello_kernel( __global unsigned char* d_leftImage,
__global unsigned char* d_rightImage,
__global float* d_disparityLeft) {
int blockX = get_group_id(0);
int blockY = get_group_id(1);
int threadX = get_local_id(0);
int threadY = get_local_id(1);
__local unsigned char leftImage [TILE_SHARED_MEM_WIDTH * TILE_SHARED_MEM_HEIGHT];
__local unsigned char rightImage [TILE_SHARED_MEM_WIDTH * TILE_SHARED_MEM_HEIGHT];
__local unsigned int partialDiffSum [BLOCK_WIDTH * TILE_SHARED_MEM_HEIGHT];
int alignedImageWidth = 640;
int partialDiffSumTemp;
float bestDisparity[4] = {0,0,0,0};
int lowestDist[4];
lowestDist[0] = 214748364;
lowestDist[1] = 214748364;
lowestDist[2] = 214748364;
lowestDist[3] = 214748364;
// Read image blocks into shared memory. read is done at 32bit integers on a uchar array. each thread reads 3 integers(12byte) 96/12=8threads
int sharedMemIdx = threadY * TILE_SHARED_MEM_WIDTH + 4 * threadX;
int globalMemIdx = (blockY * BLOCK_HEIGHT + threadY) * alignedImageWidth + blockX * BLOCK_WIDTH + 4 * threadX;
for (int i = 0; i < 4; i++) {
leftImage [sharedMemIdx + i ] = d_leftImage [globalMemIdx + i];
leftImage [sharedMemIdx + 4 * THREAD_NUM_WIDTH + i ] = d_leftImage [globalMemIdx + 4 * THREAD_NUM_WIDTH + i];
leftImage [sharedMemIdx + 8 * THREAD_NUM_WIDTH + i ] = d_leftImage [globalMemIdx + 8 * THREAD_NUM_WIDTH + i];
rightImage[sharedMemIdx + i ] = d_rightImage[globalMemIdx + i];
rightImage[sharedMemIdx + 4 * THREAD_NUM_WIDTH + i ] = d_rightImage[globalMemIdx + 4 * THREAD_NUM_WIDTH + i];
rightImage[sharedMemIdx + 8 * THREAD_NUM_WIDTH + i ] = d_rightImage[globalMemIdx + 8 * THREAD_NUM_WIDTH + i];
}
barrier(CLK_LOCAL_MEM_FENCE);
int imageIdx = sharedMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS;
int partialSumIdx = threadY * BLOCK_WIDTH + 4 * threadX;
for(int dispLevel = MIN_DISPARITY; dispLevel <= MAX_DISPARITY; dispLevel++) {
// horizontal partial sum
partialDiffSumTemp = 0;
#pragma unroll
for(int i = imageIdx - WINDOW_RADIUS; i <= imageIdx + WINDOW_RADIUS; i++) {
//partialDiffSumTemp += calcDiff(leftImage [i], rightImage[i - dispLevel]);
partialDiffSumTemp += abs(leftImage[i] - rightImage[i - dispLevel]);
}
partialDiffSum[partialSumIdx] = partialDiffSumTemp;
barrier(CLK_LOCAL_MEM_FENCE);
for (int pixelNum = 1, i = imageIdx - WINDOW_RADIUS; pixelNum < NUM_PIXEL_PER_THREAD; pixelNum++, i++) {
partialDiffSum[partialSumIdx + pixelNum] = partialDiffSum[partialSumIdx + pixelNum - 1] +
abs(leftImage[i + WINDOW_SIZE] - rightImage[i - dispLevel + WINDOW_SIZE]) -
abs(leftImage[i] - rightImage[i - dispLevel]);
}
barrier(CLK_LOCAL_MEM_FENCE);
// vertical sum
if(threadY >= WINDOW_RADIUS && threadY < TILE_SHARED_MEM_HEIGHT - WINDOW_RADIUS) {
for (int pixelNum = 0; pixelNum < NUM_PIXEL_PER_THREAD; pixelNum++) {
int rowIdx = partialSumIdx - WINDOW_RADIUS * BLOCK_WIDTH;
partialDiffSumTemp = 0;
for(int i = -WINDOW_RADIUS; i <= WINDOW_RADIUS; i++,rowIdx += BLOCK_WIDTH) {
partialDiffSumTemp += partialDiffSum[rowIdx + pixelNum];
}
if (partialDiffSumTemp < lowestDist[pixelNum]) {
lowestDist[pixelNum] = partialDiffSumTemp;
bestDisparity[pixelNum] = dispLevel - 1;
}
}
}
}
if (threadY >= WINDOW_RADIUS && threadY < TILE_SHARED_MEM_HEIGHT - WINDOW_RADIUS && blockY < 32) {
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 0] = bestDisparity[0];
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 1] = bestDisparity[1];
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 2] = bestDisparity[2];
d_disparityLeft[globalMemIdx + TILE_BOUNDARY_WIDTH - WINDOW_RADIUS + 3] = bestDisparity[3];
}
}
Thanks for all the help
Yuval
From my experience NVidia GPUs not always crash on out of bound access and many times kernel still returns expected results.
Use printf to check the indexes. If you have Nvidia OpenCL 1.2 driver installed printf should be available as a core function. As far as I checked Mali-T628 uses OpenCL 1.1 then check if printf is available as a vendor extension. Also you can run your kernel on AMD/Intel CPU where printf is available (OpenCL 1.2 / 2.0).
Alternative way of checking indexes can be passing __global int* debug array where you would store indexes and then check them on the host. Make sure to allocate it big enough so that out of bound index will be recorded.

IP Camera h264 streaming using udp multicast with RTPS

I have a video conference software that works with h264 multicast streamings, now I need to make this software works with an ip camera that provides an RTPS control interface.
I requested the h264 stream and it's coming via udp multicast, the packages are in RTP format.
So, acording with some research I need to remove the RTP header from the udp payload to have my wanted data, and I need to reconstruct the I frames becouse they may be fragmented.
I'm using Qt and the class QUdpSocket
QByteArray IDR;
while(socket->hasPendingDatagrams())
{
int pendingDataSize = socket->pendingDatagramSize();
char * data = (char *) malloc(pendingDataSize);
socket->readDatagram(data, pendingDataSize);
int fragment_type = data[12] & 0x1F;
int nal_type = data[13] & 0x1F;
int start_bit = data[13] & 0x80;
int end_bit = data[13] & 0x40;
//If it is an I Frame
if (((fragment_type == 28) || (fragment_type == 29)) && (nal_type == 5))
{
if(start_bit == 128 && end_bit == 64)
{
char nalByte = (data[12] & 0xE0) | (data[13] & 0x1F);
data[13] = nalByte;
char * dataWithoutHeader = data + 13;
uint8_t* datagramToQueue = (uint8_t*) queue_malloc(pendingDataSize - 13);
memcpy(datagramToQueue, dataWithoutHeader, pendingDataSize - 13);
f << "\nI Begin + I End\n";
}
if(start_bit == 128)
{
f << "\nI Begin\n";
char nalByte = (data[12] & 0xE0) | (data[13] & 0x1F);
data[13] = nalByte;
char * dataWithoutHeader = data + 13;
IDR.append(dataWithoutHeader, pendingDataSize - 13);
}
if(end_bit == 64)
{
f << "\nI End\n";
char* dataWithoutHeader = data + 13;
IDR.append(dataWithoutHeader, pendingDataSize - 13);
datagramToQueue = (uint8_t*) queue_malloc(IDR.size());
memcpy(datagramToQueue, IDR.data(), IDR.size());
queue_enqueue(this->encodedQueue, datagramToQueue, IDR.size(),0, NULL);
IDR.clear();
}
if(start_bit != 128 && end_bit != 64)
{
char* dataWithoutHeader = data + 13;
IDR.append(dataWithoutHeader, pendingDataSize - 13);
}
f << "\nI\n";
continue;
}
f << "\nP\n";
uint8_t* datagramToQueue = (uint8_t*) queue_malloc(pendingDataSize - 12);
char* datacharMid = data + 13;
memcpy(datagramToQueue, datacharMid, pendingDataSize - 12);
queue_enqueue(this->encodedQueue, datagramToQueue, pendingDataSize - 12,0, NULL);
}
To decode the stream, My software have two implementations, the first uses FFMPEG and the second Intel Media SDK.
The both identifies the parametes of the video, FFMPEG shows as result a little strip of video where I can define some stuff that is in front of the camera and the rest of image is an entire mass with the colors of my scenario. The Intel Media SDK results in a pink screen with some gray lines moving around.
So, someone can tell me if there is some mistake in my packages parser? The order fragment_type, start_bit, end_bit are coming just don't make much sense to me.

Resources