Frama-c slice : choosing an entry to get pragma ctrl - frama-c

I'm having a problem getting a CTRL slice.
I'm trying to analyze OpenSSL by running this:
the code is like below
int dtls1_process_heartbeat(SSL *s)
{
unsigned char *p = &s->s3->rrec.data[0], *pl;
unsigned short hbtype;
unsigned int payload;
unsigned int padding = 16; /* Use minimum padding */
/* Read type and payload length first */
hbtype = *p++;
n2s(p, payload);
pl = p;
if (s->msg_callback)
s->msg_callback(0, s->version, TLS1_RT_HEARTBEAT,
&s->s3->rrec.data[0], s->s3->rrec.length,
s, s->msg_callback_arg);
if (hbtype == TLS1_HB_REQUEST)
{
unsigned char *buffer, *bp;
int r;
/* Allocate memory for the response, size is 1 byte
* message type, plus 2 bytes payload length, plus
* payload, plus padding
*/
buffer = OPENSSL_malloc(1 + 2 + payload + padding);
bp = buffer;
/* Enter response type, length and copy payload */
*bp++ = TLS1_HB_RESPONSE;
s2n(payload, bp);
/*# slice pragma stmt; */
memcpy(bp, pl, payload);
bp += payload;
/* Random padding */
RAND_pseudo_bytes(bp, padding);
r = dtls1_write_bytes(s, TLS1_RT_HEARTBEAT, buffer, 3 + payload + padding);
if (r >= 0 && s->msg_callback)
s->msg_callback(1, s->version, TLS1_RT_HEARTBEAT,
buffer, 3 + payload + padding,
s, s->msg_callback_arg);
OPENSSL_free(buffer);
if (r < 0)
return r;
}
else if (hbtype == TLS1_HB_RESPONSE)
{
unsigned int seq;
/* We only send sequence numbers (2 bytes unsigned int),
* and 16 random bytes, so we just try to read the
* sequence number */
n2s(pl, seq);
if (payload == 18 && seq == s->tlsext_hb_seq)
{
dtls1_stop_timer(s);
s->tlsext_hb_seq++;
s->tlsext_hb_pending = 0;
}
}
return 0;
}
`
frama-c ./ssl/d1_both.c -main dtls1_process_heartbeat -slice-calls memcpy -cpp-command "gcc -C -E -I ./include/ -I ./" -then-on 'Slicing export' -print
That produced nothing, so I then tried this: want to get a backforward slicing
frama-c ./ssl/d1_both.c -main dtls1_process_heartbeat -slice-pragma dtls1_process_heartbeat -cpp-command "gcc -C -E -I ./include/ -I ./" -then-on 'Slicing export' -print
But I still get nothing like that
void dtls1_process_heartbeat(void);
void dtls1_process_heartbeat(void)
{
return;
}
How can I get a slice like that?
function A (){
…
memcpy()
...
}
function B (){
…
…
...
}
function C (){
…
memcpy()
...
}
I want to capture everything to do with memcpy(), so I want to keep A and C, but not B.
How should I choose an entry point? How do I choose the pragma?
I hope I've stated my question clearly; it's had me confused for days.

First, notice that Frama-C Fluorine is an obsolete version. It has been released more than 3 years ago. Some slicing-related bugs have been fixed in the meantine. Please upgrade to a newer version, preferably Aluminium.
Second, the documentation for option -slicing-value is
select the result of left-values v1,...,vn at the
end of the function given as entry point (addresses are
evaluated at the beginning of the function given as entry
point)
It is unlikely to do what you want. Did you try option -slice-calls, more precisely -slice-calls memcpy ?
Also, keep in mind that B will be kept in the slice if it computes a value that is later used within a call to memcpy.

Related

arduino, setup ethernet & network using data from SD config file

Im try to add to my sketch a dynamic way to setup the ethernet info (mac, ip, gateway, subnet) from a configuration file (config.txt). So running a webserver and serving htm files from sd card, user can go to setting page, fill a form with these info and when posted , the webserver parse the http form and save (update) the config.txt file. After that system do a restart, in order to start with the new settings (by read the config.txt file)
I have create succesfully all the parts (sd, ethernet, webserver, webclient, create the config file from posted form data) except the get params by reading the config.txt file.
I can read line by line the config, I can split the line to param & value, and now I need to fill some byte variables with the readed data. I can (after a month of google searching) to read IPs (decimal values) to byte array. Im stack to read the MAC ADDRESS hex into byte array. The config file contains the:
mac=8f:2c:2b:19:e0:b7;
ip=192.168.1.200;
netmask=255.255.255.0;
gateway=192.168.1.254;
dns=8.8.8.8;
posturl=192.168.1.157;
postport=8080;
postscript=/itherm/update.php;
interval=60000;
and the code that I use to read is:
byte myMAC[6];
byte myIP[4];
File fset;
fset = SD.open("config.txt");
if (fset){
char ff[40];
while (fset.available()>1){
bool eol=false;
for (int i=0; !eol;i++){
ff[i]=fset.read();
if (ff[i]=='\n'){
eol=true;
}
}
String par="";
bool DONE=false;
for (int i=0; !DONE;i++){
par+=ff[i];
if (ff[i]== '='){DONE=true;}
}
String pval="";
DONE=false;
//------------------------
if (par=="ip=" ){
int x=0;
while(!DONE){
for(int i=3;i<=i+21;i++){
if(ff[i]=='.'){
myIP[x]=pval.toInt();
x++;
i++;
pval="";
}
else if(ff[i]==';' || i>20){
myIP[x]=pval.toInt();
DONE=true;
break;
}
pval+=ff[i];
}
}
}
} //while (fset.available()>1)
} //if (fset)
I will appreciate any help. Please no answers with simple use of Serial.print(). I have found hundreds of suggestions but none, that work properly to read all the parameters (dec, hex, strings). After a month of effort & searching, I wonder why something so necessary and useful does not exist as an example in the community, completely functional !!
Best regards
Okay so here is a complete set of routines to do what you want -I think you misunderstood the concept of char arrays vs a single char[0] The routines are documented and self explanatory. I recomend not to finish lines with ; but with '\n' which in your example is there anyway (also you can not see the new line terminator) To get the mac address I need three lines:
if (strncmp(cfgLine, "mac=", 4) == 0) {
strcpy (macAddr, cfgLine + 4);
}
line one compares the first 4 characters and if it is 0 (meaning its a fit)
line two copies the chars from the fifth to the last char from the lineBuffer to the target array, which can actually be used as param for functions.
The file structure should be with no ; as you would have to parse ; and \n
mac=8f:2c:2b:19:e0:b7
ip=192.168.1.200
....
postport=8080
To convert a char array to eg int we use atoi(), to convert a single char[0] to a single number we use int singleDigit = char[0]-48;
const char configurationFilePath [] = "/someconfig.txt";
char cfgLine[128] = {'\0'}; // this is a global temp char array to hold the read lines (lenght= chars longest line +1)
char numBuffer[16] = {'\0'}; // this is a global temo char array to help to convert char to number
char macAddr [18] = {'\0'}; // this is a global char array to hold the mac address
char ipAddr [16] = {'\0'}; // this is a global char array to hold the IP address - max xxx.xxx.xxx.xxx
int postport=0;
// .... you can easyly implement for all other data you want to store/retrieve
// Counts the lines of a file
uint16_t countLines() {
uint16_t currentLineCount = 0;
File cfgFile = SD.open(configurationFilePath, "r");
if (!cfgFile) {
Serial.println(F("Config file open failed on read"));
} else {
while (cfgFile.available()) {
/** Lets read line by line from the file */
if (cfgFile.read() == '\n') currentLineCount ++; // Lines are delimited by '\n'
}
cfgFile.close();
}
return currentLineCount;
}
//Load the config file from SD/SPIFFS/LittleFS
bool loadConfigFile() {
uint16_t lineCounter = countLines();
if (lineCounter <= 0) {
Serial.print(F("No config data stored in file ")); Serial.println(configurationFilePath);
return false;
}
else {
File cfgFile = SD.open(configurationFilePath, "r");
while (cfgFile.available()) {
strcpy (cfgLine, (cfgFile.readStringUntil('\n').c_str())); // normaly you use new line, we copy one line at a time
// Serial.println(cfgLine); /** Printing for debuging purpose */
while (cfgLine[0] != '\0') { /* Block refilling of cfgLine till processed */
loadSingleCfgLine();
}
}
cfgFile.close();
Serial.println(F("[Success] Loaded config !"));
return true;
}
}
//Load the data of a single line into a char array
void loadSingleCfgLine() {
if (strncmp(cfgLine, "mac=", 4) == 0) {
strcpy (macAddr, cfgLine + 4);
}
if (strncmp(cfgLine, "ip=", 3) == 0) {
strcpy (ipAddr, cfgLine + 3);
}
if (strncmp(cfgLine, "postport=", 9) == 0) {
strcpy (numBuffer, cfgLine + 9);
postport = atoi(numBuffer); // One extra step to convert to int
}
// ... easy to implement for all other data
}
I divided the routines into small independend functions, so its easy adaptable for different uses. I'm sorry for not digging into your code as it is hard to follow and unclear what you want todo.As an added bonus we do not use the String class. These Strings tend to fragment heap - causing resets/crashes while the global char arrays are compiled to flash and don't show this behavior.

Change and wrap keyword integers without loop in C

I'm writing a program that accepts a string at the command prompt then converts each character of the string to corresponding 0-25 digit of the alphabet. Each digit is then used to encipher each character of another string the user enters after being prompted by the program. Each alphabetic character of the second string should match the order of the string of integers and the string of integers will wrap if the second string is longer. The goal of the program is the use the first string as a key to shift each character of a message (the second string).
Example (desired output):
User runs program and enters keyword: bad
User is prompted to enter string of alphabetical characters and punctuation only: Dr. Oz
Program converts keyword 'bad' into 1,0,3
Program enciphers message into Er. Ra
What I actually get is:
… T.B.S. …
I've tried many things but unfortunately I can't seem to figure out how to loop and wrap the key without looping the second message. If you run the program you will see my problem.
#include <cs50.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int shift(char key1);
int main(int argc, string argv[]) // user enter number at cmd prompt
{
if (argv[1] == '\0')
{
printf("Usage: ./vigenere keyword\n");
return 1;
}
string key = argv[1]; // declare second arg as string
for (int i = 0, n = strlen(key); i < n; i++)
if (isdigit(key[i]) != 0 || argc != 2)
{
printf("Usage: ./vigenere keyword\n");
return 1;
}
string text = get_string("plaintext: ");
printf("ciphertext: ");
int k;
char t;
for (int j = 0, o = strlen(text); j < o; j++)
{
t = text[j];
for (int i = 0, n = strlen(key); i < n; i++)
{
k = shift(key[i]);
if (isupper(t))
{
t += k;
if (t > 'Z')
{
t -= 26;
}
}
if (islower(t))
{
t += k;
if (t > 'z')
{
t -= 26;
}
}
printf("%c", t);
}
}
printf("\n");
}
int shift(char key1)
{
int k1 = key1;
if (islower(key1))
{
k1 %= 97;
}
if (isupper(key1))
{
k1 %= 65;
}
return k1;
}
I appreciate any help and suggestions but please keep in mind the solution should match the level of coding my program suggests. There may be many advanced ways to write this program but unfortunately we are still in the beginning of this course so showing new methods (which I will definitely try to understand) may go over my head.
Here's a modified version of your code, with changes based on my comments:
#include <cs50.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int shift(char key1);
int main(int argc, string argv[]) // user enter number at cmd prompt
{
if (argc != 2 || argv[1][0] == '\0')
{
fprintf(stderr, "Usage: ./vigenere keyword\n");
return 1;
}
string key = argv[1]; // declare second arg as string
for (int i = 0, n = strlen(key); i < n; i++)
{
if (!isalpha(key[i]))
{
fprintf(stderr, "Usage: ./vigenere keyword\n");
return 1;
}
}
string text = get_string("plain text: ");
printf("ciphertext: ");
int keylen = strlen(key);
int keyidx = 0;
for (int j = 0, o = strlen(text); j < o; j++)
{
int t = text[j];
if (isupper(t))
{
int k = shift(key[keyidx++ % keylen]);
t += k;
if (t > 'Z')
t -= 26;
}
else if (islower(t))
{
int k = shift(key[keyidx++ % keylen]);
t += k;
if (t > 'z')
t -= 26;
}
printf("%c", t);
}
printf("\n");
}
int shift(char key1)
{
if (islower(key1))
key1 -= 'a';
if (isupper(key1))
key1 -= 'A';
return key1;
}
The test for exactly two arguments and for a non-empty key are moved to the top. This is slightly different from what was suggested in the comments. The error messages are printed to standard error, not standard output. I'd probably replace the second 'usage' message with a more specific error — the key may only contain alphabetic characters or thereabouts. And the errors should include argv[0] as the program name rather than hard-coding the name. The key validation loop checks that the key is all alphabetic, rather than checking that they are not digits — there are more character classes than digits and letters. The code uses keyidx and keylen to track the length of the key and the position in the key. I use single-letter variable names, but usually only for loop indexes or simple pointers (usually pointers into strings); otherwise I use short semi-mnemonic names. There are two calls to shift() so that keyidx is only incremented when the input character is a letter. There are other ways that this could be coded.
One very important change not foretold in the comments is the change of type for t — from char to int. When it is a char, if you encrypt letter z with a letter late in the alphabet (e.g. y), the value 'z' + 24 overflows the (signed) char type prevalent on Intel machines, giving a negative value (most typically; formally, the behaviour is undefined). That leads to bogus outputs. Changing to int fixes that problem. Since the value of t is promoted to int anyway when passed to printf(), there is no harm done in the printing. I used the prompt plain text: with a space so that the input and output align on the page.
I decided not to use the extra local variable k1 in shift(). I also used subtraction instead of modulus as noted in the comments.
Given the program cc59 created from cc59.c, a sample run is:
$ cc59 bad
plain text: Dr. Oz
ciphertext: Er. Ra
$ cc59 zax
plain text: Er. Ra
ciphertext: Dr. Oz
$ cc59 ablewasiereisawelba
plain text: The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs. The five boxing wizards jump quickly. How vexingly quick daft zebras jump. Bright vixens jump; dozy fowl quack.
ciphertext: Tip uqius fisef fkb uvmpt zzar lpi cehq dkk. Abck nj fkx oqxy jqne zskfn ljbykr bckj. Xpw fezp coxjyk sirivuw rmml ufjckmj. Lkw nmbzrody mytdk dbqx vetzej ncep. Xvthht wtbank rydt; lgzu jzxl qvlgg.
$ cc59 azpweaiswjwsiaewpza
plain text: Tip uqius fisef fkb uvmpt zzar lpi cehq dkk. Abck nj fkx oqxy jqne zskfn ljbykr bckj. Xpw fezp coxjyk sirivuw rmml ufjckmj. Lkw nmbzrody mytdk dbqx vetzej ncep. Xvthht wtbank rydt; lgzu jzxl qvlgg.
ciphertext: The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs. The five boxing wizards jump quickly. How vexingly quick daft zebras jump. Bright vixens jump; dozy fowl quack.
$
The decrypting keys were derived by matching the 'encrypting' letters in row 1 with the decrypting letters in row 2 of the data:
abcdefghijklmnopqrstuvwxyz
azyxwvutsrqponmlkjihgfedcb
With encryption and decryption, the most basic acid test for the code is that the program can decrypt its own encrypted output given the correct decrypting key and the cipher text.

CUDA streams are blocking despite Async

I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)
Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest frames.
So I need to capture ALL the frames at 30 fps, and I expect to get the hard-work result at 5 fps.
My problems is that I cannot keep the capture running at the right pace : it seems that the hard-work calculation slows down the capture of frames, either at CPU level or at GPU level. I miss some frames...
I tried many solutions. None worked:
I tried to set-up jobs on 2 streams (image below):
the host gets a frame
First stream (called Stream2) : cudaMemcpyAsync copies the frame on the Device. Then, a first kernel does the basic bluring calculations. (In the attached image, bluring appears as a short slot at 3.07 s and 3.085 s. And then nothing... until the big part has finished)
the host checks if the second stream is "available" thanks to a CudaEvent, and lauches it if possible. Practically, the stream is available 1/2 of tries.
Second stream (called Stream4) : starts hard-work calculations in a kernel ( kernelCalcul_W2), outputs the result, and records an Event.
NSight capture
Practically, I wrote :
cudaStream_t sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
cudaEvent_t event_1;
cudaEventCreate(&event_1);
if (frame has arrived)
{
cudaMemcpyAsync(..., sHigh); // HtoD, to upload images in the GPU
blur_Image <<<... , sHigh>>> (...)
if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
else printf("Event 2 not ready\n");
}
void hard_work( cudaStream_t sLow_)
{
kernelCalcul_W2<<<... , sLow_>>> (...);
cudaMemcpyAsync(... the result..., sLow_); //DtoH
cudaEventRecord(event_1, sLow_);
}
I tried to use only one stream. It's the same code as above, but change 1 parameter while launching hard_work.
host gets a frame
Stream: cudaMemcpyAsync copies the frame on the Device. Then, the kernel does the basic bluring calculations. Then, if the CudaEvent Event_1 is ok, I lauch the hard-work, and I add an Event_1 to get the status on next round.
Practically, the stream is ALWAYS available: I never fall in the "else" part.
This way, while the hard-work is running, I expected to "buffer" all the frames to copy, and not to lose any. But I do lose some: it turns out that each time I get a frame and I copy it, Event_1 seems ok so I launch the hard-work, and only get the the next frame very late.
I tried to put the two streams in two different threads (in C). Not better (even worse).
So the question is: how to ensure that the first stream captures ALL frames?
I really have the feeling that the different streams block the CPU.
I display the images with OpenGL. Would it interfere?
Any idea of ways to improve this?
Thanks a lot!
EDIT:
As requested, I put here a MCVE.
There is a parameter you can tune (#define ADJUST) to see what's happening. Basically, the main procedure sends CUDA requests in Async mode, but it seems to block the main thread. As you will see in the image, I have "memory access" (i.e. images captured ) every 30 ms except when the hard-work is running (then, I just don't get images).
Last detail: I'm using CUDA 7.5 to run this. I tried to install 8.0 but apparently the compiler is still 7.5
#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, b, n, n0;
float input;
b = 3;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
// storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
free(images_as_Output_in_Host);
free(camera);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
Sleep(30); // to simulate Camera
}
releaseMapFile();
getchar();
}
The primary issue here is that cudaMemcpyAsync is only a properly non-blocking async operation if the host memory involved is pinned, i.e. allocated using cudaHostAlloc. This characteristic is covered in several places, including the API documentation and the relevant programming guide section.
The following modification to your code (to run on linux, which I prefer) demonstrates the behavioral difference:
$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, n, n0;
float input;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
#ifndef QUICK
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize/2, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
#endif
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
cudaEventRecord(event_2, s2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
usleep(30000); // to simulate Camera
}
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard work is launched after image 1 is captured ==> running ok
Dealing with image 2
Hard work is launched after image 2 is captured ==> running ok
Dealing with image 3
Hard work is launched after image 3 is captured ==> running ok
Dealing with image 4
Hard work is launched after image 4 is captured ==> running ok
Dealing with image 5
Hard work is launched after image 5 is captured ==> running ok
Dealing with image 6
Hard work is launched after image 6 is captured ==> running ok
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard work is launched after image 8 is captured ==> running ok
Dealing with image 9
Hard work is launched after image 9 is captured ==> running ok
real 0m2.790s
user 0m0.688s
sys 0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard_work still running, so unable to process after image 1
Dealing with image 2
Hard_work still running, so unable to process after image 2
Dealing with image 3
Hard_work still running, so unable to process after image 3
Dealing with image 4
Hard_work still running, so unable to process after image 4
Dealing with image 5
Hard_work still running, so unable to process after image 5
Dealing with image 6
Hard_work still running, so unable to process after image 6
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard_work still running, so unable to process after image 8
Dealing with image 9
Hard_work still running, so unable to process after image 9
real 0m1.721s
user 0m0.028s
sys 0m0.629s
$
In the USE_HOST_ALLOC case above, the launch pattern for the low-priority kernel is intermittent, as expected, and the overall run time is considerably shorter.
In short, if you want the expected behavior out of cudaMemcpyAsync, make sure any participating host allocations are page-locked.
A pictorial (profiler) example of the effect that pinning can have on multi-stream behavior can be seen in this answer.

Recover a GZIP file of which first 361 bytes are truncated

I have a gzip file of size 325 MB. I just figured it that it is truncated by 361 bytes from the beginning.
Please advise how can I recover the compressed files from it.
You need to find the next deflate block boundary. Such a boundary can occur at any bit location. You will need to attempt decompression starting at every bit until you get successful decoding for at least a few deflate blocks.
You can use zlib's inflatePrime() to feed less than a byte to inflate(). You can use inflateSetDictionary() to provide a faux 32K dictionary to precede the data being inflated, in order to avoid distance-too-far-back errors.
Once you find a block boundary, you have solved half the problem. The next half is to find where in the deflate stream there is no longer a dependence on the unknown uncompressed data derived from that missing 361 bytes of compressed data. It is possible for such a dependency to very long lasting. For example, if the word " the " appears in that missing section, then it can be referred to after that as a missing string. However, you don't know that it is " the ". All you know is that there is a reference to a five-byte string in the missing data. Then where that five-byte string is copied to can itself be referenced by a later match. This could, in principle, propagate through the entire 325 MB, making the whole thing completely unrecoverable.
However that is unlikely. It is more likely that at some point the propagation of strings from the first 361 bytes stops. From there on, you can recover the uncompressed data.
In order to tell whether you are still seeing propagation or not, do the decompression twice. Once with an initial faux dictionary of all 0's, and once with an initial faux dictionary of all 1's. Where the decompressed data is the same for both decompressions, you have successfully recovered that data.
Then you will need to go up to the next level of structure in that data, and see if you can somehow make use of what you have recovered.
Good luck. And don't cut off the first 361 bytes next time.
Below is example code that does what is described above.
/* salvage -- recover data from a corrupted deflate stream
* Copyright (C) 2015 Mark Adler
* Version 1.0 28 June 2015 Mark Adler
*/
/*
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Attempt to recover deflate data from a corrupted stream. The corrupted data
is read on stdin, and any reliably decompressed data is written to stdout. A
deflate stream is deemed to have been found successfully if there are eight
or fewer bytes of compressed data unused when done. This can be changed
with the MAXLEFT macro below, or the conditional that currently uses
MAXLEFT. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <assert.h>
#include "zlib.h"
/* Get the size of an allocated piece of memory (usable size -- not necessarily
the requested size). */
#if defined(__APPLE__) && defined(__MACH__)
# include <malloc/malloc.h>
# define memsize(p) malloc_size(p)
#elif defined (__linux__)
# include <malloc.h>
# define memsize(p) malloc_usable_size(p)
#elif defined (_WIN32)
# include <malloc.h>
# define memsize(p) _msize(p)
#else
# error You need to find an allocated memory size function
#endif
#define local static
/* Load an entire file into a memory buffer. load() returns 0 on success, in
which case it puts all of the file data in *dat[0..*len - 1]. That is,
unless *len is zero, in which case *dat is NULL. *data is allocated memory
which should be freed when done with it. load() returns zero on success,
with *data == NULL and *len == 0. The error values are -1 for read error or
1 for out of memory. To guard against bogging down the system with
extremely large allocations, if limit is not zero then load() will return an
out of memory error if the input is larger than limit. */
local int load(FILE *in, unsigned char **data, size_t *len, size_t limit)
{
size_t size = 1048576, have = 0, was;
unsigned char *buf = NULL, *mem;
*data = NULL;
*len = 0;
if (limit == 0)
limit--;
if (size >= limit)
size = limit - 1;
do {
/* if we already saturated the size_t type or reached the limit, then
out of memory */
if (size == limit) {
free(buf);
return 1;
}
/* double size, saturating to the maximum size_t value */
was = size;
size <<= 1;
if (size < was || size > limit)
size = limit;
/* reallocate buf to the new size */
mem = realloc(buf, size);
if (mem == NULL) {
free(buf);
return 1;
}
buf = mem;
/* read as much as is available into the newly allocated space */
have += fread(buf + have, 1, size - have, in);
/* if we filled the space, make more space and try again until we don't
fill the space, indicating end of file */
} while (have == size);
/* if there was an error reading, discard the data and return an error */
if (ferror(in)) {
free(buf);
return -1;
}
/* if a zero-length file is read, return NULL for the data pointer */
if (have == 0) {
free(buf);
return 0;
}
/* resize the buffer to be just big enough to hold the data */
mem = realloc(buf, have);
if (mem != NULL)
buf = mem;
/* return the data */
*data = buf;
*len = have;
return 0;
}
#define DICTSIZE 32768
#if UINT_MAX <= 0xffff
# define BUFSIZE 32768
#else
# define BUFSIZE 1048576
#endif
/* Inflate the provided buffer starting at a specified bit offset. Use an
already-initialized inflate stream structure for rapid repeated attempts.
The structure needs to have been initialized using inflateInit2(strm, -15).
Inflation begins at data[off], starting at bit bit in that byte, going from
that bit to the more significant bits in that byte, and then on to the next
byte. bit must be in the range 0..7. bit == 0 uses the entire byte at
data[off]. bit == 7 uses only the most significant bit of the byte at
data[off]. Before inflation, the dictionary is initialized to
dict[0..DICTSIZE-1] so that references before the start of the uncompressed
data do not stop inflation. Inflation continues as long as possible, until
either an error is encountered, the end of the deflate stream is reached, or
data[len-1] is processed. On entry *recoup is a pointer to allocated memory
or NULL, and on return *recoup points to allocated memory with the
decompressed data. *got is set to the number of bytes of decompressed data
returned at *recoup.
inflate_at() returns Z_DATA_ERROR if an error was detected in the alleged
deflate data, Z_STREAM_END if the end of a valid deflate stream was reached,
or Z_OK if the end of the provided compressed data was reached without
encountering an erorr or the end of the stream. */
local int inflate_at(z_stream *strm, unsigned char *data, size_t len,
size_t off, int bit, size_t *unused, unsigned char *dict,
unsigned char **recoup, size_t *got)
{
int ret;
size_t left, size;
/* check input */
assert(data != NULL && off < len && bit >= 0 && bit <= 7);
assert(dict != NULL && recoup != NULL);
/* set up inflate engine, feeding first few bits if necessary */
ret = inflateReset(strm);
assert(ret == Z_OK);
ret = inflateSetDictionary(strm, dict, DICTSIZE);
assert(ret == Z_OK);
if (bit) {
ret = inflatePrime(strm, 8 - bit, data[off] >> bit);
assert(ret == Z_OK);
off++;
}
/* inflate as much as possible */
strm->next_in = data + off;
left = len - off;
*got = 0;
do {
strm->avail_in = left > UINT_MAX ? UINT_MAX : left;
left -= strm->avail_in;
do {
/* assure at least BUFSIZE available in recoup */
size = memsize(*recoup);
if (*got + BUFSIZE > size) {
size = size ? size << 1 : BUFSIZE;
assert(size != 0);
*recoup = reallocf(*recoup, size);
assert(*recoup != NULL);
}
/* inflate into recoup */
strm->next_out = *recoup + *got;
strm->avail_out = BUFSIZE;
ret = inflate(strm, Z_NO_FLUSH);
assert(ret != Z_STREAM_ERROR && ret != Z_MEM_ERROR);
/* set the number of compressed bytes unused so far, in case we
return */
if (unused != NULL)
*unused = left + strm->avail_in;
/* update the number of uncompressed bytes generated */
*got += BUFSIZE - strm->avail_out;
/* if we cannot continue to decompress, then return the reason */
if (ret == Z_DATA_ERROR || ret == Z_STREAM_END)
return ret;
/* continue with provided input data until all output generated */
} while (strm->avail_out == 0);
assert(strm->avail_in == 0);
/* provide more input data, if any */
} while (left);
/* ran through all compressed data with no errors or end of stream */
return Z_OK;
}
/* The criteria for success is the completion of inflate with no more than this
many bytes unused. (8 is the length of a gzip trailer.) */
#define MAXLEFT 8
/* Read a corrupted (or not) deflate stream from stdin and write the reliably
recovered data to stdout. */
int main(void)
{
int ret, bit;
unsigned char *data = NULL, *recoup = NULL, *comp = NULL;
size_t len, off, unused, got;
z_stream strm;
unsigned char dict[DICTSIZE] = {0};
/* read input into memory */
ret = load(stdin, &data, &len, 0);
if (ret < 0)
fprintf(stderr, "file error reading input\n");
if (ret > 0)
fprintf(stderr, "ran out of memory reading input\n");
assert(ret == 0);
fprintf(stderr, "read %lu bytes\n", len);
/* initialize inflate structure */
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.next_in = Z_NULL;
strm.avail_in = 0;
ret = inflateInit2(&strm, -15);
assert(ret == Z_OK);
/* scan for an acceptable starting point for inflate */
for (off = 0; off < len; off++)
for (bit = 0; bit < 8; bit++) {
ret = inflate_at(&strm, data, len, off, bit, &unused, dict,
&recoup, &got);
if ((ret == Z_STREAM_END || ret == Z_OK) && unused <= MAXLEFT)
goto done;
}
done:
/* if met the criteria, show result and write out reliable data */
if (bit != 8 && (ret == Z_STREAM_END || ret == Z_OK)) {
fprintf(stderr,
"decoded %lu bytes (%lu unused) at offset %lu, bit %d\n",
len - off - unused, unused, off, bit);
/* decompress again with a different dictionary to detect unreliable
data */
memset(dict, 1, DICTSIZE);
inflate_at(&strm, data, len, off, bit, NULL, dict, &comp, &got);
{
unsigned char *p, *q;
/* search backwards from the end for the first unreliable byte */
p = recoup + got;
q = comp + got;
while (q > comp)
if (*--p != *--q) {
p++;
q++;
break;
}
/* write out the reliable data */
fwrite(q, 1, got - (q - comp), stdout);
fprintf(stderr,
"%lu bytes of reliable uncompressed data recovered\n",
got - (q - comp));
fprintf(stderr,
"(out of %lu total uncompressed bytes recovered)\n", got);
}
}
/* otherwise declare failure */
else
fprintf(stderr, "no deflate stream found that met criteria\n");
/* clean up */
free(comp);
free(recoup);
inflateEnd(&strm);
free(data);
return 0;
}

GCC pointer cast warning

I am wondering why GCC is giving me this warning:
test.h: In function TestRegister:
test.h:12577: warning: cast to pointer from integer of different size
Code:
#define Address 0x1234
int TestRegister(unsigned int BaseAddress)
{
unsigned int RegisterValue = 0;
RegisterValue = *((unsigned int *)(BaseAddress + Address)) ;
if((RegisterValue & 0xffffffff) != (0x0 << 0))
{
return(0);
}
else
{
return(1);
}
}
Probably because you're on a 64-bit platform, where pointers are 64-bit but ints are 32-bit.
Rule-of-thumb: Don't try to use integers to store addresses.
If you include <stdint.h> and if you compile for the C99 standard using gcc -Wall -std=c99 you could cast to and from intptr_t which is an integer type of the same size as pointers.
RegisterValue = *((unsigned int *)((intptr_t)(BaseAddress + Address))) ;
Among other things, you're assuming that a pointer will fit into an unsigned int, where C gives no such guarantee… there are a number of platforms in use today where this is untrue, apparently including yours.
A pointer to data can be stored in a (void*) or (type*) safely. Pointers can be added to (or subtracted to yield) a size_t or ssize_t. There's no guaranteed relationship between sizeof(int), sizeof(size_t), sizeof(ssize_t), and (void*) or (type*)…
(Also, in this case, there's no real point in initializing the var and overwriting it on the next line…)
Also unrelated, but you realise that != (0x0 << 0) → != 0 and can be omitted, since if (x) = if (x != 0) … ? Perhaps that's because this is cut down from a larger sample, but that entire routine could be presented as
int TestRegister (unsigned int* BaseAddress)
{ return ( (0xffffffff & *(BaseAddress + Address)) ? 0 : 1 ); }
(Edited: changed to unsigned int* as it seems far more likely he wants to skip through at int-sized offsets?)

Resources