MSVC2022 17.2 address sanitizer trigger false positive on vector::push_back - address-sanitizer

I turned on Asan on my project, which is a exe project and based on webRTC.lib(release72).
Because webRTC release72 do not support Asan on windows, so the webRTC.lib can not turn on Asan, BTW, the webRTC.lib is generated by MSVC2017 while my project is generated by MSVC2022(17.2 preview).
Before I turned on Asan on my project, it works normally. then I turned on Asan, my project trigger a Asan report in webRTC h264 EncoderQueue thread:
std::vector<uint8_t> ParseRbsp(const uint8_t* data, size_t length) {
std::vector<uint8_t> out;
for (size_t i = 0; i < length;) {
// Be careful about over/underflow here. byte_length_ - 3 can underflow, and
// i + 3 can overflow, but byte_length_ - i can't, because i < byte_length_
// above, and that expression will produce the number of bytes left in
// the stream including the byte at i.
if (length - i >= 3 && !data[i] && !data[i + 1] && data[i + 2] == 3) {
// Two rbsp bytes.
out.push_back(data[i++]);
out.push_back(data[i++]);
// Skip the emulation byte.
i++;
} else {
// Single rbsp byte.
out.push_back(data[i++]);<======crash here everytime
}
}
return out;
}

Related

arduino, setup ethernet & network using data from SD config file

Im try to add to my sketch a dynamic way to setup the ethernet info (mac, ip, gateway, subnet) from a configuration file (config.txt). So running a webserver and serving htm files from sd card, user can go to setting page, fill a form with these info and when posted , the webserver parse the http form and save (update) the config.txt file. After that system do a restart, in order to start with the new settings (by read the config.txt file)
I have create succesfully all the parts (sd, ethernet, webserver, webclient, create the config file from posted form data) except the get params by reading the config.txt file.
I can read line by line the config, I can split the line to param & value, and now I need to fill some byte variables with the readed data. I can (after a month of google searching) to read IPs (decimal values) to byte array. Im stack to read the MAC ADDRESS hex into byte array. The config file contains the:
mac=8f:2c:2b:19:e0:b7;
ip=192.168.1.200;
netmask=255.255.255.0;
gateway=192.168.1.254;
dns=8.8.8.8;
posturl=192.168.1.157;
postport=8080;
postscript=/itherm/update.php;
interval=60000;
and the code that I use to read is:
byte myMAC[6];
byte myIP[4];
File fset;
fset = SD.open("config.txt");
if (fset){
char ff[40];
while (fset.available()>1){
bool eol=false;
for (int i=0; !eol;i++){
ff[i]=fset.read();
if (ff[i]=='\n'){
eol=true;
}
}
String par="";
bool DONE=false;
for (int i=0; !DONE;i++){
par+=ff[i];
if (ff[i]== '='){DONE=true;}
}
String pval="";
DONE=false;
//------------------------
if (par=="ip=" ){
int x=0;
while(!DONE){
for(int i=3;i<=i+21;i++){
if(ff[i]=='.'){
myIP[x]=pval.toInt();
x++;
i++;
pval="";
}
else if(ff[i]==';' || i>20){
myIP[x]=pval.toInt();
DONE=true;
break;
}
pval+=ff[i];
}
}
}
} //while (fset.available()>1)
} //if (fset)
I will appreciate any help. Please no answers with simple use of Serial.print(). I have found hundreds of suggestions but none, that work properly to read all the parameters (dec, hex, strings). After a month of effort & searching, I wonder why something so necessary and useful does not exist as an example in the community, completely functional !!
Best regards
Okay so here is a complete set of routines to do what you want -I think you misunderstood the concept of char arrays vs a single char[0] The routines are documented and self explanatory. I recomend not to finish lines with ; but with '\n' which in your example is there anyway (also you can not see the new line terminator) To get the mac address I need three lines:
if (strncmp(cfgLine, "mac=", 4) == 0) {
strcpy (macAddr, cfgLine + 4);
}
line one compares the first 4 characters and if it is 0 (meaning its a fit)
line two copies the chars from the fifth to the last char from the lineBuffer to the target array, which can actually be used as param for functions.
The file structure should be with no ; as you would have to parse ; and \n
mac=8f:2c:2b:19:e0:b7
ip=192.168.1.200
....
postport=8080
To convert a char array to eg int we use atoi(), to convert a single char[0] to a single number we use int singleDigit = char[0]-48;
const char configurationFilePath [] = "/someconfig.txt";
char cfgLine[128] = {'\0'}; // this is a global temp char array to hold the read lines (lenght= chars longest line +1)
char numBuffer[16] = {'\0'}; // this is a global temo char array to help to convert char to number
char macAddr [18] = {'\0'}; // this is a global char array to hold the mac address
char ipAddr [16] = {'\0'}; // this is a global char array to hold the IP address - max xxx.xxx.xxx.xxx
int postport=0;
// .... you can easyly implement for all other data you want to store/retrieve
// Counts the lines of a file
uint16_t countLines() {
uint16_t currentLineCount = 0;
File cfgFile = SD.open(configurationFilePath, "r");
if (!cfgFile) {
Serial.println(F("Config file open failed on read"));
} else {
while (cfgFile.available()) {
/** Lets read line by line from the file */
if (cfgFile.read() == '\n') currentLineCount ++; // Lines are delimited by '\n'
}
cfgFile.close();
}
return currentLineCount;
}
//Load the config file from SD/SPIFFS/LittleFS
bool loadConfigFile() {
uint16_t lineCounter = countLines();
if (lineCounter <= 0) {
Serial.print(F("No config data stored in file ")); Serial.println(configurationFilePath);
return false;
}
else {
File cfgFile = SD.open(configurationFilePath, "r");
while (cfgFile.available()) {
strcpy (cfgLine, (cfgFile.readStringUntil('\n').c_str())); // normaly you use new line, we copy one line at a time
// Serial.println(cfgLine); /** Printing for debuging purpose */
while (cfgLine[0] != '\0') { /* Block refilling of cfgLine till processed */
loadSingleCfgLine();
}
}
cfgFile.close();
Serial.println(F("[Success] Loaded config !"));
return true;
}
}
//Load the data of a single line into a char array
void loadSingleCfgLine() {
if (strncmp(cfgLine, "mac=", 4) == 0) {
strcpy (macAddr, cfgLine + 4);
}
if (strncmp(cfgLine, "ip=", 3) == 0) {
strcpy (ipAddr, cfgLine + 3);
}
if (strncmp(cfgLine, "postport=", 9) == 0) {
strcpy (numBuffer, cfgLine + 9);
postport = atoi(numBuffer); // One extra step to convert to int
}
// ... easy to implement for all other data
}
I divided the routines into small independend functions, so its easy adaptable for different uses. I'm sorry for not digging into your code as it is hard to follow and unclear what you want todo.As an added bonus we do not use the String class. These Strings tend to fragment heap - causing resets/crashes while the global char arrays are compiled to flash and don't show this behavior.

CUDA streams are blocking despite Async

I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)
Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest frames.
So I need to capture ALL the frames at 30 fps, and I expect to get the hard-work result at 5 fps.
My problems is that I cannot keep the capture running at the right pace : it seems that the hard-work calculation slows down the capture of frames, either at CPU level or at GPU level. I miss some frames...
I tried many solutions. None worked:
I tried to set-up jobs on 2 streams (image below):
the host gets a frame
First stream (called Stream2) : cudaMemcpyAsync copies the frame on the Device. Then, a first kernel does the basic bluring calculations. (In the attached image, bluring appears as a short slot at 3.07 s and 3.085 s. And then nothing... until the big part has finished)
the host checks if the second stream is "available" thanks to a CudaEvent, and lauches it if possible. Practically, the stream is available 1/2 of tries.
Second stream (called Stream4) : starts hard-work calculations in a kernel ( kernelCalcul_W2), outputs the result, and records an Event.
NSight capture
Practically, I wrote :
cudaStream_t sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
cudaEvent_t event_1;
cudaEventCreate(&event_1);
if (frame has arrived)
{
cudaMemcpyAsync(..., sHigh); // HtoD, to upload images in the GPU
blur_Image <<<... , sHigh>>> (...)
if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
else printf("Event 2 not ready\n");
}
void hard_work( cudaStream_t sLow_)
{
kernelCalcul_W2<<<... , sLow_>>> (...);
cudaMemcpyAsync(... the result..., sLow_); //DtoH
cudaEventRecord(event_1, sLow_);
}
I tried to use only one stream. It's the same code as above, but change 1 parameter while launching hard_work.
host gets a frame
Stream: cudaMemcpyAsync copies the frame on the Device. Then, the kernel does the basic bluring calculations. Then, if the CudaEvent Event_1 is ok, I lauch the hard-work, and I add an Event_1 to get the status on next round.
Practically, the stream is ALWAYS available: I never fall in the "else" part.
This way, while the hard-work is running, I expected to "buffer" all the frames to copy, and not to lose any. But I do lose some: it turns out that each time I get a frame and I copy it, Event_1 seems ok so I launch the hard-work, and only get the the next frame very late.
I tried to put the two streams in two different threads (in C). Not better (even worse).
So the question is: how to ensure that the first stream captures ALL frames?
I really have the feeling that the different streams block the CPU.
I display the images with OpenGL. Would it interfere?
Any idea of ways to improve this?
Thanks a lot!
EDIT:
As requested, I put here a MCVE.
There is a parameter you can tune (#define ADJUST) to see what's happening. Basically, the main procedure sends CUDA requests in Async mode, but it seems to block the main thread. As you will see in the image, I have "memory access" (i.e. images captured ) every 30 ms except when the hard-work is running (then, I just don't get images).
Last detail: I'm using CUDA 7.5 to run this. I tried to install 8.0 but apparently the compiler is still 7.5
#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, b, n, n0;
float input;
b = 3;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
// storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
free(images_as_Output_in_Host);
free(camera);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
Sleep(30); // to simulate Camera
}
releaseMapFile();
getchar();
}
The primary issue here is that cudaMemcpyAsync is only a properly non-blocking async operation if the host memory involved is pinned, i.e. allocated using cudaHostAlloc. This characteristic is covered in several places, including the API documentation and the relevant programming guide section.
The following modification to your code (to run on linux, which I prefer) demonstrates the behavioral difference:
$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, n, n0;
float input;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
#ifndef QUICK
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize/2, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
#endif
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
cudaEventRecord(event_2, s2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
usleep(30000); // to simulate Camera
}
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard work is launched after image 1 is captured ==> running ok
Dealing with image 2
Hard work is launched after image 2 is captured ==> running ok
Dealing with image 3
Hard work is launched after image 3 is captured ==> running ok
Dealing with image 4
Hard work is launched after image 4 is captured ==> running ok
Dealing with image 5
Hard work is launched after image 5 is captured ==> running ok
Dealing with image 6
Hard work is launched after image 6 is captured ==> running ok
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard work is launched after image 8 is captured ==> running ok
Dealing with image 9
Hard work is launched after image 9 is captured ==> running ok
real 0m2.790s
user 0m0.688s
sys 0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard_work still running, so unable to process after image 1
Dealing with image 2
Hard_work still running, so unable to process after image 2
Dealing with image 3
Hard_work still running, so unable to process after image 3
Dealing with image 4
Hard_work still running, so unable to process after image 4
Dealing with image 5
Hard_work still running, so unable to process after image 5
Dealing with image 6
Hard_work still running, so unable to process after image 6
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard_work still running, so unable to process after image 8
Dealing with image 9
Hard_work still running, so unable to process after image 9
real 0m1.721s
user 0m0.028s
sys 0m0.629s
$
In the USE_HOST_ALLOC case above, the launch pattern for the low-priority kernel is intermittent, as expected, and the overall run time is considerably shorter.
In short, if you want the expected behavior out of cudaMemcpyAsync, make sure any participating host allocations are page-locked.
A pictorial (profiler) example of the effect that pinning can have on multi-stream behavior can be seen in this answer.

Recover a GZIP file of which first 361 bytes are truncated

I have a gzip file of size 325 MB. I just figured it that it is truncated by 361 bytes from the beginning.
Please advise how can I recover the compressed files from it.
You need to find the next deflate block boundary. Such a boundary can occur at any bit location. You will need to attempt decompression starting at every bit until you get successful decoding for at least a few deflate blocks.
You can use zlib's inflatePrime() to feed less than a byte to inflate(). You can use inflateSetDictionary() to provide a faux 32K dictionary to precede the data being inflated, in order to avoid distance-too-far-back errors.
Once you find a block boundary, you have solved half the problem. The next half is to find where in the deflate stream there is no longer a dependence on the unknown uncompressed data derived from that missing 361 bytes of compressed data. It is possible for such a dependency to very long lasting. For example, if the word " the " appears in that missing section, then it can be referred to after that as a missing string. However, you don't know that it is " the ". All you know is that there is a reference to a five-byte string in the missing data. Then where that five-byte string is copied to can itself be referenced by a later match. This could, in principle, propagate through the entire 325 MB, making the whole thing completely unrecoverable.
However that is unlikely. It is more likely that at some point the propagation of strings from the first 361 bytes stops. From there on, you can recover the uncompressed data.
In order to tell whether you are still seeing propagation or not, do the decompression twice. Once with an initial faux dictionary of all 0's, and once with an initial faux dictionary of all 1's. Where the decompressed data is the same for both decompressions, you have successfully recovered that data.
Then you will need to go up to the next level of structure in that data, and see if you can somehow make use of what you have recovered.
Good luck. And don't cut off the first 361 bytes next time.
Below is example code that does what is described above.
/* salvage -- recover data from a corrupted deflate stream
* Copyright (C) 2015 Mark Adler
* Version 1.0 28 June 2015 Mark Adler
*/
/*
This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Attempt to recover deflate data from a corrupted stream. The corrupted data
is read on stdin, and any reliably decompressed data is written to stdout. A
deflate stream is deemed to have been found successfully if there are eight
or fewer bytes of compressed data unused when done. This can be changed
with the MAXLEFT macro below, or the conditional that currently uses
MAXLEFT. */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <assert.h>
#include "zlib.h"
/* Get the size of an allocated piece of memory (usable size -- not necessarily
the requested size). */
#if defined(__APPLE__) && defined(__MACH__)
# include <malloc/malloc.h>
# define memsize(p) malloc_size(p)
#elif defined (__linux__)
# include <malloc.h>
# define memsize(p) malloc_usable_size(p)
#elif defined (_WIN32)
# include <malloc.h>
# define memsize(p) _msize(p)
#else
# error You need to find an allocated memory size function
#endif
#define local static
/* Load an entire file into a memory buffer. load() returns 0 on success, in
which case it puts all of the file data in *dat[0..*len - 1]. That is,
unless *len is zero, in which case *dat is NULL. *data is allocated memory
which should be freed when done with it. load() returns zero on success,
with *data == NULL and *len == 0. The error values are -1 for read error or
1 for out of memory. To guard against bogging down the system with
extremely large allocations, if limit is not zero then load() will return an
out of memory error if the input is larger than limit. */
local int load(FILE *in, unsigned char **data, size_t *len, size_t limit)
{
size_t size = 1048576, have = 0, was;
unsigned char *buf = NULL, *mem;
*data = NULL;
*len = 0;
if (limit == 0)
limit--;
if (size >= limit)
size = limit - 1;
do {
/* if we already saturated the size_t type or reached the limit, then
out of memory */
if (size == limit) {
free(buf);
return 1;
}
/* double size, saturating to the maximum size_t value */
was = size;
size <<= 1;
if (size < was || size > limit)
size = limit;
/* reallocate buf to the new size */
mem = realloc(buf, size);
if (mem == NULL) {
free(buf);
return 1;
}
buf = mem;
/* read as much as is available into the newly allocated space */
have += fread(buf + have, 1, size - have, in);
/* if we filled the space, make more space and try again until we don't
fill the space, indicating end of file */
} while (have == size);
/* if there was an error reading, discard the data and return an error */
if (ferror(in)) {
free(buf);
return -1;
}
/* if a zero-length file is read, return NULL for the data pointer */
if (have == 0) {
free(buf);
return 0;
}
/* resize the buffer to be just big enough to hold the data */
mem = realloc(buf, have);
if (mem != NULL)
buf = mem;
/* return the data */
*data = buf;
*len = have;
return 0;
}
#define DICTSIZE 32768
#if UINT_MAX <= 0xffff
# define BUFSIZE 32768
#else
# define BUFSIZE 1048576
#endif
/* Inflate the provided buffer starting at a specified bit offset. Use an
already-initialized inflate stream structure for rapid repeated attempts.
The structure needs to have been initialized using inflateInit2(strm, -15).
Inflation begins at data[off], starting at bit bit in that byte, going from
that bit to the more significant bits in that byte, and then on to the next
byte. bit must be in the range 0..7. bit == 0 uses the entire byte at
data[off]. bit == 7 uses only the most significant bit of the byte at
data[off]. Before inflation, the dictionary is initialized to
dict[0..DICTSIZE-1] so that references before the start of the uncompressed
data do not stop inflation. Inflation continues as long as possible, until
either an error is encountered, the end of the deflate stream is reached, or
data[len-1] is processed. On entry *recoup is a pointer to allocated memory
or NULL, and on return *recoup points to allocated memory with the
decompressed data. *got is set to the number of bytes of decompressed data
returned at *recoup.
inflate_at() returns Z_DATA_ERROR if an error was detected in the alleged
deflate data, Z_STREAM_END if the end of a valid deflate stream was reached,
or Z_OK if the end of the provided compressed data was reached without
encountering an erorr or the end of the stream. */
local int inflate_at(z_stream *strm, unsigned char *data, size_t len,
size_t off, int bit, size_t *unused, unsigned char *dict,
unsigned char **recoup, size_t *got)
{
int ret;
size_t left, size;
/* check input */
assert(data != NULL && off < len && bit >= 0 && bit <= 7);
assert(dict != NULL && recoup != NULL);
/* set up inflate engine, feeding first few bits if necessary */
ret = inflateReset(strm);
assert(ret == Z_OK);
ret = inflateSetDictionary(strm, dict, DICTSIZE);
assert(ret == Z_OK);
if (bit) {
ret = inflatePrime(strm, 8 - bit, data[off] >> bit);
assert(ret == Z_OK);
off++;
}
/* inflate as much as possible */
strm->next_in = data + off;
left = len - off;
*got = 0;
do {
strm->avail_in = left > UINT_MAX ? UINT_MAX : left;
left -= strm->avail_in;
do {
/* assure at least BUFSIZE available in recoup */
size = memsize(*recoup);
if (*got + BUFSIZE > size) {
size = size ? size << 1 : BUFSIZE;
assert(size != 0);
*recoup = reallocf(*recoup, size);
assert(*recoup != NULL);
}
/* inflate into recoup */
strm->next_out = *recoup + *got;
strm->avail_out = BUFSIZE;
ret = inflate(strm, Z_NO_FLUSH);
assert(ret != Z_STREAM_ERROR && ret != Z_MEM_ERROR);
/* set the number of compressed bytes unused so far, in case we
return */
if (unused != NULL)
*unused = left + strm->avail_in;
/* update the number of uncompressed bytes generated */
*got += BUFSIZE - strm->avail_out;
/* if we cannot continue to decompress, then return the reason */
if (ret == Z_DATA_ERROR || ret == Z_STREAM_END)
return ret;
/* continue with provided input data until all output generated */
} while (strm->avail_out == 0);
assert(strm->avail_in == 0);
/* provide more input data, if any */
} while (left);
/* ran through all compressed data with no errors or end of stream */
return Z_OK;
}
/* The criteria for success is the completion of inflate with no more than this
many bytes unused. (8 is the length of a gzip trailer.) */
#define MAXLEFT 8
/* Read a corrupted (or not) deflate stream from stdin and write the reliably
recovered data to stdout. */
int main(void)
{
int ret, bit;
unsigned char *data = NULL, *recoup = NULL, *comp = NULL;
size_t len, off, unused, got;
z_stream strm;
unsigned char dict[DICTSIZE] = {0};
/* read input into memory */
ret = load(stdin, &data, &len, 0);
if (ret < 0)
fprintf(stderr, "file error reading input\n");
if (ret > 0)
fprintf(stderr, "ran out of memory reading input\n");
assert(ret == 0);
fprintf(stderr, "read %lu bytes\n", len);
/* initialize inflate structure */
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
strm.opaque = Z_NULL;
strm.next_in = Z_NULL;
strm.avail_in = 0;
ret = inflateInit2(&strm, -15);
assert(ret == Z_OK);
/* scan for an acceptable starting point for inflate */
for (off = 0; off < len; off++)
for (bit = 0; bit < 8; bit++) {
ret = inflate_at(&strm, data, len, off, bit, &unused, dict,
&recoup, &got);
if ((ret == Z_STREAM_END || ret == Z_OK) && unused <= MAXLEFT)
goto done;
}
done:
/* if met the criteria, show result and write out reliable data */
if (bit != 8 && (ret == Z_STREAM_END || ret == Z_OK)) {
fprintf(stderr,
"decoded %lu bytes (%lu unused) at offset %lu, bit %d\n",
len - off - unused, unused, off, bit);
/* decompress again with a different dictionary to detect unreliable
data */
memset(dict, 1, DICTSIZE);
inflate_at(&strm, data, len, off, bit, NULL, dict, &comp, &got);
{
unsigned char *p, *q;
/* search backwards from the end for the first unreliable byte */
p = recoup + got;
q = comp + got;
while (q > comp)
if (*--p != *--q) {
p++;
q++;
break;
}
/* write out the reliable data */
fwrite(q, 1, got - (q - comp), stdout);
fprintf(stderr,
"%lu bytes of reliable uncompressed data recovered\n",
got - (q - comp));
fprintf(stderr,
"(out of %lu total uncompressed bytes recovered)\n", got);
}
}
/* otherwise declare failure */
else
fprintf(stderr, "no deflate stream found that met criteria\n");
/* clean up */
free(comp);
free(recoup);
inflateEnd(&strm);
free(data);
return 0;
}

keil c51 weird masking behaviour

I am trying to read two pins from a c8051f040 controller.
Reading the port directly works, but saving the same port value to a variable does not work even though the debugger shows the correct value.
// This works
if((P1 & 0xF0) == 0xa0)
{
YEL_LED = 1; //Turn on
}
else
{
YEL_LED = 0; //Turn off
}
// This does not work even though the debugger
// shows the correct value 0xa0 for the var
ORange = (P1 & 0xF0);
if(ORange == 0xa0)
{
YEL_LED = 1; //Turn on
}
else
{
YEL_LED = 0; //Turn off
}
Is this a KEIL c51 bug or is something being optimized away.
The variable was declared as char which is signed. It should be unsigned.
I got fooled by the debugger which was showing the the watch variable as unsigned.

How to remove EXIF data without recompressing the JPEG?

I want to remove the EXIF information (including thumbnail, metadata, camera info... everything!) from JPEG files, but I don't want to recompress it, as recompressing the JPEG will degrade the quality, as well as usually increasing the file size.
I'm looking for a Unix/Linux solution, even better if using the command-line. If possible, using ImageMagick (convert tool). If that's not possible, a small Python, Perl, PHP (or other common language on Linux) script would be ok.
There is a similar question, but related to .NET.
exiftool does the job for me, it's written in perl so should work for you on any o/s
https://exiftool.org/
usage :
exiftool -all= image.jpg
UPDATED - as PeterCo explains below this will remove ALL of the tags. if you just want to remove the EXIF tags then you should use
exiftool -EXIF= image.jpg
With imagemagick:
convert <input file> -strip <output file>
ImageMagick has the -strip parameter, but it recompresses the image before saving. Thus, this parameter is useless for my need.
This topic from ImageMagick forum explains that there is no support for JPEG lossless operations in ImageMagick (whenever this changes, please post a comment with a link!), and suggests using jpegtran (from libjpeg):
jpegtran -copy none -progressive image.jpg > newimage.jpg
jpegtran -copy none -progressive -outfile newimage.jpg image.jpg
(If you are unsure about me answering my own question, read this and this and this)
You might also want to look into Exiv2 -- it's really fast (C++ and no recompression), it's command line, and it also provides a library for EXIF manipulation you can link against. I don't know how many Linux distros make it available, but in CentOS it's currently available in the base repo.
Usage:
exiv2 rm image.jpg
I'd propose jhead:
man jhead
jhead -purejpg image.jpg
Only 123Kb on debian/ubuntu, it's fast, and it only touches EXIF keeping the image itself intact. Note that you need to create a copy if you want to preserve the original file with EXIF in it.
I recently undertook this project in C. The code below does the following:
1) Gets the current orientation of the image.
2) Removes all data contained in APP1 (Exif data) and APP2 (Flashpix data) by blanking.
3) Recreates the APP1 orientation marker and sets it to the original value.
4) Finds the first EOI marker (End of Image) and truncates the file if nessasary.
Some things to note first are:
1) This program is used for my Nikon camera. Nikon's JPEG format adds somthing to the very end of each file it creates. They encode this data on to the end of the image file by creating a second EOI marker. Normally image programs read up to the first EOI marker found. Nikon has information after this which my program truncates.
2) Because this is for Nikon format, it assumes big endian byte order. If your image file uses little endian, some adjustments need to be made.
3) When trying to use ImageMagick to strip exif data, I noticed that I ended up with a larger file than what I started with. This leads me to believe that Imagemagick is encoding the data you want stripped away, and is storing it somewhere else in the file. Call me old fashioned, but when I remove something from a file, I want a file size the be smaller if not the same size. Any other results suggest data mining.
And here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <libgen.h>
#include <string.h>
#include <errno.h>
// Declare constants.
#define COMMAND_SIZE 500
#define RETURN_SUCCESS 1
#define RETURN_FAILURE 0
#define WORD_SIZE 15
int check_file_jpg (void);
int check_file_path (char *file);
int get_marker (void);
char * ltoa (long num);
void process_image (char *file);
// Declare global variables.
FILE *fp;
int orientation;
char *program_name;
int main (int argc, char *argv[])
{
// Set program name for error reporting.
program_name = basename(argv[0]);
// Check for at least one argument.
if(argc < 2)
{
fprintf(stderr, "usage: %s IMAGE_FILE...\n", program_name);
exit(EXIT_FAILURE);
}
// Process all arguments.
for(int x = 1; x < argc; x++)
process_image(argv[x]);
exit(EXIT_SUCCESS);
}
void process_image (char *file)
{
char command[COMMAND_SIZE + 1];
// Check that file exists.
if(check_file_path(file) == RETURN_FAILURE)
return;
// Check that file is an actual JPEG file.
if(check_file_jpg() == RETURN_FAILURE)
{
fclose(fp);
return;
}
// Jump to orientation marker and store value.
fseek(fp, 55, SEEK_SET);
orientation = fgetc(fp);
// Recreate the APP1 marker with just the orientation tag listed.
fseek(fp, 21, SEEK_SET);
fputc(1, fp);
fputc(1, fp);
fputc(18, fp);
fputc(0, fp);
fputc(3, fp);
fputc(0, fp);
fputc(0, fp);
fputc(0, fp);
fputc(1, fp);
fputc(0, fp);
fputc(orientation, fp);
// Blank the rest of the APP1 marker with '\0'.
for(int x = 0; x < 65506; x++)
fputc(0, fp);
// Blank the second APP1 marker with '\0'.
fseek(fp, 4, SEEK_CUR);
for(int x = 0; x < 2044; x++)
fputc(0, fp);
// Blank the APP2 marker with '\0'.
fseek(fp, 4, SEEK_CUR);
for(int x = 0; x < 4092; x++)
fputc(0, fp);
// Jump the the SOS marker.
fseek(fp, 72255, SEEK_SET);
while(1)
{
// Truncate the file once the first EOI marker is found.
if(fgetc(fp) == 255 && fgetc(fp) == 217)
{
strcpy(command, "truncate -s ");
strcat(command, ltoa(ftell(fp)));
strcat(command, " ");
strcat(command, file);
fclose(fp);
system(command);
break;
}
}
}
int get_marker (void)
{
int c;
// Check to make sure marker starts with 0xFF.
if((c = fgetc(fp)) != 0xFF)
{
fprintf(stderr, "%s: get_marker: invalid marker start (should be FF, is %2X)\n", program_name, c);
return(RETURN_FAILURE);
}
// Return the next character.
return(fgetc(fp));
}
int check_file_jpg (void)
{
// Check if marker is 0xD8.
if(get_marker() != 0xD8)
{
fprintf(stderr, "%s: check_file_jpg: not a valid jpeg image\n", program_name);
return(RETURN_FAILURE);
}
return(RETURN_SUCCESS);
}
int check_file_path (char *file)
{
// Open file.
if((fp = fopen(file, "rb+")) == NULL)
{
fprintf(stderr, "%s: check_file_path: fopen failed (%s) (%s)\n", program_name, strerror(errno), file);
return(RETURN_FAILURE);
}
return(RETURN_SUCCESS);
}
char * ltoa (long num)
{
// Declare variables.
int ret;
int x = 1;
int y = 0;
static char temp[WORD_SIZE + 1];
static char word[WORD_SIZE + 1];
// Stop buffer overflow.
temp[0] = '\0';
// Keep processing until value is zero.
while(num > 0)
{
ret = num % 10;
temp[x++] = 48 + ret;
num /= 10;
}
// Reverse the word.
while(y < x)
{
word[y] = temp[x - y - 1];
y++;
}
return word;
}
Hope this helps someone!
We used this to remove latitude data from TIFF file:
exiv2 mo -M"del Exif.GPSInfo.GPSLatitude" IMG.TIF
where you can use exiv2 -pa IMG.TIF to list all metadata.
Hint for convenience: If you are on Windows, you can apply a REG file to the registry, to install an entry in the context menu, so you can easily remove metadata by right-clicking the file and selecting the command.
For example (remember to edit the paths to point to where the executables are installed on your computer):
For JPEG,JPG,JPE,JFIF files: command "Remove metadata"
(using ExifTool, preserves original file as backup)
exiftool -all= image.jpg
JPG-RemoveExif.reg
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Classes\jpegfile\shell\RemoveMetadata]
#="Remove metadata"
[HKEY_CURRENT_USER\Software\Classes\jpegfile\shell\RemoveMetadata\command]
#="\"C:\\Path to\\exiftool.exe\" -all= \"%1\""
[HKEY_CURRENT_USER\Software\Classes\jpegfile\shell\RemoveMetadata]
"Icon"="C:\\Path to\\exiftool.exe,0"
For PNG files: command "Convert to minified PNG"
(using ImageMagick, changes data overwriting original file)
convert -background none -strip -set filename:n "%t" image.png "%[filename:n].png"
PNG-Minify.reg
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Software\Classes\pngfile\shell\ConvertToMinifiedPNG]
#="Convert to minified PNG"
[HKEY_CURRENT_USER\Software\Classes\pngfile\shell\ConvertToMinifiedPNG\command]
#="\"C:\\Path to\\convert.exe\" -background none -strip -set filename:n \"%%t\" \"%1\" \"%%[filename:n].png\""
[HKEY_CURRENT_USER\Software\Classes\pngfile\shell\ConvertToMinifiedPNG]
"Icon"="C:\\Path to\\convert.exe,0"
Related: convert PNGs to ICO in context menu.
For lossless EXIF strip you can use libexif, which is available with cygwin. Remove both EXIF and thumbnail to anonymize an image:
$ exif --remove --tag=0 --remove-thumbnail exif.jpg -o anonymized.jpg
Drag-n-drop .bat file for use with cygwin:
#ECHO OFF
exif --remove --tag=0 --remove-thumbnail %~1
If you already use jpegoptim you can use it to remove the exif, too.
jpegoptim -s *

Resources