Related
I'm using the web audio API to apply effects and export a file. It works great, but I would like to export the file at 32 bit instead of 16.
How could I alter the below settings to achieve this?
setUint32(0x46464952); // "RIFF"
setUint32(length - 8); // file length - 8
setUint32(0x45564157); // "WAVE"
setUint32(0x20746d66); // "fmt " chunk
setUint32(16); // length = 16
setUint16(1); // PCM (uncompressed)
setUint16(numOfChan);
setUint32(abuffer.sampleRate);
setUint32(abuffer.sampleRate * 2 * numOfChan); // avg. bytes/sec
setUint16(numOfChan * 2); // block-align
setUint16(16);
setUint32(0x61746164); // "data" - chunk
setUint32(length - pos - 4); // chunk length
// write interleaved data
for (i = 0; i < abuffer.numberOfChannels; i++)
channels.push(abuffer.getChannelData(i));
while (pos < length) {
for (i = 0; i < numOfChan; i++) {
// interleave channels
sample = Math.max(-1, Math.min(1, channels[i][offset])); // clamp
sample = (0.5 + sample < 0 ? sample * 32768: sample * 32767) | 0;
view.setInt16(pos, sample, true); // update data chunk
pos += 2;
}
offset++; // next source sample
}
The basic idea is simple. Convert the float to an Uint32 by using an array view. Then write out the uint32 values instead of int16 values. No clipping is needed. And be sure to output the correct wav header for the changed length and format type.
I know Chromium has some code to do this for WebAudio testing. But you'll have to abide by Chromium licenses to use it.
I have been trying to use NDI SDK 4.5, in a Objective-C iOS-13 app, to broadcast camera capture from iPhone device.
My sample code is in public Github repo: https://github.com/bharatbiswal/CameraExampleObjectiveC
Following is how I send CMSampleBufferRef sampleBuffer:
CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);
NDIlib_video_frame_v2_t video_frame;
video_frame.xres = VIDEO_CAPTURE_WIDTH;
video_frame.yres = VIDEO_CAPTURE_HEIGHT;
video_frame.FourCC = NDIlib_FourCC_type_UYVY; // kCVPixelFormatType_420YpCbCr8BiPlanarFullRange
video_frame.line_stride_in_bytes = VIDEO_CAPTURE_WIDTH * VIDEO_CAPTURE_PIXEL_SIZE;
video_frame.p_data = CVPixelBufferGetBaseAddress(pixelBuffer);
NDIlib_send_send_video_v2(self.my_ndi_send, &video_frame);
I have been using "NewTek NDI Video Monitor" to receive the video from network. However, even though it shows as source, the video does not play.
Has anyone used NDI SDK in iOS to build broadcast sender or receiver functionalities? Please help.
You should use kCVPixelFormatType_32BGRA in video settings. And NDIlib_FourCC_type_BGRA as FourCC in NDIlib_video_frame_v2_t.
Are you sure about your VIDEO_CAPTURE_PIXEL_SIZE ?
When I worked with NDI on macos I had the same black screen problem and it was due to a wrong line stride.
Maybe this can help : https://developer.apple.com/documentation/corevideo/1456964-cvpixelbuffergetbytesperrow?language=objc ?
Also it seems the pixel formats from core video and NDI don't match.
On the core video side you are using Bi-Planar Y'CbCr 8-bit 4:2:0, and on the NDI side you are using NDIlib_FourCC_type_UYVY which is Y'CbCr 4:2:2.
I cannot find any Bi-Planar Y'CbCr 8-bit 4:2:0 pixel format on the NDI side.
You may have more luck using the following combination:
core video: https://developer.apple.com/documentation/corevideo/1563591-pixel_format_identifiers/kcvpixelformattype_420ypcbcr8planarfullrange?language=objc
NDI: NDIlib_FourCC_type_YV12
Hope this helps!
In my experience, you have two mistake. To use CVPixelBuffer's CVPixelBufferGetBaseAddress, the CVPixelBufferLockBaseAddress method must be called first. Otherwise, it returns a null pointer.
https://developer.apple.com/documentation/corevideo/1457128-cvpixelbufferlockbaseaddress?language=objc
Secondly, NDI does not support YUV420 biplanar. (The default format for iOS cameras.) More precisely, NDI only accepts one data pointer. In other words, you have to merge the biplanar memory areas into one, and then pass it in NV12 format. See the NDI document for details.
So your code should look like this: And if sending asynchronously instead of NDIlib_send_send_video_v2, a strong reference to the transferred memory area must be maintained until the transfer operation by the NDI library is completed.
CVPixelBufferRef pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer);
CVPixelBufferLockBaseAddress(pixelBuffer, 0);
int width = (int)CVPixelBufferGetWidth(pixelBuffer);
int height = (int)CVPixelBufferGetHeight(pixelBuffer);
OSType pixelFormat = CVPixelBufferGetPixelFormatType(pixelBuffer);
NDIlib_FourCC_video_type_e ndiVideoFormat;
uint8_t* pixelData;
int stride;
if (pixelFormat == kCVPixelFormatType_32BGRA) {
ndiVideoFormat = NDIlib_FourCC_type_BGRA;
pixelData = (uint8_t*)CVPixelBufferGetBaseAddress(pixelBuffer); // Or copy for asynchronous transmit.
stride = width * 4;
} else if (pixelFormat == kCVPixelFormatType_420YpCbCr8BiPlanarFullRange) {
ndiVideoFormat = NDIlib_FourCC_type_NV12;
pixelData = (uint8_t*)malloc(width * height * 1.5);
uint8_t* yPlane = (uint8_t*)CVPixelBufferGetBaseAddressOfPlane(pixelBuffer, 0);
int yPlaneBytesPerRow = (int)CVPixelBufferGetBytesPerRowOfPlane(pixelBuffer, 0);
int ySize = yPlaneBytesPerRow * height;
uint8_t* uvPlane = (uint8_t*)CVPixelBufferGetBaseAddressOfPlane(pixelBuffer, 1);
int uvPlaneBytesPerRow = (int)CVPixelBufferGetBytesPerRowOfPlane(pixelBuffer, 1);
int uvSize = uvPlaneBytesPerRow * height;
stride = yPlaneBytesPerRow;
memcpy(pixelData, yPlane, ySize);
memcpy(pixelData + ySize, uvPlane, uvSize);
} else {
return;
}
NDIlib_video_frame_v2_t video_frame;
video_frame.xres = width;
video_frame.yres = height;
video_frame.FourCC = ndiVideoFormat;
video_frame.line_stride_in_bytes = stride;
video_frame.p_data = pixelData;
NDIlib_send_send_video_v2(self.my_ndi_send, &video_frame); // synchronous sending.
CVPixelBufferUnlockBaseAddress(pixelBuffer, 0);
// For synchrnous sending case. Free data or use pre-allocated memory.
if (pixelFormat == kCVPixelFormatType_420YpCbCr8BiPlanarFullRange) {
free(pixelData);
}
I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)
Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest frames.
So I need to capture ALL the frames at 30 fps, and I expect to get the hard-work result at 5 fps.
My problems is that I cannot keep the capture running at the right pace : it seems that the hard-work calculation slows down the capture of frames, either at CPU level or at GPU level. I miss some frames...
I tried many solutions. None worked:
I tried to set-up jobs on 2 streams (image below):
the host gets a frame
First stream (called Stream2) : cudaMemcpyAsync copies the frame on the Device. Then, a first kernel does the basic bluring calculations. (In the attached image, bluring appears as a short slot at 3.07 s and 3.085 s. And then nothing... until the big part has finished)
the host checks if the second stream is "available" thanks to a CudaEvent, and lauches it if possible. Practically, the stream is available 1/2 of tries.
Second stream (called Stream4) : starts hard-work calculations in a kernel ( kernelCalcul_W2), outputs the result, and records an Event.
NSight capture
Practically, I wrote :
cudaStream_t sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
cudaEvent_t event_1;
cudaEventCreate(&event_1);
if (frame has arrived)
{
cudaMemcpyAsync(..., sHigh); // HtoD, to upload images in the GPU
blur_Image <<<... , sHigh>>> (...)
if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
else printf("Event 2 not ready\n");
}
void hard_work( cudaStream_t sLow_)
{
kernelCalcul_W2<<<... , sLow_>>> (...);
cudaMemcpyAsync(... the result..., sLow_); //DtoH
cudaEventRecord(event_1, sLow_);
}
I tried to use only one stream. It's the same code as above, but change 1 parameter while launching hard_work.
host gets a frame
Stream: cudaMemcpyAsync copies the frame on the Device. Then, the kernel does the basic bluring calculations. Then, if the CudaEvent Event_1 is ok, I lauch the hard-work, and I add an Event_1 to get the status on next round.
Practically, the stream is ALWAYS available: I never fall in the "else" part.
This way, while the hard-work is running, I expected to "buffer" all the frames to copy, and not to lose any. But I do lose some: it turns out that each time I get a frame and I copy it, Event_1 seems ok so I launch the hard-work, and only get the the next frame very late.
I tried to put the two streams in two different threads (in C). Not better (even worse).
So the question is: how to ensure that the first stream captures ALL frames?
I really have the feeling that the different streams block the CPU.
I display the images with OpenGL. Would it interfere?
Any idea of ways to improve this?
Thanks a lot!
EDIT:
As requested, I put here a MCVE.
There is a parameter you can tune (#define ADJUST) to see what's happening. Basically, the main procedure sends CUDA requests in Async mode, but it seems to block the main thread. As you will see in the image, I have "memory access" (i.e. images captured ) every 30 ms except when the hard-work is running (then, I just don't get images).
Last detail: I'm using CUDA 7.5 to run this. I tried to install 8.0 but apparently the compiler is still 7.5
#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, b, n, n0;
float input;
b = 3;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
// storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
free(images_as_Output_in_Host);
free(camera);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
Sleep(30); // to simulate Camera
}
releaseMapFile();
getchar();
}
The primary issue here is that cudaMemcpyAsync is only a properly non-blocking async operation if the host memory involved is pinned, i.e. allocated using cudaHostAlloc. This characteristic is covered in several places, including the API documentation and the relevant programming guide section.
The following modification to your code (to run on linux, which I prefer) demonstrates the behavioral difference:
$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, n, n0;
float input;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
#ifndef QUICK
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize/2, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
#endif
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
cudaEventRecord(event_2, s2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
usleep(30000); // to simulate Camera
}
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard work is launched after image 1 is captured ==> running ok
Dealing with image 2
Hard work is launched after image 2 is captured ==> running ok
Dealing with image 3
Hard work is launched after image 3 is captured ==> running ok
Dealing with image 4
Hard work is launched after image 4 is captured ==> running ok
Dealing with image 5
Hard work is launched after image 5 is captured ==> running ok
Dealing with image 6
Hard work is launched after image 6 is captured ==> running ok
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard work is launched after image 8 is captured ==> running ok
Dealing with image 9
Hard work is launched after image 9 is captured ==> running ok
real 0m2.790s
user 0m0.688s
sys 0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard_work still running, so unable to process after image 1
Dealing with image 2
Hard_work still running, so unable to process after image 2
Dealing with image 3
Hard_work still running, so unable to process after image 3
Dealing with image 4
Hard_work still running, so unable to process after image 4
Dealing with image 5
Hard_work still running, so unable to process after image 5
Dealing with image 6
Hard_work still running, so unable to process after image 6
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard_work still running, so unable to process after image 8
Dealing with image 9
Hard_work still running, so unable to process after image 9
real 0m1.721s
user 0m0.028s
sys 0m0.629s
$
In the USE_HOST_ALLOC case above, the launch pattern for the low-priority kernel is intermittent, as expected, and the overall run time is considerably shorter.
In short, if you want the expected behavior out of cudaMemcpyAsync, make sure any participating host allocations are page-locked.
A pictorial (profiler) example of the effect that pinning can have on multi-stream behavior can be seen in this answer.
I need to extract frames from a video in my Qt based application. Using ffmpeg libraries I am able to fetch frames as AVFrames which I need to convert to QImage to use in other parts of my application. This conversion needs to be efficient. So far it seems sws_scale() is the right function to use but I am not sure what source and destination pixel formats are to be specified.
Came up with the following 2-step process that first converts a decoded AVFame to another AVFrame in RGB colorspace and then to QImage. It works and is reasonably fast.
src_frame = get_decoded_frame();
AVFrame *pFrameRGB = avcodec_alloc_frame(); // intermediate pframe
if(pFrameRGB==NULL) {
;// Handle error
}
int numBytes= avpicture_get_size(PIX_FMT_RGB24,
is->video_st->codec->width, is->video_st->codec->height);
uint8_t *buffer = (uint8_t*)malloc(numBytes);
avpicture_fill((AVPicture*)pFrameRGB, buffer, PIX_FMT_RGB24,
is->video_st->codec->width, is->video_st->codec->height);
int dst_fmt = PIX_FMT_RGB24;
int dst_w = is->video_st->codec->width;
int dst_h = is->video_st->codec->height;
// TODO: cache following conversion context for speedup,
// and recalculate only on dimension changes
SwsContext *img_convert_ctx_temp;
img_convert_ctx_temp = sws_getContext(
is->video_st->codec->width, is->video_st->codec->height,
is->video_st->codec->pix_fmt,
dst_w, dst_h, (PixelFormat)dst_fmt,
SWS_BICUBIC, NULL, NULL, NULL);
QImage *myImage = new QImage(dst_w, dst_h, QImage::Format_RGB32);
sws_scale(img_convert_ctx_temp,
src_frame->data, src_frame->linesize, 0, is->video_st->codec->height,
pFrameRGB->data,
pFrameRGB->linesize);
uint8_t *src = (uint8_t *)(pFrameRGB->data[0]);
for (int y = 0; y < dst_h; y++)
{
QRgb *scanLine = (QRgb *) myImage->scanLine(y);
for (int x = 0; x < dst_w; x=x+1)
{
scanLine[x] = qRgb(src[3*x], src[3*x+1], src[3*x+2]);
}
src += pFrameRGB->linesize[0];
}
If you find a more efficient approach, let me know in the comments
I know, it's too late, but maybe someone will find it useful. From here I got the clue of doing the same conversion, which looks a bit shorter.
So I created QImage which is reused for every decoded frame:
QImage img( width, height, QImage::Format_RGB888 );
Created frameRGB:
frameRGB = av_frame_alloc();
//Allocate memory for the pixels of a picture and setup the AVPicture fields for it.
avpicture_alloc( ( AVPicture *) frameRGB, AV_PIX_FMT_RGB24, width, height);
After the the first frame is decoded I create conversion context SwsContext this way (it will be used for all the next frames):
mImgConvertCtx = sws_getContext( codecContext->width, codecContext->height, codecContext->pix_fmt, width, height, AV_PIX_FMT_RGB24, SWS_BICUBIC, NULL, NULL, NULL);
And finally for every decoded frame conversion is performed:
if( 1 == framesFinished && nullptr != imgConvertCtx )
{
//conversion frame to frameRGB
sws_scale(imgConvertCtx, frame->data, frame->linesize, 0, codecContext->height, frameRGB->data, frameRGB->linesize);
//setting QImage from frameRGB
for( int y = 0; y < height; ++y )
memcpy( img.scanLine(y), frameRGB->data[0]+y * frameRGB->linesize[0], mWidth * 3 );
}
See the link for the specifics.
A simpler approach, I think:
void takeSnapshot(AVCodecContext* dec_ctx, AVFrame* frame)
{
SwsContext* img_convert_ctx;
img_convert_ctx = sws_getContext(dec_ctx->width,
dec_ctx->height,
dec_ctx->pix_fmt,
dec_ctx->width,
dec_ctx->height,
AV_PIX_FMT_RGB24,
SWS_BICUBIC, NULL, NULL, NULL);
AVFrame* frameRGB = av_frame_alloc();
avpicture_alloc((AVPicture*)frameRGB,
AV_PIX_FMT_RGB24,
dec_ctx->width,
dec_ctx->height);
sws_scale(img_convert_ctx,
frame->data,
frame->linesize, 0,
dec_ctx->height,
frameRGB->data,
frameRGB->linesize);
QImage image(frameRGB->data[0],
dec_ctx->width,
dec_ctx->height,
frameRGB->linesize[0],
QImage::Format_RGB888);
image.save("capture.png");
}
Today, I have tested directly pass the image->bit() to swscale and finally it works, so it doesn't need to copy to memory. For example:
/* 1. Get frame and QImage to show */
struct my_frame *frame = get_frame(source);
QImage *myImage = new QImage(dst_w, dst_h, QImage::Format_RGBA8888);
/* 2. Convert and write into image buffer */
uint8_t *dst[] = {myImage->bits()};
int linesizes[4];
av_image_fill_linesizes(linesizes, AV_PIX_FMT_RGBA, frame->width);
sws_scale(myswscontext, frame->data, (const int*)frame->linesize,
0, frame->height, dst, linesizes);
I just discovered that scanLine is just seeking thru the buffer.. all you need is use AV_PIX_FMT_RGB32 for the AVFrame and QImage::FORMAT_RGB32 for the QImage.
Then after decoding just do a memcpy
memcpy(img.scanLine(0), pFrameRGB->data[0], pFrameRGB->linesize[0] * pFrameRGB->height());
I had problems with the other proposed solutions as :
They did not mention freeing either AVFrame, SwsContext or the allocated buffers, which caused massive memory leaks (I had thousands of frames to handle). These problems couldn't all be solved easily as QImage relies on the underlying data, and does not copy it. If freeing the buffer directly, the QImage points to freed data and breaks. This could be solved by using QImage's cleanupFunction to free the buffer once the image is no longer needed, but with other problems it wasn't good anyways.
In some cases one of the suggestions, of passing QImage.bits directly to sws_scale, would not work as QImage are minimum 32 bit aligned. Therefore for certain dimensions it would not match the expected width by sws_scale and output each line shifted a little bit.
A third problem is that they used deprecated AVPicture elements.
I listed the problem in another question Converting an AVFrame to QImage with conversion of pixel format and in the end found a solution using a temporary buffer, which could be copied to the QImage, and then safely freed.
So see my answer for a fully working, efficient, and with no deprecated function calls, implementation : https://stackoverflow.com/a/68212609/7360943
I am trying to program a MSP430 with a simple "FIR filter" program, that looks like the following:
#include "msp430x22x4.h"
#include "legacymsp430.h"
#define FILTER_LENGTH 4
#define TimerA_counter_value 12000 // 12000 counts/s -> 12000 counts ~ 1 Hz
int i;
double x[FILTER_LENGTH+1] = {0,0,0,0,0};
double y = 0;
double b[FILTER_LENGTH+1] = {0.0338, 0.2401, 0.4521, 0.2401, 0.0338};
signed char floor_and_convert(double y);
void setup(void)
{
WDTCTL = WDTPW + WDTHOLD; // Stop WDT
BCSCTL1 = CALBC1_8MHZ; // Set DCO
DCOCTL = CALDCO_8MHZ;
/* Setup Port 3 */
P3SEL |= BIT4 + BIT5; // P3.4,5 = USART0 TXD/RXD
P3DIR |= BIT4; // P3.4 output direction
/* UART */
UCA0CTL1 = UCSSEL_2; // SMCLK
UCA0BR0 = 0x41; // 9600 baud from 8Mhz
UCA0BR1 = 0x3;
UCA0MCTL = UCBRS_2;
UCA0CTL1 &= ~UCSWRST; // **Initialize USCI state machine**
IE2 |= UCA0RXIE; // Enable USCI_A0 RX interrupt
/* Setup TimerA */
BCSCTL3 |= LFXT1S_2; // LFXT1S_2: Mode 2 for LFXT1 = VLO
// VLO provides a typical frequency of 12kHz
TACCTL0 = CCIE; // TACCR0 Capture/compare interrupt enable
TACCR0 = TimerA_counter_value; // Timer A Capture/Compare 0: -> 25 Hz
TACTL = TASSEL_1; // TASSEL_1: Timer A clock source select: 1 - ACLK
TACTL |= MC_1; // Start Timer_A in up mode
__enable_interrupt();
}
void main(void) // Beginning of program
{
setup(); // Call Function setup (see above)
_BIS_SR(LPM3_bits); // Enter LPM0
}
/* USCIA interrupt service routine */
/*#pragma vector=USCIAB0RX_VECTOR;*/
/*__interrupt void USCI0RX_ISR(void)*/
interrupt (USCIAB0RX_VECTOR) USCI0RX_ISR(void)
{
TACTL |= MC_1; // Start Timer_A in up mode
x[0] = (double)((signed char)UCA0RXBUF); // Read received sample and perform type casts
y = 0;
for(i = 0;i <= FILTER_LENGTH;i++) // Run FIR filter for each received sample
{
y += b[i]*x[i];
}
for(i = FILTER_LENGTH-1;i >= 0;i--) // Roll x array in order to hold old sample inputs
{
x[i+1] = x[i];
}
while (!(IFG2&UCA0TXIFG)); // Wait until USART0 TX buffer is ready?
UCA0TXBUF = (signed char) y;
TACTL |= TACLR; // Clear TimerA (prevent interrupt during receive)
}
/* Timer A interrupt service routine */
/*#pragma vector=TIMERA0_VECTOR;*/
/*__interrupt void TimerA_ISR (void)*/
interrupt (TIMERA0_VECTOR) TimerA_ISR(void)
{
for(i = 0;i <= FILTER_LENGTH;i++) // Clear x array if no data has arrived after 1 sec
{
x[i] = 0;
}
TACTL &= ~MC_1; // Stops TimerA
}
The program interacts with a MatLab code, that sends 200 doubles to the MSP, for processing in the FIR filter. My problem is, that the MSP is not able to deal with the doubles.
I am using the MSPGCC to compile the code. When I send a int to the MSP it will respond be sending a int back again.
Your problem looks like it is in the way that the data is being sent to the MSP.
The communications from MATLAB is, according to your code, a sequence of 4 binary byte values that you then take from the serial port and cast it straight to a double. The value coming in will have a range -128 to +127.
If your source data is any other data size then your program will be broken. If your data source is providing binary "double" data then each value may be 4 or 8 bytes long depending upon its internal data representation. Sending one of these values over the serial port will be interpreted by the MSP as a full set of 4 input samples, resulting in absolute garbage for a set of answers.
The really big question is WHY ON EARTH ARE YOU DOING THIS IN FLOATING POINT - on a 16 bit integer processor that (many versions) have integer multiplier hardware.
As Ian said, You're taking an 8bit value (UCA0RXBUF is only 8 bits wide anyway) and expecting to get a 32bit or 64 bit value out of it.
In order to get a proper sample you would need to read UCA0RXBUF multiple times and then concatenate each 8 bit value into 32/64 bits which you then would cast to a double.
Like Ian I would also question the wisdom of doing floating point math in a Low power embedded microcontroller. This type of task is much better suited to a DSP.
At least you should use fixed point math, seewikipedia (even in a DSP you would use fixed point arithmetic).
Hmm. Actually the code is made of my teacher, I'm just trying to make it work on my Mac, and not in AIR :-)
MATLAB code is like this:
function FilterTest(comport)
Fs = 100; % Sampling Frequency
Ts = 1/Fs; % Sampling Periode
L = 200; % Number of samples
N = 4; % Filter order
Fcut = 5; % Cut-off frequency
B = fir1(N,Fcut/(Fs/2)) % Filter coefficients in length N+1 vector B
t = [0:L-1]*Ts; % time array
A_m = 80; % Amplitude of main component
F_m = 5; % Frequency of main component
P_m = 80; % Phase of main component
y_m = A_m*sin(2*pi*F_m*t - P_m*(pi/180));
A_s = 40; % Amplitude of secondary component
F_s = 40; % Frequency of secondary component
P_s = 20; % Phase of secondary component
y_s = A_s*sin(2*pi*F_s*t - P_s*(pi/180));
y = round(y_m + y_s); % sum of main and secondary components (rounded to integers)
y_filt = round(filter(B,1,y)); % filtered data (rounded to integers)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Serial_port_object = serial(comport); % create Serial port object
set(Serial_port_object,'InputBufferSize',L) % set InputBufferSize to length of data
set(Serial_port_object,'OutputBufferSize',L) % set OutputBufferSize to length of data
fopen(Serial_port_object) % open Com Port
fwrite(Serial_port_object,y,'int8'); % send out data
data = fread(Serial_port_object,L,'int8'); % read back data
fclose(Serial_port_object) % close Com Port
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
subplot(2,1,1)
hold off
plot(t,y)
hold on
plot(t,y_filt,'r')
plot(t,y_filt,'ro')
plot(t,data,'k.')
ylabel('Amplitude')
legend('y','y filt (PC)','y filt (PC)','y filt (muP)')
subplot(2,1,2)
hold off
plot(t,data'-y_filt)
hold on
xlabel('time')
ylabel('muP - PC')
figure(1)
It is also not advised to keep interrupt routines doing long processing routines, because you will impact on interrupt latency. Bytes comming from the PC can get easily lost, because of buffer overrun on the serial port.
The best is to build a FIFO buffer holding a resonable number of input values. The USCI routine fills the FIFO while the main program keeps looking for data inside it and process them as they are available.
This way, while the data is being processed, the USCI can interrupt to handle new incomming bytes.
When the FIFO is empty, you can put the main process in a suitable LPM mode to conserve power (and this is the best MSP430 feature). The USCI routine will wake the CPU up when a data is ready (just put the WAKEUP attribute in the USCI handler if you are using MSPGCC).
In such a scenario be sure to declare volatile every variable that are shared between interrupt routines and the main process.