I want to display an image received in a short[] of pixels from a server.
The server(C++) writes the image as an unsigned short[] of pixels (12 bit depth).
My java application gets the image by a CORBA call to this server.
Since java does not have ushort, the pixels are stored as (signed) short[].
This is the code I'm using to obtain a BufferedImage from the array:
private WritableImage loadImage(short[] pixels, int width, int height) {
int[] intPixels = new int[pixels.length];
for (int i = 0; i < pixels.length; i++) {
intPixels[i] = (int) pixels[i];
}
BufferedImage image = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
WritableRaster raster = (WritableRaster) image.getData();
raster.setPixels(0, 0, width, height, intPixels);
return SwingFXUtils.toFXImage(image, null);
}
And later:
WritableImage orgImage = convertShortArrayToImage2(image.data, image.size_x, image.size_y);
//load it into the widget
Platform.runLater(() -> {
imgViewer.setImage(orgImage);
});
I've checked that width=1280 and height=1024 and the pixels array is 1280x1024, that matches with the raster height and width.
However I'm getting an array out of bounds error in the line:
raster.setPixels(0, 0, width, height, intPixels);
I have try with ALL ImageTypes , and all of them produce the same error except for:
TYPE_USHORT_GRAY: Which I thought it would be the one, but shows an all-black image
TYPE_BYTE_GRAY: which show the image in negative(!) and with a lot of grain(?)
TYPE_BYTE_INDEXED: which likes the above what colorized in a funny way
I also have tried shifting bits when converting from shot to int, without any difference:
intPixels[i] = (int) pixels[i] & 0xffff;
So..I'm quite frustrated after looking for days a solution in the internet. Any help is very welcome
Edit. The following is an example of the images received, converted to jpg on the server side. Not sure if it is useful since I think it is made from has pixel rescaling (sqrt) :
Well, finally I solved it.
Probably not the best solution but it works and could help someone in ether....
Being the image grayscale 12 bit depth, I used BufferedImage of type TYPE_BYTE_GRAY, but I had to downsample to 8 bit scaling the array of pixels. from 0-4095 to 0-255.
I had an issue establishing the higher and lower limits of the scale. I tested with avg of the n higher/lower limits, which worked reasonably well, until someone sent me a link to a java program translating the zscale algorithm (used in DS9 tool for example) for getting the limits of the range of greyscale vlues to be displayed:
find it here
from that point I modified the previous code and it worked like a charm:
//https://github.com/Caltech-IPAC/firefly/blob/dev/src/firefly/java/edu/caltech/ipac/visualize/plot/Zscale.java
Zscale.ZscaleRetval retval = Zscale.cdl_zscale(pixels, width, height,
bitsVal, contrastVal, opt_sizeVal, len_stdlineVal, blankValueVal);
double Z1 = retval.getZ1();
double Z2 = retval.getZ2();
try {
int[] ints = new int[pixels.length];
for (int i = 0; i < pixels.length; i++) {
if (pixels[i] < Z1) {
pixels[i] = (short) Z1;
} else if (pixels[i] > Z2) {
pixels[i] = (short) Z2;
}
ints[i] = ((int) ((pixels[i] - Z1) * 255 / (Z2 - Z1)));
}
BufferedImage bImg
= new BufferedImage(width, height, BufferedImage.TYPE_BYTE_GRAY);
bImg.getRaster().setPixels(0, 0, width, height, ints);
return SwingFXUtils.toFXImage(bImg, null);
} catch (Exception ex) {
System.out.println(ex.getMessage());
}
return null;
I have a requirement that the bullets on a specific implementation of a scatterplot needs to have labels next to them, however, it is known that many of the datapoints in the set are identical or very close to one another, so if I were to set labels on a fixed coordinate relative to the bullet, the labels would stack on top of eachother and not be readable.
I want to implement this so that the labels will give way for eachother - moving around, so they don't overlap - and I am thinking that this is a common enough idea that some approach already exists, but I have no idea what to search for. Does this concept have a name?
I would ofcource appreciate an implementation example, but that is not the most important thing. I am sure I can solve it myself, but I'd rather not reinvent something that someone else has already done better.
The image above displays examples of bullets on top of and close to each other
I ended up finding inspiration in Simulated Annealing.
My solution looks like this
/**
* Implements an algorithm for placing labels on a chart in a way so that they
* do not overlap as much.
* The approach is inspired by Simulated Annealing
* (https://en.wikipedia.org/wiki/Simulated_annealing)
*/
export class Placer {
private knownPositions: Coordinate[];
private START_RADIUS = 20;
private RUNS = 15;
private ORIGIN_WEIGHT = 2;
constructor() {
this.knownPositions = []
}
/**
* Get a good spot to place the object.
*
* Given a start coordinate, this method tries to find the best place
* that is close to that point but not too close to other known points.
*
* #param {Coordinate} coordinate
* #returns {Coordinate}
*/
getPlacement(coordinate: Coordinate) : Coordinate {
let radius = this.START_RADIUS;
let lastPosition = coordinate;
let lastScore = 0;
while (radius > 0) {
const newPosition = this.getRandomPosition(coordinate, radius);
const newScore = this.getScore(newPosition, coordinate);
if (newScore > lastScore) {
lastPosition = newPosition;
lastScore = newScore;
}
radius -= this.START_RADIUS / this.RUNS;
}
this.knownPositions.push(lastPosition);
return lastPosition;
}
/**
* Return a random point on the radius around the position
*
* #param {Coordinate} position Center point
* #param {number} radius Distance from `position` to find a point
* #returns {Coordinate} A random point `radius` distance away from
* `position`
*/
private getRandomPosition(position: Coordinate, radius:number) : Coordinate {
const randomRotation = radians(Math.random() * 360);
const xOffset = Math.cos(randomRotation) * radius;
const yOffset = Math.sin(randomRotation) * radius;
return {
x: position.x + xOffset,
y: position.y + yOffset,
}
}
/**
* Returns a number score of a position. The further away it is from any
* other known point, the better the score (bigger number), however, it
* suffers a subtraction in score the further away it gets from its origin
* point.
*
* #param {Coordinate} position The position to score
* #param {Coordinate} origin The initial position before looking for
* better ones
* #returns {number} The representation of the score
*/
private getScore(position: Coordinate, origin: Coordinate) : number {
let closest: number = null;
this.knownPositions.forEach((knownPosition) => {
const distance = Math.abs(Math.sqrt(
Math.pow(knownPosition.x - position.x, 2) +
Math.pow(knownPosition.y - position.y, 2)
));
if (closest === null || distance < closest) {
closest = distance;
}
});
const distancetoOrigin = Math.abs(Math.sqrt(
Math.pow(origin.x - position.x, 2) +
Math.pow(origin.y - position.y, 2)
));
return closest - (distancetoOrigin / this.ORIGIN_WEIGHT);
}
}
There is room for improvement in the getScore method, but the results are good enough for my case.
Basically, all points try to move to a random position in a given radius and sees if that position is "better" than the original. The algorithm keeps doing that for a smaller and smaller radius until radius = 0.
The class keeps track of all known points, so that when you try to place point number two, the scoring can account for the presence of point number one.
I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)
Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest frames.
So I need to capture ALL the frames at 30 fps, and I expect to get the hard-work result at 5 fps.
My problems is that I cannot keep the capture running at the right pace : it seems that the hard-work calculation slows down the capture of frames, either at CPU level or at GPU level. I miss some frames...
I tried many solutions. None worked:
I tried to set-up jobs on 2 streams (image below):
the host gets a frame
First stream (called Stream2) : cudaMemcpyAsync copies the frame on the Device. Then, a first kernel does the basic bluring calculations. (In the attached image, bluring appears as a short slot at 3.07 s and 3.085 s. And then nothing... until the big part has finished)
the host checks if the second stream is "available" thanks to a CudaEvent, and lauches it if possible. Practically, the stream is available 1/2 of tries.
Second stream (called Stream4) : starts hard-work calculations in a kernel ( kernelCalcul_W2), outputs the result, and records an Event.
NSight capture
Practically, I wrote :
cudaStream_t sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
cudaEvent_t event_1;
cudaEventCreate(&event_1);
if (frame has arrived)
{
cudaMemcpyAsync(..., sHigh); // HtoD, to upload images in the GPU
blur_Image <<<... , sHigh>>> (...)
if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
else printf("Event 2 not ready\n");
}
void hard_work( cudaStream_t sLow_)
{
kernelCalcul_W2<<<... , sLow_>>> (...);
cudaMemcpyAsync(... the result..., sLow_); //DtoH
cudaEventRecord(event_1, sLow_);
}
I tried to use only one stream. It's the same code as above, but change 1 parameter while launching hard_work.
host gets a frame
Stream: cudaMemcpyAsync copies the frame on the Device. Then, the kernel does the basic bluring calculations. Then, if the CudaEvent Event_1 is ok, I lauch the hard-work, and I add an Event_1 to get the status on next round.
Practically, the stream is ALWAYS available: I never fall in the "else" part.
This way, while the hard-work is running, I expected to "buffer" all the frames to copy, and not to lose any. But I do lose some: it turns out that each time I get a frame and I copy it, Event_1 seems ok so I launch the hard-work, and only get the the next frame very late.
I tried to put the two streams in two different threads (in C). Not better (even worse).
So the question is: how to ensure that the first stream captures ALL frames?
I really have the feeling that the different streams block the CPU.
I display the images with OpenGL. Would it interfere?
Any idea of ways to improve this?
Thanks a lot!
EDIT:
As requested, I put here a MCVE.
There is a parameter you can tune (#define ADJUST) to see what's happening. Basically, the main procedure sends CUDA requests in Async mode, but it seems to block the main thread. As you will see in the image, I have "memory access" (i.e. images captured ) every 30 ms except when the hard-work is running (then, I just don't get images).
Last detail: I'm using CUDA 7.5 to run this. I tried to install 8.0 but apparently the compiler is still 7.5
#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, b, n, n0;
float input;
b = 3;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
// storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
free(images_as_Output_in_Host);
free(camera);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
Sleep(30); // to simulate Camera
}
releaseMapFile();
getchar();
}
The primary issue here is that cudaMemcpyAsync is only a properly non-blocking async operation if the host memory involved is pinned, i.e. allocated using cudaHostAlloc. This characteristic is covered in several places, including the API documentation and the relevant programming guide section.
The following modification to your code (to run on linux, which I prefer) demonstrates the behavioral difference:
$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, n, n0;
float input;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
#ifndef QUICK
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize/2, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
#endif
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
cudaEventRecord(event_2, s2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
usleep(30000); // to simulate Camera
}
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard work is launched after image 1 is captured ==> running ok
Dealing with image 2
Hard work is launched after image 2 is captured ==> running ok
Dealing with image 3
Hard work is launched after image 3 is captured ==> running ok
Dealing with image 4
Hard work is launched after image 4 is captured ==> running ok
Dealing with image 5
Hard work is launched after image 5 is captured ==> running ok
Dealing with image 6
Hard work is launched after image 6 is captured ==> running ok
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard work is launched after image 8 is captured ==> running ok
Dealing with image 9
Hard work is launched after image 9 is captured ==> running ok
real 0m2.790s
user 0m0.688s
sys 0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard_work still running, so unable to process after image 1
Dealing with image 2
Hard_work still running, so unable to process after image 2
Dealing with image 3
Hard_work still running, so unable to process after image 3
Dealing with image 4
Hard_work still running, so unable to process after image 4
Dealing with image 5
Hard_work still running, so unable to process after image 5
Dealing with image 6
Hard_work still running, so unable to process after image 6
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard_work still running, so unable to process after image 8
Dealing with image 9
Hard_work still running, so unable to process after image 9
real 0m1.721s
user 0m0.028s
sys 0m0.629s
$
In the USE_HOST_ALLOC case above, the launch pattern for the low-priority kernel is intermittent, as expected, and the overall run time is considerably shorter.
In short, if you want the expected behavior out of cudaMemcpyAsync, make sure any participating host allocations are page-locked.
A pictorial (profiler) example of the effect that pinning can have on multi-stream behavior can be seen in this answer.
For my current project it is necessary, that I compute the screen coordinates of a given point in the world space in Unity.
I used this tutorial to write a methode to do so.
After some debugging the x and y screen coordinate are correct, but my z coordinates looks wrong and I have some more questions:
static public Vector3 convertWorldToScreenCoordinates (Vector3 point, PhotoData photoData)
{
// get the camera
Camera camera = GameObject.Find (photoData.cameraName).camera;
/*
* 1 convert P_world to P_camera
*/
Vector4 pointInCameraCoodinates = convertWorldToCameraCoordinates (point, photoData);
/*
* 2 convert P_camera to P_clipped
*/
Vector4 pointInClipCoordinates = camera.projectionMatrix * pointInCameraCoodinates;
/*
* 3 convert P_clipped to P_ndc
* Normalized Device Coordinates
*/
Vector3 pointInNdc = pointInClipCoordinates / pointInClipCoordinates.w;
/*
* 4 convert P_ndc to P_screen
*/
Vector3 pointInScreenCoordinates;
pointInScreenCoordinates.x = camera.pixelWidth / 2.0f * (pointInNdc.x + 1);
pointInScreenCoordinates.y = camera.pixelHeight / 2.0f * (pointInNdc.y + 1);
pointInScreenCoordinates.z = ((camera.farClipPlane - camera.nearClipPlane) * pointInNdc.z + (camera.farClipPlane + camera.nearClipPlane)) / 2.0f;
// return screencoordinates
return pointInScreenCoordinates;
}
PhotoData is a class, that contains some information about the camera. The important part here is that I can access the camera.
static public Vector4 convertWorldToCameraCoordinates (Vector3 point, PhotoData photoData)
{
// translate the point by the negative camera-offset
//and convert to Vector4
Vector4 translatedPoint = point - photoData.cameraPosition;
// by default translatedPoint.w is 0
translatedPoint.w = 1.0f;
// create transformation matrix
Matrix4x4 transformationMatrix = Matrix4x4.identity;
transformationMatrix.SetRow (0, photoData.camRight);
transformationMatrix.SetRow (1, photoData.camUp);
transformationMatrix.SetRow (2, - photoData.camForward);
Vector4 transformedPoint = transformationMatrix * translatedPoint;
return transformedPoint;
}
First of all, the tutorial mentions, that after computing the ndc-values, "the range of values is now normalized from -1 to 1 in all 3 axes". This is not true in my case and I do not see what I am doing wrong.
My second question is, does pointInClipCoordinates.z < 0 mean the world point is "behind" my camera?
And the last question so far is why do I have to use - photoData.camForward?
// edit: updated code + questions
For editor scripts, use HandleUtility.WorldToGUIPoint (available since Unity3D 4.12) to convert a world space point to a 2D GUI position.