JavaFX ScatterChart not freeing memory when points are cleared - javafx

When I call scatterChart.getData().clear(), Java is not freeing up memory. For example if I run the following code:
for (int i=0; i<20000; i++) {
double x = Math.random() * 100.0;
double y = Math.random() * 500.0;
ScatterChart.Data<Number,Number> dataPoint = new
ScatterChart.Data<Number, Number>(x,y);
series.getData().add(dataPoint);
}
scatterChart.getData().add(series);
And then call scatterChart.getData().clear() no memory gets cleared. Calling System.gc() explicitly doesn't clear any memory. Any idea what is going on? I don't have references to these points anywhere else in my code.

Related

How is the length determined in sendcount and recvcount in MPI.COMM_WORLD.Gather

So after I Bcast the data (clusters[10][5] 2d array) to every other process and then when each one calculates its new local values I want to send them back to the process 0.
But some of the data is missing sometimes (depends on no. of clusters) or the data is not equal to the ones I have in the sending processes.
I don't know why but the max value of recvcount and recvcount need to be divided by size or by some factor, they can't be array size (10 or 10*5 - no. of elements).
If I put its full size for instance cluster.lenght(10) it says indexoutofbounce 19 and if I run with more processes (mpjrun.bat -np 11 name) the higher index occurs in the outofbounce and it always goes up or down by 2 with higher/lower no. of processes (for example I use 5 processes and get outofbounce 9 and then next run use 6 and get 11).
Can someone explain why is Gather's count connected to number of processes or why it can't accept the array size?
And also the program doesn't end after the data is calculated correctly, only if I use 1 process it ends but otherwise it goes out of the loop and then print something to the terminal and after that I have MPI.finalize but nothing happens and I have Ctrl+c to terminate bat job so I can use the terminal again.
The clusterget variable is set to number of clusters*size of proceses so that it stores all the new clusters from other processes so that I can then use them all in the first process so the problem isn't in clusterget variable or maybe is it? Since there isn't really anything documented about sending through a 2d array of floats (yeah I need to use MPI.OBJECT because java doesn't like float if I use float it says Float can't be casted to Float).
MPI.COMM_WORLD.Bcast(clusters, 0, clusters.length, MPI.OBJECT, 0);
//calculate and then send back to 0
MPI.COMM_WORLD.Gather(clusters, 0, clusters.length / size, MPI.OBJECT, clusterget, 0, clusters.length / size, MPI.OBJECT, 0);
if (me == 0) {
for (int j = 0; j < clusters.length; j++) { //adds clusters from each other process to the first ones
for (int i = 0; i < size - 1; i++) {
System.out.println(clusterget[j+i*cluster][4]+" tock "+clusters[j][4]);
clusters[j][2] += clusterget[j + i * cluster][2]; //dodaj
clusters[j][3] += clusterget[j + i * cluster][3];
clusters[j][4] += clusterget[j + i * cluster][4];
}
}
}
In Summmary:
The data from each process isn't the same as the one collected after gather, in which i can't put the full size of 2d float array.
I've changed gather to Send and Recv and it works and I needed to add a Barrier so that the data is in synch before sending. But this only works for 2 procesess.
MPI.COMM_WORLD.Barrier();
if (me != 0){
MPI.COMM_WORLD.Send(clusters,0,clusters.length,MPI.OBJECT,0,MPI.ANY_TAG);
}
if (me == 0) {
for (int i = 1; i < size; i++) {
MPI.COMM_WORLD.Recv(clusterget,0,clusters.length,MPI.OBJECT,i,MPI.ANY_TAG);
for (int j = 0; j < clusters.length; j++) {
clusters[j][2] += clusterget[j][2];
clusters[j][3] += clusterget[j][3];
clusters[j][4] += clusterget[j][4];
}
}

OpenCL Atomic add for vector types?

I'm updating a single element in a buffer from two lanes and need an atomic for float4 types. (More specifically, I launch twice as many threads as there are buffer elements, and each successive pair of threads updates the same element.)
For instance (this pseudocode does nothing useful, but hopefully illustrates my issue):
int idx = get_global_id(0);
int mapIdx = floor (idx / 2.0);
float4 toAdd;
// ...
if (idx % 2)
{
toAdd = (float4)(0,1,0,1);
}
else
{
toAdd = float3(1,0,1,0);
}
// avoid race condition here?
// I'd like to atomic_add(map[mapIdx],toAdd);
map[mapIdx] += toAdd;
In this example, map[0] should be incremented by (1,1,1,1). (0,1,0,1) from thread 0, and (1,0,1,0) from thread 1.
Suggestions? I haven't found any reference to vector atomics in the CL documents. I suppose I could do this on each individual vector component separately:
atomic_add(map[mapIdx].x, toAdd.x);
atomic_add(map[mapIdx].y, toAdd.y);
atomic_add(map[mapIdx].z, toAdd.z);
atomic_add(map[mapIdx].w, toAdd.w);
... but that just feels like a bad idea. (And requires a cmpxchg hack since there are no float atomics.
Suggestions?
Alternatively you could try using local memory like that:
__local float4 local_map[LOCAL_SIZE/2];
if(idx < LOCAL_SIZE/2) // More optimal would be to use work items together than every second (idx%2) as they work together in a warp/wavefront anyway, otherwise that may affect the performance
local_map[mapIdx] = toAdd;
barrier(CLK_LOCAL_MEM_FENCE);
if(idx >= LOCAL_SIZE/2)
local_map[mapIdx - LOCAL_SIZE/2] += toAdd;
barrier(CLK_LOCAL_MEM_FENCE);
What will be faster - atomics or local memory - or possible (size of local memory may be too big) depends on actual kernel, so you will need to benchmark and choose the right solution.
Update:
Answering your question from comments - to write later back to global buffer do:
if(idx < LOCAL_SIZE/2)
map[mapIdx] = local_map[mapIdx];
Or you can try without introducing local buffer and write directly into global buffer:
if(idx < LOCAL_SIZE/2)
map[mapIdx] = toAdd;
barrier(CLK_GLOBAL_MEM_FENCE); // <- notice that now we use barrier related to global memory
if(idx >= LOCAL_SIZE/2)
map[mapIdx - LOCAL_SIZE/2] += toAdd;
barrier(CLK_GLOBAL_MEM_FENCE);
Aside from that I can see now problem with indexes. To use the code from my answer the previous code should look like:
if(idx < LOCAL_SIZE/2)
{
toAdd = (float4)(0,1,0,1);
}
else
{
toAdd = (float4)(1,0,1,0);
}
If you need to use id%2 though then all the code must follow this or you will have to do some index arithmetic so that the values go into right places in map.
If I understand issue correctly I would do next.
Get rid of ifs by making array with offsets
float4[2] = {(1,0,1,0), (0,1,0,1)}
and use idx %2 as offset
move map into local memory and use mem_fence(CLK_LOCAL_MEM_FENCE) to make sure all threads in group synced.

OpenCL MultiGPU slower than single GPU

I am developing an application which performs some processing on video frame data. To accelerate it I use 2 graphic cards and process the data with OpenCL. My idea is to send one frame to the first card and another one to the second card. The devices use the same context, but different command queues, kernels and memory objects.
However, it seems to me that the computations are not executed in parallel, because the time required by the 2 cards is almost the same as the time required by only one graphic card.
Does anyone have a good example of using multiple devices on independant data pieces simultaneously?
Thanks in advance.
EDIT:
Here is the resulting code after switching to 2 separate contexts. However, the execution time with 2 graphic cards still remains the same as with 1 graphic card.
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_TRUE, 0, imageSize*sizeof(float), wt[i].data);
// Set kernel arguments
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
float* modulus = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_TRUE, 0, imageSize/4*sizeof(float), modulus);
// Do something with the modulus;
}
Your main problem is that you are using blocking calls. It doesn't matter how many devices you have, if you operate them in that way. Since you are doing an operation and waiting for it to finish, so no parallelization at all (or very little). You are doing this at the moment:
Wr:-Copy1--Copy2--------------------
G1:---------------RUN1--------------
G2:---------------RUN2--------------
Re:-------------------Read1--Read2--
You should change your code to do it like this at least:
Wr:-Copy1-Copy2-----------
G1:------RUN1-------------
G2:------------RUN2-------
Re:----------Read1-Read2--
With this code:
cl::NDRange globalws(imageSize);
cl::NDRange localws;
for (int i = 0; i < numDevices; i++){
// Set kernel arguments //YOU SHOULD DO THIS AT INIT STAGE, IT IS SLOW TO DO IT IN A LOOP
kernel[i].setArg(0, inputDataBuffer[i]);
kernel[i].setArg(1, modulusBuffer[i]);
kernel[i].setArg(2, imagewidth);
// Copy the input data to the device
commandQueues[i].enqueueWriteBuffer(inputDataBuffer[i], CL_FALSE, 0, imageSize*sizeof(float), wt[i].data);
}
for (int i = 0; i < numDevices; i++){
// Run kernel
commandQueues[i].enqueueNDRangeKernel(kernel[i], cl::NullRange, globalws, localws);
}
float* modulus[numDevices];
for (int i = 0; i < numDevices; i++){
// Read the modulus back to the host
modulus[i] = new float[imageSize/4];
commandQueues[i].enqueueReadBuffer(modulusBuffer[i], CL_FALSE, 0, imageSize/4*sizeof(float), modulus[i]);
}
clFinish();
// Do something with the modulus;
Regarding the comments to have multiple contexts, depends if you are ever going to comunicate both GPUs or not. As long as the GPUs only use their memory, theere will be no copy overhead. But if you set/unset kernel args constantly, that will trigger copys to the other GPUs. So, be careful with that.
The safer approach for a non-comunication between GPUs are different contexts.
I suspect your main problem is the memory copy and not the kernel execution, highly likely 1 GPU will fulfil your needs if you hide the memory latency:
Wr:-Copy1-Copy2-Copy3----------
G1:------RUN1--RUN2--RUN3------
Re:----------Read1-Read2-Read3-

OpenCL - Global Memory reads preforming better than local

I have a kernel which I am running on a NVidia GTX 680 that increased in execution time when switching from using global memory to local memory.
My kernel which is part of a finite element ray tracer now loads each element into local memory before processing. The data for each element is stored in a struct fastTriangle which has the following definition :
typedef struct fastTriangle {
float cx, cy, cz, cw;
float nx, ny, nz, nd;
float ux, uy, uz, ud;
float vx, vy, vz, vd;
} fastTriangle;
I pass an array of these object to the kernel which is written as follows (I have removed the irrelevant code for brevity:
__kernel void testGPU(int n_samples, const int n_objects, global const fastTriangle *objects, __local int *x_res, __global int *hits) {
// Get gid, lid, and lsize
// Set up random number generator and thread variables
// Local storage for the two triangles being processed
__local fastTriangle triangles[2];
for(int i = 0; i < n_objects; i++) { // Fire ray from each object
event_t evt = async_work_group_copy((local float*)&triangles[0], (global float*)&objects[i],sizeof(fastTriangle)/sizeof(float),0);
//Initialise local memory x_res to 0's
barrier(CLK_LOCAL_MEM_FENCE);
wait_group_events(1, &evt);
Vector wsNormal = { triangles[0].cw*triangles[0].nx, triangles[0].cw*triangles[0].ny, triangles[0].cw*triangles[0].nz};
for(int j = 0; j < n_samples; j+= 4) {
// generate a float4 of random numbers here (rands
for(int v = 0; v < 4; v++) { // For each ray in ray packet
//load the first object to be intesected
evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[0],sizeof(fastTriangle)/sizeof(float),0);
// Some initialising code and calculate ray here
// Should have ray fully specified at this point;
for(int w = 0; w < n_objects; w++) { // Check for intersection against each ray
wait_group_events(1, &evt);
// Check for intersection against object w
float det = wsDir.x*triangles[1].nx + wsDir.y*triangles[1].ny + wsDir.z*triangles[1].nz;
float dett = triangles[1].nd - (triangles[0].cx*triangles[1].nx + triangles[0].cy*triangles[1].ny + triangles[0].cz*triangles[1].nz);
float detpx = det*triangles[0].cx + dett*wsDir.x;
float detpy = det*triangles[0].cy + dett*wsDir.y;
float detpz = det*triangles[0].cz + dett*wsDir.z;
float detu = detpx*triangles[1].ux + detpy*triangles[1].uy + detpz*triangles[1].uz + det*triangles[1].ud;
float detv = detpx*triangles[1].vx + detpy*triangles[1].vy + detpz*triangles[1].vz + det*triangles[1].vd;
// Interleaving the copy of the next triangle
evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[w+1],sizeof(fastTriangle)/sizeof(float),0);
// Complete intersection calculations
} // end for each object intersected
if(objectNo != -1) atomic_inc(&x_res[objectNo]);
} // end for sub rays
} // end for each ray
barrier(CLK_LOCAL_MEM_FENCE);
// Add all the local x_res to global array hits
barrier(CLK_GLOBAL_MEM_FENCE);
} // end for each object
}
When I first wrote this kernel I did not buffer each object in local memory and instead just accessed it form global memory i.e instead of triangles[0].cx I would use objects[i].cx
When setting out to optimise I switched to using local memory as listed above but then observed a execution run time increase of around 25%.
Why would performance be worse when using local memory to buffer the objects instead of directly accessing them in global memory?
It really depends on your program if local memory helps you to run faster. There are two things to consider when using local memory:
you have additional computation when copying the data from global to local and from local to global again.
I see that you have 3 times "barrier(...)", these barriers are performance killers. All OpenCL tasks have to wait at the barrier for all others. This way the parallelism is hindered and the tasks don't run independent any more.
Local memory is great when you read data lots of times in your computation. But the fast reads and writes need to get you more performance gain than the copying and synchronizing takes.

OutOfMemoryError during heuristic search

I'm writing a program to solve an 8 tile sliding puzzle for an AI class. in theory this is pretty easy, but the number of node states generated is pretty large (estimated 180,000 or so). We're comparing different heuristic functions in class, so my code has to be able to handle even some very inefficient functions. I'm getting "OutOfMemoryError: Java heap space" when using java's PriorityQueue class. Heres the relevant code withing my solver function: (the error is on the openList.add(temp); line)
public void solve(char[] init,int searchOrder)
{
State initial = new State(init,searchOrder); //create initial state
openList = new PriorityQueue<State>(); //create open list
closedList = new LinkedList<State>(); // create closed list
generated = new HashSet(); //Keeps track of all nodes generated to cut down search time
openList.add(initial); //add initial state to the open list
State expanded,temp = null,solution = null; //State currently being expanded
int nodesStored = 0, nodesExpanded = 0;
boolean same; //used for checking for state redundancy
TreeGeneration:
while(openList.size() > 0)
{
expanded = openList.poll();
closedList.addLast(expanded);
for (int k = 0; k < 4; k++)
{
if (k == 0)
{
temp = expanded.moveLeft();
}
else if (k == 1)
{
temp = expanded.moveRight();
}
else if (k == 2)
{
temp = expanded.moveAbove();
}
else
{
temp = expanded.moveBelow();
}
if(temp.isSolution())
{
solution = temp;
nodesStored = openList.size() + closedList.size();
nodesExpanded = closedList.size();
break TreeGeneration;
}
if(!generated.contains(temp))
{
// System.out.println(temp.toString());
openList.add(temp); // error here
generated.add(temp);
}
// System.out.println(openList.toString());
}
}
Am I doing something wrong here, or should I be using something else to handle this quantity of data? Thanks.
By default, JVM starts with 64 MB heap space, you can increase this amount by passing a parameter like below;
java -Xmx1024m YOUR_CLASS
this gives 1024 MB heap space in memory, you can change the amount of memory as you need.
If you are using NetBeans, Netbeans doesn't scale heap space automatically, you can achieve this by following below steps;
1- Right click on your project
2- Navigate to Set Configuration -> Customize
3-Add -Xmx256m into VM Options then click Ok
Now, you can run your project with custom heap space.

Resources