I'm getting this alarm on Cloudera, is there any way to increase the swap space capacity?
While you ask how to increase the swap space capacity, I think it save to assume that what you are really looking for is a way to solve the problem of full swap space.
Increasing the swap space is only one way of dealing with the issue - the other is simply to use less swap space. Cloudera recommends using minimal to no swap space because using swap degrades the performance substantially. The way of controlling this is by setting the 'swappiness' to 1, vs the default of 60. See documentation for instructions and more rational.
If the swappiness is already set to 1, than you can try clearing the swap by toggling swap off, then on.
swapoff -a
swapon -a
Before toggling swap you should make sure that
the amount of swap space in use is less than the amount of free memory (as the contents of swap may be shifted to memory).
currently running processes are not using swap (running vmstat produces on output with columns labeled 'si' and 'so' telling you the amount of memory swapped in and out per second. if these are both 0, then you should be safe).
Related
I have initially work units with the size of 11*11*6779. For the sake of simplicity I dont want to translate it into 1D global work size. When when i changed it into 21*21*6779 the performance is 5-6x slower than before. the code as far as i know has nothing to do with the number of threads being ran.
The amount of data transfered is only 4x bigger, which I dont think is a reason why the programm runs slower, because i tested the memory allocation process.
Note that my device has a max work items of 256*256*256, meaning I would be use half of all available work items, and this is not a dedicated device (also used for display..).
I wonder if setting the work item sizes into 21*21*6779 uses too many of my work items, or the dimensions are simply inconvenient for openCL to adjust ?
If your max work items is 256x256x256 then why are you using 21x21x6779 (where 6779 is greater than 256)? Note that if the work group size is not specified, the runtime will try to pick one that can divide up your global work size. If your dimensions not easily divisible by the runtime, it might pick bad work group sizes. That could explain why the performance changes based on global work size. I recommend you specify the work group size, and make the global work size a multiple of that (if necessary, pass in the real size as parameters and in each work item check if it is in range; this is a typical pattern you will see a lot in OpenCL).
Often it is advised to keep the global_work_size the same as the logical amount of "elements" you must process. My application doesn't have such a thing, though. If I have N elements that need to be processed, then, after a single kernel pass, I will have M elements - a completely different number that doesn't depend on N.
In order to deal with this situation, I could write a loop such as:
while (elementsToBeProcessed)
read "elementsToBeProcessed" variable from device
enqueue ND range kernel with global_work_size = elemnetsToBeProcessed
But that requires one read per pass. An alternative would be to keep everything inside the GPU, by calling enqueueNDRangeKernel only once, with a fixed global_work_size and local_work_size matching the GPU layout and then use a master thread to synchronize the computation within.
My question is simple: is my intuition correct that the second option is better, or is there any reason to go with the first?
That is a tricky problem, which way to take. And depends on the global size values you are going to have and how much they change over time.
A read per pass: (better for highly changing values)
Fitted global size, all the work items will do useful work
Unfitted local size for the HW, if the work size is small
Blocking behavior in the queue, bad device utilization
Easy to understand and debug
Fixed kernel launch size: (better for stable but changing values)
Un-fitted global size, may waste some time running null work items
Fitted local size to the device
Non blocking behavior, 100% device usage
Complex to debug
As some answers already say, OpenCL 2.0 is the solution, by using pipes. But it is also possible to use another OpenCL 2.0 feature, kernel calling inside kernels. So that your kernels can launch the next batch of kernels without CPU intervention.
It is always good if you can avoid transferring data between host and device, even if it means little bit more work on the device. In many applications data transferring is the slowest part.
To find out better solution for your system configuration, you need to test both of them. If you are targeting to multiple platforms then the second one should be faster in general. But there are lot of things that can make it slower. For example the code for it might be harder to optimize for the compilers or the data access pattern might lead to more cache misses.
If you are targeting to OpenCL 2.0, pipes might be something you want to look at for this kind of random amount of elements. (Before I get some down votes because of the platforms not supporting 2.0, AMD has promised 2.0 drivers to come this year) With pipes, you can make producer kernel and consumer kernel. Consumer kernel can start work as soon as it has enough items to work on. This might lead to better utilization of all resources.
The tradeoff: The performance hit for doing the readback is that the GPU will be idle waiting for work, whereas if you just enqueue a bunch of kernels it will stay busy.
Simple: So I think the answer depends on how much elementsToBeProcessed will vary. If a sequence of runs might be (for example) 20000, 19760, 15789, 19345 then I'd always run 20000 and have a few idle work items. On the other hand, if a typical pattern is 20000, 4236, 1234, 9000 then I'd read back elementsToBeProcessed and enqueue the kernel for only what is needed.
Advanced: If your pattern is monotonically decreasing you could interleave the readback with the kernel enqueue, so that you're always keeping the GPU busy but you're also making them smaller as you go. Between every kernel enqueue start an async double-buffered readback of a copy of the elementsToBeProcessed and use it for the kernel after the one you enqueue next.
Like this:
elementsToBeProcessedA = starting value
elementsToBeProcessedB = starting value
eventA = NULL
eventB = NULL
Enqueue kernel with NDRange of elementsToBeProcessedA
non-blocking clEnqueueReadBuffer for elementsToBeProcessedA, taking eventA
if non-null, wait on eventB, release event
Enqueue kernel with NDRange of elementsToBeProcessedB
non-blocking clEnqueueReadBuffer for elementsToBeProcessedB, taking eventB
if non-null, wait on eventA, release event
goto 5
This will kepp the GPU fully saturated and yet will use smaller elementsToBeProcessed as it goes. It will not handle the case where elementsToBeProcessed increases so don't do it this way if that is the case.
An alternate solution: Always run a fixed number of global work items, enough to fill the GPU but not more. Each work item should then look at the total number of items to be done for this pass (elementsToBeProcessed) and then do it's portion of the total.
uint elementsToBeProcessed = <read from global memory>
uint step = get_global_size(0);
for (uint i = get_global_id(0); i < elementsToBeProcessed; i += step)
{
<process item "i">
}
A simplified example: global work size of 5 (artificially small for example), elementsToBeProcessed = 19: first pass through loop elements 0-4 are processed, second pass 5-9, third pass 10-14, forth pass 15-18.
You'd want to tune the fixed global work size to exactly match your hardware (compute units * max work group size or some division of that).
This is not unlike the algorithm for how work items cooperate to copy data into shared local memory regardless of work group size.
Global Work size doesn't have to be fixed. E. g. you have 128 stream processors. So, you make a kernel with local size 128 too. Your global work size can be any number, which is multiple to that value - 256, 4096, etc.
Though, size of local group usually is determined by hardware specs. In case you have more data to process, just increase number of local groups involved.
I'm looking for a script to check the size of a particular LVM volume on CentOS 6.5 and when it reaches a certain threshold, have it automatically extend the partition and online re-size the file system.
I have this particular machine monitored, and could do it manually, but I saw a script once to do just this.
I have plenty of disk space on the physical volumes but, since it's easier to expand when needed than reduce later, I'd rather expand my logical partitions only when they start to fill up. There are several logical volumes on this machine, but only one that regularly grows.
Any tips are appreciated; and, if the overall best thing to do is just expand the volume manually when the time comes that advice is welcome as well!
I am using pyopencl to find a certain pixel in a 512 x 512 (262,144 pixels) image. I am starting (512,512), when I run my program and comparing the pixel's neighbors to a known group of neighbors. I am doing image synthesis. I don't want to wait around for the remaining kernels to run if I find my group of pixels within a kernel. Is there a way to terminate the rest of the running kernels with a kernel program ?
Thanks
Tim
When you queue a kernel with many work items, it gets divided up into work groups and threads which keep the GPU busy. Really large global sizes start as many threads as they can and issue new ones when the old ones finish. So you could find the smallest global size that still performs well, and queue many of those (instead of one large one), but also be checking on the results of the previous ones you queued (use events to know when they are done, and read back memory to get their results). When you get the correct answer, stop queueing kernels.
so instead of this:
queue entire job (say, 4096 x 4906)
do:
do
{
queue some work (say, 32 x 32)
check if any of the prior work queued is done and check if it got the answer
}
while (no more work OR answer found)
You'll need to figure out the right tradeoff between the size of the smaller jobs and the overhead of checking their results versus extra work done.
Your question is a big issue and problem of parallelism.
What to do when one of your parallel threads has already the answer to the problem?
OpenCL does not allow to control the kernel execution. Not even at host level. And this is a big problem. However it is how it has to be, since, if the work items do not run freely detached one from another then it is not fully parallel.
The only solution is to split the computation into small parts and check the completion of each of them. But, sometimes the parts are already very small (like in your case 512x512 is quite small).
In your specific case I would process everything (512x512), after that I would use another kernel to get the final results out of the 512x512 set.
First thought it to have some sort of global memory flag that each kernel can read and set. This approach requires atomicity, so make sure to use the atomic_ functions.
__kernel void t(__global int *Data,
__global int *Flag){
if(atomic_max(*Flag, 0) == 0){
//perform calc on Data
if(PixelsFound){
//Set the flag to +1
*Flag = atomic_inc(*Flag);
}
}
}
Community, feel free to comment if this is known not to work!
It is my understanding that it is the erases that wear out SSDs, not the writes themselves. Therefore, optimizing away the need for erases would be hugely beneficial from the point of view of drive manufacturers. Can I take it as a given that they do this?
I'd like you to assume that I'm writing directly to the disk and that there isn't a filesystem to mess things up.
If there are empty pages in a block and the SSD wants to write to those pages it will not erase the block first. A SSD will only erase a block before a write if it cannot find any empty pages to write to because doing a full read-erase-write is very slow.
Besides, the wear-out from writing and erasing is about the same. Both involve pulling electrons through the oxide layer, just in different direction.
Also, the erased state for NAND is all 1. You then write 1 to 0. You need to erase the 0 to get it back to a 1.
Unless I'm reading your question wrong I think you misunderstand how SSDs work.
SSDs are made up of large blocks (usually 512k), which are much larger than we are used to in a filesystem (usually 4k).
The Erase pass is necessary before anything can be written to the block unless the block is already empty.
So the problem with erases wearing out the disk is that if 4k of your 512k block is used, you must erase the whole 512k block and write the original 4k + anything else you are adding. This creates excessive wear and slows things down as instead of one "write" you need a "read-wipe-write" (known as "write amplification").
This is simplifying it a bit as the drive firmware does a lot of clever things to try and make sure the blocks are optimally filled e.g. it tries to keep lots of empty blocks to avoid slow writes.
Hope that helps/didn't confuse things further!
In the case that the SSD are using read-erase-write, it first read the content of the block, then erase it and write the new values. So it erase the entire block before writing, but it has saved the content for the next write operation. in some SSDs which are not read-erase-write, when you are writing, it will write on the new page (may be on the same block, then invalid the previous page). In this case, it erase the block only after making sure that every pages in the block are invalid or has been copied to another place.