Does Webots use at most one core of the CPU - cpu-usage

I found that when I open multiple Webots, the maximum CPU utilization of each Webots in FAST mode is about 100. I want to know if each Webots uses up to 1 core of the CPU.

By default, Webots uses a single core. In the WorldInfo node, there is an optimalThreadCount parameter which you can increase to use multiple cores. However, please carefully read the parameter description as it will not necessarily speed up the simulation.

Related

MPI - No performance gain when using every available core on the machine

I have a C program (acoustic wave solver) that is parallelized with MPI. However, I've been testing the speed up on various numbers of cores and I've noticed something strange. If I use N processes where N is the number of available cores in the machine, then I do not see a performance improvement over the next step down.
So on my 8 core machine then I see speedup from 1 process to 2 processes to 4 processes, but not from 4 to 8. Similarly on my 4 core laptop I see speedup from 1 to 2, but not from 2 to 4.
Any idea what could be causing this?
Many modern (Intel-)cpu run two hyperthreads on a single physical core. The number of cores you are referencing are actually the number of hardware threads that are available, not the number of physical execution units.
As long as you are using a number of processes that is smaller or equal to the number of physical cores, the processes will (or at least should) be distributed to use all of the available codes. But as soon as all physical cores are taken, additional processes will share a physical core with another process.
It is not possible to give a definitive answer on if using all threads will increase your performance at all or by how much. That strongly depends on the code you are running. A very nice answer to a similar question is given on superuser.com. Essentially, if your process is memory-bound or uses different parts of your cpu (Integer/Floating point arithmetic, Video encoding, vector processing, ...) and communication overhead is small you might even get perfect scaling. Code that is cpu-bound and only does one type of computation might not give any improvement, or might even take longer due to communication overhead.

OpenCL how to control number of processors to use

I want to control the number of GPU cores to test the speedup. How can I do that in OpenCL? I realize I can control group size to control synchronization, but I'm confused since group size can be more than hundred, which is far larger than the number of GPU cores.
What you are looking for is called device fission. It is an extension in OpenCL 1.1 and in the core specification from OpenCL 1.2 onwards.
To give you a starting point, you will need to use clCreateSubDevices. For example, to restrict your kernel to run on only one compute unit, you may pass properties as such:
{
CL_DEVICE_PARTITION_BY_COUNTS,
1, // Use only one compute unit
CL_DEVICE_PARTITION_BY_COUNTS_LIST_END
}
This tells the driver to create one sub-device composed of one compute unit. You may then run your kernel on that sub-device, which will be scheduled on one compute unit only.

Local memory and registers scale linearly with work group size - how to choose optimal size?

My kernel's local memory and register usage scale linearly with work group size. Besides trial and error, are there guidelines for choosing the optimal work group size? I am targeting AMD hardware, where the maximum work group size is 256; should I try to maximize the number of work items in group, or does this risk reducing occupancy and creating register spilling?
You should do both : try to maximize occupancy avoiding register spilling at all costs i.e. get the most of the resources available on your platform.
If you are using nvcc, you can get the number of registers a single thread would need to execute your kernel like this. Then using this information with the local memory needed (that's your input) you can use the CUDA occupancy calculator to see the impact on occupancy. That does not replace a good old "trial-and-error" though.
EDIT: You are using AMD. I don't know how you can map NVIDIA compute capability to AMD devices though.

Translating C code to OpenCL

I am trying to translate a smaller program written in C into openCL. I am supposed to transfer some input data to the GPU and then perform ALL calculations on the device using successive kernel calls.
However, I am facing difficulties with parts of the code that are not suitable for parallelization since I must avoid transferring data back and forth between CPU and GPU because of the amount of data used.
Is there a way to execute some kernels without the parallel processing so I can replace these parts of code with them? Is this achieved by setting global work size to 1?
Yes, you can execute code serially on OpenCL devices. To do this, write your kernel code the same as you would in C and then execute it with the clEnqueueTask() function.
You could manage two devices :
the GPU for highly parallelized code
the CPU for sequential code
This is a bit complex as you must manage one command-queue by device to schedule each kernel on the appropriate device.
If the devices are part of the same platform (typically AMD) you can use the same context, otherwise you will have to create one more context for the CPU.
Moreover, if you want to have a more fine-grained CPU task-parallelization you could use device-fission if your CPU supports it.

How many tasks can be executed simultaneously on GPU device?

I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?
The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.
GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.

Resources