Haxe - Monitor CPU usage of a particular process/executable - cpu-usage

It is required of my Haxe program to calculate the average CPU usage of a particular executable/process over a specified period of time.
I understand that the solution is platform specific and would like to achieve it for both Windows and Linux.

Related

Checking if gpu is integrated or not

I couldn't find any query command about device being integrated/embedded in cpu or using system ram or its own dedicated gddr memory? I can benchmark mapping/unmapping versus reading/writing to get a conclusion but that device can be under load at that time and behave wrong and it would add complexity to already complex load balancing algorithm that I'm using.
Is there a simple way to check if a gpu is using same memory with cpu so I can choose directly mapping/unmapping instead of reading/writing?
Edit: there is CL_DEVICE_LOCAL_MEM_TYPE
CL_GLOBAL or CL_LOCAL
is this an indication of integratedness?
OpenCL 1.x has the device query CL_DEVICE_HOST_UNIFIED_MEMORY:
Is CL_TRUE if the device and the host have a unified memory subsystem
and is CL_FALSE otherwise.
This query is deprecated as of OpenCL 2.0, but should probably still work on OpenCL 2.x platforms for now. Otherwise, you may be able to produce a heuristic from the result of CL_DEVICE_SVM_CAPABILITIES instead.

Difference between CPU Usage and CPU Utilization?

I was wondering if there is a scientific differentiation in terminology when speaking of CPU Usage and CPU Utilization. I have the feeling that both words are used as synonyms. They both describe the relation between CPU Time and CPU Capacity. Wikipedia calls it CPU Usage. Microsoft uses CPU Utilization. But I also found an article where Microsoft uses the term CPU Usage. Now VMware defines to use CPU Utilization in the context of physical CPUs and CPU Usage in the context of logical CPUs. Also, there is no tag for cpu_utilization in stackoverflow.
Does anyone know a scientific differentiation?
Usage
CPU usage as a percentage during the interval.
o VM - Amount of actively used virtual CPU, as a percentage of total available CPU. This is the host's view of the CPU usage, not the guest operating system view. It is the average CPU utilization over all available virtual CPUs in the virtual machine. For example, if a virtual machine with one virtual CPU is running on a host that has four physical CPUs and the CPU usage is 100%, the virtual machine is using one physical CPU completely.
virtual CPU usage = usagemhz / (# of virtual CPUs x core frequency)
o Host - Actively used CPU of the host, as a percentage of the total available CPU. Active CPU is approximately equal to the ratio of the used CPU to the available CPU.
available CPU = # of physical CPUs x clock rate
100% represents all CPUs on the host. For example, if a four-CPU host is running a virtual machine with two CPUs, and the usage is 50%, the host is using two CPUs completely.
o Cluster - Sum of actively used CPU of all virtual machines in the cluster, as a percentage of the total available CPU.
CPU Usage = CPU usagemhz / effectivecpu
CPU usage, as measured in megahertz, during the interval.
o VM - Amount of actively used virtual CPU. This is the host's view of the CPU usage, not the guest operating system view.
o Host - Sum of the actively used CPU of all powered on virtual machines on a host. The maximum possible value is the frequency of the processors multiplied by the number of processors. For example, if you have a host with four 2GHz CPUs running a virtual machine that is using 4000MHz, the host is using two CPUs completely.
4000 / (4 x 2000) = 0.50
Used:
Time accounted to the virtual machine. If a system service runs on behalf of this virtual machine, the time spent by that service (represented by cpu.system) should be charged to this virtual machine. If not, the time spent (represented by cpu.overlap) should not be charged against this virtual machine.
Reference:http://pubs.vmware.com/vsphere-51/index.jsp?topic=%2Fcom.vmware.wssdk.apiref.doc%2Fcpu_counters.html
Very doubtful. You will probably find exact definitions in some academic text books but I bet they'll be inconsistent between text books. I've seen definitions in manpages that are inconsistent with the actual implementation within the code. This is a case where everyone assumes the definitions are so obvious they never check to see if theirs is consistent with others.
My suggestion is to fully definite your use and go with that. Others can then have a reference (your formula/algorithm) and can translate between yours and theirs.
By the way, figuring out utilization, usage, etc. is very complicated and fraught with traps. OSs move tasks around, logical CPUs move between cores, turbo modes temporarily bump clock rates, work is offloaded to internal coprocessors, processors go to sleep or drop in frequency, hyperthreading where multiple logical CPUs contend for shared resources, etc. What's worse is that it is a moving target. Exact and well-defined metrics today will start to get out of date quickly as hardware and software architectures continue to evolve per Moore's law and any SW equivalent.
Within a single context (paper, book, web article, etc.), there may be a difference, but there are not, as far as I know, consistent universally accepted standard definitions for these terms.
Within one authors writings, however, they might be used to describe different things. For example (not an exhaustive list):
How much of a single CPUs computing capacity is being used over a specific sample period
How much of a single CPUs computing capacity is being used by a specific schedulable entity (thread, process, light-weight process, kernel, interrupt routine, etc.) over a specific sample period
Either of the above, but taking all CPUs in the system into account
Any of the above, but with a difference in perspective between real CPUs and virtual CPUs (whether hyperthreading or CPUs actually being emulated by VMware, KVM/QEMU, Xen, Virtualbox or the like)
A comparative measure of how much CPU capacity is being used in one algorithm over another
Probably several other possibilities as well....

Limit CPU & Memory for *nix Process

Is it possible to limit CPU & Memory for the *nix Process?
The CPU limit may look like "use no more than 10% of one core".
The memory limit may look like "use no more than 100Mb", the OS may limit it or kill the process if it try to exceed the limit, both ways are fine.
Any *nix that could do that would be fine.
It seems it is possible to implement it with virtual machines, but it is not acceptable because the overhead is too huge.
If you happen to use Solaris, the ability to limit resource usage is a native feature.
Memory (RAM) usage can be capped per process using the rcap.max-rss setting while CPU usage can be limited per project using the project.cpu-caps.
Note that Solaris also allows OS level virtualization (a.k.a. zones) which have no significant overhead, if any, compared to a bare metal OS instance.
Resource capping is part of Solaris zones configuration.
Try CPULimit
cpulimit is a simple program which attempts to limit the cpu usage of a process (expressed in percentage, not in cpu time). This is useful to control batch jobs, when you don't want them to eat too much cpu. It does not act on the nice value or other scheduling priority stuff, but on the real cpu usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly.

Confusion about the performance counter

I am working on a program written in OpenCL and running on Fusion APU (CPU+GPU on one die). I wan to get some performance counters such as instructions number, branch number and so on. I have two tools on hand: AMD APP Profiler and CodeAnalyst. When I use the APP Profiler, I found that it seems can only provide instructions counter for GPU, cannot for CPU. Then I use CodeAnalyst, but then three confusions occurred.
On App Profiler, it can give the number of ALUInsts (i.e. the number of executed ALU instructions per work-item) is about 70000. The whole thread space on GPU has 8192 threads, so I intuitively think there are 70000 * 8192 instructions executed by GPU. Is that right?
When I use CodeAnalyst to measure the instructions for the same program on CPU part, it just gave "Ret inst", "Ret branch" such kind of counters, but I am not sure about one thing: this program runs on both CPU and GPU at the same time, what are these counters for? For CPU only, for GPU only? or the sum?
No matter what these counters for, I found that the value of Ret Inst (i.e. retired instructions) is about 40000, it seems too small for the whole program, I guess the instructions for a program should be at order of billions, how it could be only 4w? The attached pic shows the results.
Is there any people can help me resolve these confusions, I am just a tyro here, wish kind help from all of you. Thanks!

OpenCL- waste of host computing power

I am new to OpenCL, please tell me that the host cpu can be used only for allocating memory to the device, or we can use it can as an openCL device. (Because after the allocation is done, the host cpu will be idle).
You can use a cpu as a compute device. Opencl even allows multicore/processor systems to segment cores into separate compute units. I like to use this feature to divide the cpus on my system into groups based on NUMA nodes. It is possible to divide a cpu into compute devices which all share the same level of cache memory (L1, L2, L3 or L4).
You need a platform that supports it, such as AMD's SDK. I know there are ways to have Nvidia and AMD platforms on the same machine, but I have never had to do so myself.
Also, the opencl event/callback system allows you to use your cpu as you normally would while the gpu kernels are executing. In this way, you can use openmp or any other code on the host while you wait for the gpu kernel to finish.
There's no reason the CPU has to be idle, but it needs a separate job to do. Once you've submitted work to OpenCL you can:
Get on with something else, like preparing the next set of work, or performing calculation on something completely different.
Have the CPU set up as another compute device, and so submit a piece of work to it.
Personally I tend to find myself needing the first case more often as it's rare I find myself with two tasks that are independent and lend themselves to OpenCL style. The trick is keeping things balanced so you're not waiting a long time for the GPU task to finish, or having the GPU idle while the CPU is getting on with other work.
It's the same problem OpenGL coders had to conquer. Avoiding being CPU or GPU bound, and balancing between the two for best performance.

Resources