Calculating % of CPU time consumed by a process on the tiva 5 series ARM M4F core microcontroller

Calculating % of CPU time consumed by a process on the tiva 5 series ARM M4F core microcontroller - cpu-usage

I am designing an RTOS for the tiva 5 series microcontroller and I am supporting basic linux commands like ps. I need to calculate the percentage of cpu time consumed by every process in the RTOS as I display that as a result of the ps command.
What is the best way forward?

In real time, any way should sample the program counter value at regular intervals.
The quicker the samples the better the results.

Related

What kind of code/instructions can make a MCU run at it's maximum power consumption

I am trying to evaluate the maximum power consumption of a MCU (Renesas RX72N or RX651). It's not battery powered and it's never run in sleep mode. I am thinking I can write a piece of benchmark code, in C of course, that should do a lot of complex calculations. While the MCU is executing the calculations, I can measure the current drawn by the MCU and deduce what's its maximum power consumption level. Is my understand correct? If so, do you think what kind of code I should write or is there already some open source code to use for the purpose?
Thanks in advance.
-woody

In general you just have to experiment. There is no fixed answer. First off check the datasheet, you certainly want flash on and all the peripherals out of reset. Fastest clock is probably good but understand that running from flash does not necessarily scale up, depends on the part (do they have wait states that have a table relative to processor clock speed).
Changing states burns power. So you want to try to flip as many as you can. But it could be that you want to flip gpio or other external pins rather than try to get the processor core to pull more power.
You will want a nice meter for this one that can measure milliamps or milliwatts.
Things like multiply and divide in theory take a good percentage of chip space, if they implement it in one clock, but some of these mcus naturally won't do that, the instruction will be multi clock or at times the chip vendor can choose (if they buy an arm core for example). But you may need to deal with data patterns to to get more consumption. For core stuff you probably want to do as much register based stuff vs read/save things to memory.
You probably want to write the test code in assembly language. Will want to start with an infinite loop, branch to self, measure power/current. Complicate the loop, additional memory cycles, or alu operations, see if you can detect a power difference (you may find that you are not going to be able to make much difference with the core). Depending on the mcu design you may/should get better execution performance running from ram, but it depends of course. ram tends to be faster in mcus than flash. then do things like flip gpio pins, leave them high, etc. If you have LEDs turn them all on naturally, etc.
There is no one answer for this, so no one benchmark nor one solution pushes any random chip the hardest. Assume that if possible to see a noticeable difference for a particular chip, that the test would be specific to that chip and not necessarily the same solution for other chips from the same company, much less chips from other companies.

Arduino measuring power to run code

I have c code running on bear metal (no OS). The code takes in some sensor data, performs a computation, forms a packet and transmits. The board is battery powered.
I'm interested in knowing the energy consumed for each operation in Jules. Is this possible? How would one go about doing it?

The number of joules used per instruction depends upon the processor you are using and which instruction you are looking at. I believe the ARM and the Atmel AVR processors have no real hardware power management which makes things simpler.
How much energy an instruction uses has to do with how much and what type of on-silicon circuitry it uses. This means that trying to theoretically compute the number of joules will be complicated since it is not simply related to the number of cycles the instruction uses.
So you’ll have to do it experimentally. Here’s what I’d do.
Remove all compiler optimizations
Do a frequency analysis to find your hot operations and pick out the most used. (I’m assuming we aren’t talking ASM instructions but ‘C’ instructions.)
Write a loop that repeats the instruction, say, 20 times, and have the loop run for several seconds at a minimum
Replace your battery with a power supply.
Use the series resistor to measure power as mentioned in the comment but (of course) scale the voltage appropriately.
Run the looping program and get a statistical sampling of the power.
Do this for all of your hot operations.
Compute power usage for your program
Validate the power usage against real power measurements of the execution of your program
Adjust (i.e. normalize) your computations as appropriate
You’ll also have to take into account the memory hierarchy. Accessing off chip memory takes energy. When operations or data are cached, it’s going to change your energy equation.
I figure this should work but don’t know. Good luck.

MPI + GPU : how to mix the two techniques

My program is well-suited for MPI. Each CPU does its own, specific (sophisticated) job, produces a single double, and then I use an MPI_Reduce to multiply the result from every CPU.
But I repeat this many, many times (> 100,000). Thus, it occurred to me that a GPU would dramatically speed things up.
I have google'd around, but can't find anything concrete. How do you go about mixing MPI with GPUs? Is there a way for the program to query and verify "oh, this rank is the GPU, all other are CPUs" ? Is there a recommended tutorial or something?
Importantly, I don't want or need a full set of GPUs. I really just need a lot of CPUs, and then a single GPU to speed up the frequently-used MPI_Reduce operation.
Here is a schematic example of what I'm talking about:
Suppose I have 500 CPUs. Each CPU somehow produces, say, 50 doubles. I need to multiply all 250,00 of these doubles together. Then I repeat this between 10,000 and 1 million times. If I could have one GPU (in addition to the 500 CPUs), this could be really efficient. Each CPU would compute its 50 doubles for all ~1 million "states". Then, all 500 CPUs would send their doubles to the GPU. The GPU would then multiply the 250,000 doubles together for each of the 1 million "states", producing 1 million doubles.
These numbers are not exact. The compute is indeed very large. I'm just trying to convey the general problem.

This isn't the way to think about these things.
I like to say that MPI and GPGPU stuff are orthogonal(*). You use MPI between tasks (for which think nodes, although you can have multiple tasks per node), and each task may or may not use an accelerator like a GPU to accelerate the computation within task. There is no MPI rank on a GPU.
Regardless, Talonmies is right; this particular example doesn't sound like it would benefit much from a GPU. And it won't be helped by having tens of thousands of doubles per task; if you're only doing one or a few FLOPs per double, the cost of sending the data to the GPU will exceed the benefit of having all those cores operate on them.
(*) This used to be more clearly true; now with, for instance, GPUDirect being able to copy memory to remote GPUs over infiniband, the distinction is fuzzier. However, I maintain that this is still the most useful way to think about things, with such things as RDMA to GPUs being an important optimization but conceptually a minor tweak.

Here I have found some news about the topic:
"MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it."

How many tasks can be executed simultaneously on GPU device?

I'm using OpenCL and have ATI 4850 card. It has:
CL_DEVICE_MAX_COMPUTE_UNITS: 10
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
CL_DEVICE_AVAILABLE: 1
CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?

The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.

GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.

What is relation between MCPS(million cycles per second) and power consumed

I have been working on a ARM cortex A8 board on mp3 decoder.
While doing this i have a requirement saying the mp3 decoder solution i am doing should consume 50 milli-watts of power. This generated few questions in my mind when i thought about it:-
1.) I recall that there is some relation between the Core Voltage applied(V), the clock frequency(f) of a processor and power consumed(P) as something like P is directly proportional to the voltage and frequency squared. But is the exact relation. Given operating clock Frequency, voltage of a processor, how can we calculate power consumed by it.
2.) Now if i get the power consumed from step 1.) at some clock frequency, and i am told that the decoder solution i am giving, can consume only 50 milli-watts, how can i get the maximum limit on MCPS, which will be the upper bound on the MCPS of my decoder solution running on that hardware board?
Can i deduce that if power obtained as in step 1.) say P, is consumed at frequency F, so for 50 milli-watts power, what is clock frequency frequency and calculate accordingly the frequency. And then call this frequency as my code MHz (MCPS) upper bound?
Basically how does one map(is there any equation) power consumed by a software to MCPS consumed
I hope this is relevant here, or should it go to superuser?
Thank you.
-AD.

It really depends on the architecture.
From their own page:
Core area, frequency range and power consumption are dependent on process, libraries and optimizations.
Power with cache (mW/MHz) <0.59
<0.45
Basically, it states that you can't accurately calculate the power consumption, so your best bet would be to do some measurements yourself. Try to run a full CPU-usage application and meassure the power consumption. It will give you some idea of the max-load, which will be a good start for you (to know how much you need to optimize your code and insert idle points).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex