I am in process of understanding the Task Scheduler functions. For example I am working on 32-bit Infineon Aurix Tricore controller whose Task Schedulers are designed for 5msec. Now, if I design to run my application on 10msec task scheduler function instead of 5msec what kind of data I should be consider into account?
Such as impact on CPU run-time, CPU load analysis etc?
Like how my change of task scheduler at low level code impact the code execution.
In short, the smaller the task slice time, the smoother the multitasking will appear to the user. On the other hand, more task switches increases the time spent switching tasks instead of running them.
Longer times with many tasks means long time distance revisiting the same task (e.g., more jerky behavior).
(Note: I normally use 1ms task switches on very low end MCUs with very good results with around 5-10 total tasks.)
If you are changing the entire scheduler to be run on 10ms instead of 5ms , then whether the SW will be able to detect the changes in your system in time should be considered (For example, if you have sensors like environment temperature sensor , the probability of change in value in 5ms is extremely rare.But on the other hand if you are detecting something like wheel speed 10ms might be slow ).Similarly whether you can control the actuators to create the desired response in the system should be considered as well. If these two things are considered and provided task execution time (not the frequency) is within limits the CPU load will not be a problem.
Note: If your application has code which calculates delays under assumption that the task runs in 5ms , then this needs to be changed
On the other hand if you are adding a 10ms task in addition to a 5ms task.then you should consider the following
1.Adding a new time slice adds additional context switch operation which will add a delay.
2.Also based on the task's priority and preemptive/co-operative behavior, one task can block the other for some duration which can potentially create lag on the functionality or cause it to malfunction, but this is just a probability and need not happen as well .
3.Context stitch also means you need now to use some more stack area than before, based on your project utilization this may cause an issue
You need analyse your SW to come to conclusion on points 2 and 3.I have given some examples which could help in the analysis
Ex1: If execution time(not the frequency) of your 10ms task is negligible compared to execution time of your 5ms task ,then better option is to schedule this functionality also in 5ms as this will save you the context switch time as well as stack size.
Ex2: If execution time of your 10ms task is comparatively long and it is of lesser priority(and preemptible by 5ms task ) than the 5ms , it is better to have the functionality in 10ms task as this will help the 5ms task to finish its execution before its next slice.
Related
...aside from the benefit in separate performance monitoring and logging.
For logging, I am confident I can get granularity through manually adding the name of the "routine" to each call. This is how it is now with several discrete Functions for different parts of the system:
There are multiple automatic logs: start and finish of the routine, for example. It would be more challenging to find out how expensive certain routines are, but it would not be impossible.
The reason I want the entire logic of the application handled by a single handle function is because of reducing cold starts: one function means only one container that can be persistently kept alive when there are very few users of the app.
If a month is ~2.6m seconds and we assume the system uses 1 GB RAM and 1 GHz CPU frequency at all times, that's:
2600000 * 0.0000025 + 2600000 * 0.000001042 = USD$9.21 a month
...for one minimum instance.
I should also state that all of my functions have the bare minimum amount of global scope code; it just sets up Firebase assets (RTDB and Firestore).
From a billing, performance (based on user wait time), and user/developer experience perspective, is there any reason why it would be smart to keep all my functions discrete?
I'd also accept an answer saying "one single function for all logic is reasonable" as long as there's a reason for it.
Thanks!
If you have very small app with ~5 end points and very low traffic. Sure you could do something like this. But why not do it:
billing and performance
The important thing to realize is that with every request a new instance of your function is created. Which means there could be 10s of them running at the same time.
If you would like to have just 1 instance handling all the traffic you should explore GCP Cloud run, where you have 1 container handling multiple requests and scaling only when it's not sufficient.
Imagine you have several end-points and every one of them have different performance requirements.
1 can need only 128MB or RAM
1 can need 1GB RAM
(FYI: You can control the CPU MHz of the function via the RAM settings too - which can speed up execution in some cases)
If you had only 1 function with 1GB of ram. Every request would allocate such function and in some cases most of the memory could go to waste.
But if you split it into multiple, some requests will require much less resources and can save you $ when we talk about bigger amount of executions / month. (tens of thousands+).
Let's imagine function, 3 second execution, 10k executions/month:
128MB would cost you $0.0693
1024MB would cost you $0.495
As you can see, with small app the difference could be nothing. But if you scale it matters. (*The cost can vary based on datacenter)
As for the logging, I don't think it matters. Usually in bigger systems there could be messages traveling trough several functions so you have to deal with that anyway.
As for the cold start. You just need good UI to facilitate that. At first I was worry about it in our apps but later on, you just get used to it that some action can take ~2s to execute (cold start). And you should have the UI "loading" regardless, because you don't know if the function will take ~100ms or 3s due to bad connection.
Recently i got the task to optimize a quite huge PLSQL script which prior to my changes took about 1 hour +/- 10mins.
So I got to do some reallocation of some methods and generally just some replacement of big views with simpler subquery or with statements. I noticed that if I ran the scheduled job by right-clicking it and execute job I would in most cases see the run duration change (in a positive way). But if I enabled the job and let it run by its schedule it takes the original hour no matter what changes you do to it.
Now my question here is: Is there any way to monitor the RAM or CPU usage of the session/job or is there a difference in general how many resources are allocated to background processes? Because my suspicion here is the "manual" run job somehow gets some priorities the scheduler doesn't get or doesn't take.
Either way for troubleshooting purposes you can't take a few hours a work day just to wait for results.
Assume you have some functions that must be called at different point in times but continuosly (constant task like each 250ms, each 2s, each 5 mins).
Is it better to use 4-5 timers each one dedicated to a task or is it better to code everything in the smaller task and then use a counter variable to run the other function?
e.g.
//callback each 250ms
void 250ms_TASK(){
if (counter % 8 != 0){ //250ms*8 = 2s
return;
}
// do 2 sec stuff
if (counter != 4800){ //250ms*4800 = 20min
return;
}
//do 20min stuff
counter = 0;
}
Assume also that you want to avoid/be bulletproof to situations like this:
before doing 2 secs stuff you MUST be sure that the 8th 250ms task is computed.
before doing 20 min stuff you MUST be sure that the 4800th 250ms and the 600th 2s task is computed.
The question is related to best practice and performance.
Moreover is it better to perform those calculations in the callback or use the callback to modify flags and perform the calculations in the main loop ?
I assume you are using STM32 since you tagged STM32.
Unless your application is very much time critical that you need to use preemptive and asynchronous timer interrupts (for example 5 mins task is very important so it should be called even while a separated 250ms callback task is running), using multiple timer interrupts is just waste of timers and you need to use as fewer interrupts as possible IMHO. Counting variable is not costly so it is okay to do that.
The real consideration is the length of tasks. The ISRs should be as short as possible so if the timer callback tasks are quite long you should use flags and use polling operation in the main loop. Polling flags is more preferable especially when you are using multiple callbacks in a single timer ISR. Imagine the moment that 250ms, 2s, and 20min callbacks should be called in the ISR and the ISR will take 3 times longer than usual.
By the way, if you decide to use a single timer, why not using SysTick? The SysTick timer is provided in every Cortex M MCUs and its operation is the same across the MCU families. You can easily configure this as 1ms interrupt timer very easily. As far as you use polling in the main loop 1ms interrupt must be fine. There are many tutorials on Systick (for example, part1 and part2)
The standard way to do this for tasks that aren't very time critical, is to implement a single timer, which triggers once every millisecond.
That timer then goes through a list of registered "software timers" and checks if it is time for them to be executed. If so, the timer then calls a function pointer which contains the timer-specific code. That is, a callback function called upon by the timer driver.
If these functions are kept minimal, for example just setting a flag, you can execute them from the main timer ISR.
You can make various arguments regarding power consumption and real timer requirement. It really depends on your application. But these question can deliver insightful answers for beginners, and even more experienced developers. The keyword here is scheduling.
The typical setup I prefer, bare metal real-time:
Main runs all low priority and idle tasks. Main bases these timings on the systick timer that ticks every 1 ms: if( (now - then) > delay ){ then = now; foo(); }
These tasks can be interrupted by everything, except in a critical zone (when using ISR threadspace data).
Low priority tasks are blinking LED's and handling communications.
There are peripheral interrupts and timers that set IRQ pending bits to signal real-time work is ready to be done. Eg: read uart or adc register before overrun.
The interrupt priorities and timers are setup in a way that the work is done in the correct order at the correct time. Eg: when processing ADC samples, and the hardware alarm IRQ arrives, this is handled immediately.
This way I have the DMA signal samples are ready to be processed, whilst a synchronized timer at a lower frequency set the IRQ-pending for the process loop. The process loop must run after the samples, thus has lower priority in the NVIC.
Advantage: Real time performance is not impeded when the communication channel is overflowed with data.
Disadvantage: The cpu never sleeps long.
The ISR's of the real time tasks may not exceed their time window. This is where Windowed Watchdog Timers are useful. Also, idle tasks will only run when there is time to spare. They might be late.
A similar option here is to use a real time operating system. Like ChibiOS.
However, when you're a battery application you don't want the MCU to wake up every second. You want the MCU to wake up only when work has to be done. You can do this in two ways.
Multiple hardware timers signal the wake-up event.
This requires multiple timers to keep running and might still use too much energy.
Tickless operation. You use one timer, the chip wakes up and does work when the time is reached. Then it reloads the timer compare with the time of the next deadline. If your intervals are long enough apart you can use the RTC for this to get ultra low power consumption.
Advantage: chip is allowed to go to sleep for longer period depending on workload.
Disadvantage: the design is a bit more complicated to implement and debug.
Similar option here is to use a tickless operating system.
Assuming you're not using a real time OS, I'd use a timer to do the time critical stuff (if it's handled with few clock cycles) and long timer counters through an interrupt and use non time critical stuff and longer periods in the main loop (with or without a watchdog timer/sleep).
The interrupts will interrupt the main loop stuff so you can be sure the time critical stuff happens when it needs to, the less time critical stuff happens whenever it can.
You could use a state machine in the main loop to do the logic stuff to make sure everything is done in the right order, things are checked, loaded, sensors read etc.
There is no right answer here, best practices would be to implement the design to meet the requirements, since requirements for a project vary from project to project there is no single right answer. One common solution will fail to work right for a wide array of products, as would another common solution. You could force one solution but that can add a lot of hacked up band-aids simply adding risk to the project, possibly lead to failure and or recalls or field upgrades that were unecessary that make the product and the company look bad. Do your system engineering and most of the time the correct solution will simply present itself, dont do your system engineering and the failures will simply present themselves.
Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.
I'm running a kernel on a big array. When I profile the clEnqueueNDRange command, the execution time (end-start) is .001 ms but the time between submit and start (start-submit) is around 120 ms which varies with the size of the input data. What happens when a command is submitted until it start to execute. Is it reasonable to get this large time?
OpenCL operates asynchronously. That is to say that when you ask for a piece of work to be done, it may not happen at that time. It will happen at some time in the future. This is a little weird, especially when you start profiling things, but it works like this so that the CPU can queue up lots of work for the OpenGL device, and then go do something else while the work is done.
For example:
clEnqueueWriteBuffer(blah);
clEnqueueNDRange(blah);
clEnqueueReadBuffer(blah, but blocking_read = CL_TRUE);
Here, the writeBuffer and the NDRange will probably appear to take very small amounts of time. All they'll do is record what needs to be done. The blocking readBuffer will take a long time, because it has to wait for the results of the read. For that read to complete, the write, and the kernel execution have to complete, before the read can even start.
Now the read might be very small, but because it's waiting for everything before it to finish the time it appears to take is dependent on the amount of work in the commands before it.
I don't quite understand what you're measuring from your question, but I expect what you're seeing is this effect. The time for work is being charged to other functions because they have to wait for previous work to finish.
Knowing which functions cause the CPU to wait on the GPU is one of the big tricks when it comes to writing high performance code. Any time you introduce a wait like this the CPU stops doing any useful work, and the GPU is likely to go idle whilst the CPU prepares the next lump of work. Sometimes, there no alternative, and you just have to wait.