I have N separate OpenCL kernels that are run synchronously in a sequential fashion. The second kernel uses results from the first kernel, the third kernel uses those from the second kernel, and etc.
I measured the total "execution" time for the kernels in two different ways:
1) I associate each kernel with an individual event and measure the time for each kernel separately by subtracting the start time from the end time of each event. The total time was computed by adding the times for all the events of all the kernels. In this case, the event wait method was called after executing each kernel.
2) As in 1), I associate each kernel with an event (N kernels and N events). Then, I call the event wait methods at the end after executing all the kernels (namely, all the event wait calls are followed by the series of kernel calls). The total execution time was obtained by subtracting the start time of the first event from the end time of the last event.
The difference between the total times from 1) and 2) was quite significant: ~20% of the total execution time. Where is this 20% lost? Is it associated with (en)queuing the multiple kernels?
How would I go about reducing the "overhead"? Would it reduce the overhead to write a larger kernel by cramming all the kernels into a single large kernel by restructuring the kernels?
Thanks!
Related
Imagine I have M independent jobs, each job has N steps. Jobs are independent from each other but steps of each job should be serial. In other words J(i,j) should be started only after J(i,j-1) is finished (i indicates the job index and j indicates the step). This is isomorphic to building a wall with width of M and hight of N blocks.
Each block of job should be executed only once. The time that it takes to do one block of work using one CPU (also the same order) is different for different blocks and is not known in advance.
The simple way of doing this using MPI is to assign blocks of work to processors and wait until all of them finish their blocks before the next assignment. This way we can make ensure that priorities are enforced, but there will be a lot of waiting time.
Is there a more efficient way of doing this? I mean when a processor finishes its job, using some kind of environmental variables or shared memory, could decide which block of job it should do next, without waiting for other processors to finish their jobs and make a collective decision using communications.
You have M jobs with N steps each. You also have a set of worker processes of size W, somewhere between 2 and M.
If W is close to M, the best you can do is simply assign them 1:1. If one worker finishes early that's fine.
If W is much smaller than M, and N is also fairly large, here is an idea:
Estimate some average or typical time for one step to complete. Call this T. You can adjust this estimate as you go in case you have a very poor estimator at the outset.
Divide your M jobs evenly in number among the workers, and start them. Tell the workers to run as many steps of their assigned jobs as possible before a timeout, say T*N/K. Overrunning the timeout slightly to finish the current job is allowed to ensure forward progress.
Have the workers communicate to each other which steps they completed.
Repeat, dividing the jobs evenly again taking into account how complete each one is (e.g. two 50% complete jobs count the same as one 0% complete job).
The idea is to give all the workers enough time to complete roughly 1/K of the total work each time. If no job takes much more than K*T, this will be quite efficient.
It's up to you to find a reasonable K. Maybe try 10.
Here's an idea, IDK if it's good:
Maintain one shared variable: n = the progress of the farthest-behind task. i.e. the lowest step-number that any of the M tasks has completed. It starts out at 0, because all tasks start at the first step. It stays at 0 until all tasks have completed at least 1 step each.
When a processor finishes a step of a job, check the progress of the step it's currently working on against n. If n < current_job_step - 4, switch tasks because the one we're working on is too far ahead of the farthest-behind one.
I picked 4 to give a balance between too much switching vs. having too much serial work in only a couple tasks. Adjust as necessary, and maybe make it adaptive as you near the end.
Switching tasks without having two threads both grab the same work unit is non-trivial unless you have a scheduler thread that makes all the decisions. If this is on a single shared-memory machine, you could use locking to protect a priority queue.
So, in theory how many times per second will the loop function be executed?
Can this be calculated based on the Megahertz? Arduino runs at 16 Megahertz
As Gerald said, you cannot simply take the clock frequency and calculate the loop method's iterations per second, some functions take long, some dont, what about delay(1000) and conditionals?
If your program needs to know how many times the loop is happening per second you can use the millis and micros methods. For example you can count the amount of times the loop has looped and save every second by monitoring millis or micros.
Trying to use clGetEventProfilingInfo for timing my kernels.
Is there any facility to give no. of iterations before which the values of start time and end time are reported?
If the kernel is run only once then , of ourse it has lots of over heads associated with it. So to get the best timing we should run the kernel several times and take the average time.
Do we have such a parameter in profiling using API? (We do have such parameters when we use third party software Tools for profiling)
The clGetEventProfilingInfo function will return profiling information for a single event, which corresponds to a single enqueued command. There is no built-in mechanism to automatically report information across a number of calls; you'll have to code that yourself.
It's pretty straightforward to do - just query the start and end times for each event you care about and add them up. If you are only running a single kernel (in a loop), then you could just use a wall-clock timer (with clFinish before you start and stop timing), or take the difference between the time the first event started and the last event finished.
I understand that single-cycle programs are not very efficient. One reason is because not all instructions are equal in length, but in a single-cycle program, all instructions are completed in the same length of time.
In pipeline, throughput is increased, which means the time between one output and the next will be shorter than in a single-cycle implementation after you reach a certain point. But then can you say that instructions in a pipelined approach take the same amount of time (going from IF/Instruction Fetch to WB/Writeback)? Or is this the wrong conclusion?
See all instructions in a single cycle non pipelined structure do not necessarily take same amount of time rather the next instruction to be executed after an instruction can not start until the next clock cycle ,current instruction may complete before the current cycle because cycle length is determined by the longest instruction.e.g add register completes before load in a RISC.
Now in a pipelined structure processor is
multistage with register to store and propogate the state of processor.Now basically on pipelined processor we save time by overlapping two instructionss' substages.hence even though individually the length of instruction is increased but overall time has reduced.Now see every instruction may not go through all the stages eg load and add again
So overall latency for each instruction will consist of all the stages but its execution may have had taken less number of cycles
So you can say that latency of each instruction is same but not the execution time or cycles consumed
i am new to XV6 so be patient with me :D .
i want to make a new scheduler and it is a mix of two scheduler the multi-level feedback queue (MLFQ) and another one the lottery scheduler.
The basic idea is simple: Build a two-level scheduler which first places jobs into the high-priority queue. When a job uses its time slice on the first queue, move it to the lower-priority queue; jobs on the lower-priority queue should run for two time slices before relinquishing the CPU.When there is more than one job on a queue, each should run in proportion to the number of tickets it has; the more tickets a process has, the more it runs. Each time slice, a randomized lottery determines the winner of the lottery; that winning process is the one that runs for that time slice.
i need a couple of new system calls to implement this scheduler.
The first is int settickets(int num) , which sets the number of tickets of the calling process. By default, each process should get one ticket; calling this routine makes it such that a process can raise the number of tickets it receives, and thus receive a higher proportion of CPU cycles. This routineshould return 0 if successful, and -1 otherwise (if, for example, the user passes in a number less than one).
The second is int getpinfo(struct pstat *) . This routine returns some basic information about each running process, including how many times it has been chosen to run and its process ID, and which queue it is on (high or low).
any help ? any lniks to help .