Trying to use clGetEventProfilingInfo for timing my kernels.
Is there any facility to give no. of iterations before which the values of start time and end time are reported?
If the kernel is run only once then , of ourse it has lots of over heads associated with it. So to get the best timing we should run the kernel several times and take the average time.
Do we have such a parameter in profiling using API? (We do have such parameters when we use third party software Tools for profiling)
The clGetEventProfilingInfo function will return profiling information for a single event, which corresponds to a single enqueued command. There is no built-in mechanism to automatically report information across a number of calls; you'll have to code that yourself.
It's pretty straightforward to do - just query the start and end times for each event you care about and add them up. If you are only running a single kernel (in a loop), then you could just use a wall-clock timer (with clFinish before you start and stop timing), or take the difference between the time the first event started and the last event finished.
Related
Recently i got the task to optimize a quite huge PLSQL script which prior to my changes took about 1 hour +/- 10mins.
So I got to do some reallocation of some methods and generally just some replacement of big views with simpler subquery or with statements. I noticed that if I ran the scheduled job by right-clicking it and execute job I would in most cases see the run duration change (in a positive way). But if I enabled the job and let it run by its schedule it takes the original hour no matter what changes you do to it.
Now my question here is: Is there any way to monitor the RAM or CPU usage of the session/job or is there a difference in general how many resources are allocated to background processes? Because my suspicion here is the "manual" run job somehow gets some priorities the scheduler doesn't get or doesn't take.
Either way for troubleshooting purposes you can't take a few hours a work day just to wait for results.
I'm trying to count how many times a bbl is executed in the whole program run, but apparently, Trace_addinstrumentfunction skips traces that have already been executed once. Anyone has any ideas?
Pin instrumentation works in two phases. The instrumentation phase is called when new code is encountered, and allows you to insert analysis callbacks. Analysis callbacks are called every time the code is encountered.
I strongly recommend reading the first bit of the pin manual to understand the difference between instrumentation and analysis functions.
The instrumentation called allows you to insert the callbacks. In simpler terms, the function will have you put function calls before each instrument. You can define this instrument as either Instruction, Trace or Routine. Now, specific to your question, finding number of bbl is easy. Pin however follows a different definition of BBL. Finding the number of times a BBL(Per Pin's definition) is executed is easy. You can simply insert an Trace Instrumentation call and for every BBL increment a counter in the analysis call and you will get the BBL count.
If you want to go by the textbook definition of BBL(one entry one exit) which implies one BBL breaks at the BranchOrCall statement, insert a call using IsBranchOrCall API and increment the BBLcounter in the callback function.
I recommend trying both of them and figuring out the difference between the two definitions.
I understand that single-cycle programs are not very efficient. One reason is because not all instructions are equal in length, but in a single-cycle program, all instructions are completed in the same length of time.
In pipeline, throughput is increased, which means the time between one output and the next will be shorter than in a single-cycle implementation after you reach a certain point. But then can you say that instructions in a pipelined approach take the same amount of time (going from IF/Instruction Fetch to WB/Writeback)? Or is this the wrong conclusion?
See all instructions in a single cycle non pipelined structure do not necessarily take same amount of time rather the next instruction to be executed after an instruction can not start until the next clock cycle ,current instruction may complete before the current cycle because cycle length is determined by the longest instruction.e.g add register completes before load in a RISC.
Now in a pipelined structure processor is
multistage with register to store and propogate the state of processor.Now basically on pipelined processor we save time by overlapping two instructionss' substages.hence even though individually the length of instruction is increased but overall time has reduced.Now see every instruction may not go through all the stages eg load and add again
So overall latency for each instruction will consist of all the stages but its execution may have had taken less number of cycles
So you can say that latency of each instruction is same but not the execution time or cycles consumed
I need a simple way to run a program using digital write for a certain number of seconds.
I am driving two DC Motors. I already have my setup complete, and have driven the motors using pause() and digitalWrite(). I will be making time measurements in milliseconds
Adjustable runtime, and would preferable have non-blocking code.
You could use a timer-driven interrupt triggering code execution which will handle the output (decrementing required time value and eventually switching off the output) or use threads.
I would suggest using threads.
Your requirement is similar to a "blinking diodes" case I described in a different thread
If you replace defines setting time intervals and use variables instead you could use this code to drive outputs or simplify the whole code by using only one thread working the same way aforementioned timer interrupt would work.
If you would like to try timer interrupt-driven approach this post gives a good overview and examples (but you have to change OCR1A to about 16 to get overflow each 1ms) .
I've been wondering whether there are any particular reasons why one should use Wtime instead of other time measurement methods? Is it more accurate or reliable?
The only reason I see is platform independence.
Since MPI_Wtime() guarantees that the beginning time at all ranks is the same, it can not only be used for calculating time between any two points at the same rank, but also to compare the the time taken by different ranks to reach a certain point very conveniently.
There can be other applications too for this globally synched clock, but right now i can think only about this.
MPI_Wtime() does not guarantee the global synchronization among process lying on different nodes. It does provide the synchronous clock for process lying on same node but also gettimeofday() provides the same.
According to the manual for MPI_Wtime (Open MPI 4.0.0):
On POSIX platforms, this function may utilize a timer that is cheaper to invoke than the gettimeofday() system call, but will fall back to gettimeofday() if a cheap high-resolution timer is not available. The ompi_info command can be consulted to see if Open MPI supports a native high-resolution timer on your platform; see the value for "MPI_WTIME support" (or "options:mpi-wtime" when viewing the parsable output). If this value is "native", a method that is likely to be cheaper than gettimeofday() will be used to obtain the time when MPI_Wtime is invoked.