thrust transform performance number - asynchronous

Can any body tell me that thrust routines are blocking or non blocking?
I want to time it, here are the code snippets-
code snippet -1:
clock_t start,end;
start = clock();
thrust::transform( a.begin(), a.end(), b.begin(), thrust::negate<int>());
end = clock();
code snippet - 2
clock_t start,end;
start = clock();
thrust::transform( a.begin(), a.end(), b.begin(), thrust::negate<int>());
cudaThreadSynchronize();
end = clock();
code snippet -1 is taking very less time in compare to code snippet -2
why is this happening? and which one is the right way to time the thrust routines so that i may compare it to my parallel code.

I don't believe Thrust formally defines which APIs are blocking and which are non-blocking anywhere in the documentation. However, a transform call like your example should be executed in a single back-end closure operation (which translates into a single kernel call without host-device data copies) and should be asynchronous.
Your second code snippet is closer to the correct way to time a Thrust operation, but note that
clock() is generally implemented using a low resolution time source and is probably not suitable for timing these types of operations. You should find a higher resolution host source timer, or better still, use the CUDA events API to time your code. You can see an example of how to use these APIs in this question-answer pair.
cudaThreadSynchronize is a deprecated API as of the CUDA 4.0 release. You should use cudaDeviceSynchronize instead.

Related

Efficiently connecting an asynchronous IMFSourceReader to a synchronous IMFTransform

Given an asynchronous IMFSourceReader connected to a synchronous only IMFTransform.
Then for the IMFSourceReaderCallback::OnReadSample() callback is it a good idea not to call IMFTransform::ProcessInput directly within OnReadSample, but instead push the produced sample onto another queue for another thread to call the transforms ProcessInput on?
Or would I just be replicating identical work source readers typically do internally? Or put another way does work within OnReadSample run the risk of blocking any further decoding work within the source reader that could have otherwise happened more asynchronously?
So I am suggesting something like:
WorkQueue transformInputs;
...
// Called back async
HRESULT OnReadSampleCallback(... IMFSample* sample)
{
// Push sample and return immediately
Push(transformInputs, sample);
}
// Different worker thread awoken for transformInputs queue samples
void OnTransformInputWork()
{
// Transform object is not async capable
transform->TransformInput(0, Pop(transformInputs), 0);
...
}
This is touched on, but not elaborated on here 'Implementing the Callback Interface':
https://learn.microsoft.com/en-us/windows/win32/medfound/using-the-source-reader-in-asynchronous-mode
Or is it completely dependent on whatever the source reader sets up internally and not easily determined?
It is not a good idea to perform a long blocking operation in IMFSourceReaderCallback::OnReadSample. Nothing is going to be fatal or serious but this is not the intended usage.
Taking into consideration your previous question about audio format conversion though, audio sample data conversion is fast enough to happen on such callback.
Also, it is not clear or documented (depends on actual implementation), ProcessInput is often instant and only references input data. ProcessOutput would be computationally expensive in this case. If you don't do ProcessOutput right there in the same callback you might run into situation where MFT is no longer accepting input, and so you'd have to implement a queue anyway.
With all this in mind you would just do the processing in the callback neglecting performance impact assuming your processing is not too heavy, or otherwise you would just start doing the queue otherwise.

Why does the first execution of an Ada procedure takes longer than other executions?

I try to write a delay procedure for a FE310 microcontroller. I need to write this procedure because I use a zero footprint runtime (ZFP) that doesn't provide the native Ada delays.
The procedure rely on a 64 bits hardware timer. The timer is incremented 32768 times per second. The procedure reads the timer, calculates the final value by adding a value to the read value and then reads the timer until it reaches its final value.
I toggle a pin before and after the execution and check the delay with a logic analyzer. The delays are quite accurate except for the first execution where they are 400 us to 600 us longer than requested.
Here is my procedure:
procedure Delay_Ms (Ms : Positive)
is
Start_Time : Machine_Time_Value;
End_Time : Machine_Time_Value;
begin
Start_Time := Machine_Time;
End_Time := Start_Time + (Machine_Time_Value (Ms) * Machine_Time_Value (LF_Clock_Frequency)) / 1_000;
loop
exit when Machine_Time >= End_Time;
end loop;
end Delay_Ms;
Machine_Time is a function reading the hardware timer.
Machine_Time_Value is a 64 bits unsigned integer.
I am sure the hardware aspect is correct because I wrote the same algorithm in C and it behaves exactly as expected.
I think that GNAT is adding some code that is only executed the first time. I searched to web for mentions of a similar behavior, but didn't find anything relevant. I found some information about elaboration code and how it can be removed, but after some research, I realized that elaboration code is executed before the main and shouldn't be the cause my problem.
Do you know why the first execution of procedure like mine could take longer? Is it possible to avoid this kind of behavior?
As Simon Wright suggested, the different first execution time is because the MCU reads the code from the SPI flash on first execution but reads it from the instruction cache on subsequent executions.
By default, the FE310 SPI clock is the processor core clock divided by 8. When I set the SPI clock divider to 2, the difference in execution time is divided by 4.

Counting cycles on Cortex M0+

I have a Cortex M0+ (SAML21) board that I'm using for performance testing. I'd like to measure how many cycles a given piece of code takes. I tried using DWT (DWT_CONTROL), but it never produced a result; it returned 0 cycles regardless of what code ran.
// enable the use DWT
*DEMCR = *DEMCR | 0x01000000;
// Reset cycle counter
*DWT_CYCCNT = 0;
// enable cycle counter
*DWT_CONTROL = *DWT_CONTROL | 1 ;
// some code here
// .....
// number of cycles stored in count variable
count = *DWT_CYCCNT;
Is there a way to count cycles (perhaps with an interrupt and counter?) much like I can query for milliseconds (eg. millis() on Arduino)?
I cannot find any mention of the cycle counter register in the ARMv6-M Architecture Reference Manual.
So I'd say, this is not possible with an internal counter like it is in the bigger siblings like the M3, M4 and so on.
This is also stated in this knowledge base article:
This article was written for Cortex-M3 and Cortex-M4, but the same points apply to Cortex-M7, Cortex-M33 and Cortex-M55. Newer Cortex-M processors at the higher end of performance, such as Cortex-M55, may include an extended Performance Motnioring Unit that provides additional preformance measuring capabilities, but these are outside the scope of this article. The smaller Cortex-M processors such as Cortex-M0, Cortex-M0+ and Cortex-M23 do not include the DWT capabilities described here, and, other than the Cortex-M23, do not include ETM instruction trace, but all Cortex-M processors provide the "tarmac" capability for the chip designers.
(Emphasis mine)
So other means have to be used:
some debuggers can measure the time between hitting two breakpoints (or between two stops), the accuracy of this is usually limited by interacting with the OS, so can easily be in the order of 20 ms
use an internal timer with high enough clock frequency to give reasonable results and start / stop it before and after the interesting region
toggle a pin and measure the time with a logic analyzer / oscilloscope
According to the CMSIS header file for the M0+ (core_cm0plus.h), the Core Debug Registers are only accessible over the Debug Access Port and not via the processor. I can only suggest using some free running timer (maybe SysTick) or perhaps your debugger can be of some help to get access to the required registers.

Passing data on to packet for transmission using readstream

I am using the readstream interface to sample at 100hz, I have been able to integrate the interface into Oscilloscope application. I just have a doubt in the way I pass on the buffer value on to the packet to be transmitted . Currently this is how I am doing it :
uint8_t i=0;
event void ReadStream.bufferDone( error_t result,uint16_t* buffer, uint16_t count )
{
if (reading < count )
i++;
local.readings[reading++] = buffer[i];
}
I have defined a buffer size of 50, I am not sure this is the way to do it as I am noticing just one sample per packet even though I have set Nreadings=2.
Also the sampling rate does not seem to be 100 samples/second when I check.I am not doing something right in the way I pass data to the packet to be transmitted.
I think I need to clarify a few things according to your questions and comments.
Reading a single sample from an accelerometer on micaZ motes works as follows:
Turn on the accelerometer.
Wait 17 milliseconds. According to the ADXL202E (the accelerometer) datasheet, startup time is 16.3 ms. This is because this particular hardware is capable of providing first reading not immediately after being powered on, but with some delay. If you decrease this delay, you will likely get a wrong reading, however, the behavior is undefined, so you may sometimes get a correct reading or the result may depend on environment conditions, such as ambient temperature. Changing this 17-ms delay to a lower value is certainly a bad idea.
Read values (in two axes) from the Analog to Digital Converter (ADC), which as an MCU component that converts analog output voltage of the accelerometer to the digital value (an integer). The speed at which ADC can sample is independent from the parameters of the accelerometer: it is another piece of hardware.
Turn off the accelerometer.
This is what happens when you call Read.read() in your code. You see that the maximum frequency at which you can sample is once every 17 ms, that is, 58 samples per second. It may be even a bit smaller because of some overhead from MCU or inaccuracy of timers. This is true when you sample by calling Read.read() in a loop or every fixed interval, because this call itself lasts no less than 17 ms (I mean the delay between the command and the event).
What you may want to do is:
Turn on the accelerometer.
Wait 17 ms.
Perform series of reads.
Turn off the accelerometer.
If you do so, you have one 17-ms delay for a set of samples instead of such delay for each sample. What is important, these steps have nothing to do with the interface you use for performing readings. You may call Read.read() multiple times in your application, however, it cannot be the same implementation of the read command that is already implemented for this accelerometer, because the existing implementation is responsible for turning on and off the accelerometer, and it waits 17 ms before reading each sample. For convenience, you may implement the ReadStream interface instead and call it once in your application.
Moreover, you wrote that ReadStream used a microsecond timer and is independent from the 17-ms settling time of the ADC. That sentence is completely wrong. First of all, you cannot say that an interface uses or does not use a timer. The interface is just a set of commands and events without their definitions. A particular implementation of the interface may use timers. The Read and ReadStream interfaces may be implemented multiple times on different platforms by various hardware components, such as accelerometers, thermometers, hygrometers, magnetometers, and so on. Secondly, the 17-ms settling time refers to the accelerometer, not the ADC. And no matter which interface you use, Read or ReadStream, and which timers a driver uses, milli- or microsecond, the 17-ms delay is always required after powering on the accelerometer. As I mentioned, you probably want to make this delay once per multiple reads instead of once per a single read.
It seems that the TinyOS source code already contains an implementation of the accelerometer driver providing the ReadStream interface which allows you to sample continuously. Look at the AccelXStreamC and AccelYStreamC components (in tos/sensorboards/mts300/).
The ReadStream interface consists of two commands. postBuffer(val_t *buf, uint16_t count) is called to provide a buffer for samples. In the accelerometer driver, val_t is defined as uint16_t. You may post multiple buffers, one by one. This command does not yet start sampling and filling buffers. For that purpose, there is a read(uint32_t usPeriod) command, which directs the device to start filling buffers by sampling with the specified period (in microseconds). When a buffer is full, you get an event bufferDone(error_t result, val_t *buf, uint16_t count) and a component starts filling a next buffer, if any. If there are no buffers left, you get additionally an event readDone(error_t result, uint32_t usActualPeriod), which passes to your application a parameter usActualPeriod, which indicates an actual sampling period and may be different (especially, higher) from a period you requested when calling read due to some hardware constraints.
So the solution is to use the ReadStream interface provided by AccelXStreamC and AccelYStreamC (or maybe some higher-level components that use them) and pass an expected period in microseconds to the read command. If the actual period is lower than one you expect, this means that sampling at higher rate is impossible either due to hardware constraints or because it was not implemented in the ADC driver. In the second case, you may try to fix the driver, although it requires good knowledge of low-level programming. The ADC driver source code for this platform is located in tos/chips/atm128/adc.

Need an Arduino Sketch that uses digital write for a certian number of seconds

I need a simple way to run a program using digital write for a certain number of seconds.
I am driving two DC Motors. I already have my setup complete, and have driven the motors using pause() and digitalWrite(). I will be making time measurements in milliseconds
Adjustable runtime, and would preferable have non-blocking code.
You could use a timer-driven interrupt triggering code execution which will handle the output (decrementing required time value and eventually switching off the output) or use threads.
I would suggest using threads.
Your requirement is similar to a "blinking diodes" case I described in a different thread
If you replace defines setting time intervals and use variables instead you could use this code to drive outputs or simplify the whole code by using only one thread working the same way aforementioned timer interrupt would work.
If you would like to try timer interrupt-driven approach this post gives a good overview and examples (but you have to change OCR1A to about 16 to get overflow each 1ms) .

Resources