Intel FPGA OpenCL: Track down reason for low kernel clock frequency

Intel FPGA OpenCL: Track down reason for low kernel clock frequency - opencl

I'm implementing an OpenCL design for an Intel Cyclone V FPGA. It is based on a modified Version of the Terasic DE10 Standard OpenCL BSP.
The modification contains a connection to an external AD converter card, connected to the FPGA board, for which a custom VHDL-based Qsys block was implemented, that streams samples at a sample rate of approx 16MHz into an avalon dual-clock FIFO. The output of the FIFO is connected to the kernel clock domain and the Avalon Streaming interface output of the FIFO is exported to the OpenCL kernel as channel as described in the Intel® FPGA SDK for OpenCLTM Standard Edition Programming Guide under topic 5.4.5.4. (Implementing I/O Channels Using the io Channels Attribute).
Currently, the CL kernel simply fetches blocks of contiguous data from the channel and writes it to a global memory buffer. The host then appends the samples to a file. This works stable for AD converter sample rates up to 1MHz, higher sample rates produce lots of dropouts.
Enabling profiling and using the Intel Dynamic Profiler for OpenCL showed the reason, the average kernel clock is as low as 1.3MHz. However, as the an OpenCL system is not compiled through the Quartus IDE but through the command line via aoc, there is not so much information on what was synthesized and what is the reason for such a low clock frequency. So how do I track down the bottleneck of my design with the tools provided by Intel?
Here is a screenshot of the profiling results:
Edit:
Here is the relevant part of the CL kernel and the Qsys Design. Note that the TX path that you find in both is currently not used.
#pragma OPENCL EXTENSION cl_intel_channels: enable
struct TwoChannelSample
{
short2 chanA;
short2 chanB;
};
#define FIFO_DEPTH 32768
channel struct TwoChannelSample rxSamps __attribute__((depth(0))) __attribute__((io("THDB_ADA_rxSamples")));
channel struct TwoChannelSample txSamps __attribute__((depth(0))) __attribute__((io("THDB_ADA_txSamples")));
channel ushort stateChan __attribute__((depth(0))) __attribute__((io("THDB_ADA_state")));
kernel void thdbADARxTxCallback (global const float2* restrict txSamples,
global float2* restrict rxSamples,
global ushort* restrict interfaceState)
{
// get state from interface
*interfaceState = read_channel_intel (stateChan);
// Process sample-wise
for (int i = 0; i < FIFO_DEPTH; ++i)
{
struct TwoChannelSample rxSample = read_channel_intel (rxSamps);
rxSamples[i].x = (float)rxSample.chanA.x;
rxSamples[i].y = (float)rxSample.chanA.y;
rxSamples[i + FIFO_DEPTH].x = (float)rxSample.chanB.x;
rxSamples[i + FIFO_DEPTH].y = (float)rxSample.chanB.y;
}
}

Related

Interrupting with SoftwareSerial on Arduino

I'm using Bluetooth serial port profile to communicate with Arduino. The bluetooth module (HC-06) is connected to my digital pins 10 and 11 (RX, TX). The module is working properly, but I need an interrupt on data receive. I can't periodically check for incoming data as Arduino is working on a time-sensitive task (music-playing through a passive buzzer) and I need control signals to interrupt immediately on receive. I've looked through many documents including Arduino's own site, and they all explain how to establish regular communication using checking for serialPort.available() periodically. I've found one SO question Arduino Serial Interrupts but that's too complicated for my level. Any suggestions on reading realtime input through serial?

Note that the current version of SoftSerial actually uses PCINT to detect the individual bits. Hence I believe defining it again at the main loop would conflict with the SoftSerial's actual detection of bits.
I am reluctant to suggest this as it is modifying a core library. Which is difficult not to do when sharing interrupts. But if desperate, you could modify that routine, with your need.
within
\arduino-1.5.7\hardware\arduino\avr\libraries\SoftwareSerial\SoftwareSerial.cpp.
//
// The receive routine called by the interrupt handler
//
void SoftwareSerial::recv()
{
...
// if buffer full, set the overflow flag and return
if ((_receive_buffer_tail + 1) % _SS_MAX_RX_BUFF != _receive_buffer_head)
{
// save new data in buffer: tail points to where byte goes
_receive_buffer[_receive_buffer_tail] = d; // save new byte
_receive_buffer_tail = (_receive_buffer_tail + 1) % _SS_MAX_RX_BUFF;
#ifdef YOUR_THING_ENABLE
// Quickly check if it is what you want and DO YOUR THING HERE!
#endif
}
...
}
But beware your are still in a ISR and all Interrupts are OFF and you are blocking EVERYTHING. One should not lollygag nor dilly dally, here. Do you something quick and get out.

Multitasking in PIC24

I have a PIC24 based system equipped with a 24 bit, 8 channels ADC (google MCP3914 Evaluation Board for more details...).
I have got the board to sample all of the 8 channels, store the data in a 512x8 buffer and transmit the data to PC using a USB module when the buffer is full (it's is done by different interrupts).
The only problem is that when the MCU is transmitting data (UART transmission interrupt has higher priority than the ADC reading interrupt) the ADC is not sampling data hence there will be data loss (sample rate is around 500 samples/sec).
Is there any way to prevent this data loss? maybe some multitasking?

Simply transmit the information to the UART register without using interrupts but by polling the bit TXIF
while (PIR1.TXIF == 0);
TXREG = "the data you want to send";
The same applies to the ADC conversion : if you were using interruptions to start / stop a conversion, simply poll the required bits (ADON) and thats it.
The TX bits and AD bits may vary depending on your PIC.
That prevents the MCU to enter an interrupt service routine and loose 3-4 samples.

In PIC24 an interrupt can be assigned one of the 8 priorities. Take a look at the corresponding section in the "Family Reference Manual" -> http://ww1.microchip.com/downloads/en/DeviceDoc/70000600d.pdf

Alternatively you can use DMA channels which are very handy. You can configure your ADC to use the DMA, and thus sampling and feeding the buffer won't use any CPU Time, same goes for UART I beleive.
http://ww1.microchip.com/downloads/en/DeviceDoc/39742A.pdf
http://esca.atomki.hu/PIC24/code_examples/docs/manuallyCreated/Appendix_H_ADC_with_DMA.pdf

Overrun errors with two USART interrupts

Using two USARTs running at 115200 baud on a STM32F2, one to communicate with a radio module and one for serial from a PC. The clock speed is 120MHz.
When receiving data from both USARTs simultaneously overrun errors can occur on one USART or the other. Doing some quick back of the envelope calculations there should be enough time to process both, as the interrupts are just simple copy the byte to a circular buffer.
In both theory and from measurement the interrupt code to push byte to buffer should/does run in the order of 2-4µS, at 115200 baud we have around 70us to process each char.
Why are we seeing occassional OREs on one or other USART?
Update - additional information:
No other ISRs in our code are firing at this time.
We are running Keil RTX with systick interrupt configured to fire every 10mS.
We are not disabling any interrupts at this time.
According this book (The Designer's Guide to the Cortex-M Processor Family) the interupt latency is around 12cycles (not really deadly)
Given all the above 70us is at least a factor of 10 over the time we take to clear the interrupts - so I'm not sure its is so easy to explain. Should I be concluding that there must be some other factor I am over looking?
MDK-ARM is version 4.70
The systick interrupt is used by the RTOS so cannot time this the other ISRs take 2-3µS to run per byte each.

I ran into a similar problem as yours a few months ago on a Cortex M4 (SAM4S). I have a function that gets called at 100 Hz based on a Timer Interrupt.
In the meantime I had a UART configured to interrupt on char reception. The expected data over UART was 64 byte long packets and interrupting on every char caused latency such that my 100 Hz update function was running at about 20 Hz. 100 Hz is relatively slow on this particular 120 MHz processor but interrupting on every char was causing massive delays.
I decided to configure the UART to use PDC (Peripheral DMA controller) and my problems disappeared instantly.
DMA allows the UART to store data in memory WITHOUT interrupting the processor until the buffer is full saving lots of overhead.
In my case, I told PDC to store UART data into an buffer (byte array) and specified the length. When UART via PDC filled the buffer the PDC issued an interrupt.
In PDC ISR:
Give PDC new empty buffer
Restart UART PDC (so can collect data while we do other stuff in isr)
memcpy full buffer into RINGBUFFER
Exit ISR
As swineone recommended above, implement DMA and you'll love life.

Had a similar problem. Short solution - change oversampling to 8 bits which makes USART clock more precise. And choose your MCU clock wisely!
huart1.Init.OverSampling = UART_OVERSAMPLING_8;
Furthermore, add USART error handler and mechanism to check that your data valid such as CRC16. Here is example for the STM32F0xx series, I am assuming it should be pretty similar across the series.
void UART_flush(void) {
// Flush UART RX buffer if RXNE is set
if READ_BIT(huart1.Instance->ISR, USART_ISR_RXNE) {
SET_BIT(huart1.Instance->RQR, UART_RXDATA_FLUSH_REQUEST);
}
// Not available on F030xx devices!
// SET_BIT(huart1.Instance->RQR, UART_TXDATA_FLUSH_REQUEST);
// Clear All Errors (if needed)
if (READ_BIT(huart1.Instance->ISR, USART_ISR_ORE | USART_ISR_FE | USART_ISR_NE)) {
SET_BIT(huart1.Instance->ICR, USART_ICR_ORECF | USART_ICR_FECF | USART_ICR_NCF);
}
}
// USART Error Handler
void HAL_UART_ErrorCallback(UART_HandleTypeDef *huart) {
if(huart->Instance==USART1) {
// See if we have any errors
if (READ_BIT(huart1.Instance->ISR, USART_ISR_ORE | USART_ISR_FE | USART_ISR_NE | USART_ISR_RXNE)) {
// Flush errors
UART_flush();
// Raise Error Handler
_Error_Handler(__FILE__, __LINE__);
}
}
}
DMA might help as well. My problem was related to USART clock tolerances which might even cause overrun error with DMA implemented. Since it is USART hardware problem. Anyway, hope this helps someone out there! Cheers!

I have this problem recently, so I implemented a UART_ErrorCallback function that had was not implanted yet (just the _weak version).
Is like this:
void HAL_UART_ErrorCallback(UART_HandleTypeDef *huart)
{
if(huart == &huart1)
{
HAL_UART_DeInit(&huart1);
MX_USART1_UART_Init(); //my initialization code
...
And this solve the overrun issue.

Restrict number of GPUs for AMD OpenCL

Is there a solution to restrict the used number of GPUs for AMD OpenCL platforms? For NVIDIA platforms one can simply set the environment variable CUDA_VISIBLE_DEVICES to limit the set of GPUs available to OpenCL.
EDIT: I know, that I can create a context with a reduced set of devices. However, I am looking for ways to control the number of devices for the OpenCL platform from "outside".

AMD have the GPU_DEVICE_ORDINAL environment variable for both Windows and Linux. This allows you to specify the indices of the GPUs that you want to be visible from your OpenCL application. For example:
jprice#nowai:~/benchmark$ python benchmark.py -clinfo
Platform 0: AMD Accelerated Parallel Processing
-> Device 0: Tahiti
-> Device 1: Tahiti
-> Device 2: Intel(R) Core(TM) i5-3550 CPU # 3.30GHz
jprice#nowai:~/benchmark$ export GPU_DEVICE_ORDINAL=0
jprice#nowai:~/benchmark$ python benchmark.py -clinfo
Platform 0: AMD Accelerated Parallel Processing
-> Device 0: Tahiti
-> Device 1: Intel(R) Core(TM) i5-3550 CPU # 3.30GHz
A more detailed description can be found in the AMD APP OpenCL Programming Guide (currently in section 2.4.3 "Masking Visible Devices"):
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

OpenCL host APIs allow you to specify the the the number of devices when you get the device ids list
_int clGetDeviceIDs(
cl_platform_id platform,
cl_device_type device_type,
cl_uint num_entries, // Controls the minimum number of devices
cl_device_id *devices,
cl_uint *num_devices)
The device id pointer *devices can be used to create the context with a specific number of devices.
Here is what the spec says
num_entries is the number of cl_device entries that can be added to
devices. If devices is not NULL, the num_entries must be greater than
zero. devices returns a list of OpenCL devices found. The cl_device_id
values returned in devices can be used to identify a specific OpenCL
device. If devices argument is NULL, this argument is ignored. The
number of OpenCL devices returned is the minimum of the value
specified by num_entries or the number of OpenCL devices whose type
matches device_type. num_devices returns the number of OpenCL devices
available that match device_type. If num_devices is NULL, this
argument is ignored
cl_context clCreateContext(
const cl_context_properties *properties,
cl_uint num_devices, // Number of devices
const cl_device_id *devices,
(voidCL_CALLBACK *pfn_notify) (
const char *errinfo,
const void *private_info, size_t cb,
void *user_data
),
void *user_data,
cl_int *errcode_ret)
Each device is then addressed through its own device queue.

There is not a portable solution defined by the OpenCL specification.
NVIDIA has the solution you mentioned. I don't think AMD has a standard; your OpenCL programs will have to come up with a way to share the available devices.
Note that AMD does have OpenCL extensions (some of which have become more official in OpenCL 1.2) for "device fission" which is used for splitting up a single device among multiple programs (but that is different than what you are asking).

Arduino serial: inverted 7E1. Possible?

I'm trying to talk serial with an SDI-12 device, and it requires inverted seven data bits, even parity and one stop bit (7E1) serial at 1200 baud.
From the datasheet:
SDI-12 communication sends characters at 1200 bits per second. Each character has 1 start bit, 7 data bits (LSB first), 1 even parity bit, and 1 stop bit (Active low or inverted logic levels):
All SDI-12 commands and response must adhere to the following format on the data line. Both the command and response are preceded by an address and terminated by a carriage return line feed combination.
Is this possible with the Serial or SoftwareSerial libraries? I am trying to avoid additional hardware (beyond a levelshifter to 3.3 V), but I will do so if it is the only way.
I have seen that SoftwareSerial can do inverted, and Serial can do 7E1, but I can't find if either can do both.
I have access to a Arduino Mega (R2), and Arduino Uno (R3).
Here is the device I want to communicate with: http://www.decagon.com/products/sensors/soil-moisture-sensors/gs3-soil-moisture-temperature-and-ec/ and here, http://www.decagon.com/assets/Uploads/GS3-Integrators-Guide.pdf is the document explaining the protocol. Page 6 talks about its implementation of SDI.

I'm not familiar with Arduino, however the SDI-12 physical layer is inverted from the standard TTL levels - probably for two reasons:
Since the idle voltage is 0V, this results in lower standby power (due to nominal pull-down resistors in a typical SDI-12 sensor.
It facilitates simple bus 'sniffing' using a standard RS-232 serial port.
Short of bit-banging a 5V IO pin - yes, if using a standard microcontroller UART you will need an external inverter (or 2) and a 3-state buffer. Possibly requiring level shifting, depending on your hardware.
Thumbs down to the Wikipedia entry - SDI-12 uses entirely standard UART bit timings (very much like RS-232), just different signal levels (0 - 5V); see point #2. However, there are specific break sequences and strict timing requirements, which makes firmware development more difficult.
If you are serious about SDI-12 firmware development, you may want to invest in an SDI-12 Verifier. A thorough study of the specification is essential.

A little late... but better late than never
I have actually just written a library for exactly that (actually exactly that including the sensors ... so it should work exactly with the included examples )
https://github.com/joranbeasley/SDISerial (Arduino Library)
#include <SDISerial.h> //https://github.com/joranbeasley/SDISerial (Arduino Library)
#include <string.h>
#define DATA_PIN 2
SDISerial connection(DATA_PIN);
char output_buffer[125]; // just for uart prints
char tmp_buffer[4];
char sensor_info[]
//initialize variables
void setup(){
connection.begin();
Serial.begin(9600);//so we can print to standard uart
//small delay to let the sensor do its startup stuff
delay(3000);//3 seconds should be more than enough
char* sensor_info = connection.sdi_query("0I!",1000); // get sensor info for address 0
}
//main loop
void loop(){
//print to uart
Serial.println("Begin Command: ?M!");
//send measurement query (M) to the first device on our bus
char* resp = connection.service_request("0M!","0D0!");//Get Measurement from address 0
sprintf(output_buffer,"RECV: %s",resp?resp:"No Response Recieved!!");
Serial.println(output_buffer);
delay(10000);//sleep for 10 seconds before the next read
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Intel FPGA OpenCL: Track down reason for low kernel clock frequency - opencl

Related

Interrupting with SoftwareSerial on Arduino

Multitasking in PIC24

Overrun errors with two USART interrupts

Restrict number of GPUs for AMD OpenCL

Arduino serial: inverted 7E1. Possible?

Categories

Resources