PCIe Bandwidth on ATI FirePro - opencl

I am trying to measure PCIe Bandwidth on ATI FirePro 8750. The amd app sample PCIeBandwidth in the SDK measures the bandwith of transfers from:
Host to device, using clEnqueueReadBuffer().
Device to host, using clEnqueueWriteBuffer().
On my system (windows 7, Intel Core2Duo 32 bit) the output is coming like this:
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : ATI RV770
Host to device : 0.412435 GB/s
Device to host : 0.792844 GB/s
This particular card has 2 GB DRAM and max clock frequency is 750 Mhz
1- Why is bandwidth different in each direction?
2- Why is the Bandwdith so small?
Also I understand that this communication takes place through DMA, so the Bandwidth may not be affected by CPU.

This paper from Microsoft Research labs give some inkling of why there is asymmetric PCIe data transfer bandwidth between GPU - CPU. The paper describes performance metrics for FPGA - GPU data transfer bandwidth over PCIe. It also includes metrics from CPU - GPU data transfer bandwidth over PCIe.
To quote the relevant section
'it should also be noted that the GPU-CPU transfers themselves also
show some degree of asymmetric behavior. In the case of a GPU to CPU
transfer, where the GPU is initiating bus master writes, the GPU
reaches a maximum of
6.18 GByte/Sec. In the opposite direction from CPU to GPU, the GPU is initiating bus master reads and the resulting bandwidth falls to 5.61
GByte/Sec. In our observations it is typically the case that bus
master writes are more efficient than bus master reads for any PCIe
implementation due to protocol overhead and the relative complexity of
implementation. While a possible solution to this asymmetry would be
to handle the CPU to GPU direction by using CPU initiated bus master
writes, that hardware facility is not available in the PC architecture
in general. '
The answer to the second question on bandwidth could be due units of data transfer size.
See figs 2,3,4 and 5. I have also seen graphs like this at the 1st AMD Fusion Conference. The explanation is that the PCIe transfer of data has overheads due to the protocol and the device latency. The overheads are more significant for small transfer sizes and become less significant for larger sizes.
What levers do you have to control or improve performance?
Getting the right combo of chip/motherboard and GPU is the H/W lever. Chips with the max number of PCIe lanes are better. Using a higher spec PCIe protocol, PCIe 3.0 is better than PCIe 2.0. All components need to support the higher standards.
As a programmer controlling the data transfer size, is a very important lever.
Transfer sizes of 128K - 256K bytes get approx 50% of the max bandwidth. Transfers of 1M - 2M bytes get over 90% of max bandwidth.

Related

What factors maximise BLE throughput using an L2CAP channel in iOS?

I am trying to understand the bottleneck in my setup, that is if one exists.
From an iPhone 11 app (central) I am writing to an L2CAP channel made available by a Raspberry Pi 4B (peripheral using BlueZ). (Using the CoreBluetooth framework on the iOS side)
From the btmon log I see The iPhone suggests an MTU and MPS of 1498. The Pi suggests an MTU of 65535 and MPS of 247.
Throughput gets better with higher MTU but only up to a point. No difference between specifying MTU of 5000 or 65535 from the peripheral side.
On the peripheral I have set the connection interval min to 15ms and max to 15ms (and tried 30ms). Higher intervals result in slower throughput as expected.
The kbit/s does not seem to go above the 137kbit/s, which is far lower than the 394kbit/s shown by Apple in WWDC 2017.
Data length extension is available on both devices.
From the btmon logs I see the majority packets are of size 247 (243 bytes of payload) and a few of size 146 (142 bytes of payload). This could account for some slowness but I doubt it causes the throughput to go down by a factor of 3.
Am I missing something or is this the limit for my setup?

How does Intel DSA (Data Streaming Accelerators) can store the descriptors in the local sram?

I would like to ask a question regarding to Intel DSA (Data Streaming Accelerator).
According to the documentation, Intel DSA can support up to 256 queues and 65536 entries(the bit width of registers are 8 and 16 respectively). Are all descriptors stored locally in the sram, or is the hardware written to the DDR?
I think such a large number of descriptors and queues will take up a lot of chip area, and the hardware circuit timing and congestion is difficult to control.

What is the lowest latency communication method between a computer and a microcontroller?

I have a project in which I need to have the lowest latency possible (in the 1-100 microseconds range at best) for a communication between a computer (Windows + Linux + MacOSX) and a microcontroller (arduino or stm32 or anything).
I stress that not only it has to be fast, but with low latency (for example a fast communication to the moon will have a low latency).
For the moment the methods I have tried are serial over USB or HID packets over USB. I get results around a little less than a millisecond. My measurement method is a round trip communication, then divide by two. This is OK, but I would be much more happy to have something faster.
EDIT:
The question seems to be quite hard to answer. The best workaround I found is to synchronize clocks of the computer and microcontroler. Synchronization requires communication indeed. With the process below, dt is half a round trip, and sync is the difference between the clocks.
t = time()
write(ACK);
read(remotet)
dt = (time() - t) / 2
sync = time() - remotet - dt
Note that the imprecision of this synchronization is at most dt. The importance of the fastest communication channel stands, but I have an estimation of the precision.
Also note technicalities related to the difference of timestamp on different systems (us/ms based on epoch on Linux, ms/us since the MCU booted on Arduino).
Pay attention to the clock shift on Arduino. It is safer to synchronize often (every measure in my case).
USB Raw HID with hacked 8KHz poll rate (125us poll interval) combined with Teensy 3.2 (or above). Mouse overclockers have achieved 8KHz poll rate with low USB jitter, and Teensy 3.2 (Arduino clone) is able to do 8KHz poll rate with a slightly modified USB FTDI driver on the PC side.
Barring this, and you need even better, you're now looking at PCI-Express parallel ports, to do lower-latency signalling via digital pins directly to pins on the parallel port. They must be true parallel ports, and not through a USB layer. DOS apps on gigahertz-level PCs were tested to get sub-1us ability (1.4Ghz Pentium IV) with parallel port pin signalling, but if you write a virtual device driver, you can probably get sub-100us within Windows.
Use raised priority & critical sections out of the wazoo, preferably a non-garbage-collected language, minimum background apps, and essentially consume 100% of a CPU core on your critical loop, and you can definitely reliably achieve <100us. Not 100% of the time, but certainly in the territory of five-nines (and probably even far better than that). If you can tolerate such aberrations.
To answer the question, there are two low latency methods:
Serial or parallel port. It is possible to get latency down to the millisecond scale, although your performance may vary depending on manufacturer. One good brand is Brainboxes, although their cards can cost over $100!
Write your own driver. It should be possible to achieve latencies on the order of a few hundred micro seconds, although obviously the kernel can interrupt your process mid-way if it needs to serve something with a higher priority. This how a lot of scientific equipment actually works. (and a lot of the people telling you that a PC can't be made to work on short deadlines are wrong).
For info, I just ran some tests on a Windows 10 PC fitted with two dedicated PCIe parallel port cards.
Sending TTL (square wave) pulses out using Python code (actually using Psychopy Builder and Psychopy coder) the 2 channel osciloscope showed very consistant offsets between the two pulses of 4us to 8us.
This was when the python code was run at 'above normal' priority.
When run at normal priority it was mostly the same apart from a very occassional 30us gap, presumably when task switching took place)
In short, PCs aren't set up to handle that short of deadline. Even using a bare metal RTOS on an Intel Core series processor you end up with interrupt latency (how fast the processor can respond to interrupts) in the 2-3 µS range. (see http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/industrial-solutions-real-time-performance-white-paper.pdf)
That's ignoring any sort of communication link like USB or ethernet (or other) that requires packetizing data, handshaking, buffering to avoid data loss, etc.
USB stacks are going to have latency, regardless of how fast the link is, because of buffering to avoid data loss. Same with ethernet. Really, any modern stack driver on a full blown OS isn't going to be capable of low latency because of what else is going on in the system and the need for robustness in the protocols.
If you have deadlines that are in the single digit of microseconds (or even in the millisecond range), you really need to do your real time processing on a microcontroller and have the slower control loop/visualization be handled by the host.
You have no guarantees about latency to userland without real time operating system. You're at the mercy of the kernel and it's slice time and preemption rules. Which could be higher than your maximum 100us.
In order for a workstation to respond to a hardware event you have to use interrupts and a device driver.
Your options are limited to interfaces that offer an IRQ:
Hardware serial/parallel port.
PCI
Some interface bridge on PCI.
Or. If you're into abusing IO, the soundcard.
USB is not one of them, it has a 1kHz polling rate.
Maybe Thunderbolt does, but I'm not sure about that.
Ethernet
Look for a board that has a gigabit ethernet port directly connected to the microcontroller, and connect it to the PC directly with a crossover cable.

Reserve GPU for one exclusive OpenCL host programme

Is there a possibility to exclusively reserve the GPU for an OpenCL host programme?
No other process shall have access to this device via OpenCL or OpenGL.
Background is that my software calculates real time data on the GPU and therefore it's not good for the performance if the GPU is doing other stuff as well.

Mobile network generations and bandwidth

I have couple of question related to mobile network generations:
While comparing generation of mobile network people mention that 1G and 2G has to use maximum bandwidth. What is bandwidth in this content and why they use maximum bandwidth?
1G and 2G are narrow band networks. What is narrow band here?
3G and 4G are wide band networks. Don't they use maximum bandwidth?
They use maximum bandwidth to transmit analogue voice information. Since available bandwidth is limited and there is a limit on the bandwidth needed to make the voice recognizable on the other side, they can't go below this fixed bandwidth.
It refers to the frequency span of the carrier. If it is narrow band, then usually it's just a few kilohertzes, which makes available bandwidth low (this also depends on the kind of modulation used).
They don't need the maximum available bandwidth for voice communications, since for an acceptable voice quality we need to transmit signals that encode just a few kilohertzes of analogue information.

Resources