Vivado: setting timing constraints for input and output delay, simulation mismatch and wrong clock behavior - constraints

I'm implementing a hashing algorithm in Verilog using Vivado 2019.2.1. Everything (including synthesis and implementation) worked quite well but I noticed recently that the results of the behavioral simulation (correct hash digest) differs from the post-synthesis/-implementation functional and timing simulation, i.e. I receive three different values for the same circuit design/code.
My base configuration contained a testbench using the default `timescale 1ns / 1ps and a #1 delay for toggling the clock register. I further constrained the clock to a frequency of 10 MHz using an xdc file. During synthesis, no errors (or even warnings, except some "parameter XYZ is used before its declaration") are shown and no non-blocking and blocking assignments are mixed inside my code. Nevertheless, I noticed that the post-* simulation (no matter if functional or timing) needs more clock cycles (e.g. 58 instead of 50 until the value of a specific register was toggled) for achieving the same state of the circuit. My design is entirely synchronous and driven by one clock.
This brought me to the Timing Report and I noticed that 10 input and 10 output delays are not constrained. In addition, the Design Timing Summary shows a worst negative slack for setup that is very close to the time of one clock cycle. I tried some combinations of input and output delays following the Vivado documentation and tutorial videos but I'm not sure how to find out which values are suitable. The total slack (TNS, THS and TPWS) is zero.
Furthermore, I tried to reduce the clock frequency because the propagation delay of some signals that control logic in the FSM (= top) module might be too large. The strange thing that happened then is that the simulation never reached the $finish; in my testbench and nothing except the clock register changed its value in the waveform. In the behavioral simulation everything works as expected but this doesn't seem to be influenced by constraints or even timing. Monitoring the o_round_done wire (determined by an LFSR in a separate submodule) in my testbench, I noticed that for the behavioral simulation the value of this wire changes with the clock whereas for the post-* simulations the value is changed with a small delay:
Behavioral Simulation
clock cycles: 481, round_done: 0
clock cycles: 482, round_done: 1
clock cycles: 483, round_done: 0
total of 1866 clock cycles
Post-Implementation Functional Simulation
clock cycles: 482, round_done: 0
clock cycles: 482, round_done: 1
clock cycles: 483, round_done: 1
clock cycles: 483, round_done: 0
total of 1914 clock cycles
Post-Implementation Timing Simulation
WARNING: "C:\Xilinx\Vivado\2019.2\data/verilog/src/unisims/BUFG.v" Line 57: Timing violation in scope /tb/fsm/i_clk_IBUF_BUFG_inst/TChk57_10300 at time 997845 ps $period (posedge I,(0:0:0),notifier)
WARNING: "C:\Xilinx\Vivado\2019.2\data/verilog/src/unisims/BUFG.v" Line 56: Timing violation in scope /tb/fsm/i_clk_IBUF_BUFG_inst/TChk56_10299 at time 998845 ps $period (negedge I,(0:0:0),notifier)
simulation never stops (probably because round_done is never 1)
Do you know what I'm doing wrong here? I'm wondering why the circuit is not behaving correctly at very low clock frequencies (e.g. 500 kHz) as, to my knowledge, this will provide enough time for each signal to "travel" to the correct destination.
Another thing I noticed is that one wire that is assigned to a register in a submodule is 8'bXX in the behavioral simulation until the connected register is "filled" but in the post-* simulations it is 8'b00 from the beginning. Any idea here?
Moreover, what is actually defining the clock frequency for the simulations? The values in the testbench (timescale and delay #) or the constraint in the xdc file?

I found an explanation for the question why the post-* simulations are behaving differently compared to the behavioral simulation w.r.t. clock cycles etc. in the Xilinx Vivado Design Suite User Guide for Logic Simulation (UG900).
What causes the "latency" before the actual computation of the design can start is called Global Set and Reset (GSR) and takes 100ns:
The glbl.vfile declares the global GSR and GTS signals and automatically pulses GSR for 100ns. (p. 217)
Consequently, I solved the issue by letting the test bench wait for the control logic (= finite-state machine) to be ready, i.e. changing to the state after RESET.

Related

Detect Telosb emission using USRP Source

I am using N210 USRP to have a RF spectrum around 2.4GHz range.
I have programmed two TelosB nodes and they are using RadioCoundLed to send and Receive signals
I have set the TelosB nodes at highest power level following the datasheet
I also made them fixed at a channel(26) around 2.48Ghz
I can see the Telosb nodes communication and the LEDS are blinking.
Now I should observe this in USRP RF spectrum. However I am observing nothing in Scope Sink. I have fixed the center freq in the 2.48 Ghz range.
Set the RX gain - 0
Sampling rate is 2M
Is it possible to even to observe it?
I guess I solved the problem. I was using the wrong daughter board. Now I am using the SBX board that can support 2.5Ghz range.

Performance of the MPI_win_lock

I face a big challenge to justify the performance of the following snapshot of my code that uses Intel MPI library
double time=0
time = time - MPI_Wtime();
MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win_global_scheduling_step);
MPI_Win_unlock(0,win_global_scheduling_step);
time= time + MPI_Wtime();
if(id==0)
sleep(10);
printf("%d sync time %f\n", id, time);
The output depends on how much will rank 0 sleep.
As the following
0 sync time 0.000305
1 sync time 10.00045
2 sync time 10.00015
If I change the sleep of the rank 0 to be 5 seconds instead of 10 seconds, then the sync time at the other ranks will be of the same scale of 5 seconds
The actual data associated with the window "win_global_step" is owned by rank 0
Any discussion or thoughts about the code would be so helpful
If rank 0 owns the win_global_step, and rank 0 goes to sleep or cranks away on a computation kernel, or otherwise does not make MPI calls, many MPI implementations will not be able to service other requests.
There is an environment variable (MPICH_ASYNC_PROGRESS) you might try setting. It introduces some big performance tradeoffs, but it can in some instances let RMA operations make progress without explicit calls to MPI routines.
Despite the name "MPICH" in the environment variable, it might work for you as Intel MPI is based off of the MPICH implementation.

OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS

I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

Balancing quadcopter using Arduino

I am doing a project on self balancing quadcopter with Autonomous control. I am using Arduino Mega 2560 and MPU6050. I have obtained the roll and pitch angles from MPU6050 without the help of DMP and applied complex filter to omit the noise due to vibration.
Also configured and able to run the BLDC motors with Flysky Transmitter and receiver with the help of Arduino interrupts. Now for balancing I am focusing on only one axis (i.e. roll). I have also constructed a balancing stand for the free movement of roll axis by the motor.
For the controlling part, I am implementing PID algorithm. I tried using only the kp value so that, somehow I can balance and then move on to ki and kd term. But unfortunately, for Kp itself, the quadcopter is undergoing aggressive oscillation and is not settling at all.
Some of my queries are:
Whether a single PID loop is enough, or we have to add another?
What type of tuning method I can implement, to find the kp, ki, kd other than trial and error?
I programmed my ESC for 1000 to 2000 microseconds. My PID input angles will be within the range +/- 180. Whether I can directly set the PID output limits for range -1000 to 1000 or -180 to 180 or any other value?
The code can read from the URL https://github.com/antonkewin/quadcopter/blob/master/quadpid.ino
Since its not provided, I am assuming that:
The Loop time is atleast 4ms. (The less the better)
The sensor noise is been reduced to an acceptable level.
MPU-6050 needs gyro+accel data to be combined to get angles in degrees.
If the above points are not taken care of, it will Not balance itself.
Initially, you can get away without tuning kI. So let's focus on kP and kD:
Keep increasing Kp till it starts oscillate fast. Keep the Kp value half of that.
With kP set, start experimenting kD values, as it will try to dampen the overshoots of kP.
Fiddle around these two values, tune it for perfection.
Note, the more accurate your gyro data is, the higher you can set your kP to.

nanosleep - need low resolution

We are running both SLES10 (2.6.16.60-0.54.5-smp) and SLES11 (2.6.32.12-0.7-default).
After 2.6.16, nanosecond was changed to make use of high resolution timers.
Our code must run with similar characteristics on both SLES10 and SLES11. Currently because the SLES11 kernel is configured for high resolution timer (which we may not change), we find CPU usage is much higher than on SLES10. A simple looped nanosecond sleep will display in "top" on SLES11, where not on SLES10.
We can change the calls to nanosecond in the code, but don't know what to change to make them work equivalent on both platforms.
More info:
on SLES11, kernel timer interrupt frequency is approx. 4016 Hz or higher
on SLES10, kernel timer interrupt frequency is approx. 250 Hz
what value should be used in timespec's tv_nsec to decrease CPU usage on the SLES11 platform?
The previous behaviour (sleeping for a nanosecond with a 250Hz interrupt frequency) would, in average, sleep for 1/500th of a second.
If you want approximately the same behaviour as before, you can simply sleep for 1/500th of a second, which is 2,000,000 nanoseconds.

Resources