I have so far not found any way to do anything similar to Xilinx' RLOC constraints for Altera FPGAs.
Does anyone know a way to do this?
For example place two FFs in the same or adjacent LABs
So to answer my own question, after some consultation with some Altera manuals and some trial and error, I found that this pretty much does what I want.
module synchronizer (input wire dat_i,
input wire out_clk,
output wire dat_o);
(* altera_attribute = "-name SYNCHRONIZATION_REGISTER_CHAIN_LENGTH 2; -name SYNCHRONIZER_IDENTIFICATION \"FORCED IF ASYNCHRONOUS\"" *)
logic [1:0] out_sync_reg;
always_ff#(posedge out_clk) begin
out_sync_reg <= {out_sync_reg[0],dat_i};
end
assign dat_o = out_sync_reg[1];
endmodule
I tested this by setting global synchronizer detection to off and observed that TimeQuest found and analysed the correct paths for metastability.
This works well even when dat_i is latched by clk_a and out_clk is driven by clk_b and where the two clocks are set as:
set_clock_groups -asynchronous -group {clk_a}
set_clock_groups -asynchronous -group {clk_b}
Thus creating false paths between all connections from registers clocked by clk_a to registers clocked by clk_b
set_max/min_delay wont work since it is ignored (as stated by Altera) if the the two clocks are in different asynchronous clock groups.
Altera do not support RLOC style constraints. Apparently this is something to do with the underlying physical architecture. I believe they over-provision ALMs and fuse out columns during chip test to improve yield, therefore relative locations constraints won't translate as expected to a given physical device.
If you are worried about a synchroniser chain placement, you can enable synchroniser chain detection using SYNCHRONIZATION_REGISTER_CHAIN_LENGTH and SYNCHRONIZER_IDENTIFICATION QSF settings (see also this answer).
If you just want to ensure particular timing properties then use set_max_delay and set_min_delay timing constraints on your path.
Related
I have read several codes that do layer initialization using nn.init.kaiming_normal_() of PyTorch. Some codes use the fan in mode which is the default. Of the many examples, one can be found here and shown below.
init.kaiming_normal(m.weight.data, a=0, mode='fan_in')
However, sometimes I see people using the fan out mode as seen here and shown below.
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
Can someone give me some guidelines or tips to help me decide which mode to select? Further I am working on image super resolutions and denoising tasks using PyTorch and which mode will be more beneficial.
According to documentation:
Choosing 'fan_in' preserves the magnitude of the variance of the
weights in the forward pass. Choosing 'fan_out' preserves the
magnitudes in the backwards pass.
and according to Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015):
We note that it is sufficient to use either Eqn.(14) or
Eqn.(10)
where Eqn.(10) and Eqn.(14) are fan_in and fan_out appropriately. Furthermore:
This means that if the initialization properly scales the backward
signal, then this is also the case for the forward signal; and vice
versa. For all models in this paper, both forms can make them converge
so all in all it doesn't matter much but it's more about what you are after. I assume that if you suspect your backward pass might be more "chaotic" (greater variance) it is worth changing the mode to fan_out. This might happen when the loss oscillates a lot (e.g. very easy examples followed by very hard ones).
Correct choice of nonlinearity is more important, where nonlinearity is the activation you are using after the layer you are initializaing currently. Current defaults set it to leaky_relu with a=0, which is effectively the same as relu. If you are using leaky_relu you should change a to it's slope.
Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?
It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.
I have a component in OpenMDAO without outputs that serves to provide inputs to the rest of the group. apply_linear in that component is being called despite the fact that the output of it is not connected. Shouldn't the relevance reduction algorithm in OpenMDAO 1.x figure out that apply_linear for this method never needs to be called?
As it turns out, relevance reduction on a per-variable basis isn't turned on by default. You can turn it on with:
prob.root.ln_solver = LinearGaussSeidel()
prob.root.ln_solver.options['single_voi_relevance_reduction'] = True
This options is set to False by default because it does use more memory by allocating separate vectors for each quantity of interest (though each vector is smaller because it only contains relevant variables, but the total size may be larger.) Also, relevance-reduction is only applicable when using Linear Gauss Seidel as the top linear solver.
My reputation isn't high enough yet to leave comments, so I'm just adding another answer instead. I just wanted to mention that if you're not running under MPI, activating single_voi_relevance_reduction is essentially free. The real increase in memory use isn't due to the vectors themselves, but instead it's due to the index arrays that we store in order to transfer the data from source arrays to target arrays. We're forced to use index arrays under MPI, because PETSc requires it, but when we're not using MPI we use python slice objects to do our data transfer. Slice objects require very little memory.
I would like to run the full 6502 test suite by Klaus Dormann to test my Kansas Lava 6502 implementation. However, the code uses self-modification (see all uses of range_adr), which, while trivial to implement in an emulator, doesn't bode well for a hardware implementation: the program image needs to be stored on ROM, so the write-backs are going to be blackholed by whatever routes writes based on addressing ROM or RAM-backed parts.
The same problem, of course, applies both to synthesizing it to a real FPGA and to running it in a simulator (either the low-level VHDL one or the high-level Kansas Lava one).
Is there a way to run the test suite without doing a lengthy (in terms of cycles) dance of suspending the CPU, copying the program from some unaddressable ROM into an all-RAM memory byte-by-byte, and then initializing the CPU and letting it run? I'd prefer not doing this because simulating these extra cycles at startup will slow down running the test considerably.
Knee-jerk observations:
Despite coming as a 64kb image, the test is actually just 14,093 bytes of actual content, from $0000 up to $370d, then a padded fill of $ffs up to the three vectors in $fffa–$ffff. So you'd need to copy at most 14,099 bytes rather than the prima facie 65,536.
Having set up that very test suite in an emulator I wrote yesterday (no, really) the full range of touched addresses — using [x, y] to denote the closed range, i.e. to include both x and y, is:
[000a, 0012], [0100, 0101], [01f9, 01ff] (i.e. the stack and zero pages);
0200;
[0203, 0207];
04a8;
2cbb;
2cdc;
2eb1;
2ed2;
30a7;
30c8;
33f2;
3409;
353b; and
3552.
From the .lst version of the program, that means all you need to move are the variables with labels:
test_case;
ada2;
sba2;
range_adr;
... and either move or remove the routines that:
test AND immediate from 2cac down to 2cec;
test EOR immediate from 2ea2 to 2ee2;
test ORA immediate from 3098 to 30d8;
test decimal ADC/SBC immediate from 33e7 down to 3414 (specifically to include chkdadi and chksbi);
test binary ADC/SBC immediate from 3531 down to 355d.
All the immediate tests self modify the operand. If you're happy leaving that one addressing mode untested then it shouldn't be too troublesome.
So, I guess, edit those tests out of the original file, and you can safely relocate range_adr to the middle of the stack page if my simulation is accurate.
Okay, so I am trying to drive a 7 segment based display in order to display temperature in degrees celcius. So, I have two displays, plus one extra LED to indicate positive and negative numbers.
My problem lies in the software. I have to find some way of driving these displays, which means converting a given integer into the relevant voltages on the pins, which means that for each of the two displays I need to know the number of tens and number of 1s in the integer.
So far, what I have come up with will not be very nice for an arduino as it relies on division.
tens = numberToDisplay / 10;
ones = numberToDisplay % 10;
I have admittedly not tested this yet, but I think I can assume that for a microcontroller with limited division capabilities this is not an optimal solution.
I have wracked my brain and looked around for a solution using addition/subtraction/bitwise but I cannot think of one at all. This division is the only one I can see.
For this application it's fine. You don't need to get bothered with performance in a simple thermometer.
If however you do need something quicker than division and modulo, then bitwise operations come to help. Basically you would use bitwise & operator, to compare your value to display with patterns describing digits to be displayed on the display.
See the project here for example: http://fritzing.org/projects/2-digit-7-segment-0-99-counting-with-arduino/
You might also try using a 7-seg display driver chip to simplify your output and save pins. The MC14511BCP (a "4511") is a good one. It'll translate binary coded decimal (BCD) to the appropriate 7-seg configuration. Spec sheets are available here and they can be commonly found at electronics parts stores online.