I am confused in finding RAW dependencies whether we have to find only in adjacent instructions or non-adjacent also.
consider the following assembly code
I1: ADD R1 , R2, R2;
I2: ADD R3, R2, R1;
I3: SUB R4, R1 , R5;
I4: ADD R3, R3, R4;
FIND THE NUMBER OF READ AFTER WRITE(RAW) DEPENDENCIES IN THE Above Code.
assume ADD x,y,z = x <- y + z
I am getting 2 dependency I2-I1 and I4-I3.
Let us say that after an instruction enters the pipeline, it will take it x stages after which any register write by that instruction will be visible to any following instruction.
Then you have to take care of the RAW dependencies among every set of x consecutive instructions. In the worst case you can take x to be the max no. of stages in the pipeline.
Now, the case in the question looks like a HW problem and since the pipeline structure is not defined so you will have to look at the RAW dependencies over all the instructions, which in this case are:
I2 and I1 over R1
I3 and I1 over R1
I4 and I2 over R3
I4 and I3 over R4
Related
I'm currently reading Programming in Lua Fourth Edition and I'm already stuck on the first exercise of "Chapter 2. Interlude: The Eight-Queen Puzzle."
The example code is as follows:
N = 8 -- board size
-- check whether position (n, c) is free from attacks
function isplaceok (a, n ,c)
for i = 1, n - 1 do -- for each queen already placed
if (a[i] == c) or -- same column?
(a[i] - i == c - n) or -- same diagonal?
(a[i] + i == c + n) then -- same diagonal?
return false -- place can be attacked
end
end
return true -- no attacks; place is OK
end
-- print a board
function printsolution (a)
for i = 1, N do -- for each row
for j = 1, N do -- and for each column
-- write "X" or "-" plus a space
io.write(a[i] == j and "X" or "-", " ")
end
io.write("\n")
end
io.write("\n")
end
-- add to board 'a' all queens from 'n' to 'N'
function addqueen (a, n)
if n > N then -- all queens have been placed?
printsolution(a)
else -- try to place n-th queen
for c = 1, N do
if isplaceok(a, n, c) then
a[n] = c -- place n-th queen at column 'c'
addqueen(a, n + 1)
end
end
end
end
-- run the program
addqueen({}, 1)
The code's quite commented and the book's quite explicit, but I can't answer the first question:
Exercise 2.1: Modify the eight-queen program so that it stops after
printing the first solution.
At the end of this program, a contains all possible solutions; I can't figure out if addqueen (n, c) should be modified so that a contains only one possible solution or if printsolution (a) should be modified so that it only prints the first possible solution?
Even though I'm not sure to fully understand backtracking, I tried to implement both hypotheses without success, so any help would be much appreciated.
At the end of this program, a contains all possible solutions
As far as I understand the solution, a never contains all possible solutions; it either includes one complete solution or one incomplete/incorrect one that the algorithm is working on. The algorithm is written in a way that simply enumerates possible solutions skipping those that generate conflicts as early as possible (for example, if first and second queens are on the same line, then the second queen will be moved without checking positions for other queens, as they wouldn't satisfy the solution anyway).
So, to stop after printing the first solution, you can simply add os.exit() after printsolution(a) line.
Listing 1 is an alternative to implement the requirement. The three lines, commented respectively with (1), (2), and (3), are the modifications to the original implementation in the book and as listed in the question. With these modifications, if the function returns true, a solution was found and a contains the solution.
-- Listing 1
function addqueen (a, n)
if n > N then -- all queens have been placed?
return true -- (1)
else -- try to place n-th queen
for c = 1, N do
if isplaceok(a, n, c) then
a[n] = c -- place n-th queen at column 'c'
if addqueen(a, n + 1) then return true end -- (2)
end
end
return false -- (3)
end
end
-- run the program
a = {1}
if not addqueen(a, 2) then print("failed") end
printsolution(a)
a = {1, 4}
if not addqueen(a, 3) then print("failed") end
printsolution(a)
Let me start from Exercise 2.2 in the book, which, based on my past experience to explain "backtracking" algorithms to other people, may help to better understand the original implementation and my modifications.
Exercise 2.2 requires to generate all possible permutations first. A straightforward and intuitive solution is in Listing 2, which uses nested for-loops to generate all permutations and validates them one by one in the inner most loop. Although it fulfills the requirement of Exercise 2.2, the code does look awkward. Also it is hard-coded to solve 8x8 board.
-- Listing 2
local function allsolutions (a)
-- generate all possible permutations
for c1 = 1, N do
a[1] = c1
for c2 = 1, N do
a[2] = c2
for c3 = 1, N do
a[3] = c3
for c4 = 1, N do
a[4] = c4
for c5 = 1, N do
a[5] = c5
for c6 = 1, N do
a[6] = c6
for c7 = 1, N do
a[7] = c7
for c8 = 1, N do
a[8] = c8
-- validate the permutation
local valid
for r = 2, N do -- start from 2nd row
valid = isplaceok(a, r, a[r])
if not valid then break end
end
if valid then printsolution(a) end
end
end
end
end
end
end
end
end
end
-- run the program
allsolutions({})
Listing 3 is equivalent to List 2, when N = 8. The for-loop in the else-end block does what the whole nested for-loops in Listing 2 do. Using recursive call makes the code not only compact, but also flexible, i.e., it is capable of solving NxN board and board with pre-set rows. However, recursive calls sometimes do cause confusions. Hope the code in List 2 helps.
-- Listing 3
local function addqueen (a, n)
n = n or 1
if n > N then
-- verify the permutation
local valid
for r = 2, N do -- start from 2nd row
valid = isplaceok(a, r, a[r])
if not valid then break end
end
if valid then printsolution(a) end
else
-- generate all possible permutations
for c = 1, N do
a[n] = c
addqueen(a, n + 1)
end
end
end
-- run the program
addqueen({}) -- empty board, equivalent allsolutions({})
addqueen({1}, 2) -- a queen in 1st row and 1st column
Compare the code in Listing 3 with the original implementation, the difference is that it does validation after all eight queens are placed on the board, while the original implementation validates every time when a queen is added and will not go further to next row if the newly-added queen causes conflicts. This is all what "backtracking" is about, i.e. it does "brute-force" search, it abandons the search branch once it finds a node that will not lead to a solution, and it has to reach a leaf of the search tree to determine it is a valid solution.
Back to the modifications in Listing 1.
(1) When the function hits this point, it reaches a leaf of the search tree and a valid solution is found, so let it return true representing success.
(2) This is the point to stop the function from further searching. In original implementation, the for-loop continues regardless of what happened to the recursive call. With modification (1) in place, the recursive call returns true if a solution was found, the function needs to stop and to propagate the successful signal back; otherwise, it continues the for-loop, searching for other possible solutions.
(3) This is the point the function returns after finishing the for-loop. With modification (1) and (2) in place, it means that it failed to find a solution when the function hits this point, so let it explicitly return false representing failure.
We have a big graph database made with Neo4j which has two type of relationships "E" and "I".
We would like to extract two graphs from it with a starting node called n0.
The first graph Gxi, based on the "I" relationship, must be obtained randomly.
The following request is wrong but this is the idea we want to implement. Here 10 neighbors are randomly chosen for each node of the last step
MATCH r1:(n0)-[:I]-(n1)
WITH random(n1) LIMIT 10
MATCH r2:(n1)-[:I]-(n2)
WITH random(n2) LIMIT 10*10
MATCH r3:(n2)-[:I]-(n3)
WITH random(n3) LIMIT 10*10*10
MATCH r4:(n4)-[:I]-(n4)
WITH random(n4) LIMIT 10*10*10*10
RETURN r1+r2+r3+r4
Then we would like to create the second graph Gxe based on the relationships "E" and the nodes of Gxi.
Thank you for your help.
APOC Procedures may be able to help here. There are collection functions that can be used to choose random items from a collection, and you can get slices of the collection rather than having to use LIMIT.
The trickier part will actually be collecting the subpaths along the way.
// assume already matched to start node n
MATCH r = (n)-[:I]-()
WITH apoc.coll.randomItems(collect(r), 10) as r1
UNWIND r1 as r
WITH r1, last(nodes(r)) as n
MATCH r = (n)-[:I]-()
WITH r1, apoc.coll.randomItems(collect(r), 10) as r2
UNWIND r2 as r
WITH r1, r2, last(nodes(r)) as n
MATCH r = (n)-[:I]-()
WITH r1, r2, apoc.coll.randomItems(collect(r), 10) as r3
UNWIND r3 as r
WITH r1, r2, r3, last(nodes(r)) as n
MATCH r = (n)-[:I]-()
WITH r1, r2, r3, apoc.coll.randomItems(collect(r), 10) as r4
RETURN r1 + r2 + r3 + r4
Hello I am very new to writing assembly and have a question regarding my attempt at writing a recursive function to compute the factorial of n.
Here is my attempt at writing the factorial function:
.global main
main:
MOV r1, #3
fact:
SUB sp, sp, #8
STR lr, [sp, #0]
STR r1, [sp,#4]
CMP r1, #1
BGT Else
ADD sp, sp, #8
MOV pc, lr
Else:
SUB r1, r1, #1
BL fact
MOV r2, r1
LDR r1, [sp, #4]
LDR lr, [sp, #0]
ADD sp, sp, #8
MUL r1, r2, r1
MOV pc, lr
MOV r0, #1
SWI 0x6b
SWI 0x11
The issue is this: i successfully can compute that 3 factorial is 6 and it gets stored in r1 at the end of the program; however, I can never get passed the last "MOV pc, lr" statement in the third execution of the loop and I cannot understand the logic behind why.
When I get to the third loop of MOV pc, lr I get an error stating: "PC out of valid memory range" but I am not sure why this is the case. Any pointers in the right direction would be greatly appreciated because I am an absolute beginner and cannot understand why this error is occurring. Thank you for your time!
For example, let's say register 4 (R4) has a value 0001110010101111. How could you change bit 5 (0001110010 >1< 01111) to 0 (even if it was already 0) without moving or changing the other bits in a single hex instruction?
So 0001110010101111 -> 0001110010001111
You'll want to AND it. Since the immediate value for AND is 5 bits and it uses sign extension, you can only clear a bit if it's one of the four least significant bits. Otherwise, you will need to perform another instruction to load the mask into a register. I'll do an example of both.
In the case of the 5th bit, the number that will mask the bit is 0b1111111111011111. In decimal, this is #65503 or #-33. Since this is too big to fit in an immediate instruction, you won't be able to do it in a single instruction. You will need to declare it in the data segment of your program and load the mask into a register. Then, you can AND it with R4.
; assuming R4 = 0001110010101111
LD R5, MASK_5 ; load the mask into R5
AND R4, R4, R5 ; set R4 = R4 AND R5
; R4 will now have the bit cleared
; data segment
MASK_5 .FILL #65503
In the case of the 3rd bit, the number that will mask the bit is 0b1111111111110111. In decimal, this is #65527 or #-9. This will fit in the immediate value of AND, so you can perform it in a single instruction:
; assuming R4 = 0001110010101111
AND R4, R4, #-9 ; set R4 = R4 AND #-9
; R4 will now have the bit cleared
Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.
You might try this (it's not in ASM, but you should be able to convert it easily):
float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);
In ASM it would be probably only VADD and VPADD.
I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...
PS. I'm new to NEON too
It seems that you want to get the sum of a certain length of array, and not only four float values.
In that case, your code will work, but is far from optimized :
many many pipeline interlocks
unnecessary 32bit addition per iteration
Assuming the length of the array is a multiple of 8 and at least 16 :
vldmia {q0-q1}, [pSrc]!
sub count, count, #8
loop:
pld [pSrc, #32]
vldmia {q3-q4}, [pSrc]!
subs count, count, #8
vadd.f32 q0, q0, q3
vadd.f32 q1, q1, q4
bgt loop
vadd.f32 q0, q0, q1
vpadd.f32 d0, d0, d1
vadd.f32 s0, s0, s1
pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.
I hope the rest of the code above is self explanatory.
You will notice that this version is many times faster than your initial one.
Here is the code in ASM:
vpadd.f32 d1,d6,d7 # q3 is register that needs all of its contents summed
vadd.f32 s1,s2,s3 # now we add the contents of d1 together (the sum)
vadd.f32 s0,s0,s1 # sum += s1;
I may have forgotten to mention that in C the code would look like this:
float sum = 1.0f;
sum += number1 * number2;
I have omitted the multiplication from this little piece asm of code.