(edited) i'm working on the verilog Arithmetic project and i got stuck on the sign extension part(assuming this is the problem). i have 4bit input A, B and should have 8 bit output. for some of the process(sum, sub...) i need to use sign extend to make the 8bit output. so for the body of arithmetic, i have this code. this is half of the code. i didn't include the half part cuz it's just long..
module arithmetic(A, B, AN0, DP, sum, sub, mult, div, comp, shiftLeft,
shiftRight, signExtend);
input signed [3:0] A, B;
output [7:0] sum, sub, mult, div, comp, shiftLeft, shiftRight,
signExtend;
output AN0, DP;
//sum
reg [4:0] qsum;
always# (A, B)
qsum = A+B;
assign sum = {{3{qsum[4]}},qsum};
//sub
reg [4:0] qsub;
always# (A, B)
qsub = A-B;
assign sub = {{3{qsub[4]}},qsub};
//mult
reg [7:0] qmult;
always# (A, B)
qmult = A * B;
assign mult = qmult;
and when i checked my simulation, it doesn't have any values but Z, and Xs. it doesn't even show any input values. why is that happening?? thank you
(edited) this is my test bench code. there are 8 operations(sum, subtract, multiply, division, comparator, shiftleft, shiftright, sign extension)
module lap3_top_tb();
reg signed [3:0] A, B;
reg [2:0] Operation;
wire [7:0] Result;
wire DP, AN0;
lab3_top ulap3_top(
.A(A),
.B(B),
.Operation(Operation),
.Result(Result),
.DP(DP),
.AN0(AN0)
);
initial begin
A = 6; B = 7; Operation = 0;
#20;
A = -6; B = -7; Operation = 0;
#20;
A = 6; B = 7; Operation = 1;
#20;
A = -6; B = -7; Operation = 1;
#20;
A = 6; B = 7; Operation = 2;
#20;
A = -6; B = 7; Operation = 2;
#20;
A = 7; B = 4; Operation = 3;
#20;
A = 7; B = 0; Operation = 3;
#20;
A = 6; B = 7; Operation = 4;
#20;
A = -6; B = -7; Operation = 4;
#20;
A = 1; B = 6; Operation = 5;
#20;
A = 1; B = -6; Operation = 5;
#20;
A = 1; B = 6; Operation = 6;
#20;
A = 1; B = -6; Operation = 6;
#20;
A = 6; B = 0; Operation = 7;
#20;
A = -5; B = 0; Operation = 7;
#20;
end
endmodule
the lap3_top file is here. (mux_8_1 will pick the output and out thru Result. if you need code, let me know! but i think mux works fine)
module lap3_top(A, B, Operation, Result, AN0, DP);
input signed [3:0] A, B;
input [2:0] Operation;
output AN0, DP;
output [7:0] Result;
wire a, b, c, d, e, f, g, h;
arithmetic uarithmetic(
.A(A),
.B(B),
.AN0(AN0),
.DP(DP),
.sum(a),
.sub(b),
.mult(c),
.div(d),
.comp(e),
.shiftLeft(f),
.shiftRight(g),
.signExtend(h)
);
mux_8_1 umux8_1(
.A(a),
.B(b),
.C(c),
.D(d),
.E(e),
.F(f),
.G(g),
.H(h),
.Operation(Operation),
.Result(Result)
);
endmodule
thank you so much guys!
I try to simulate your code and found a following mistake in your code: when you instantiation top module in the testbench module you use lab3_top ulap3_top(...); lab3_top name of module, but module which would you like to has another name module lap3_top(...); lap3_top.
I changed the name and everything works well (on waveform you can see ZZ state,
because I haven't mux_8_1 module in the code and few operation haven't description)
P.S. By the way, I suppose that you use Vivado as you add this tag. And if that, there has a hint how to check errors like this (with different names in module and in instantiation or when you have some errors in module that it couldn't be compiled in library). If you expand all your modules in hierarchy you will find ? sign in module where error is.
Related
Code a method called calcSeries that calculates and returns the value of y in the following series:
y=1+∑ni=1i2i∗x
where n and x are two input integers and y is the returned double value.
Example:
Input: x=8, n=4
So, the series is y=1+1/(1∗8)+2^2/(2∗8)+3^2/(3∗8)+4^2/(4∗8)
Output: 2.25
I am having trouble with going about how to design the code. So far I have:
public double calcSeries(int n, int x){
double y = 0.0;
int a = 1;
double b = Math.pow(n,2)/(n*x);
for (int i = 1; i < (n + 1); i++) {
}
return y;
}
You've already written most of it. Just added the declaration of b inside the loop, since it's value depends on the current i.
public double calcSeries(int n, int x){
double y = 1.0;
for (int i = 1; i < (n + 1); i++) {
double b = Math.pow(i,2)/(i*x);
y += b;
}
return y;
}
int main() {
const int i = 1;
const int* p = &i;
int j = 2;
const int* q = &j;
j = 3;
printf("%d", *p + *q);
return 0;
}
I have this code, and I try to understand how it compiles. p and q are pointers to constant integers but j isn't declared as constant. Moreover, j changes into 3.
How does it work?
Thanks!
On the 5th line, you're assigning the address of variable j to q. This do not enforce any constraint over j, just on pointer q. Through q the compiler won't allow you to change pointed value, however j remains writeable and the line j = 3; is legal.
See What is the difference between const int*, const int * const, and int const *?
I am trying to run this fast fourier implementation code. It compiles fine but gives this error at runtime. I have no idea about the error or what it means. Can anyone help me out?
I compiled and run the program by:
mpicc -o exec test.c
./exec
CODE:
This is the code that I found on GITHUB. Its the parallel version of fast fourier algorithm.
#include <stdio.h>
#include <mpi.h> //To use MPI
#include <complex.h> //to use complex numbers
#include <math.h> //for cos() and sin()
#include "timer.h" //to use timer
#define PI 3.14159265
#define bigN 16384 //Problem Size
#define howmanytimesavg 3
int main()
{
int my_rank,comm_sz;
MPI_Init(NULL,NULL); //start MPI
MPI_Comm_size(MPI_COMM_WORLD,&comm_sz); ///how many processes are we
using?
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //which process is this?
double start,finish;
double avgtime = 0;
FILE *outfile;
int h;
if(my_rank == 0) //if process 0 open outfile
{
outfile = fopen("ParallelVersionOutput.txt", "w"); //open from current
directory
}
for(h = 0; h < howmanytimesavg; h++) //loop to run multiple times for AVG
time.
{
if(my_rank == 0) //If it's process 0 starts timer
{
start = MPI_Wtime();
}
int i,k,n,j; //Basic loop variables
double complex evenpart[(bigN / comm_sz / 2)]; //array to save the data
for EVENHALF
double complex oddpart[(bigN / comm_sz / 2)]; //array to save the data
for ODDHALF
double complex evenpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for EVENHALF
double complex oddpartmaster[ (bigN / comm_sz / 2) * comm_sz]; //array
to save the data for ODDHALF
double storeKsumreal[bigN]; //store the K real variable so we can abuse
symmerty
double storeKsumimag[bigN]; //store the K imaginary variable so we can
abuse symmerty
double subtable[(bigN / comm_sz)][3]; //Each process owns a subtable
from the table below
double table[bigN][3] = //TABLE of numbers to use
{
0,3.6,2.6, //n, Real,Imaginary CREATES TABLE
1,2.9,6.3,
2,5.6,4.0,
3,4.8,9.1,
4,3.3,0.4,
5,5.9,4.8,
6,5.0,2.6,
7,4.3,4.1,
};
if(bigN > 8) //Everything after row 8 is all 0's
{
for(i = 8; i < bigN; i++)
{
table[i][0] = i;
for(j = 1; j < 3;j++)
{
table[i][j] = 0.0; //set to 0.0
}
}
}
int sendandrecvct = (bigN / comm_sz) * 3; //how much to send and
recieve??
MPI_Scatter(table,sendandrecvct,MPI_DOUBLE,subtable,sendandrecvct,MPI_DOUBLE,0,MPI_COMM_WORLD); //scatter the table to subtables
for (k = 0; k < bigN / 2; k++) //K coeffiencet Loop
{
/* Variables used for the computation */
double sumrealeven = 0.0; //sum of real numbers for even
double sumimageven = 0.0; //sum of imaginary numbers for even
double sumrealodd = 0.0; //sum of real numbers for odd
double sumimagodd = 0.0; //sum of imaginary numbers for odd
for(i = 0; i < (bigN/comm_sz)/2; i++) //Sigma loop EVEN and ODD
{
double factoreven , factorodd = 0.0;
int shiftevenonnonzeroP = my_rank * subtable[2*i][0]; //used to shift index numbers for correct results for EVEN.
int shiftoddonnonzeroP = my_rank * subtable[2*i + 1][0]; //used to shift index numbers for correct results for ODD.
/* -------- EVEN PART -------- */
double realeven = subtable[2*i][1]; //Access table for real number at spot 2i
double complex imaginaryeven = subtable[2*i][2]; //Access table for imaginary number at spot 2i
double complex componeeven = (realeven + imaginaryeven * I); //Create the first component from table
if(my_rank == 0) //if proc 0, dont use shiftevenonnonzeroP
{
factoreven = ((2*PI)*((2*i)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
}
else //use shiftevenonnonzeroP
{
factoreven = ((2*PI)*((shiftevenonnonzeroP)*k))/bigN; //Calculates the even factor for Cos() and Sin()
// *********Reduces computational time*********
}
double complex comptwoeven = (cos(factoreven) - (sin(factoreven)*I)); //Create the second component
evenpart[i] = (componeeven * comptwoeven); //store in the evenpart array
/* -------- ODD PART -------- */
double realodd = subtable[2*i + 1][1]; //Access table for real number at spot 2i+1
double complex imaginaryodd = subtable[2*i + 1][2]; //Access table for imaginary number at spot 2i+1
double complex componeodd = (realodd + imaginaryodd * I); //Create the first component from table
if (my_rank == 0)//if proc 0, dont use shiftoddonnonzeroP
{
factorodd = ((2*PI)*((2*i+1)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
}
else //use shiftoddonnonzeroP
{
factorodd = ((2*PI)*((shiftoddonnonzeroP)*k))/bigN;//Calculates the odd factor for Cos() and Sin()
// *********Reduces computational time*********
}
double complex comptwoodd = (cos(factorodd) - (sin(factorodd)*I));//Create the second component
oddpart[i] = (componeodd * comptwoodd); //store in the oddpart array
}
/*Process ZERO gathers the even and odd part arrays and creates a evenpartmaster and oddpartmaster array*/
MPI_Gather(evenpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,evenpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
MPI_Gather(oddpart,(bigN / comm_sz / 2),MPI_DOUBLE_COMPLEX,oddpartmaster,(bigN / comm_sz / 2), MPI_DOUBLE_COMPLEX,0,MPI_COMM_WORLD);
if(my_rank == 0)
{
for(i = 0; i < (bigN / comm_sz / 2) * comm_sz; i++) //loop to sum the EVEN and ODD parts
{
sumrealeven += creal(evenpartmaster[i]); //sums the realpart of the even half
sumimageven += cimag(evenpartmaster[i]); //sums the imaginarypart of the even half
sumrealodd += creal(oddpartmaster[i]); //sums the realpart of the odd half
sumimagodd += cimag(oddpartmaster[i]); //sums the imaginary part of the odd half
}
storeKsumreal[k] = sumrealeven + sumrealodd; //add the calculated reals from even and odd
storeKsumimag[k] = sumimageven + sumimagodd; //add the calculated imaginary from even and odd
storeKsumreal[k + bigN/2] = sumrealeven - sumrealodd; //ABUSE symmetry Xkreal + N/2 = Evenk - OddK
storeKsumimag[k + bigN/2] = sumimageven - sumimagodd; //ABUSE symmetry Xkimag + N/2 = Evenk - OddK
if(k <= 10) //Do the first 10 K's
{
if(k == 0)
{
fprintf(outfile," \n\n TOTAL PROCESSED SAMPLES : %d\n",bigN);
}
fprintf(outfile,"================================\n");
fprintf(outfile,"XR[%d]: %.4f XI[%d]: %.4f \n",k,storeKsumreal[k],k,storeKsumimag[k]);
fprintf(outfile,"================================\n");
}
}
}
if(my_rank == 0)
{
GET_TIME(finish); //stop timer
double timeElapsed = finish-start; //Time for that iteration
avgtime = avgtime + timeElapsed; //AVG the time
fprintf(outfile,"Time Elaspsed on Iteration %d: %f Seconds\n", (h+1),timeElapsed);
}
}
if(my_rank == 0)
{
avgtime = avgtime / howmanytimesavg; //get avg time
fprintf(outfile,"\nAverage Time Elaspsed: %f Seconds", avgtime);
fclose(outfile); //CLOSE file ONLY proc 0 can.
}
MPI_Barrier(MPI_COMM_WORLD); //wait to all proccesses to catch up before finalize
MPI_Finalize(); //End MPI
return 0;
}
ERROR:
Fatal error in PMPI_Gather: Invalid datatype, error stack:
PMPI_Gather(904): MPI_Gather(sbuf=0x7fffb62799a0, scount=8192,
MPI_DATATYPE_NULL, rbuf=0x7fffb6239980, rcount=8192, MPI_DATATYPE_NULL,
root=0, MPI_COMM_WORLD) failed
PMPI_Gather(815): Datatype for argument sendtype is a null datatype
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=537490947
:
system msg for write_line failure : Bad file descriptor
There is no MPI_DATATYPE_NULL in your code, but you only use MPI_DOUBLE_COMPLEX. Note the latter type is a Fortran datatype, and using it in C is not correct strictly speaking.
My guess is that MPI_DOUBLE_COMPLEX is causing the issue (type not defined or not initialized because you invoked the C version of MPI_Init()).
You can obviously rewrite your code in Fortran, or use your own derived datatype for a C double complex number.
Meanwhile, I suggest you write simple C and Fortran helloworld programs that use MPI_DOUBLE_COMPLEX (MPI_Bcast() of one element for example) to confirm the issue is with MPI_DOUBLE_COMPLEX and is restricted to C or not.
I am looking for an simple example where using vectorization and parallelization on Xeon Phi this has better perfomance than only-Xeon. Could you help me please?
I am trying with the next example. I comment the lines 14, 18 and 19 for run on only-Xeon and uncoment these for Xeon-Phi, but only-Xeon has better performance than Xeon-phi
1.void main(){
2.double *a, *b, *c;
3.int i,j,k, ok, n=100;
4.int nPadded = ( n%8 == 0 ? n : n + (8-n%8) );
5.ok = posix_memalign((void**)&a, 64, n*nPadded*sizeof(double));
6.ok = posix_memalign((void**)&b, 64, n*nPadded*sizeof(double));
7.ok = posix_memalign((void**)&c, 64, n*nPadded*sizeof(double));
8.for(i=0; i<n; i++)
9.{
10. a[i] = (int) rand();
11. b[i] = (int) rand();
12. c[i] = 0.0;
13.}
14.#pragma offload target(mic) in(a,b:length(n*nPadded)) inout(c:length(n*nPadded))
15.#pragma omp parallel for
16.for( i = 0; i < n; i++ )
17. for( k = 0; k < n; k++ )
18. #pragma vector aligned
19. #pragma ivdep
20. for( j = 0; j < n; j++ ){
21. c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j]
22.}
First couple words about autovectorization. Advantage of autovectorization is simplicity. You need to set some keywords than magic happens and compiler make fast code for you. If you want to go this way try this manual.
The disadvantage of this approach is that there is no easy way to understand how compiler make his work. In vectorization report you will see "LOOP WAS VECTORIZED" or "LOOP WAS NOT VECTORIZED". But if you want truly understand how your code works the only way is look in your program assembly. This is not a problem to get assembly. You need to compile program with -fcode-asm. But I think if you need to read assembly to check how "simple autovectorization" method works it is not so simple.
Alternative to autovectorization are intrinsics (actually, this is not single alternative). Think about intrinsics like assembly wrapped with C functions. Many intrinsics internally wrap single assembly command.
I recommend to use this intrinsics guide.
So my simple way steps:
Make single thread reference implementation. You will use it to check correctness of intrinsics version.
Implement SSE intrinsics version. SSE intrinsics are much simpler and can be tested on Xeon.
Implement AVX-512 version for Xeon Phi.
Measure your speed.
Let's do it with your program.
There are many differences with your program:
I use float instead double.
I use _mm_malloc instead posix_memalign.
I suppose n is divided by 16 without remainder (16 floats in AVX-512 vector register). I don't work with loop peeling in this example.
I use native mode instead of offload mode. KNL is bootable so it is not necessary to use offload mode anymore.
Also I think your program is not correct because it modifies c array from several threads in one moment of time. But lets think it is not important and we just need some calculation job.
My code work time:
Intel Xeon 5680
reference calc time: 97.677505 seconds
Intrinsics calc time: 6.189296 seconds
Intel Xeon Phi (KNC) SE10X
reference calc time: 199.0 seconds
Intrinsics calc time: 2.78 seconds
Code:
#include <stdio.h>
#include <omp.h>
#include <math.h>
#include "immintrin.h"
#include <assert.h>
#define F_E_Q(X,Y,N) (round((X) * pow(10, N)-(Y) * pow(10, N)) == 0)
void reference(float* a, float* b, float* c, int n, int nPadded);
void intrinsics(float* a, float* b, float* c, int n, int nPadded);
char *test(){
int n=4800;
int nPadded = n;
assert(n%16 == 0);
float* a = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* b = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* cRef = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
float* c = (float*) _mm_malloc(sizeof(float)*n*nPadded, 64);
assert(a != NULL);
assert(b != NULL);
assert(cRef != NULL);
assert(c != NULL);
for(int i=0, max = n*nPadded; i<max; i++){
a[i] = (int) rand() / 1804289408.0;
b[i] = (int) rand() / 1804289408.0;
cRef[i] = 0.0;
c[i] = 0.0;
}
debug_arr("a", "%f", a, 0, 9, 1);
debug_arr("b", "%f", b, 0, 9, 1);
debug_arr("cRef", "%f", cRef, 0, 9, 1);
debug_arr("c", "%f", c, 0, 9, 1);
double t1 = omp_get_wtime();
reference(a, b, cRef, n, nPadded);
double t2 = omp_get_wtime();
debug("reference calc time: %f", t2-t1);
t1 = omp_get_wtime();
intrinsics(a, b, c, n, nPadded);
t2 = omp_get_wtime();
debug("Intrinsics calc time: %f", t2-t1);
debug_arr("cRef", "%f", cRef, 0, 9, 1);
debug_arr("c", "%f", c, 0, 9, 1);
for(int i=0, max = n*nPadded; i<max; i++){
assert(F_E_Q(cRef[i], c[i], 2));
}
_mm_free(a);
_mm_free(b);
_mm_free(cRef);
_mm_free(c);
return NULL;
}
void reference(float* a, float* b, float* c, int n, int nPadded){
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j++ )
c[i*nPadded+j] = c[i*nPadded+j] + a[i*nPadded+k]*b[k*nPadded+j];
}
#if __MIC__
void intrinsics(float* a, float* b, float* c, int n, int nPadded){
#pragma omp parallel for
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j+=16 ){
__m512 aPart = _mm512_extload_ps(a + i*nPadded+k, _MM_UPCONV_PS_NONE, _MM_BROADCAST_1X16, _MM_HINT_NONE);
__m512 bPart = _mm512_load_ps(b + k*nPadded+j);
__m512 cPart = _mm512_load_ps(c + i*nPadded+j);
cPart = _mm512_add_ps(cPart, _mm512_mul_ps(aPart, bPart));
_mm512_store_ps(c + i*nPadded+j, cPart);
}
}
#else
void intrinsics(float* a, float* b, float* c, int n, int nPadded){
#pragma omp parallel for
for(int i = 0; i < n; i++ )
for(int k = 0; k < n; k++ )
for(int j = 0; j < n; j+=4 ){
__m128 aPart = _mm_load_ps1(a + i*nPadded+k);
__m128 bPart = _mm_load_ps(b + k*nPadded+j);
__m128 cPart = _mm_load_ps(c + i*nPadded+j);
cPart = _mm_add_ps(cPart, _mm_mul_ps(aPart, bPart));
_mm_store_ps(c + i*nPadded+j, cPart);
}
}
#endif
I have int A, B, C. And A is in range 0-9999, B is 0-99, C is 0-99.
Because the function must return only one double, I think of putting them all into one number. Otherwise I need to call function three times.
But I cannot write an efficient code to do this. This will be called millions times, so it should be quite effective, but no ASM.
I need a function double pack3int_to_double(int A, int B, int C) {}
Couldn't you just store A + 1000B + 100000C?
For example, if you wanted to store A = 1234, B = 6, and C = 89, you'd just store
89061234
CCBAAAA
You can then extract the numbers by casting the double to an int and using standard integer division and modulus tricks to recover the individual values.
Hope this helps!
If A<10,000 and B & C <100, A can be expressed with 14 bits, and B & C with 8 bits. Thus you need 30 bits in total.
You could therefore pack/unpack the integers by shifting it to the right place:
int packed = A + B<<14 + C<<22;
A = packed & 0x3FFF; B = (packed >> 14) & 0xFF; C = (packed >> 22) & 0xFF;
Bit shifting is of course MUCH faster than multiply/divide, and you can cast the int to a double and vice versa.
This is technically not legal C code, so you would use this at your own risk:
typedef union {
double x;
struct {
unsigned a : 14;
unsigned b : 7;
unsigned c : 7;
} y;
} result_t;
The C standard doesn't allow using a union member to write a value and a different one to read it out, but I am not aware of a compiler that does the static analysis to diagnose such a problem (it doesn't mean one won't do so in the future). Also, using certain int values may result in a trap representation for a double. But, if you know your system will not generate any trap representations, you can consider using this.
double pack3int_to_double(int A, int B, int C) {
result_t r;
r.y.a = A;
r.y.b = B;
r.y.c = C;
return r.x;
}
void unpack3int_from_double (double X, int *A, int *B, int *C) {
result_t r = { X };
*A = r.y.a;
*B = r.y.b;
*C = r.y.c;
}
You can use out parameters in function call and retrieve all 3 int variables.
You could return a NaN double with the data stored in the mantissa. That gives you 53 bits to utilize. Should be plenty.
http://en.m.wikipedia.org/wiki/NaN
Inspired by your answers, this is what I come up so far. This should be quite efficient, and only 32 bits are used, so the exponent of the double is not touched.
struct pack_abc {
unsigned short a;
unsigned char b, c;
int safety;
};
double pack3int_to_double(int A, int B, int C) {
struct pack_abc R = {A, B, C, 0}; // or 0 could be replaced with something smater, like NaN?
return *(double*)&R;
}
void main() {
int w = 1234, a = 56, d = 78;
int W, A, D, i;
double p = pack3int_to_double(w, a, d);
// we got the data packed into 'p', now let's unpack it
struct pack_abc *R = (struct pack_abc*) & p;
printf("%i %i %i\n", (int)R->a, (int)R->b, (int)R->c);
}