Multiply multiple _mm128 with single entry of _mm256 - intel

I have 8 _mm128 registers and each register needs to be multiplied by a single entry of another _mm256 register.
One solution that jumps to my mind would be:
INPUT: __m128 a[8]; __m256 b;
__m128 tmp = _mm256_extractf128_ps(b,0);
a[0] = _mm_mul_ps(a[0],_mm_shuffle_ps(tmp,tmp,0));
a[1] = _mm_mul_ps(a[1],_mm_shuffle_ps(tmp,tmp,0x55));
a[2] = _mm_mul_ps(a[2],_mm_shuffle_ps(tmp,tmp,0xAA));
a[3] = _mm_mul_ps(a[3],_mm_shuffle_ps(tmp,tmp,0xFF));
tmp = _mm256_extractf128_ps(b,1);
a[4] = _mm_mul_ps(a[4],_mm_shuffle_ps(tmp,tmp,0));
a[5] = _mm_mul_ps(a[5],_mm_shuffle_ps(tmp,tmp,0x55));
a[6] = _mm_mul_ps(a[6],_mm_shuffle_ps(tmp,tmp,0xAA));
a[7] = _mm_mul_ps(a[7],_mm_shuffle_ps(tmp,tmp,0xFF));
What would be the best way to achieve this? Thank you.

I think your solution is about as good as it's going to get, except that I would use explicit variables rather than an array, so that everything stays in registers as far as possible:
__m128 a0, a1, a2, a3, a4, a5, a6, a7;
__m256 b;
__m128 tmp = _mm256_extractf128_ps(b,0);
a0 = _mm_mul_ps(a0, _mm_shuffle_ps(tmp,tmp,0));
a1 = _mm_mul_ps(a1, _mm_shuffle_ps(tmp,tmp,0x55));
a2 = _mm_mul_ps(a2, _mm_shuffle_ps(tmp,tmp,0xAA));
a3 = _mm_mul_ps(a3, _mm_shuffle_ps(tmp,tmp,0xFF));
tmp = _mm256_extractf128_ps(b,1);
a4 = _mm_mul_ps(a4, _mm_shuffle_ps(tmp,tmp,0));
a5 = _mm_mul_ps(a5, _mm_shuffle_ps(tmp,tmp,0x55));
a6 = _mm_mul_ps(a6, _mm_shuffle_ps(tmp,tmp,0xAA));
a7 = _mm_mul_ps(a7, _mm_shuffle_ps(tmp,tmp,0xFF));

Related

What are the different versions of arithmetic swap and why do they work?

I think we all should be familiar of the arithmetic swap algorithm, that swaps two variables without using a third variable. Now I found out that there are two variations of the arithmetic swap. Please consider the following:
Variation 1.
int a = 2;
int b = 3;
a = a + b;
b = a - b;
a = a - b;
Variation 2.
int a = 2;
int b = 3;
b = b - a;
a = a + b;
b = a - b;
I want to know, why are there two distinct variations of the arithmetic swap and why do they work? Are there also other variations of the arithmetic swap that achieve the same result? How are they related? Is there any elegant mathematical formula that justifies why the arithmetic swap works the way it does, for all variations? Is there anything related between these two variations of the two arithmetic swap, like an underlying truth?
Break each variable out as what it represents:
a = 2
b = 3
a1 = a + b
b1 = a1 - b = (a + b) - b = a
a2 = a1 - b1 = (a + b) - a = b
a = 2
b = 3
b1 = b - a
a1 = a + b1 = a + (b - a) = b
b2 = a1 - b1 = b - (b - a) = a
There's not underlying truth other than the fact that the math works out. Remember that each time you do an assignment, it's effectively a new "variable" from the math side.

convert an R script to IDL: Array manipulation

I am an R and IDL beginner. Im trying to convert an R script to IDL.
R can do array manipulation with t1 (array[100000]) but IDL cannot.
ERROR: Array subscript for CZ must have same size as source expression
s1= 100000.
c1 = array[200000]
n1 = s1*2+2
t1 = array[100000]
————————————————————————————————
(function)
f03, c1, s1, n1
cz = fltarr(n1,3)
cz[0:((2*s1)-1),0] = c1
cz[1:(2*s1),1] = c1
cz[2:((2*1)+1),2] = c1
cr = cz[0:(n1-1),1] - cz[0:(n1-1),2]
cl = cz[0:(n1-1),1] - cz[0:(n1-1),0]
p1 = where(cr GE 0.0 AND cl GE 0.0 AND (cz[0:(n1-1),1]) GE 1.4)
n2 = n_elements(p1)
ct = fltarr(n2+1,2)
ct[0:n2-1,0] = p1
ct[1:n2,1] = p1
c2 = ct[*,0] - ct[*,1]
ip = where(c2 GT 2.)
ch = p1[ip]
return, ch
————————————————————————————————
p1 = f03(c1,s1,n1) ;;;;; function works here
f1 = f03(t1,s1,n1) ;;;;; error on array size
I used MATRIX and AS.MATRIX in R (for f03). Does FLTARR cause this error?
Your lines:
cz = fltarr(n1, 3)
cz[0:((2 * s1) - 1), 0] = c1
make sense when cz has a first dimension of size 2000002 and c1 is 200000 elements — the sizes on the left and right of the = sign match, i.e., 0:199999 is 2000000 elements just like the size of c1.
But in the second call, c1 only has 100000 elements, but the left side is still asking for 2000000 elements.
Also, define a function like:
function f03, c1, s1, n1
; a bunch of code goes here
return, ch
end

Issue with substitution

I have a list L = [a13 == a10, a14 == a11, a15 == a12, a16 == a7, a17 == a8, a18 == a9]
I then have a running through a loop giving it these values
a = 1
a = 2*a15*a16 + 2*a13*a17 + 2*a13*a18 +1849
etc
I have
print(a)
a.subs(L)
print(a)
and it indicates no change, but I would of thought/ expected substitution to of taken place. Maybe I am being idiot, but please tell me where.
Thanks.
Edit: Example code
I will write out some of my code + outputs:
print L
while k <= i[0].degree(t):
a = i[0].coefficient({t:k})
print a
b = a.subs(L)
print b
Don't understand why there is an extra box, but hopefully this makes sense.
An example of Outputs:
[a13 == a10, a13 == a11, a15 == a12, a16 == a7, a17 == a8, a18 == a9]
1
1
1
1
2*a15*16 + 2*a14*a17+2*a13*a13 + 1849
2*a15*16 + 2*a14*a17+2*a13*a13 + 1849
Hope this helps
I think that what you are missing is that a.subs(input) is not intended to modify a - presumably so that one may do it many times. Why not try this:
b = a.subs(L)
print b

Determining whether there is a descending pattern between two sampled numbers

I have two numbers that are samples of two different quantities (it doesn't really matter what it is). They are both fluctuating with time. I have samples for these values from two different points in time. Call them a0, a1, b0, b1. I can use the differences (a1-a0, b1-b0) the difference and sum of the differences ( (a1-a0)-(b1-b0) ) ( (a1-a0) + (b1-b0) ) )
My questions is how do you determine when both of them are descending in an fashion that doesn't hard code any constants. Let me explain.
I want to detect when both of these quantities have decreased by a certain amount but that amount may change if I change the quantities I'm sampling so I can't hard code a constant.
I'm sorry if this is vague but that's really all the information I have. I was just wondering if this is even solvable.
if ( a1 - a0 < 0)
if( b1 - b0 < 0) {
//... descending
}
or:
if ( a1 - a0 + b1 - b0 < a1 - a0) // b1 - b0 is negative
if( a1 - a0 + b1 - b0 < b1 - b0) { // a1 - a0 is negative
//... descending
}
To add a threshold is simple:
if ( a1 - a0 < -K)
if( b1 - b0 < -K) {
//... descending, more than K
}
or:
if ( a1 - a0 + b1 - b0 < a1 - a0 - K) // b1 - b0 is less than -K
if( a1 - a0 + b1 - b0 < b1 - b0 - K) { // a1 - a0 is less than -K
//... descending more than K
}

Performance degrade while using alternative for Intel intrinsics SSSE3

I am developing a performance critical application which has to be ported into Intel Atom processor which just supports MMX, SSE, SSE2 and SSE3. My previous application had support for SSSE3 as well as AVX now I want to downgrade it to Intel Atom processor(MMX, SSE, SSE2, SSE3).
There is a serious performance downgrade when I replace ssse3 instruction particularly _mm_hadd_epi16 with this code
RegTemp1 = _mm_setr_epi16(RegtempRes1.m128i_i16[0], RegtempRes1.m128i_i16[2],
RegtempRes1.m128i_i16[4], RegtempRes1.m128i_i16[6],
Regfilter.m128i_i16[0], Regfilter.m128i_i16[2],
Regfilter.m128i_i16[4], Regfilter.m128i_i16[6]);
RegTemp2 = _mm_setr_epi16(RegtempRes1.m128i_i16[1], RegtempRes1.m128i_i16[3],
RegtempRes1.m128i_i16[5], RegtempRes1.m128i_i16[7],
Regfilter.m128i_i16[1], Regfilter.m128i_i16[3],
Regfilter.m128i_i16[5], Regfilter.m128i_i16[7]);
RegtempRes1 = _mm_add_epi16(RegTemp1, RegTemp2);
This is the best conversion I was able to come up with for this particular instruction. But this change has seriously affected the performance of the entire program.
Can anyone please suggest a better performance efficient alternative within MMX, SSE, SSE2 and SSE3 instructions to the _mm_hadd_epi16 instruction. Thanks in advance.
_mm_hadd_epi16(a, b) can be simulated with the following code:
/* (b3, a3, b2, a2, b1, a1, b0, a0) */
__m128i ab0 = _mm_unpacklo_epi16(a, b);
/* (b7, a7, b6, a6, b5, a5, b4, a4) */
__m128i ba0 = _mm_unpackhi_epi16(a, b);
/* (b5, b1, a5, a1, b4, b0, a4, a0) */
__m128i ab1 = _mm_unpacklo_epi16(ab0, ba0);
/* (b7, b3, a7, a3, b6, b2, a6, a2) */
__m128i ba1 = _mm_unpackhi_epi16(ab0, ba0);
/* (b6, b4, b2, b0, a6, a4, a2, a0) */
__m128i ab2 = _mm_unpacklo_epi16(ab1, ba1);
/* (b7, b5, b3, b1, a7, a5, a3, a1) */
__m128i ba2 = _mm_unpackhi_epi16(ab1, ba1);
/* (b6+b7, b4+b5, b2+b3, b0+b1, a6+a7, a4+a5, a2+a3, a0+a1) */
__m128i c = _mm_add_epi16(ab2, ba2);
If your goal is to take the horizontal sum of 8 16-bit values you can do this with SSE2 like this:
__m128i sum1 = _mm_shuffle_epi32(a,0x0E); // 4 high elements
__m128i sum2 = _mm_add_epi16(a,sum1); // 4 sums
__m128i sum3 = _mm_shuffle_epi32(sum2,0x01); // 2 high elements
__m128i sum4 = _mm_add_epi16(sum2,sum3); // 2 sums
__m128i sum5 = _mm_shufflelo_epi16(sum4,0x01); // 1 high element
__m128i sum6 = _mm_add_epi16(sum4,sum5); // 1 sum
int16_t sum7 = _mm_cvtsi128_si32(sum6); // 16 bit sum

Resources