Saturday, 15 February 2014

c++ - Performance AVX/SSE assembly vs. intrinsics -


I'm just trying to check the optimal approach to optimizing a few basic routines. In this case, a very simple example of multiplying 2 float vectors together:

  zero (float * src1, float * src2, float * dst) float * {for (int i = 0; i & lt; ct; i ++) dst [i] = sqr 1 [i] * cr2 [ii] ;; };  

Simple implementation is very slow I had used some external ASM AVX and also tried to use intrinsics. These are the results of the test (time is smaller, better):

  ASM: 0.110 IPP: 0.125 Intrinsics: 0.18 plain C ++: 4.0  

(Tried using compiled MSVC 2013, SSE 2, Intel Compiler, the results were too much)

As you can see that my ASM code also kills Intel display primitives (possibly because To ensure that I can use AVX Coalition Instructions)But I want to use the personal approach, it is easy to manage and I was thinking that the compiler should work best to optimize all branches and goods (my ASM code is useless in that case, even then Is fast). So here's the code used by the internal code:

  int i; (I = 0; (MINTEGER) (DST + I)% 32! = 0 & amp; i & lt; cnt; i ++) DST [I] = src1 [i] * src2 [i]; If ((MINTEGER) (src1 + i)% 32 == 0) {if ((MINTEGER) (src2 + i)% 32 == 0) {for (; i  

One problem is that in the beginning and at the end, C ++ implementation is not using AVX unless I can not enable AVX in the compiler, which I do not want because it is just AVX Expertise, but the software should work on a single platform, where AVX is not available. And sadly there is no intrinsic value for instructions like VMOSS, so there is probably a penalty for mixing AVX code with SEO, which uses the compiler. However, if I enable AVX in the compiler, then it is still not less than 0.14.

How to optimize the Institute to reach the speed of the ASM code?

Your implementation with intrinsics is not the same as your implementation in C: eg What if your function was called with the argument the original (P, P, P + 1) ? You will get different results Pure C version is slow because the compiler is making sure that the code exactly is what you said.

If you do not want to overlap three arrays based on perception based on compiler optimization, you have to clarify it:

  from zero to original (float * src1 , Float * src2, float * __ restricted_dst)  

or even

  void root (const float * src1, const float * src2 , Float * __ Restrict___DST)  

(I think this __restrict __ is just on the output pointer, though it will not be added to input pointers too)


No comments:

Post a Comment