DHara: c++ - Performance AVX/SSE assembly vs. intrinsics -

Saturday, 15 February 2014

c++ - Performance AVX/SSE assembly vs. intrinsics -

I'm just trying to check the optimal approach to optimizing a few basic routines. In this case, a very simple example of multiplying 2 float vectors together:

  zero (float * src1, float * src2, float * dst) float * {for (int i = 0; i & lt; ct; i ++) dst [i] = sqr 1 [i] * cr2 [ii] ;; };

Simple implementation is very slow I had used some external ASM AVX and also tried to use intrinsics. These are the results of the test (time is smaller, better):

  ASM: 0.110 IPP: 0.125 Intrinsics: 0.18 plain C ++: 4.0

(Tried using compiled MSVC 2013, SSE 2, Intel Compiler, the results were too much)

As you can see that my ASM code also kills Intel display primitives (possibly because To ensure that I can use AVX Coalition Instructions)But I want to use the personal approach, it is easy to manage and I was thinking that the compiler should work best to optimize all branches and goods (my ASM code is useless in that case, even then Is fast). So here's the code used by the internal code:

  int i; (I = 0; (MINTEGER) (DST + I)% 32! = 0 & amp; i & lt; cnt; i ++) DST [I] = src1 [i] * src2 [i]; If ((MINTEGER) (src1 + i)% 32 == 0) {if ((MINTEGER) (src2 + i)% 32 == 0) {for (; i   One problem is that in the beginning and at the end, C ++ implementation is not using AVX unless I can not enable AVX in the compiler, which I do not want because it is just AVX Expertise, but the software should work on a single platform, where AVX is not available. And sadly there is no intrinsic value for instructions like VMOSS, so there is probably a penalty for mixing AVX code with SEO, which uses the compiler. However, if I enable AVX in the compiler, then it is still not less than 0.14.  
 How to optimize the Institute to reach the speed of the ASM code?   
 
  Your implementation with intrinsics is not the same as your implementation in C: eg What if your function was called with the argument  the original (P, P, P + 1) ? You will get different results Pure C version is slow because the compiler is making sure that the code  exactly  is what you said. 
  If you do not want to overlap three arrays based on perception based on compiler optimization, you have to clarify it: 
   from zero to original (float * src1 , Float * src2, float * __ restricted_dst)  
  or even 
   void root (const float * src1, const float * src2 , Float * __ Restrict___DST)  
  (I think this  __restrict __  is just on the output pointer, though it will not be added to input pointers too)




Posted by



Unknown




at

02:22











Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




No comments:







Post a Comment