I have this very easy parallel code that I'm using to learn Open MP that is embarrassing parallel. However, I do not expect to increase the superline or at least linear performance
#pragma omp parallel num_threads (core) {int id = omp_get_thread_num (); Cblas_sgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, Row, Columns, Columns, 1.0, MatrixA1 [ID], Columns, Matrix B [ID], Columns, 0.0, Matriculation [ID], Columns); }
using Intel C ++ compiler XE 15.0 on Visual Studio and computing sgemm 288 to 288 metric (matrix multiplication), I core = 4, for core = 1 To get 350microsecs and 1177microsecs which just seems like a sequential code, I have set parallel (also tested with sequential) and language settings to generate the parallel code (/ couponmepi) to Intel MkL property. Anyway to improve it? I'm running a Quad Core Havol Processor
If you want to calculate the size of your input only There are some microseconds, as you say, there is no way 4 threads less than that. Basically, your input data is too small for parallel, because there is overhead in creating threads.
Try to increase the input data so that it takes some good seconds and repeat the experiment.
For example, you can also share wrong, but there is nothing to consider at this point
What you can do to improve the performance of the code vector (But in this case, you can not do this because you are using a library call, i.e. you want to write your own function).
No comments:
Post a Comment