Jump to content

matrix multiplication benchmark


Recommended Posts

hi all,

 

as documented here, i did some minor experiments with small heatsinks on an orange pi one sbc (allwinner H3)

 

among the things i used a small matrix multiplication program which multiplies a 1000x1000 matrix single precision floating point matrix

main.cpp

the command to build is c++ -O2 -std=gnu++11 -pthread -o mat main.cpp

then run ./mat

 

the last documented results on an orange pi one H3 running at 1.2ghz is 2389.49Mflops ~ 2.39Gflops

the sources attached is 'optimised', the means of optimization is unrolling of the innermost loop to take advantage of platforms that could do quad sp fp simultaneous execute

i think those neon enabled platforms would be able to do that. the other optimization is that it spawn off 4 threads so that it could utilise 4 simultaneous threads for the matrix multiplication

 

could try to run this and let us know what is the 'mflops' you get on your sbc? what is the sbc , the soc and the frequency it is running at?

note that this is not necessarily the most optimised for your sbc / soc

 

the parameters you could try to tune includes the THREADS_NUMBER for the number of concurrent threads in the source code

in addition those who have more powerful soc could try to unroll the loops further, e.g. try to compile with additional options like -funroll-loops or even -funroll-all-loops

you could also manually update the codes e.g. to double the set of manually unrolled codes in the loop so that it become 8 sets of computations instead of 4, but you would need to review the codes such as using MATRIX_SIZE/8; i+= 8 in in the loop if you unroll the loop into 8 sets of variables, you'd need to update the summation after the loop as well result.elements[row][col] = r1+r2+r3+r4; to add the other r variables that you unrolled into

Spoiler

    for (int i = 0; i < MATRIX_SIZE/4; i+=4) {
      const float e1 = m1.elements[row];
      const float e2 = m2.elements[col];
      const float e3 = m1.elements[row][i+1];
      const float e4 = m2.elements[i+1][col];
      const float e5 = m1.elements[row][i+2];
      const float e6 = m2.elements[i+2][col];
      const float e7 = m1.elements[row][i+3];
      const float e8 = m2.elements[i+3][col];
      r1 += e1 * e2;
      r2 += e3 * e4;
      r3 += e5 * e6;
      r4 += e7 * e8;
  }

//result.elements[row][col] = r;
    result.elements[row][col] = r1+r2+r3+r4;

 

the original 'unoptimised' codes can be found in references in the original post.

strictly speaking this is not really a good test of computational prowess unlike those of linpack etc. linpack actually solves a matrix, and this is purely a square matrix multiply.

in addition, this does not explicitly use neon etc and those usage depends on the compiler optimization (i think gcc / g++ has a build in vectorizer, hence you may like to experiment with the options)

but nevertheless seeing mflops, gflops is fun, mflops, gflops is also normally a function of the frequency the core executes at, hence you could try to overclock your soc to get more gflops :D

 

Link to comment
Share on other sites

but of course the high end desktops these days runs delivers much higher staggering gflops compared to simple arm chips

https://www.pugetsystems.com/labs/hpc/Skylake-X-7800X-vs-Coffee-Lake-8700K-for-compute-AVX512-vs-AVX2-Linpack-benchmark-1068/

nevertheless arm chips these days on SBC could easily rival those of early p3, p4 and amd64 single core cpus :D

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines