Tutorial matrix multiplication benchmark

ag123 · September 25, 2018

hi all,

as documented here, i did some minor experiments with small heatsinks on an orange pi one sbc (allwinner H3)

among the things i used a small matrix multiplication program which multiplies a 1000x1000 matrix single precision floating point matrix

main.cpp

the command to build is c++ -O2 -std=gnu++11 -pthread -o mat main.cpp

then run ./mat

the last documented results on an orange pi one H3 running at 1.2ghz is 2389.49Mflops ~ 2.39Gflops

the sources attached is 'optimised', the means of optimization is unrolling of the innermost loop to take advantage of platforms that could do quad sp fp simultaneous execute

i think those neon enabled platforms would be able to do that. the other optimization is that it spawn off 4 threads so that it could utilise 4 simultaneous threads for the matrix multiplication

could try to run this and let us know what is the 'mflops' you get on your sbc? what is the sbc , the soc and the frequency it is running at?

note that this is not necessarily the most optimised for your sbc / soc

the parameters you could try to tune includes the THREADS_NUMBER for the number of concurrent threads in the source code

in addition those who have more powerful soc could try to unroll the loops further, e.g. try to compile with additional options like -funroll-loops or even -funroll-all-loops

you could also manually update the codes e.g. to double the set of manually unrolled codes in the loop so that it become 8 sets of computations instead of 4, but you would need to review the codes such as using MATRIX_SIZE/8; i+= 8 in in the loop if you unroll the loop into 8 sets of variables, you'd need to update the summation after the loop as well result.elements[row][col] = r1+r2+r3+r4; to add the other r variables that you unrolled into

Spoiler

    for (int i = 0; i < MATRIX_SIZE/4; i+=4) {
      const float e1 = m1.elements[row];
      const float e2 = m2.elements[col];
      const float e3 = m1.elements[row][i+1];
      const float e4 = m2.elements[i+1][col];
      const float e5 = m1.elements[row][i+2];
      const float e6 = m2.elements[i+2][col];
      const float e7 = m1.elements[row][i+3];
      const float e8 = m2.elements[i+3][col];
      r1 += e1 * e2;
      r2 += e3 * e4;
      r3 += e5 * e6;
      r4 += e7 * e8;
}

//result.elements[row][col] = r;
result.elements[row][col] = r1+r2+r3+r4;

the original 'unoptimised' codes can be found in references in the original post.

strictly speaking this is not really a good test of computational prowess unlike those of linpack etc. linpack actually solves a matrix, and this is purely a square matrix multiply.

in addition, this does not explicitly use neon etc and those usage depends on the compiler optimization (i think gcc / g++ has a build in vectorizer, hence you may like to experiment with the options)

but nevertheless seeing mflops, gflops is fun, mflops, gflops is also normally a function of the frequency the core executes at, hence you could try to overclock your soc to get more gflops