ag123 Posted September 25, 2018 Posted September 25, 2018 hi all, as documented here, i did some minor experiments with small heatsinks on an orange pi one sbc (allwinner H3) among the things i used a small matrix multiplication program which multiplies a 1000x1000 matrix single precision floating point matrix main.cpp the command to build is c++ -O2 -std=gnu++11 -pthread -o mat main.cpp then run ./mat the last documented results on an orange pi one H3 running at 1.2ghz is 2389.49Mflops ~ 2.39Gflops the sources attached is 'optimised', the means of optimization is unrolling of the innermost loop to take advantage of platforms that could do quad sp fp simultaneous execute i think those neon enabled platforms would be able to do that. the other optimization is that it spawn off 4 threads so that it could utilise 4 simultaneous threads for the matrix multiplication could try to run this and let us know what is the 'mflops' you get on your sbc? what is the sbc , the soc and the frequency it is running at? note that this is not necessarily the most optimised for your sbc / soc the parameters you could try to tune includes the THREADS_NUMBER for the number of concurrent threads in the source code in addition those who have more powerful soc could try to unroll the loops further, e.g. try to compile with additional options like -funroll-loops or even -funroll-all-loops you could also manually update the codes e.g. to double the set of manually unrolled codes in the loop so that it become 8 sets of computations instead of 4, but you would need to review the codes such as using MATRIX_SIZE/8; i+= 8 in in the loop if you unroll the loop into 8 sets of variables, you'd need to update the summation after the loop as well result.elements[row][col] = r1+r2+r3+r4; to add the other r variables that you unrolled into Spoiler for (int i = 0; i < MATRIX_SIZE/4; i+=4) { const float e1 = m1.elements[row]; const float e2 = m2.elements[col]; const float e3 = m1.elements[row][i+1]; const float e4 = m2.elements[i+1][col]; const float e5 = m1.elements[row][i+2]; const float e6 = m2.elements[i+2][col]; const float e7 = m1.elements[row][i+3]; const float e8 = m2.elements[i+3][col]; r1 += e1 * e2; r2 += e3 * e4; r3 += e5 * e6; r4 += e7 * e8; } //result.elements[row][col] = r; result.elements[row][col] = r1+r2+r3+r4; the original 'unoptimised' codes can be found in references in the original post. strictly speaking this is not really a good test of computational prowess unlike those of linpack etc. linpack actually solves a matrix, and this is purely a square matrix multiply. in addition, this does not explicitly use neon etc and those usage depends on the compiler optimization (i think gcc / g++ has a build in vectorizer, hence you may like to experiment with the options) but nevertheless seeing mflops, gflops is fun, mflops, gflops is also normally a function of the frequency the core executes at, hence you could try to overclock your soc to get more gflops
ag123 Posted September 25, 2018 Author Posted September 25, 2018 but of course the high end desktops these days runs delivers much higher staggering gflops compared to simple arm chips https://www.pugetsystems.com/labs/hpc/Skylake-X-7800X-vs-Coffee-Lake-8700K-for-compute-AVX512-vs-AVX2-Linpack-benchmark-1068/ nevertheless arm chips these days on SBC could easily rival those of early p3, p4 and amd64 single core cpus
Sergei Steshenko Posted October 26, 2018 Posted October 26, 2018 Have you tried to build and use 'Atlas': http://math-atlas.sourceforge.net/ ? For the suite to be built and work properly you'll need to lock CPU frequency. It's all described in the documentation.
Recommended Posts