Checked the previous kernel as instructed, there is no difference, its not a regression. Setting affinity does raise it to 380Mbit/s.
However, I did play with running dd if zero of to null, simultaneously with iperf3, and dd hogging 1 cpu is making it lose half of throughput, pointing to the problem being with the cpu scheduling or whatever and not with interface/driver code?
Anyway, I found its good enough for now, I can achieve 800Mbit/s, but only sometimes and only if run taskset 1,2,3 iperf3, and not even every time of those 1,2,3 sometimes it has to be 4,5,6, but even the 1,2,3 times happen on retries (so it looks like 600Mbit and then 250 and then 600 again!?)
And so, after many more tests, I noticed the times when I get 800Mbit/s, and other times 120Mbit/s... Becuase if I set performance governor, its then more often I can end up in reproducing 800Mbit/s, with iperf3 -A 5, or 4, but not 3 or 2... so its only the fast cpus which can give good results, the others are even if performance governor and at 1.42Ghz, still suck at 260Mbit/s
Also, only with iperf3 -A 5 (affinity), I can get 0 packet-retry, other options I guess when the process is moved from cpu 0 to 1, is when retries happen, or when ondemand governor is on, if on slow cpus, retries happen when it switches freq?