Testing something else I stumbled across something that I haven't managed to figure out yet.

Testing something else I stumbled across something that I haven't managed to figure out yet.

Let's look at this snippet:

#include <iostream> #include <chrono>int main () { int i = 0; using namespace std::chrono_literals;

auto const end = std::chrono::system_clock::now() + 5s; while (std::chrono::system_clock::now() < end) { ++i; } std::cout << i; }

I've noticed that the counts heavily depend on the machine I execute it on.

I've compiled with gcc 7.3,8.2, and clang 6.0 with `std=c++17 -O3`

.

On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~`3e8`

but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~`8e6`

Now this is a difference that I would like to understand so I've checked with `perf stat -d`

on the i7:

4999.419546 task-clock:u (msec) # 0.999 CPUs utilized

0 context-switches:u # 0.000 K/sec

0 cpu-migrations:u # 0.000 K/sec

120 page-faults:u # 0.024 K/sec

19,605,598,394 cycles:u # 3.922 GHz (49.94%) 33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%) 7,397,994,820 branches:u # 1479.771 M/sec (62.53%) 34,788 branch-misses:u # 0.00% of all branches (62.58%) 10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%) 13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%) 3,944 LLC-loads:u # 0.789 K/sec (24.95%) 1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)`5.003180401 seconds time elapsed 4.969048000 seconds user 0.016557000 seconds sys`

Xeon:

5001.000000 task-clock (msec) # 0.999 CPUs utilized

42 context-switches # 0.008 K/sec

2 cpu-migrations # 0.000 K/sec

412 page-faults # 0.082 K/sec

15,100,238,798 cycles # 3.019 GHz (50.01%) 794,184,899 instructions # 0.05 insn per cycle (62.51%) 188,083,219 branches # 37.609 M/sec (62.49%) 85,924 branch-misses # 0.05% of all branches (62.51%) 269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%) 246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%) 13,327 LLC-loads # 0.003 M/sec (49.99%) 7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)`5.006139971 seconds time elapsed`

What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.

And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.

On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.

Doing the same thing on the i7 both the counter actually drops from `3e8`

to ~`2e7`

So in the end I'm now left with two questions:

- Why do I see such a significant difference between the two machines.
- Why does a statically linked exectuable perform worse while I would expect the oposite?

**Edit:** after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ `8e6`

to `5e6`

.

perf interestingly shows different numbers though:

5002.000000 task-clock:u (msec) # 0.999 CPUs utilized

0 context-switches:u # 0.000 K/sec

0 cpu-migrations:u # 0.000 K/sec

119 page-faults:u # 0.024 K/sec

409,723,790 cycles:u # 0.082 GHz (50.00%) 392,228,592 instructions:u # 0.96 insn per cycle (62.51%) 115,475,503 branches:u # 23.086 M/sec (62.51%) 26,355 branch-misses:u # 0.02% of all branches (62.53%) 115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%) 42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%) 88 LLC-loads:u # 0.018 K/sec (49.96%) 2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)5.005940327 seconds time elapsed

0.533000000 seconds user 4.469000000 seconds sys

It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!

In this post, you'll see the difference between Rust and C/C++ in a developer’s perspective

ＬＩＫＥ | ＣＯＭＭＥＮＴ | ＳＨＡＲＥ | ＳＵＢＳＣＲＩＢＥ A variable is nothing but a name given to a storage area that our programs can manipulate. Each variable in C# has a speci...