Significant performance difference of std clock between different machines

Significant performance difference of std clock between different machines

Testing something else I stumbled across something that I haven't managed to figure out yet.

Testing something else I stumbled across something that I haven't managed to figure out yet.

Let's look at this snippet:

#include <iostream>
#include <chrono>

int main () { int i = 0; using namespace std::chrono_literals;

auto const end = std::chrono::system_clock::now() + 5s; while (std::chrono::system_clock::now() < end) { ++i; } std::cout << i; }

I've noticed that the counts heavily depend on the machine I execute it on.

I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3.

On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8

but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6

Now this is a difference that I would like to understand so I've checked with perf stat -d

on the i7:

       4999.419546      task-clock:u (msec)       #    0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
120 page-faults:u # 0.024 K/sec
19,605,598,394 cycles:u # 3.922 GHz (49.94%) 33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%) 7,397,994,820 branches:u # 1479.771 M/sec (62.53%) 34,788 branch-misses:u # 0.00% of all branches (62.58%) 10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%) 13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%) 3,944 LLC-loads:u # 0.789 K/sec (24.95%) 1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)

   5.003180401 seconds time elapsed

   4.969048000 seconds user
   0.016557000 seconds sys

Xeon:

       5001.000000      task-clock (msec)         #    0.999 CPUs utilized
42 context-switches # 0.008 K/sec
2 cpu-migrations # 0.000 K/sec
412 page-faults # 0.082 K/sec
15,100,238,798 cycles # 3.019 GHz (50.01%) 794,184,899 instructions # 0.05 insn per cycle (62.51%) 188,083,219 branches # 37.609 M/sec (62.49%) 85,924 branch-misses # 0.05% of all branches (62.51%) 269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%) 246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%) 13,327 LLC-loads # 0.003 M/sec (49.99%) 7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)

   5.006139971 seconds time elapsed

What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.

And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.

On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.

Doing the same thing on the i7 both the counter actually drops from 3e8 to ~2e7

So in the end I'm now left with two questions:

  • Why do I see such a significant difference between the two machines.
  • Why does a statically linked exectuable perform worse while I would expect the oposite?

Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6 to 5e6.

perf interestingly shows different numbers though:

   5002.000000      task-clock:u (msec)       #    0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
119 page-faults:u # 0.024 K/sec
409,723,790 cycles:u # 0.082 GHz (50.00%) 392,228,592 instructions:u # 0.96 insn per cycle (62.51%) 115,475,503 branches:u # 23.086 M/sec (62.51%) 26,355 branch-misses:u # 0.02% of all branches (62.53%) 115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%) 42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%) 88 LLC-loads:u # 0.018 K/sec (49.96%) 2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)

5.005940327 seconds time elapsed

0.533000000 seconds user 4.469000000 seconds sys

It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!

Angular 9 Tutorial: Learn to Build a CRUD Angular App Quickly

What's new in Bootstrap 5 and when Bootstrap 5 release date?

Brave, Chrome, Firefox, Opera or Edge: Which is Better and Faster?

How to Build Progressive Web Apps (PWA) using Angular 9

What is new features in Javascript ES2020 ECMAScript 2020

C/C++ vs. Rust: A developer’s perspective

In this post, you'll see the difference between Rust and C/C++ in a developer’s perspective

Variable Introduction in C#[Bangla]

LIKE | COMMENT | SHARE | SUBSCRIBE A variable is nothing but a name given to a storage area that our programs can manipulate. Each variable in C# has a speci...