Significant performance difference of std clock between different machines

Significant performance difference of std clock between different machines

Testing something else I stumbled across something that I haven't managed to figure out yet.

Testing something else I stumbled across something that I haven't managed to figure out yet.

Let's look at this snippet:

#include <iostream>
#include <chrono>

int main () { int i = 0; using namespace std::chrono_literals;

auto const end = std::chrono::system_clock::now() + 5s; while (std::chrono::system_clock::now() < end) { ++i; } std::cout << i; }

I've noticed that the counts heavily depend on the machine I execute it on.

I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3.

On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8

but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6

Now this is a difference that I would like to understand so I've checked with perf stat -d

on the i7:

       4999.419546      task-clock:u (msec)       #    0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
120 page-faults:u # 0.024 K/sec
19,605,598,394 cycles:u # 3.922 GHz (49.94%) 33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%) 7,397,994,820 branches:u # 1479.771 M/sec (62.53%) 34,788 branch-misses:u # 0.00% of all branches (62.58%) 10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%) 13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%) 3,944 LLC-loads:u # 0.789 K/sec (24.95%) 1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)

   5.003180401 seconds time elapsed

   4.969048000 seconds user
   0.016557000 seconds sys

Xeon:

       5001.000000      task-clock (msec)         #    0.999 CPUs utilized
42 context-switches # 0.008 K/sec
2 cpu-migrations # 0.000 K/sec
412 page-faults # 0.082 K/sec
15,100,238,798 cycles # 3.019 GHz (50.01%) 794,184,899 instructions # 0.05 insn per cycle (62.51%) 188,083,219 branches # 37.609 M/sec (62.49%) 85,924 branch-misses # 0.05% of all branches (62.51%) 269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%) 246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%) 13,327 LLC-loads # 0.003 M/sec (49.99%) 7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)

   5.006139971 seconds time elapsed

What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.

And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.

On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.

Doing the same thing on the i7 both the counter actually drops from 3e8 to ~2e7

So in the end I'm now left with two questions:

  • Why do I see such a significant difference between the two machines.
  • Why does a statically linked exectuable perform worse while I would expect the oposite?

Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6 to 5e6.

perf interestingly shows different numbers though:

   5002.000000      task-clock:u (msec)       #    0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
119 page-faults:u # 0.024 K/sec
409,723,790 cycles:u # 0.082 GHz (50.00%) 392,228,592 instructions:u # 0.96 insn per cycle (62.51%) 115,475,503 branches:u # 23.086 M/sec (62.51%) 26,355 branch-misses:u # 0.02% of all branches (62.53%) 115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%) 42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%) 88 LLC-loads:u # 0.018 K/sec (49.96%) 2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)

5.005940327 seconds time elapsed

0.533000000 seconds user 4.469000000 seconds sys

It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!

c++

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Dicey Issues in C/C++

C/C++ problems. If you are familiar with C/C++then you must have come across some unusual things and if you haven’t, then you are about to. The below codes are checked twice before adding, so feel free to share this article with your friends.

Loops in C++ | For, While, and Do While Loops in C++

In this Video We are going to see how to use Loops in C++. We will see How to use For, While, and Do While Loops in C++.

Using isdigit() in C/C++

In this article, we'll take a look at using the isdigit() function in C/C++. This is a very simple way to check if any value is a digit or not. Let's look

Object Oriented Programming in C++ | C++ OOPs Concepts | Learn Object Oriented C++

C++ is general purpose, compiled, object-oriented programming language and its concepts served as the basis for several other languages such as Java, Python, Ruby, Perl etc.

A Complete Guide to fread() in C/C++

In this article, we’ll take a look at using fread() in C/C++.