Top-down analysis with the perf tool¶

Note, the description here is for Linux 6.1. Earlier versions of the perf tool supported top-down analysis but due to naming inconsistencies in old versions, Linux 6.1 is focused upon on this page.

What is top-down analysis?¶

Top-down analysis is an approach for identifying software performance bottlenecks. It is described in A Top-Down method for performance analysis and counters architecture but is often come across in Appendix B of Intel's Optimization Reference Manual in their software development manuals. The approach gathers a group of metrics and from these a problematic metric is identified. A group of metrics associated with the problematic metric are then measured drilling down to the metric/issue that identifies the performance problem.

More recent Intel processors feature improvements for top-down analysis, the most recent being Timed Process Event Based Sampling (TPEBS). A description for using perf from Intel is given here.

Starting at the top¶

In this tutorial perf bench mem memcpy will be used for the <benchmark>. First of all gather the level 1 metrics:

  $ perf stat -M TopdownL1 <benchmark>

The output will be like:

  $ perf stat -M TopdownL1 perf bench mem memcpy

  # Running 'mem/memcpy' benchmark:
  # function 'default' (Default memcpy() provided by glibc)
  # Copying 1MB bytes ...

         5.677689 GB/sec
  # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         5.425347 GB/sec
  # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         6.141903 GB/sec
  # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         7.454676 GB/sec

   Performance counter stats for 'perf bench mem memcpy':

          72,813,380      TOPDOWN.SLOTS                    #     25.5 %  tma_retiring        
                                                    #     44.3 %  tma_backend_bound   
                                                    #     24.3 %  tma_frontend_bound  
                                                    #      6.0 %  tma_bad_speculation 
          18,560,273      topdown-retiring                 #     25.5% Retiring              
          17,703,645      topdown-fe-bound                 #     24.3% Frontend Bound        
              43,171      INT_MISC.UOP_DROPPING                                              
          31,980,778      topdown-be-bound                 #     43.9% Backend Bound         
              55,052      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/                                   
           4,568,682      topdown-bad-spec                 #      6.3% Bad Speculation       

         0.012820482 seconds time elapsed

         0.004290000 seconds user
         0.008581000 seconds sys

The TopdownL1 metrics may be computed by default if no metric is provided:

  $ perf stat perf bench mem memcpy
  # Running 'mem/memcpy' benchmark:
  # function 'default' (Default memcpy() provided by glibc)
  # Copying 1MB bytes ...

        16.837284 GB/sec
  # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

        16.551907 GB/sec
  # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

        19.148284 GB/sec

   Performance counter stats for 'perf bench mem memcpy':

               21.38 msec task-clock                       #    0.927 CPUs utilized             
                   7      context-switches                 #  327.406 /sec                      
                   0      cpu-migrations                   #    0.000 /sec                      
               6,247      page-faults                      #  292.187 K/sec                     
          94,684,914      cycles                           #    4.429 GHz                       
         117,788,965      instructions                     #    1.24  insn per cycle            
          25,949,650      branches                         #    1.214 G/sec                     
             252,622      branch-misses                    #    0.97% of all branches           
                          TopdownL1                 #     45.8 %  tma_backend_bound      
                                                    #     11.2 %  tma_bad_speculation    
                                                    #     18.2 %  tma_frontend_bound     
                                                    #     24.8 %  tma_retiring           

         0.023069998 seconds time elapsed

         0.000000000 seconds user
         0.023455000 seconds sys

On the right of the counter values are the metrics. For TopdownL1 there are the metrics tma_retiring, tma_backend_bound, tma_frontend_bound and tma_bad_speculation. For the benchmark tma_backend_bound is the largest. We can drill down into this metric by adding the suffix _group to the metric name:

  $ perf stat -M tma_backend_bound_group perf bench mem memcpy

  # Running 'mem/memcpy' benchmark:
  # function 'default' (Default memcpy() provided by glibc)
  # Copying 1MB bytes ...

         6.781684 GB/sec
  # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         5.033827 GB/sec
  # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         6.688784 GB/sec
  # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         6.829108 GB/sec

   Performance counter stats for 'perf bench mem memcpy':

          69,746,420      TOPDOWN.SLOTS                    #     24.9 %  tma_core_bound      
                                                    #     17.5 %  tma_memory_bound    
          18,599,045      topdown-retiring                 #     26.7% Retiring              
           1,253,770      EXE_ACTIVITY.BOUND_ON_STORES                                       
          17,504,983      topdown-fe-bound                 #     25.1% Frontend Bound        
           1,769,443      EXE_ACTIVITY.1_PORTS_UTIL                                          
          29,266,144      topdown-be-bound                 #     42.0% Backend Bound         
              55,050      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/                                   
           2,934,845      CYCLE_ACTIVITY.STALLS_MEM_ANY                                      
           6,667,954      CYCLE_ACTIVITY.STALLS_TOTAL                                        
           1,775,168      EXE_ACTIVITY.2_PORTS_UTIL                                          
           4,376,245      topdown-bad-spec                 #      6.3% Bad Speculation       

         0.012655913 seconds time elapsed

         0.008451000 seconds user
         0.004225000 seconds sys

This time tma_core_bound is the largest TMA metric and so we drill down in to it:

  $ perf stat -M tma_core_bound_group perf bench mem memcpy
  # Running 'mem/memcpy' benchmark:
  # function 'default' (Default memcpy() provided by glibc)
  # Copying 1MB bytes ...

         6.510417 GB/sec
  # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         6.065606 GB/sec
  # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         7.512019 GB/sec
  # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         6.781684 GB/sec

   Performance counter stats for 'perf bench mem memcpy':

        70,285,910      TOPDOWN.SLOTS                    #     31.2 %  tma_ports_utilization
        18,467,278      topdown-retiring                 #     26.3% Retiring              
         2,165,618      cpu/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/                                   
        17,364,754      topdown-fe-bound                 #     24.7% Frontend Bound        
         1,754,496      EXE_ACTIVITY.1_PORTS_UTIL                                          
        30,043,781      topdown-be-bound                 #     42.7% Backend Bound         
        14,057,182      CPU_CLK_UNHALTED.THREAD          #      0.1 %  tma_divider         
         3,054,356      CYCLE_ACTIVITY.STALLS_MEM_ANY                                      
         6,685,779      CYCLE_ACTIVITY.STALLS_TOTAL                                        
         1,767,046      EXE_ACTIVITY.2_PORTS_UTIL                                          
         4,410,096      topdown-bad-spec                 #      6.3% Bad Speculation       
             9,354      ARITH.DIVIDER_ACTIVE                                               

       0.011282941 seconds time elapsed

       0.000000000 seconds user
       0.011349000 seconds sys

And then tma_ports_utilization:

  $ perf stat -M tma_ports_utilization_group perf bench mem memcpy

  # Running 'mem/memcpy' benchmark:
  # function 'default' (Default memcpy() provided by glibc)
  # Copying 1MB bytes ...

         6.554111 GB/sec
  # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...  

         5.710892 GB/sec
  # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         6.467301 GB/sec
  # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

         6.300403 GB/sec

   Performance counter stats for 'perf bench mem memcpy':

         1,812,959      RESOURCE_STALLS.SCOREBOARD       #     16.6 %  tma_ports_utilized_0  (34.73%)
         1,991,726      cpu/EXE_ACTIVITY.3_PORTS_UTIL,umask=0x80/                                     (34.73%)
        14,159,441      CPU_CLK_UNHALTED.THREAD                                              (34.73%)
         6,689,757      CYCLE_ACTIVITY.STALLS_TOTAL                                          (34.73%)
         3,838,402      CYCLE_ACTIVITY.STALLS_MEM_ANY                                        (34.73%)
         3,282,823      UOPS_EXECUTED.CYCLES_GE_3        #     22.9 %  tma_ports_utilized_3m  (52.98%)
        14,324,185      CPU_CLK_UNHALTED.THREAD                                              (52.98%)
        14,599,955      CPU_CLK_UNHALTED.THREAD          #     12.5 %  tma_ports_utilized_2  (65.27%)
         1,823,495      EXE_ACTIVITY.2_PORTS_UTIL                                            (65.27%)
         1,819,926      EXE_ACTIVITY.1_PORTS_UTIL        #     12.5 %  tma_ports_utilized_1  (79.65%)
        14,591,940      CPU_CLK_UNHALTED.THREAD                                              (79.65%)

       0.012931961 seconds time elapsed

       0.008647000 seconds user
       0.004323000 seconds sys

For tma_ports_utilization_group we can see numbers like (34.73%) that indicate there were insufficient performance counters to gather the metric and the performance counters had to be multiplexed during the benchmark run. Multiplexing lowers accuracy and can be worked around by just measuring the metric on its own:

  $ perf stat -M tma_ports_utilized_0 perf bench mem memcpy
  ...
         2,268,815      RESOURCE_STALLS.SCOREBOARD       #     19.8 %  tma_ports_utilized_0

Finally we see tma_ports_utilized_3m as the largest metric. Looking at perf list, (sometimes perf list -v) we can see the metrics meaning:

tma_ports_utilized_3m

This metric represents fraction of cycles CPU executed total of 3 or more uops per cycle on all execution ports (Logical Processor cycles since ICL, Physical Core cycles otherwise). Sample with: UOPS_EXECUTED.CYCLES_GE_3

The 'Sample with' event can be used with perf record to identify where in the benchmark the performance bottleneck is:

  $ perf record -e UOPS_EXECUTED.CYCLES_GE_3 perf bench mem memcpy; perf report

  # Running 'mem/memcpy' benchmark:
  # function 'default' (Default memcpy() provided by glibc)
  # Copying 1MB bytes ...

        23.251488 GB/sec
  # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

        23.251488 GB/sec
  # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

        28.722426 GB/sec
  # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
  # Copying 1MB bytes ...

        27.901786 GB/sec
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.034 MB perf.data (1 samples) ]

  # To display the perf.data header info, please use --header/--header-only options.  
  #
  #
  # Total Lost Samples: 0  
  #
  # Samples: 1  of event 'UOPS_EXECUTED.CYCLES_GE_3/period=1000000/'
  # Event count (approx.): 2000003 
  #
  # Overhead  Command     Shared Object  Symbol         
  # ........  ..........  .............  ...............  
  #
     100.00%  mem-memcpy  perf           [.] memcpy_orig