CMN-600 perf example on Neoverse N1 SDP

The goal of this document is to give a short introduction on CMN-600 performance analysis on N1SDP. This includes driver load verification and Linux perf usage examples.

The examples also include system level cache access and traffic to and from PCIe devices from the view of the interconnect.

Support in Arm’s Neoverse N1 SDP software release

The software support for CMN-600 performance analysis can be divided into three components:

  • The user space Linux perf tool

  • The Linux kernel arm-cmn driver

  • EDK2 (DSDT table entry)

The default build of the supplied N1SDP software stack will include all necessary changes and patches to test and explore CMN-600 performance analysis.

CMN-600 Topology and NodeIDs on Neoverse N1 SDP

The PMUs in CMN-600 are distributed to the nodes of the mesh interconnect. NodeType specific events are configured per node. Event counting is done by local counters in the XP attached to the node. Global counters are in the Debug Trace Controller (DTC). The arm-cmn driver uses local/global register pairing to provide 64-bit event counters (see “Counter Allocation” section below).

All the nodes are referenced by NodeID and NodeType. PMU events must specify the NodeID of the node on which it is to be counted using the nodeid= parameter. A summary of NodeID can be found in the table below. For more details contact support (support-subsystem-enterprise@arm.com).

Purpose

Node Type

NodeID

Event Name

System-Level Cache slices (SLC)

HN-F

0x24 0x28 0x44 0x48

arm_cmn/hnf

PCI_CCIX (Expansion slot 4)

RN-D

0x08

arm_cmn/rnid

PCI_0 (All other PCI-E)

RN-D

0x0c

arm_cmn/rnid

Mesh interconnections

XP

0x00 0x08 0x20 0x28 0x40 0x48 0x60 0x68

arm_cmn/mxp

Debug Trace Controller

DTC

0x68

arm_cmn/dtc_cycles

ACE-lite slave

SBSX

0x64

arm_cmn/sbsx

For details on what is connected to PCI_0 check the N1SDP TRM (Figure 2-9 PCI Express and CCIX system).

Software components

Linux perf tool

No modifications of perf source is needed. The user can opt to use any perf compatible with the built kernel or use the included script build-scripts/build-perf.sh to build a static linked binary from the included kernel source (binary is created as output/n1sdp/perf/perf).

ACPI DSDT modification

The Linux driver expects a DSDT entry that describe the location of the CMN-600 configuration space. This is included in the supplied N1SDP software stack.

Linux perf driver (arm-cmn)

The included arm-cmn driver is a work-in-progress. A Snapshot of this driver is included in the supplied N1SDP software stack. The driver is controlled by CONFIG_ARM_CMN (enabled in default software stack build).

Counter Allocation/Limitation

The arm-cmn driver provides 64-bit event counts for any given event. It accomplishes this using a combination of combined-pair local counters (in a DTM/XP) and uncombined global counters (in the DTC):

  • DTM/XP

    Can provide up to two 32-bit local counters (built from paired 16-bit counters por_dtm_pmevcnt0+1, and 2+3) for events from itself and/or up to two devices that are connected to its ports.

    Overflows from these counters are sent to its DTC’s global counters. This means only up to 2 events from any of the devices connected to an XP can be counted at the same time without sampling.

  • DTC

    Each DTC can provide up to 8 global counters (por_dt_pmevcntA .. H). This means only up to 8 events in a DTC domain can be counted at the same time without sampling.

For example, the N1SDP’s two PCI-Express root complexes RND (PCI_CCIX on RND3 at NodeID 0x8 and PCI0 on RND4 at NodeID 0xC), hang off of the same XP (0,1). Only up to 2 RND events from either of the two PCI-E domains can be measured simultaneously without sampling; 3 or more will require sampling.

In the following example, we try to measure 4 RND events, but perf is only giving 50% sampling time for each count because the events have to share local counters in the XP.

$ perf stat -a \
    -e arm_cmn/rnid_txdat_flits,nodeid=8/ \
    -e arm_cmn/rnid_txdat_flits,nodeid=12/ \
    -e arm_cmn/rnid_rxdat_flits,nodeid=8/ \
    -e arm_cmn/rnid_rxdat_flits,nodeid=12/ \
    -I 1000
#   time       counts                unit events
1.000089438       0      arm_cmn/rnid_txdat_flits,nodeid=8/     (50.00%)
1.000089438       0      arm_cmn/rnid_txdat_flits,nodeid=12/    (50.00%)
1.000089438       0      arm_cmn/rnid_rxdat_flits,nodeid=8/     (50.00%)
1.000089438       0      arm_cmn/rnid_rxdat_flits,nodeid=12/    (50.00%)
2.000231897      79      arm_cmn/rnid_txdat_flits,nodeid=8/     (50.01%)
2.000231897       0      arm_cmn/rnid_txdat_flits,nodeid=12/    (50.01%)
2.000231897       0      arm_cmn/rnid_rxdat_flits,nodeid=8/     (49.99%)

PMU Events

perf list shows the perfmon events for the node types that are detected by the arm-cmn driver. If a node type is not detected, perf list will not show the events for that node type.

# perf list | grep arm_cmn_0/hnf
arm_cmn_0/hnf_brd_snoops_sent/                     [Kernel PMU event]
arm_cmn_0/hnf_cache_fill/                          [Kernel PMU event]
arm_cmn_0/hnf_cache_miss/                          [Kernel PMU event]
arm_cmn_0/hnf_cmp_adq_full/                        [Kernel PMU event]
arm_cmn_0/hnf_dir_snoops_sent/                     [Kernel PMU event]
arm_cmn_0/hnf_intv_dirty/                          [Kernel PMU event]
arm_cmn_0/hnf_ld_st_swp_adq_full/                  [Kernel PMU event]
arm_cmn_0/hnf_mc_reqs/                             [Kernel PMU event]
arm_cmn_0/hnf_mc_retries/                          [Kernel PMU event]
[...]

The perfmon events are described in the CMN-600 TRM in the register description section for each node type’s perf event selection register (at offset 0x2000 of each node that has a PMU).

CMN-600 TRM register summary links to all of the node types and offset registers.

Specifying NodeID to events in perf

To program the CMN-600’s PMUs, the NodeIDs of the components need to be specified for each event using a nodeid= parameter. Example:

$ perf stat -a -I 1000 -e arm_cmn/hnf_mc_reqs,nodeid=0x24/

Multiple nodes can be specified for an event as shown below :

$ perf stat -a -I 1000 \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x24/ \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x28/ \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x44/ \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x48/

Separate events on the same nodes can be specified as shown below :

$ perf stat -a -I 1000 \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x24/ \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x28/ \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x44/ \
    -e arm_cmn/hnf_mc_reqs,nodeid=0x48/ \
    -e arm_cmn/hnf_mc_retries,nodeid=0x24/ \
    -e arm_cmn/hnf_mc_retries,nodeid=0x28/ \
    -e arm_cmn/hnf_mc_retries,nodeid=0x44/ \
    -e arm_cmn/hnf_mc_retries,nodeid=0x48/

Driver verification

To verify that the arm-cmn has successfully loaded different ways:

  • Check if any arm_cmn entires is available
    $ perf list | grep arm_cmn_0
    arm_cmn_0/dn_rxreq_dvmop/                          [Kernel PMU event]
    arm_cmn_0/dn_rxreq_dvmop_vmid_filtered/            [Kernel PMU event]
    arm_cmn_0/dn_rxreq_dvmsync/                        [Kernel PMU event]
    arm_cmn_0/dn_rxreq_retried/                        [Kernel PMU event]
    arm_cmn_0/dn_rxreq_trk_occupancy_all/              [Kernel PMU event]
    arm_cmn_0/dn_rxreq_trk_occupancy_dvmop/            [Kernel PMU event]
    [...]
    
  • Sysfs entries
    $ ls -x /sys/bus/event_source/devices/arm_cmn_0/
    cpumask
    dtc_domain_0
    events
    format
    perf_event_mux_interval_ms
    power
    subsystem
    type
    uevent
    

Example

HN-F PMU

Make sure to issue some memory load operation(s) in parallel, such as memtester, while executing the following perf example.

Memory Bandwidth using hnf_mc_reqs

Measure memory bandwidth using hnf_mc_reqs; assumes bandwidth comes from SLC misses.

   $ perf stat -a -I 1000 \
       -e arm_cmn/hnf_mc_reqs,nodeid=0x24/ \
       -e arm_cmn/hnf_mc_reqs,nodeid=0x28/ \
       -e arm_cmn/hnf_mc_reqs,nodeid=0x44/ \
       -e arm_cmn/hnf_mc_reqs,nodeid=0x48/
2.000394365        121,713,206      arm_cmn/hnf_mc_reqs,nodeid=0x24/
2.000394365        121,715,680      arm_cmn/hnf_mc_reqs,nodeid=0x28/
2.000394365        121,712,781      arm_cmn/hnf_mc_reqs,nodeid=0x44/
2.000394365        121,715,432      arm_cmn/hnf_mc_reqs,nodeid=0x48/
3.000644408        121,683,890      arm_cmn/hnf_mc_reqs,nodeid=0x24/
3.000644408        121,685,839      arm_cmn/hnf_mc_reqs,nodeid=0x28/
3.000644408        121,682,684      arm_cmn/hnf_mc_reqs,nodeid=0x44/
3.000644408        121,685,669      arm_cmn/hnf_mc_reqs,nodeid=0x48/

Generic bandwith formula:

hnf_mc_reqs/second/hnf node  * 64 bytes = X MB/sec

Subsitute with data from perf output:

(121713206 + 121715680 + 121712781 + 121715432) * 64 = 29715 MB/sec

PCI-E RX/TX bandwidth

The RN-I/RN-D events are defined from the perspective of the bridge to the interconnect, so the “rdata” events are outbound writes to the PCI-E device and “wdata” events are inbound reads from PCI-E.

Measure RND (PCI-E) bandwidth to/from NVMe SSD when running fio

For the test, the NVMe SSD (Optane SSD 900P Series) is on PCI-E Root Complex 0 (PCI0, the Gen3 slot, behind the PCI-E switch).

Run fio to read from NVME SSD using 64KB block size for 1000 seconds in one terminal:

$ fio \
    --ioengine=libaio --randrepeat=1 --direct=1 --gtod_reduce=1 \
    --time_based --readwrite=read --bs=64k --iodepth=64k --name=r0 \
    --filename=/dev/nvme0n1p5 --numjobs=1 --runtime=10000
r0: (g=0): rw=read, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=65536
fio-3.1
Starting 1 process
^Cbs: 1 (f=1): [R(1)][0.5%][r=2586MiB/s,w=0KiB/s][r=41.4k,w=0 IOPS][eta 16m:35s]
fio: terminating on signal 2

r0: (groupid=0, jobs=1): err= 0: pid=1443: Thu Dec 19 12:12:10 2019
   read: IOPS=41.3k, BW=2581MiB/s (2706MB/s)(12.3GiB/4894msec) <------------------------------- read bandwidth = 2706 MB/sec
   bw (  MiB/s): min= 2276, max= 2587, per=98.10%, avg=2532.02, stdev=125.43, samples=6
   iops        : min=36418, max=41392, avg=40512.33, stdev=2006.90, samples=6
  cpu          : usr=3.15%, sys=35.15%, ctx=16686, majf=0, minf=1049353
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwt: total=202101,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=65536

Run status group 0 (all jobs):
   READ: bw=2581MiB/s (2706MB/s), 2581MiB/s-2581MiB/s (2706MB/s-2706MB/s), io=12.3GiB (13.2GB), run=4894-4894msec

Disk stats (read/write):
  nvme0n1: ios=202009/2, merge=0/19, ticks=4874362/51, in_queue=3934760, util=98.06%

Measure with perf in an other terminal. Measure rdata/wdata beats. Each beat is 32 bytes.

$ perf stat -earm_cmn/rnid_s0_{r,w}data_beats,nodeid=0xc,bynodeid=1/ -I 1000 -a
#    time                 counts                   unit events
     3.000383383            248,145      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
     3.000383383         84,728,162      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
     4.000522271            248,199      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
     4.000522271         84,743,908      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
     5.000680779            248,209      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
     5.000680779         84,746,976      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
     6.000835927            247,899      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
     6.000835927         84,417,098      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/

Calculate read bandwidth from perf measurement:

84.74e6 wdata beats * 32 bytes per beat = 2711 MB/sec

Measure RND (PCI-E) bandwidth from Ethernet NIC

netperf is executed on the N1SDP to generate network traffic.

netperf executing in one terminal.

$ netperf -D 10 -H <remote server> -t TCP_MAERTS -l 0
Interim result:  941.52 10^6bits/s over 10.000 seconds ending at 1576269135.608
Interim result:  941.52 10^6bits/s over 10.000 seconds ending at 1576269145.608
Interim result:  941.52 10^6bits/s over 10.000 seconds ending at 1576269155.608
Interim result:  941.52 10^6bits/s over 10.000 seconds ending at 1576269165.608

…and perf in another terminal at the same time.

$ perf stat -earm_cmn/rnid_s0_{r,w}data_beats,nodeid=0xc,bynodeid=1/ -I 1000 -a

#    time                 counts                  unit events
    12.001904404            308,803      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
    12.001904404          4,024,328      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
    13.002047284            308,994      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
    13.002047284          4,024,287      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
    14.002233364            309,035      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
    14.002233364          4,024,470      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
    15.002390125            309,162      arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
    15.002390125          4,024,376      arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/

Calculate bandwidth from perf measurement:

4.024e6 wdata beats/second * 32 bytes/beat * 8 bits/byte = 1030e6 bits/second

Copyright (c) 2019-2022, Arm Limited. All rights reserved.