CMN-600 perf example on Neoverse N1 SDP
The goal of this document is to give a short introduction on CMN-600 performance analysis on N1SDP. This includes driver load verification and Linux perf usage examples.
The examples also include system level cache access and traffic to and from PCIe devices from the view of the interconnect.
Support in Arm’s Neoverse N1 SDP software release
The software support for CMN-600 performance analysis can be divided into three components:
The user space Linux perf tool
The Linux kernel arm-cmn driver
EDK2 (DSDT table entry)
The default build of the supplied N1SDP software stack will include all necessary changes and patches to test and explore CMN-600 performance analysis.
CMN-600 Topology and NodeIDs on Neoverse N1 SDP
The PMUs in CMN-600 are distributed to the nodes of the mesh interconnect. NodeType specific events are configured per node. Event counting is done by local counters in the XP attached to the node. Global counters are in the Debug Trace Controller (DTC). The arm-cmn driver uses local/global register pairing to provide 64-bit event counters (see “Counter Allocation” section below).
All the nodes are referenced by NodeID and NodeType. PMU events must specify the NodeID of the node on which it is to be counted using the nodeid= parameter. A summary of NodeID can be found in the table below. For more details contact support (support-subsystem-enterprise@arm.com).
Purpose |
Node Type |
NodeID |
Event Name |
System-Level Cache slices (SLC) |
HN-F |
0x24 0x28 0x44 0x48 |
arm_cmn/hnf |
PCI_CCIX (Expansion slot 4) |
RN-D |
0x08 |
arm_cmn/rnid |
PCI_0 (All other PCI-E) |
RN-D |
0x0c |
arm_cmn/rnid |
Mesh interconnections |
XP |
0x00 0x08 0x20 0x28 0x40 0x48 0x60 0x68 |
arm_cmn/mxp |
Debug Trace Controller |
DTC |
0x68 |
arm_cmn/dtc_cycles |
ACE-lite slave |
SBSX |
0x64 |
arm_cmn/sbsx |
For details on what is connected to PCI_0 check the N1SDP TRM (Figure 2-9 PCI Express and CCIX system).
Software components
Linux perf tool
No modifications of perf
source is needed.
The user can opt to use any perf compatible with the built kernel or use the
included script build-scripts/build-perf.sh
to build a static linked binary
from the included kernel source (binary is created as
output/n1sdp/perf/perf
).
ACPI DSDT modification
The Linux driver expects a DSDT entry that describe the location of the CMN-600 configuration space. This is included in the supplied N1SDP software stack.
Linux perf driver (arm-cmn)
The included arm-cmn driver is a work-in-progress.
A Snapshot of this driver is included in the supplied N1SDP software stack.
The driver is controlled by CONFIG_ARM_CMN
(enabled in default software
stack build).
Counter Allocation/Limitation
The arm-cmn driver provides 64-bit event counts for any given event. It accomplishes this using a combination of combined-pair local counters (in a DTM/XP) and uncombined global counters (in the DTC):
- DTM/XP
Can provide up to two 32-bit local counters (built from paired 16-bit counters por_dtm_pmevcnt0+1, and 2+3) for events from itself and/or up to two devices that are connected to its ports.
Overflows from these counters are sent to its DTC’s global counters. This means only up to 2 events from any of the devices connected to an XP can be counted at the same time without sampling.
- DTC
Each DTC can provide up to 8 global counters (por_dt_pmevcntA .. H). This means only up to 8 events in a DTC domain can be counted at the same time without sampling.
For example, the N1SDP’s two PCI-Express root complexes RND (PCI_CCIX on RND3 at NodeID 0x8 and PCI0 on RND4 at NodeID 0xC), hang off of the same XP (0,1). Only up to 2 RND events from either of the two PCI-E domains can be measured simultaneously without sampling; 3 or more will require sampling.
In the following example, we try to measure 4 RND events, but perf is only giving 50% sampling time for each count because the events have to share local counters in the XP.
$ perf stat -a \
-e arm_cmn/rnid_txdat_flits,nodeid=8/ \
-e arm_cmn/rnid_txdat_flits,nodeid=12/ \
-e arm_cmn/rnid_rxdat_flits,nodeid=8/ \
-e arm_cmn/rnid_rxdat_flits,nodeid=12/ \
-I 1000
# time counts unit events
1.000089438 0 arm_cmn/rnid_txdat_flits,nodeid=8/ (50.00%)
1.000089438 0 arm_cmn/rnid_txdat_flits,nodeid=12/ (50.00%)
1.000089438 0 arm_cmn/rnid_rxdat_flits,nodeid=8/ (50.00%)
1.000089438 0 arm_cmn/rnid_rxdat_flits,nodeid=12/ (50.00%)
2.000231897 79 arm_cmn/rnid_txdat_flits,nodeid=8/ (50.01%)
2.000231897 0 arm_cmn/rnid_txdat_flits,nodeid=12/ (50.01%)
2.000231897 0 arm_cmn/rnid_rxdat_flits,nodeid=8/ (49.99%)
PMU Events
perf list
shows the perfmon events for the node types that are detected by
the arm-cmn driver.
If a node type is not detected, perf list will not show the events for that
node type.
# perf list | grep arm_cmn_0/hnf
arm_cmn_0/hnf_brd_snoops_sent/ [Kernel PMU event]
arm_cmn_0/hnf_cache_fill/ [Kernel PMU event]
arm_cmn_0/hnf_cache_miss/ [Kernel PMU event]
arm_cmn_0/hnf_cmp_adq_full/ [Kernel PMU event]
arm_cmn_0/hnf_dir_snoops_sent/ [Kernel PMU event]
arm_cmn_0/hnf_intv_dirty/ [Kernel PMU event]
arm_cmn_0/hnf_ld_st_swp_adq_full/ [Kernel PMU event]
arm_cmn_0/hnf_mc_reqs/ [Kernel PMU event]
arm_cmn_0/hnf_mc_retries/ [Kernel PMU event]
[...]
The perfmon events are described in the CMN-600 TRM in the register description section for each node type’s perf event selection register (at offset 0x2000 of each node that has a PMU).
CMN-600 TRM register summary links to all of the node types and offset registers.
Specifying NodeID to events in perf
To program the CMN-600’s PMUs, the NodeIDs of the components need to be specified for each event using a nodeid= parameter. Example:
$ perf stat -a -I 1000 -e arm_cmn/hnf_mc_reqs,nodeid=0x24/
Multiple nodes can be specified for an event as shown below :
$ perf stat -a -I 1000 \
-e arm_cmn/hnf_mc_reqs,nodeid=0x24/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x28/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x44/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x48/
Separate events on the same nodes can be specified as shown below :
$ perf stat -a -I 1000 \
-e arm_cmn/hnf_mc_reqs,nodeid=0x24/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x28/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x44/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x48/ \
-e arm_cmn/hnf_mc_retries,nodeid=0x24/ \
-e arm_cmn/hnf_mc_retries,nodeid=0x28/ \
-e arm_cmn/hnf_mc_retries,nodeid=0x44/ \
-e arm_cmn/hnf_mc_retries,nodeid=0x48/
Driver verification
To verify that the arm-cmn has successfully loaded different ways:
- Check if any arm_cmn entires is available
$ perf list | grep arm_cmn_0 arm_cmn_0/dn_rxreq_dvmop/ [Kernel PMU event] arm_cmn_0/dn_rxreq_dvmop_vmid_filtered/ [Kernel PMU event] arm_cmn_0/dn_rxreq_dvmsync/ [Kernel PMU event] arm_cmn_0/dn_rxreq_retried/ [Kernel PMU event] arm_cmn_0/dn_rxreq_trk_occupancy_all/ [Kernel PMU event] arm_cmn_0/dn_rxreq_trk_occupancy_dvmop/ [Kernel PMU event] [...]
- Sysfs entries
$ ls -x /sys/bus/event_source/devices/arm_cmn_0/ cpumask dtc_domain_0 events format perf_event_mux_interval_ms power subsystem type uevent
Example
HN-F PMU
Make sure to issue some memory load operation(s) in parallel, such as memtester, while executing the following perf example.
Memory Bandwidth using hnf_mc_reqs
Measure memory bandwidth using hnf_mc_reqs; assumes bandwidth comes from SLC misses.
$ perf stat -a -I 1000 \
-e arm_cmn/hnf_mc_reqs,nodeid=0x24/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x28/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x44/ \
-e arm_cmn/hnf_mc_reqs,nodeid=0x48/
2.000394365 121,713,206 arm_cmn/hnf_mc_reqs,nodeid=0x24/
2.000394365 121,715,680 arm_cmn/hnf_mc_reqs,nodeid=0x28/
2.000394365 121,712,781 arm_cmn/hnf_mc_reqs,nodeid=0x44/
2.000394365 121,715,432 arm_cmn/hnf_mc_reqs,nodeid=0x48/
3.000644408 121,683,890 arm_cmn/hnf_mc_reqs,nodeid=0x24/
3.000644408 121,685,839 arm_cmn/hnf_mc_reqs,nodeid=0x28/
3.000644408 121,682,684 arm_cmn/hnf_mc_reqs,nodeid=0x44/
3.000644408 121,685,669 arm_cmn/hnf_mc_reqs,nodeid=0x48/
Generic bandwith formula:
hnf_mc_reqs/second/hnf node * 64 bytes = X MB/sec
Subsitute with data from perf output:
(121713206 + 121715680 + 121712781 + 121715432) * 64 = 29715 MB/sec
PCI-E RX/TX bandwidth
The RN-I/RN-D events are defined from the perspective of the bridge to the interconnect, so the “rdata” events are outbound writes to the PCI-E device and “wdata” events are inbound reads from PCI-E.
Measure RND (PCI-E) bandwidth to/from NVMe SSD when running fio
For the test, the NVMe SSD (Optane SSD 900P Series) is on PCI-E Root Complex 0 (PCI0, the Gen3 slot, behind the PCI-E switch).
Run fio
to read from NVME SSD using 64KB block size for 1000 seconds in
one terminal:
$ fio \
--ioengine=libaio --randrepeat=1 --direct=1 --gtod_reduce=1 \
--time_based --readwrite=read --bs=64k --iodepth=64k --name=r0 \
--filename=/dev/nvme0n1p5 --numjobs=1 --runtime=10000
r0: (g=0): rw=read, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=65536
fio-3.1
Starting 1 process
^Cbs: 1 (f=1): [R(1)][0.5%][r=2586MiB/s,w=0KiB/s][r=41.4k,w=0 IOPS][eta 16m:35s]
fio: terminating on signal 2
r0: (groupid=0, jobs=1): err= 0: pid=1443: Thu Dec 19 12:12:10 2019
read: IOPS=41.3k, BW=2581MiB/s (2706MB/s)(12.3GiB/4894msec) <------------------------------- read bandwidth = 2706 MB/sec
bw ( MiB/s): min= 2276, max= 2587, per=98.10%, avg=2532.02, stdev=125.43, samples=6
iops : min=36418, max=41392, avg=40512.33, stdev=2006.90, samples=6
cpu : usr=3.15%, sys=35.15%, ctx=16686, majf=0, minf=1049353
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwt: total=202101,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=65536
Run status group 0 (all jobs):
READ: bw=2581MiB/s (2706MB/s), 2581MiB/s-2581MiB/s (2706MB/s-2706MB/s), io=12.3GiB (13.2GB), run=4894-4894msec
Disk stats (read/write):
nvme0n1: ios=202009/2, merge=0/19, ticks=4874362/51, in_queue=3934760, util=98.06%
Measure with perf
in an other terminal.
Measure rdata/wdata beats. Each beat is 32 bytes.
$ perf stat -earm_cmn/rnid_s0_{r,w}data_beats,nodeid=0xc,bynodeid=1/ -I 1000 -a
# time counts unit events
3.000383383 248,145 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
3.000383383 84,728,162 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
4.000522271 248,199 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
4.000522271 84,743,908 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
5.000680779 248,209 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
5.000680779 84,746,976 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
6.000835927 247,899 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
6.000835927 84,417,098 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
Calculate read bandwidth from perf measurement:
84.74e6 wdata beats * 32 bytes per beat = 2711 MB/sec
Measure RND (PCI-E) bandwidth from Ethernet NIC
netperf
is executed on the N1SDP to generate network traffic.
netperf
executing in one terminal.
$ netperf -D 10 -H <remote server> -t TCP_MAERTS -l 0
Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269135.608
Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269145.608
Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269155.608
Interim result: 941.52 10^6bits/s over 10.000 seconds ending at 1576269165.608
…and perf
in another terminal at the same time.
$ perf stat -earm_cmn/rnid_s0_{r,w}data_beats,nodeid=0xc,bynodeid=1/ -I 1000 -a
# time counts unit events
12.001904404 308,803 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
12.001904404 4,024,328 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
13.002047284 308,994 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
13.002047284 4,024,287 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
14.002233364 309,035 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
14.002233364 4,024,470 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
15.002390125 309,162 arm_cmn/rnid_s0_rdata_beats,nodeid=0xc/
15.002390125 4,024,376 arm_cmn/rnid_s0_wdata_beats,nodeid=0xc/
Calculate bandwidth from perf measurement:
4.024e6 wdata beats/second * 32 bytes/beat * 8 bits/byte = 1030e6 bits/second
Copyright (c) 2019-2022, Arm Limited. All rights reserved.