'Difference between two perf events for intel processors
What is the difference between following perf events for intel processors:
UNC_CHA_DIR_UPDATE.HA: Counts only multi-socket cacheline Directory state updates memory writes issued from the HA pipe. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.UNC_CHA_DIR_UPDATE.TOR: Counts only multi-socket cacheline Directory state updates due to memory writes issued from the TOR pipe which are the result of remote transaction hitting the SF/LLC and returning data Core2Core. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.UNC_M2M_DIRECTORY_UPDATE.ANY: Counts when the M2M (Mesh to Memory) updates the multi-socket cacheline Directory to a new state.
The above description about perf events is taken from here.
In particular, if there is a directory update because of the memory write request coming from a remote socket then which perf event will account for that if any?
As per my understanding, since the CHA is responsible for handling the requests coming from the remote sockets via UPI, the directory updates which are caused by the remote requests should be reflected by UNC_CHA_DIR_UPDATE.HA or UNC_CHA_DIR_UPDATE.TOR. But when I run a program (which I will explain shortly), the UNC_M2M_DIRECTORY_UPDATE.ANY count is much larger (more than 34M) whereas the other two events have the count in the order of few thousand. Since there are no other writes happening other than those coming from the remote socket it seems that UNC_M2M_DIRECTORY_UPDATE.ANY measures the number of directory updates(and not the other two events which) happening due to remote writes.
Description of the system
- Intel Xeon GOLD 6242 CPU (Intel Cascadelake architecture)
- 4 sockets with each socket having PMEM attached to it
- part of the PMEM is configured to be used as a system RAM on sockets 2 and 3
- OS: Linux (kernel 5.4.0-72-generic)
Description of the program:
Note: use numactl to bind the process to node 2 which is a DRAM node
- Allocate two buffers of size 1GB each
- Initialize these buffers
- Move the second buffer to the PMEM attached to socket 3
- Perform a data copy from the first buffer to the second buffer
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
