In preparation for a research project and potential UKOUG conference papers I am researching the effect of NUMA on x86 systems.
NUMA is one of the key features to understand in modern computer organisation, and I recommend reading “Computer Architecture, Fifth Edition: A Quantitative Approach” from Hennessy and Patterson (make sure you grab the 5th edition). Read the chapter about cache optimisation and also the appendix about the memory hierarchy!
Now why should you know NUMA? First of all there is an increasing number of multi-socket systems. AMD has pioneered the move to a lot of cores, but Intel is not far behind. Although AMD is currently leading in the number of cores (“modules”) on a die, Intel doesn’t need to: the Sandy-Bridge EP processors are way more powerful on a one-to-one comparison than anything AMD has at the moment.
In the example, I am using a blade system with Opteron 61xx processors. The processor has 12 cores according to the AMD hardware reference. The output of /proc/cpuinfo lists 48 “processors”, so it should be fair to say that there are 48/12 = 4 sockets in the system. An AWR report on the machine lists it as 4 sockets, 24 cores and 48 processors. I didn’t think the processor was using SMT, when I find out why AWR reports 24c48t I’ll update the post.
Anyway, I ensured that the kernel command line (/proc/cmdline) didn’t include numa=off, which the oracle-validated RPM sets. Then after a reboot here’s the result:
$ ]$ numactl --hardware available: 8 nodes (0-7) node 0 size: 4016 MB node 0 free: 378 MB node 1 size: 4040 MB node 1 free: 213 MB node 2 size: 4040 MB node 2 free: 833 MB node 3 size: 4040 MB node 3 free: 819 MB node 4 size: 4040 MB node 4 free: 847 MB node 5 size: 4040 MB node 5 free: 834 MB node 6 size: 4040 MB node 6 free: 851 MB node 7 size: 4040 MB node 7 free: 749 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10
Right, I have 8 NUMA nodes from 0-7, total RAM on the machine is 32GB. There are huge pages allocated for another database to allow for a 24GB RAM SGA. A lot of information about NUMA can be found in the SYSFS which is now mounted by default on RHEL and Oracle Linux. Check the path to /sys/devices/system/node:
$ ls node0 node1 node2 node3 node4 node5 node6 node7 $ ls node0 cpu0 cpu12 cpu16 cpu20 cpu4 cpu8 cpumap distance meminfo numastat
For each NUMA node as shown in the output of numactl –hardware there is a subdirectory noden. There you can see also the processors that form the node as well. Oracle Linux 6.x offers a file called cpulist, previous releases with the RHEL-compatible kernel should have subdirectories cpux. Interestingly you find memory information local to the NUMA node in the file meminfo, as well as the distance matrix you can query in numactl –hardware. So far I have only seen distances of 10 or 20-if anyone knows where these numbers come from or has soon other figures please let me know!
Another useful tool to know is numastat which presents memory information (and cross-node memory requests!) which can be useful.
$ numastat
node0 node1 node2 node3
numa_hit 3048548 25344114 14523218 13498057
numa_miss 0 0 0 0
numa_foreign 0 0 0 0
interleave_hit 8196 390371 415719 458362
local_node 2415628 24965781 14059618 12907752
other_node 632920 378333 463600 590305
node4 node5 node6 node7
numa_hit 9295098 4072364 3730878 3659625
numa_miss 0 0 0 0
numa_foreign 0 0 0 0
interleave_hit 512399 451099 417627 390960
local_node 8637176 3483582 3152133 3159090
other_node 657922 588782 578745 500535
Oracle and NUMA
Oracle has an if then else approach to NUMA as a post from Kevin Closson has explained already. I’m on 11.2.0.3 and need to use “_enable_numa_support” to enable NUMA support in the database. Before that however I though I’d give the numctl command a chance and bind it to node 7 (both for processor and memory)
This is easily done:
[oracle@server1 ~]> numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba <
Have a look at the numactl man page if you want to learn more about the options.
Now how can you check if it respected your settings? Simple enough, the tool is called “taskset”. Unlike the name may suggest not only can you set a task, but you can also get the affinities etc. A simple one-liner does that for my database SLOB:
$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 1434's current affinity list: 3,7,11,15,19,23
pid 1436's current affinity list: 3,7,11,15,19,23
pid 1438's current affinity list: 3,7,11,15,19,23
pid 1442's current affinity list: 3,7,11,15,19,23
pid 1444's current affinity list: 3,7,11,15,19,23
pid 1446's current affinity list: 3,7,11,15,19,23
pid 1448's current affinity list: 3,7,11,15,19,23
pid 1450's current affinity list: 3,7,11,15,19,23
pid 1452's current affinity list: 3,7,11,15,19,23
pid 1454's current affinity list: 3,7,11,15,19,23
pid 1456's current affinity list: 3,7,11,15,19,23
pid 1458's current affinity list: 3,7,11,15,19,23
pid 1460's current affinity list: 3,7,11,15,19,23
pid 1462's current affinity list: 3,7,11,15,19,23
pid 1464's current affinity list: 3,7,11,15,19,23
pid 1466's current affinity list: 3,7,11,15,19,23
pid 1470's current affinity list: 3,7,11,15,19,23
pid 1472's current affinity list: 3,7,11,15,19,23
pid 1489's current affinity list: 3,7,11,15,19,23
pid 1694's current affinity list: 3,7,11,15,19,23
pid 1696's current affinity list: 3,7,11,15,19,23
pid 5041's current affinity list: 3,7,11,15,19,23
pid 13374's current affinity list: 3,7,11,15,19,23
Is that really node7? Checking the cpus in node7:
$ ls node7 cpu11 cpu15 cpu19 cpu23 cpu3 cpu7
That’s us! Ok that worked.
_enable_NUMA_support
The next test I did was to see how Oracle handles NUMA in the database. There was a bit of a enable/don’t enable/enable/don’t enable from 10.2 to 11.2. If the MOS notes are correct then NUMA support is turned off by default now. The underscore parameter _enable_NUMA_support turns it on again. At least on my 11.2.0.3.2 system on Linux there was no relinking of the oracle binary necessary.
But to my surprise I saw this after starting the database with NUMA support enabled:
$ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done
pid 17513's current affinity list: 26,30,34,38,42,46
pid 17515's current affinity list: 26,30,34,38,42,46
pid 17517's current affinity list: 26,30,34,38,42,46
pid 17521's current affinity list: 26,30,34,38,42,46
pid 17523's current affinity list: 26,30,34,38,42,46
pid 17525's current affinity list: 26,30,34,38,42,46
pid 17527's current affinity list: 26,30,34,38,42,46
pid 17529's current affinity list: 26,30,34,38,42,46
pid 17531's current affinity list: 0,4,8,12,16,20
pid 17533's current affinity list: 24,28,32,36,40,44
pid 17535's current affinity list: 1,5,9,13,17,21
pid 17537's current affinity list: 25,29,33,37,41,45
pid 17539's current affinity list: 2,6,10,14,18,22
pid 17541's current affinity list: 26,30,34,38,42,46
pid 17543's current affinity list: 27,31,35,39,43,47
pid 17545's current affinity list: 3,7,11,15,19,23
pid 17547's current affinity list: 24,28,32,36,40,44
pid 17549's current affinity list: 26,30,34,38,42,46
pid 17551's current affinity list: 26,30,34,38,42,46
pid 17553's current affinity list: 26,30,34,38,42,46
pid 17555's current affinity list: 26,30,34,38,42,46
pid 17557's current affinity list: 26,30,34,38,42,46
pid 17559's current affinity list: 26,30,34,38,42,46
pid 17563's current affinity list: 26,30,34,38,42,46
pid 17565's current affinity list: 26,30,34,38,42,46
pid 17568's current affinity list: 0,4,8,12,16,20
pid 17577's current affinity list: 0,4,8,12,16,20
pid 17584's current affinity list: 0,4,8,12,16,20
pid 17597's current affinity list: 0,4,8,12,16,20
pid 17599's current affinity list: 24,28,32,36,40,44
Interesting-so the database, with an otherwise identical pfile (and a SLOB PIO SGA of 270 M) is now distributed across lots of NUMA nodes…watch out for that interleaved memory transfer!
It doesn’t help trying to use numactl to force the creation of process on a node-Oracle now uses NUMA API calls internally it seems and overrides your command:
$ numactl --membind=7 --cpunodebind=7 sqlplus / as sysdba <startup > EOF ... $ for i in `ps -ef | awk '/SLOB/ {print $2}'`; do taskset -c -p $i; done pid 20155's current affinity list: 3,7,11,15,19,23 pid 20157's current affinity list: 3,7,11,15,19,23 pid 20160's current affinity list: 3,7,11,15,19,23 pid 20164's current affinity list: 3,7,11,15,19,23 pid 20166's current affinity list: 3,7,11,15,19,23 pid 20168's current affinity list: 3,7,11,15,19,23 pid 20170's current affinity list: 3,7,11,15,19,23 pid 20172's current affinity list: 3,7,11,15,19,23 pid 20174's current affinity list: 0,4,8,12,16,20 pid 20176's current affinity list: 24,28,32,36,40,44 pid 20178's current affinity list: 1,5,9,13,17,21 pid 20180's current affinity list: 25,29,33,37,41,45 pid 20182's current affinity list: 2,6,10,14,18,22 pid 20184's current affinity list: 26,30,34,38,42,46 pid 20186's current affinity list: 27,31,35,39,43,47 pid 20188's current affinity list: 3,7,11,15,19,23 pid 20190's current affinity list: 24,28,32,36,40,44 pid 20192's current affinity list: 3,7,11,15,19,23 pid 20194's current affinity list: 3,7,11,15,19,23 pid 20196's current affinity list: 3,7,11,15,19,23 pid 20198's current affinity list: 3,7,11,15,19,23 pid 20200's current affinity list: 3,7,11,15,19,23 pid 20202's current affinity list: 3,7,11,15,19,23 pid 20206's current affinity list: 3,7,11,15,19,23 pid 20208's current affinity list: 3,7,11,15,19,23 pid 20211's current affinity list: 0,4,8,12,16,20 pid 20240's current affinity list: 0,4,8,12,16,20 pid 20363's current affinity list: 0,4,8,12,16,20 sched_getaffinity: No such process failed to get pid 20403's affinity
Little things I didn’t know! So next time I benchmark I will have that in mind!
Recent comments
17 weeks 13 hours ago
26 weeks 5 days ago
28 weeks 3 days ago
31 weeks 5 days ago
34 weeks 1 hour ago
43 weeks 3 days ago
45 weeks 15 hours ago
46 weeks 16 hours ago
46 weeks 1 day ago
48 weeks 6 days ago