NUMA

Section: Linux Programmer's Manual (3)
Updated: May 2004
 

NAME

numa - NUMA policy library  

SYNOPSIS

#include <numa.h>

cc ... -lnuma

int numa_available(void)

int numa_max_node(void)
int numa_preferred(void)
long numa_node_size(int node, long *freep)

nodemask_t numa_all_nodes
nodemask_t numa_no_nodes
int numa_node_to_cpus(int node, unsigned long *buffer, int bufferlen)

void numa_set_interleave_mask(nodemask_t *nodemask)
nodemask_t numa_get_interleave_mask(void)
void numa_bind(nodemask_t *nodemask)
void numa_set_preferred(int node)
void numa_set_localalloc(int flag)
void numa_set_membind(nodemask_t *nodemask)
nodemask_t numa_get_membind(void)

void *numa_alloc_interleaved_subset(size_t size, nodemask_t *nodemask)
void *numa_alloc_interleaved(size_t size)
void *numa_alloc_onnode(size_t size, int node)
void *numa_alloc_local(size_t size)
void *numa_alloc(size_t size)
void numa_free(void *start, size_t size)

int numa_run_on_node_mask(nodemask_t *nodemask)
int numa_run_on_node(int node)
int numa_get_run_node_mask(void)

void numa_interleave_memory(void *start, size_t size, nodemask_t *nodemask)
void numa_tonode_memory(void *start, size_t size, int node)
void numa_tonodemask_memory(void *start, size_t size, nodemask_t *nodemask)
void numa_setlocal_memory(void *start, size_t size)
void numa_police_memory(void *start, size_t size)
void numa_set_bind_policy(int strict)
void numa_set_strict(int strict)
void numa_error(char *where)
extern int numa_exit_on_error  

DESCRIPTION

libnuma offers a simple programming interface to the NUMA policy supported by the Linux kernel. On a NUMA (Non Uniform Memory Access) architecture some memory areas have different latency or bandwidth than others. Available policies are page interleaving, preferred node allocation, local allocation, allocation only on specific nodes. It also allows to bind threads to specific nodes. All policy exists per thread, but is inherited to children. For setting global policy per process it is easiest to run it using the numactl(8) utility. For more finegrained policy inside an application this library can be used.

All numa memory allocation policy only takes effect when a page is actually faulted into the address space of a process by accessing it. The numa_alloc_* functions take care of this automatically.

A node is defined as an area where all memory has the same speed as seen from a particular CPU. Caches are ignored for this definition.

The mapping of nodes to cpus depends on the architecture. On the AMD64 architecture each CPU is an own node. This library is only concerned about nodes.

Before any other calls in this library can be used numa_available must be called. When it returns an negative value all other functions in this library are undefined.

numa_max_node returns the highest node number available on the current system. When a node number or a node mask with a bit set above the value returned by this function is passed to a libnuma the result is undefined. The numa_node_size function returns the memory size of a node. When the argument freep is not NULL the free memory of the node is written to it. On error it returns -1.

Some of these functions accept or return a nodemask. A nodemask has type nodemask_t which is an abstract bitmap type containing a bit set of nodes. The maximum node number depends on the architecture, but is not bigger than NUMA_MAX_NODE. When happens in libnuma calls when bits above numa_max_node are passed is undefined. An nodemask_t should be only manipulated with the nodemask_zero, nodemask_clr, nodemask_isset, nodemask_set functions. nodemask_zero clears an nodemask_t, nodemask_isset returns true when node is set in the passed nodemask, nodemask_clr clears node in nodemask, nodemask_set sets node in nodemask. The predefined variable numa_all_nodes has all available nodes set, numa_no_nodes is the empty set. nodeset_equal returns non zero when the two nodesets are equal.

numa_preferred returns the preferd node of the current thread. It is the node the kernel preferably allocates memory on, unless some other policy overwrites this.

numa_set_interleave_mask Set an memory interleave mask for the current thread to nodemask. All new memory allocations are page interleaved over all nodes in the interleave mask. Interleaving can be turned off again by passing a zero mask. The page interleaving only occurs on the actual page fault that puts a new page into the current address space. It is also only a hint, the kernel will fall back to other nodes if no memory is available on the interleave target. This is a low level function, it may be more convenient to use the higher level functions like numa_alloc_interleaved or numa_alloc_interleaved_subset.

numa_get_interleave_mask returns the current interleave mask.

numa_bind binds the current thread and its children to the nodes specified in nodemask. They will only run on the CPUs of the specified nodes and only able to allocate memory from them. This function is equivalent to calling numa_run_on_node_mask and numa_set_membind with the same argument.

numa_set_preferred sets the preferred node for the current thread to node. Preferred node is the node memory is preferably allocated from before falling back to other nodes. The default is to use the current node the process runs on (local policy). Passing an -1 argument is equivalent to numa_set_localalloc.

numa_set_localalloc sets a local memory allocation policy for the current thread. Memory is preferably allocated from the current node.

numa_set_membind sets the memory allocation mask. The thread will only allocate memory from the nodes set in nodemask. Passing an argument of numa_no_nodes or numa_all_nodes turns off memory binding to specific nodes.

numa_get_membind returns the current node mask from which memory can be allocated. numa_no_nodes or numa_all_nodes means all nodes are available for memory allocation.

numa_alloc_interleaved allocates size bytes of memory page interleaved on all nodes. This function is relatively slow and should only be used for large areas consisting of multiple pages. The interleaving works on page level and will only show an effect when the area is large. It must be freed with numa_free. On errors NULL is returned.

numa_alloc_interleaved_subset is like numa_alloc_interleaved except that it also accepts a mask of the nodes to interleave on. On errors NULL is returned.

numa_alloc_onnode allocates memory on a specific node. This function is relatively slow and allocations are rounded to pagesize. The memory must be freed with numa_free On errors NULL is returned.

numa_alloc_local allocates size bytes of memory on the local node. This function is relatively slow and allocations are rounded to pagesize. The memory must be freed with numa_free. On errors NULL is returned.

numa_alloc allocates size bytes of memory with the current NUMA policy. This function is relatively slow and allocations are rounded to pagesize. The memory must be freed with numa_free. On errors NULL is returned.

numa_free frees size bytes of memory starting at start, allocated by the numa_alloc_* functions above.

numa_run_on_node runs the current thread and its children on a specific node. They will not migrate to CPUs of other nodes until the node affinity is reset with a new call to numa_run_on_node_mask. Passing -1 allows to schedule on all nodes again. Returns an negative value and error in errno, or 0 on success.

numa_run_on_node_mask runs the current thread and its children only on nodes specified in nodemask. They will not migrate to CPUs of other nodes until the node affinity is reset with a new call to numa_run_on_node_mask. Passing numa_all_nodes allows to schedule on all nodes again. Returns an negative value and error in errno, or 0 on success.

numa_get_run_node_mask returns the mask of nodes that the current thread is allowed to run on.

numa_interleave_memory pages interleaves size bytes memory from start on nodes nodemask. This is a lower level function to interleave not yet faulted in but allocated memory. Not yet faulted in means the memory is allocated using mmap(2) or shmat(2), but has not been accessed by the current process yet. The memory is page interleaved to all nodes specified in nodemask. Normally numa_alloc_interleaved should be used for private memory instead, but this function is useful to handle shared memory areas. To be useful the memory area should be significantly larger than a page. When the numa_set_strict flag is true then the operation will cause an numa_error if there were already pages in the mapping that do not follow the policy.

numa_tonode_memory put memory on a specific node. The constraints described for numa_interleave_memory apply here too.

numa_tonodemask_memory put memory on a specific set of nodes. The constraints described for numa_interleave_memory apply here too.

numa_setlocal_memory locates memory on the current node. The constraints described for numa_interleave_memory apply here too.

numa_police_memory locates memory with the current NUMA policy. The constraints described for numa_interleave_memory apply here too.

numa_node_to_cpus converts a node number to a bitmask of cpus. The user must pass a long enough buffer. When the buffer is not long enough errno will be set to ERANGE and -1 returned. On success 0 is returned.

numa_set_bind_policy specifies whether calls that bind memory to a specific node should use the preferred policy or a strict policy. Preferred allows to allocate memory on other nodes when there isn't enough free on the target node. strict will fail the allocation in that case. Setting the argument to specifies strict, 0 preferred. Note that specifying more than one node non strict may only use the first node in some kernel versions.

numa_set_strict sets a flag that says whether the functions allocating on specific nodes should use use a strict policy. Strict means the allocation will fail if the memory cannot be allocated on the target node. Default operation is to fall back to other nodes. This doesn't apply to interleave and default.

numa_error is an weak internal libnuma function that can be overwritten by the user program. It allows to specify a different error handling strategy when an NUMA system call fails. It does not affect numa_available. The default action is to print an error to stderr and exit the program when numa_exit_on_error is set to a non zero value. Default is zero.

 

THREAD SAFETY

numa_set_bind_policy and numa_exit_on_error are process global. The other calls are thread safe. Memory policy for an specific memory when changed affects the whole process and possible other processes mapping the same memory.

 

COPYRIGHT

Copyright 2002,2004 Andi Kleen, SuSE Labs. libnuma is under the GNU Lesser General Public License, v2.1.

 

SEE ALSO

getpagesize(2) mmap(2) shmat(2) numactl(8)


 

Index

NAME
SYNOPSIS
DESCRIPTION
THREAD SAFETY
COPYRIGHT
SEE ALSO
blog comments powered by Disqus