Huge Page usage in PHP 7

Recall on memory paging

Memory paging is a way Operating Systems manage userland process memory.
Each process memory access is virtual, and the OS together with the hardware MMU must translate that address into a physical address used to access the data in main memory (RAM).

Paging memory is dividing memory in chunks of fixed size, called pages. Under Linux, page size is usually 4Kb for many platforms, including X86/64.
Each memory page address translation is stored into a table attached to each process, this is what's called the Page Table Entry. To prevent the OS from accessing that table (stored into memory) at each memory access - leading to having two memory accesses to manage for each memory access demand - the hardware helps the OS in the use of a little cache for translations : the Translation-lookaside Buffer (TLB). This is stored into the MMU and is extremely fast and efficient, as soon as the translation is found into the TLB : a TLB hit. If not, then we happen having a TLB miss, and must access slow memory to find the translation an update the TLB (the OS Kernel does this job) to finally load our data from known memory address.

All those concepts are explained deeper in my segmentation fault article , and I highly recommand you to read it if not familiar to virtual memory management in modern Operating Systems.

Why use huge pages ?

The concept is easy. If we make the OS Kernel use bigger page sizes, that means that more data can be accessed into one single page. That also means that we'll suffer from less TLB miss, once the page translation is stored into the TLB, because one translation will now be valid for more data.
The Linux Kernel used to use default 4Kb page size on X86/64 architectures, but back in ~2.6.20, the Kernel evolved and has been added the huge page concept. A huge page is typically 512 times larger than a traditionnal page : it fits 2Mb. You can read many articles (1) about (2) huge pages (3) management in the linux Kernel, you'll then learn that usually the Kernel performs what's called transparent huge page mappings (Kernel 2.6.38) : it will map virtual memory onto traditionnal 4Kb pages, but will try sometimes (you can have hand on this) to merge contiguous pages into huge pages. This is especially done for the heap segment of processes, that can represent a very huge address space, but care should be taken as this memory may be given back to the OS (freed) by small chunks, hence wasting huge page space and probably forcing the Kernel to undo its job and shrink back a huge page into 512 smaller 4Kb pages.

However, the user process may itself ask for huge page based mappings specifically. If you are sure you're gonna fill-in the huge page space, then it is better to ask the Kernel to give you a huge page for such virtual memory page. Huge page means less pages to manage, that means smaller lists to browse by the Kernel in the process page table entry, and that means as well less entries in the crucial TLB cache. The system then becomes more efficient, more responsive and faster.

OPCache coming in help in PHP 7

In PHP7, we worked very hard in improving memory efficiency. That means that we reworked crucial PHP internal structures, to make them fit into CPU caches more efficiently. We worked hard on data spacial locality, that means that more contiguous data fit into CPU caches, and then less memory accesses in main engine parts.
This is the basic concept behind PHP 7 overall performances, and those are pretty easier to get under C than C++ (because C++ is a larger language, more complex and harder to master, together with its compiler).

In addition to that, we pushed the concepts of better memory efficiency even further by also experiencing Kernel huge pages.
In PHP 7, the OPCache extension I already wrote about for PHP 5 has been added features to play with Kernel huge pages. Let's see that together.

Asking for huge pages

Under many Unix flavors, you are povided with two APIs to play with virtual memory mappings. The so-well-called mmap() function (prefered, because it really plays with memory mappings and can really allocate huge pages if asked for), or madvise() (that just hints (adivses) the Kernel about a memory zone, f.e that the current page should be turned to huge page, with no guaranty it will be done).
First, you need to make sure your OS has huge page support and some free huge pages; If not, allocation will fail. Tune vm.nr_hugepages using sysctl.
Then, cat /proc/meminfo to confirm some huge pages are available.

> cat /proc/meminfo
HugePages_Total:      20
HugePages_Free:       20
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Here, 20 huge pages are available, each huge page is 2Mb size (Linux can use huge pages of up to 1Gb on X86/64 platforms, thus I dont recommand such a size for PHP, but RDBMS will benefit from such page sizes).

Then we may use the API. Of course, you can't map a memory segment on a huge page if this latter is not aligned on huge page boundary, so first, you must align your address (what must anyway be done for overall CPU efficiency), then ask for the Kernel to map it using a huge page. We will align it using the C language as in the example, the to-be-aligned buffer comes from the heap, though some alignment functions exist (like posix_memalign()), but for crossplatform compatibility, we won't make use of this latter.

Here is the simple little example :

	#include <stdio.h>
	#include <sys/mman.h>
	#include <string.h>
	#include <stdlib.h>

	#define ALIGN 1024*1024*2 /* We assume huge pages are 2Mb */
	#define SIZE 1024*1024*32 /* Let's allocate 32Mb */

	int main(int argc, char *argv[])
	{
		void *addr;
		void *buf = NULL;
		void *aligned_buf;
	
		/* As we're gonna align on 2Mb, we need to allocate 34Mb if
		we want to be sure we can use huge pages on 32Mb total */
		buf = malloc(SIZE + ALIGN);
	
		if (!buf) {
			perror("Could not allocate memory");
			exit(1);
		}
	
		printf("buf is at: %p\n", buf);

		*/ Align on ALIGN boundary */
		aligned_buf = (void *) ( ((unsigned long)buf + ALIGN - 1) & ~(ALIGN -1) );
	
		printf("aligned buf: %p\n", aligned_buf);

		/* Turn the address to huge page backed address, using MAP_HUGETLB */
		addr = mmap(aligned_buf, SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB | MAP_FIXED, -1, 0);

		if (addr == MAP_FAILED) {
			printf("failed mapping, check address or huge page support\n");
			exit(0);
		}

		printf("mmapped: %p with huge page usage\n", addr);

		return 0;
	}

There should be nothing special to explain here, if you are used to C programming. Memory is not freed explicitely because the program will anyway end and is dedicated to explain you the concepts.

If we look at memory information when this process has mapped memory and is about to exit, we can see that reserved huge page have been asked by the kernel :

HugePages_Total:      20
HugePages_Free:       20
HugePages_Rsvd:       16
HugePages_Surp:        0
Hugepagesize:       2048 kB

Reserved, because you remember how virtual memory works : the kernel will not pagein the pages until you write to them. Here, 16 pages are marked reserved, and 16 * 2MB effectively equals the 32Mb segment we mapped using mmap().

So far so good, this API works like we just demonstrated.

PHP 7 code segment mapped on huge pages

If you look at your PHP 7 code segment, you'll see that this latter is pretty big. On my LP64 X86/64 traditionnal machine, it is about 9Mb large (for a debug build).
Let's see it :

> cat /proc/8435/maps
00400000-00db8000 r-xp 00000000 08:01 4196579                            /home/julien.pauli/php70/nzts/bin/php /* text segment */
00fb8000-01056000 rw-p 009b8000 08:01 4196579                            /home/julien.pauli/php70/nzts/bin/php
01056000-01073000 rw-p 00000000 00:00 0 
02bd0000-02ce8000 rw-p 00000000 00:00 0                                  [heap]
... ... ...

The memory text segment is starting at virtual memory 00400000 and ending at 00db8000 (in the above example), it is then about 9Mb large.
That means that the PHP binary machine instructions total 9Mb. Yes, PHP is growing more and more as time goes. It evolves and is added more and more features, meaning more and more C code, translated into more and more CPU low level instructions.

PHP text segment is 9Mb large so far on LP64. That means that classical compiled (debug) PHP machine instructions actually weigh 9Mb.

If we look further at this memory segment, we will see that it is mapped by the Kernel using traditional 4Kb large pages :

> cat /proc/8435/smaps
00400000-00db8000 r-xp 00000000 08:01 4196579  /home/julien.pauli/php70/nzts/bin/php
Size:               9952 kB /* VM size */
Rss:                1276 kB /* PM busy load */
Pss:                1276 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:      1276 kB
Private_Dirty:         0 kB
Referenced:         1276 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB /* page size is 4Kb */
MMUPageSize:           4 kB
Locked:                0 kB

Like you can see, the Kernel did not perform transparent huge page usage for this segment. Perhaps it will do it in the future when I further use my PHP process (which pid id 8435 here) ? We don't know, as we are not Kernel programmers and don't want to dive into Kernel code managing huge pages (for this article purpose).

However what we can do, is remap this segment onto a huge page managed segment.
This is what OPCache does.

We may do it, because the code segment does not grow nor shrink while the process leaves. If it could, then a huge page could not be that good. But here, we know the segment will not move, and those 9952Kb of code could be mapped using 4 huge pages (huge page are 2Mb size) and the rest could use traditionnal 4Kb page size.

Mapping the code segment on huge pages

If you use PHP 7, your system supports huge pages, if you give opcache.huge_code_pages INI setting the value 1 (default 0), and if PHP is not a webserver module (and thus has its own memory image), then OPCache will try to map your code segment onto huge pages as soon as it starts. This is done in the accel_move_code_to_huge_pages() function.

Let's see that :

	static void accel_move_code_to_huge_pages(void)
	{
		FILE *f;
		long unsigned int huge_page_size = 2 * 1024 * 1024;

		f = fopen("/proc/self/maps", "r");
		if (f) {
			long unsigned int  start, end, offset, inode;
			char perm[5], dev[6], name[MAXPATHLEN];
			int ret;

			ret = fscanf(f, "%lx-%lx %4s %lx %5s %ld %s\n", &start, &end, perm, &offset, dev, &inode, name);
			if (ret == 7 && perm[0] == 'r' && perm[1] == '-' && perm[2] == 'x' && name[0] == '/') {
				long unsigned int  seg_start = ZEND_MM_ALIGNED_SIZE_EX(start, huge_page_size);
				long unsigned int  seg_end = (end & ~(huge_page_size-1L));

				if (seg_end > seg_start) {
					zend_accel_error(ACCEL_LOG_DEBUG, "remap to huge page %lx-%lx %s \n", seg_start, seg_end, name);
					accel_remap_huge_pages((void*)seg_start, seg_end - seg_start, name, offset + seg_start - start);
				}
			}
			fclose(f);
		}
	}

What we see here, is that OPCache opens itself /proc/self/maps to find the code memory segment. It has no other way to do it because accessing such information can't be done without the explicit usage of kernel dependencies, and we don't want to rely on such.
The procfs is nowadays present on every Unix based system, I don't know any serious system not publishing the procfs : so many user tools rely on it today (ps, free, uptime, dmesg, sysctl etc...)

We scan the file to find the code segment, then we align its bounds : start and end, on a huge page boundary address.
Then accel_remap_huge_pages() is called with the aligned bounds.

	# if defined(MAP_HUGETLB) || defined(MADV_HUGEPAGE)
	static int accel_remap_huge_pages(void *start, size_t size, const char *name, size_t offset)
	{
		void *ret = MAP_FAILED;
		void *mem;

		mem = mmap(NULL, size,
			PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS,
			-1, 0);
		if (mem == MAP_FAILED) {
			return -1;
		}
		memcpy(mem, start, size);

	#  ifdef MAP_HUGETLB
		ret = mmap(start, size,
			PROT_READ | PROT_WRITE | PROT_EXEC,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | MAP_HUGETLB,
			-1, 0);
	#  endif
	#  ifdef MADV_HUGEPAGE
		if (ret == MAP_FAILED) {
			ret = mmap(start, size,
				PROT_READ | PROT_WRITE | PROT_EXEC,
				MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
				-1, 0);
			if (-1 == madvise(start, size, MADV_HUGEPAGE)) {
				munmap(mem, size);
				return -1;
			}
		}
	#  endif
		if (ret == start) {
			memcpy(start, mem, size);
			mprotect(start, size, PROT_READ | PROT_EXEC);
		}
		munmap(mem, size);

		return (ret == start) ? 0 : -1;
	}
	#endif

Easy enough, we create a new temporary buffer (mem) to copy the data into it, then we try to mmap() the aligned buffer onto huge pages.
If that fails, as a second chance, we try to hint the kernel using madvise() version.
Once the segment is mapped, we simply copy back the original data into it, then we return.

Here are the detailed segments :

00400000-00c00000 r-xp 00000000 00:0b 1008956  /anon_hugepage
Size:               8192 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB
Locked:                0 kB
00c00000-00db8000 r-xp 00800000 08:01 4196579 /home/julien.pauli/php70/nzts/bin/php
Size:               1760 kB
Rss:                 224 kB
Pss:                 224 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:       224 kB
Private_Dirty:         0 kB
Referenced:          224 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB

8Mb mapped on huge pages, and the rest of the text segment (1760Kb) is mapped with traditionnal 4Kb pages.
Done !

We have just turned our code segment to huge pages. Zend provided me the 3% overall performance improvement under high load.
I myself did not perform tests, but the results are what we expect.

Using huge pages, we require 512 times pages less than with default 4Kb pages.
The TLB cache is then more often used, much more, leading to less memory accesses. And we are talking about the code segment, that means every PHP process machine instruction load the CPU has to perfom, is optimized. Millions of them per seconds, that may lead to interesting overall results, right ?

Conclusion

We've seen that PHP 7 OPCache extension has evolved and now allows the Linux user to get more performance by using a now-widely-deployed Kernel memory management technic known as "huge pages". Should you know that big heap applications like RDBMS already performed such an improvement few years ago, like Oracle or PostgreSQL.

Now PHP does so, with PHP 7 and through the usage of OPCache extension.