Arthur Mongodin 12 min

In our previous article Yet another bug into Netfilter, I presented a vulnerability found within the netfilter subsystem of the Linux kernel. During my investigation, I found a weird comparison that does not fully protect a copy within a buffer. It led to a heap buffer overflow that was exploited to obtain root privileges on Ubuntu 22.04.

A small jump in the past

In the last episode, we reached an out-of-bound within the nft_set structure (/include/net/netfilter/nf_tables.h).

struct nft_set {
    struct list_head        list;
    struct list_head        bindings;
    struct nft_table        *table;
    possible_net_t          net;
    char                *name;
    u64             handle;
    u32             ktype;
    u32             dtype;
    u32             objtype;
    u32             size;
    u8              field_len[NFT_REG32_COUNT];
    u8              field_count;
    u32             use;
    atomic_t            nelems;
    u32             ndeact;
    u64             timeout;
    u32             gc_int;
    u16             policy;
    u16             udlen;
    unsigned char           *udata;
    /* runtime data below here */
    const struct nft_set_ops    *ops ____cacheline_aligned;
    u16             flags:14,
    u8              klen;
    u8              dlen;
    u8              num_exprs;
    struct nft_expr         *exprs[NFT_SET_EXPR_MAX];
    struct list_head        catchall_list;
    unsigned char           data[]

As nft_set contains lots of data, some other fields of this structure could be used to get a better write primitive. I decided to search around the length fields (udlen, klen and dlen) because it could be helpful to perform some overflows.

Research for Code and Anomalies

Exploring the different accesses to the field dlen, a call to the memcpy function (1) in nft_set_elem_init (/net/netfilter/nf_tables_api.c) hold my attention.

void *nft_set_elem_init(const struct nft_set *set,
            const struct nft_set_ext_tmpl *tmpl,
            const u32 *key, const u32 *key_end,
            const u32 *data, u64 timeout, u64 expiration, gfp_t gfp)
    struct nft_set_ext *ext;
    void *elem;

    elem = kzalloc(set->ops->elemsize + tmpl->len, gfp);                <===== (0)
    if (elem == NULL)
        return NULL;

    ext = nft_set_elem_ext(set, elem);
    nft_set_ext_init(ext, tmpl);

    if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY))
        memcpy(nft_set_ext_key(ext), key, set->klen);
    if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END))
        memcpy(nft_set_ext_key_end(ext), key_end, set->klen);
    if (nft_set_ext_exists(ext, NFT_SET_EXT_DATA))
        memcpy(nft_set_ext_data(ext), data, set->dlen);                 <===== (1)


    return elem;

This call is suspicious because two different objects are used. The destination buffer is stored within an nft_set_ext object, ext, whereas the size of copy is extracted from an nft_set object. The object ext is dynamically allocated at (0) with elem and the size reserved for it is tmpl->len. I wanted to check that the value stored in set->dlen is used to compute the value stored in tmpl->len.

The wrong place

nft_set_elem_init is called (5) within the function nft_add_set_elem (/net/netfilter/nf_tables_api.c), which is responsible to add an element to a netfilter set.

static int nft_add_set_elem(struct nft_ctx *ctx, struct nft_set *set,

                const struct nlattr *attr, u32 nlmsg_flags)
    struct nlattr *nla[NFTA_SET_ELEM_MAX + 1];
    struct nft_set_ext_tmpl tmpl;
    struct nft_set_elem elem;                                                   <===== (2)
    struct nft_data_desc desc;
    if (nla[NFTA_SET_ELEM_DATA] != NULL) {
        err = nft_setelem_parse_data(ctx, set, &desc, &,           <===== (3)
        if (err < 0)
            goto err_parse_key_end;


        nft_set_ext_add_length(&tmpl, NFT_SET_EXT_DATA, desc.len);              <===== (4)
    err = -ENOMEM;
    elem.priv = nft_set_elem_init(set, &tmpl,,                <===== (5)
                      timeout, expiration, GFP_KERNEL);
    if (elem.priv == NULL)
        goto err_parse_data;

As you can observe, set->dlen is not used to reserve the space for data associated with the id NFT_SET_EXT_DATA, instead it is desc.len (5). desc is initialized within the function nft_setelem_parse_data (/net/netfilter/nf_tables_api.c) called at (3)

static int nft_setelem_parse_data(struct nft_ctx *ctx, struct nft_set *set,
                  struct nft_data_desc *desc,
                  struct nft_data *data,
                  struct nlattr *attr)
    int err;

    err = nft_data_init(ctx, data, NFT_DATA_VALUE_MAXLEN, desc, attr);          <===== (6)
    if (err < 0)
        return err;

    if (desc->type != NFT_DATA_VERDICT && desc->len != set->dlen) {             <===== (7)
        nft_data_release(data, desc->type);
        return -EINVAL;

    return 0;

First of all, data and desc are filled in nft_data_init (/net/netfilter/nf_tables_api.c) according to data provided by a user (6). The critical part is the check done between desc->len and set->dlen at (7), it only occurs when data associated to the added element has a type different of NFT_DATA_VERDICT.

However, set->dlen is controlled by the user when a new set is created. The only restriction is that set->dlen should be lower than 64 bytes and the data type should be different of NFT_DATA_VERDICT. Moreover, when desc->type is equal to NFT_DATA_VERDICT, desc->len is equal to 16 bytes.

Adding an element of type NFT_DATA_VERDICT to a set with data type NFT_DATA_VALUE usually lead to desc->len being different from set->dlen. Therefore, it is possible to perform a heap buffer overflow in nft_set_elem_init at (1). This buffer overflow can be extended up to 48 bytes long.

Stay here ! I will be back soon !

Nevertheless, this is not a standard buffer overflow, when a user can directly control the overflowing data. In this case, random data will be copied out of the allocated buffer.

If we check the call to nft_set_elem_init (5), one can observe that copied data are extracted from a local variable elem, which is a nft_set_elem object.

    struct nft_set_elem elem;                                                   <===== (2)
    elem.priv = nft_set_elem_init(set, &tmpl,,                <===== (5)
                      timeout, expiration, GFP_KERNEL);

nft_set_elem (/net/netfilter/nf_tables.h) are used to store information about new elements during their creation.


struct nft_verdict {
    u32             code;
    struct nft_chain        *chain;

struct nft_data {
    union {
        u32         data[4];
        struct nft_verdict  verdict;
} __attribute__((aligned(__alignof__(u64))));

struct nft_set_elem {
    union {
        u32     buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)];
        struct nft_data val;
    } key;
    union {
        u32     buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)];
        struct nft_data val;
    } key_end;
    union {
        u32     buf[NFT_DATA_VALUE_MAXLEN / sizeof(u32)];
        struct nft_data val;
    } data;
    void            *priv;

As one can see, 64 bytes are reserved to temporarily store data associated to the new element. However, only 16 bytes at most are written into when the buffer overflow is triggered. Therefore, random bytes are used in the overflow.

Finally, not so random

At this point, I found a buffer overflow without control on data used for the corruption. Building an exploit with an uncontrolled buffer overflow is a real challenge. used in the overflow is not initialized (2). It could be used to control the overflow.

Let’s have a look to the caller, maybe the previously called function can help to control data used for the overflow. nft_add_set_elem is called in nf_tables_newsetelem (/net/netfilter/nf_tables_api.c) for each elements that the user wants to add to the set.

static int nf_tables_newsetelem(struct sk_buff *skb,
                const struct nfnl_info *info,
                const struct nlattr * const nla[])
    nla_for_each_nested(attr, nla[NFTA_SET_ELEM_LIST_ELEMENTS], rem) {
        err = nft_add_set_elem(&ctx, set, attr, info->nlh->nlmsg_flags);
        if (err < 0) {
            NL_SET_BAD_ATTR(extack, attr);
            return err;

nla_for_each_nested is used to iterate over attributes sent by the user, so the user is able to control the number of iteration that will be done. And nla_for_each_nested is only using macros and inline functions, so a call to nft_add_set_elem can be directly followed by another call to nft_add_set_elem. It is very useful because it allows to use data of a previous element in the overflow, as is not initialized. Moreover, one can ignore the randomization of the stack layout. Therefore, the way to control the overflow will be independent of the kernel compilation.

The following schema summarizes the different stages of within the stack to produce a controlled overflow.

Different steps of the during the exploitation

Random data are stored within the stack, adding a new element with NFT_DATA_VALUE data leads to user-controlled data within the stack. Finally, adding a second element with NFT_DATA_VERDICT data will trigger the buffer overflow and the data residue of the last element will be copied during the overflow.

Cache selection

The last thing we didn’t discuss before developing my exploitation strategy is the cache where the overflow happens. elem, allocated at (0), is dependent from the different options selected by the user, as shown in the previous extract of the function nft_add_set_elem, its size may vary. There are several options that can be used to increase it, such as NFT_SET_ELEM_KEY and NFT_SET_ELEM_KEY_END. They allow to reserve two buffers with a length up to 64 bytes in elem. So this overflow can clearly happen in several caches. elem is allocated on Ubuntu 22.04 with the flag GFP_KERNEL. Therefore, the concerned caches are kmalloc-{64,96,128,192}.

Now, the only thing that remains is to align elem on the cache object size to perform the best overflow. The next schema represents the construction of elem to align it on 64 bytes.

Elem build

We used the following construction in order to target the kmalloc-64 cache:

  • 20 bytes for the object header
  • 28 bytes of padding via NFT_SET_ELEM_KEY
  • 16 bytes to store the element data of type NFT_DATA_VERDICT.

Gimme a leak

Now that it is possible to control the overflowing data, the next step is to find a way to retrieve the KASLR base. As the overflow only occurred in kmalloc-x caches, the classical msg_msg objects cannot be used to perform an information leak, because they are allocated within the kmalloc-cg-x caches.

We looked at user_key_payload (/include/keys/user-type.h) objects, normally used to store sensitive user information in kernel land, represent a good alternative. They are similar to msg_msg objects in their structure: a header with the object size, then a buffer with user data.

struct user_key_payload {
    struct rcu_head rcu;        /* RCU destructor */
    unsigned short  datalen;    /* length of this data */
    char        data[] __aligned(__alignof__(u64)); /* actual data */

These objects are allocated within the function user_preparse (/security/keys/user_defined.c)

int user_preparse(struct key_preparsed_payload *prep)
    struct user_key_payload *upayload;
    size_t datalen = prep->datalen;

    if (datalen <= 0 || datalen > 32767 || !prep->data)
        return -EINVAL;

    upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);                <===== (6)
    if (!upayload)
        return -ENOMEM;

    /* attach the data */
    prep->quotalen = datalen;
    prep->[0] = upayload;
    upayload->datalen = datalen;
    memcpy(upayload->data, prep->data, datalen);                                <===== (7)
    return 0;

The allocation done at (6) is taking into consideration the length of the data provided by the user. Then, data is stored just after the header with the call to memcpy at (7) The header of user_key_payload objects is 24 bytes long, consequently they can be used to spray several caches, kmalloc-32 to kmalloc-8k.

As with msg_msg objects, the goal is to overwrite the field datalen with a bigger value than the initial one. When retrieving the information stored, the corrupted object will return more data than initially provided by the user.

However, there is an important drawback with this spray. The number of allocated objects is restricted. The sysctl variable kernel.keys.maxkeys defines the limit on the number of allowed keys within a user keyring. Moreover, kernel.keys.maxbytes restricts the number of stored bytes within a keyring. The default values of these variables are very low. They are shown below for Ubuntu 22.04

kernel.keys.maxbytes = 20000
kernel.keys.maxkeys = 200

Leaking is good, leaking useful is better

Since I found a way to leak information, the next step is to find interesting information. Working within the kmalloc-64 cache seemed the best one, this is the cache with the smallest objects where the overflow can occur. Consequently, a higher number of objects can leak.

percpu_ref_data (/include/linux/percpu-refcount.h) objects are also allocated in this cache. They are interessant targets because they contain two useful kinds of pointers.

struct percpu_ref_data {
    atomic_long_t       count;
    percpu_ref_func_t   *release;
    percpu_ref_func_t   *confirm_switch;
    bool            force_atomic:1;
    bool            allow_reinit:1;
    struct rcu_head     rcu;
    struct percpu_ref   *ref;

These objects store pointers to functions (fields release and confirm_switch) that can be used to compute the KASLR base or module bases when they are leaked and also a pointer to a dynamically allocated object (field ref) useful to compute the physmap base.

Such objects are allocated during a call to percpu_ref_init (/lib/percpu-refcount.c).

int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
            unsigned int flags, gfp_t gfp)
    struct percpu_ref_data *data;
    data = kzalloc(sizeof(*ref->data), gfp);
    data->release = release;
    data->confirm_switch = NULL;
    data->ref = ref;
    ref->data = data;
    return 0;

The simplest way to allocate percpu_ref_data objects is to use the io_uring_setup syscall (/fs/io_uring.c). And in order to program the release of such an object, a simple call to the close syscall is enough.

The allocation of a percpu_ref_data object is done during the initialization of an io_ring_ctx object (/fs/io_uring.c) within the function io_ring_ctx_alloc (/fs/io_uring.c).

static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
    struct io_ring_ctx *ctx;
    if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free,
        goto err;


As io_uring is integrated to the Linux core, the leak of io_ring_ctx_ref_free (/fs/io_uring.c) allows to compute the KASLR base.

During my investigation, some unexpected percpu_ref_data objects were in the leak but with the address of the function io_rsrc_node_ref_zero (/fs/io_uring.c) within the field release. After analyzing the origin of these objects, I understood that they also come from the io_uring_setup syscall. This good side effect of the io_uring_setup syscall allowed to improve the leak within my exploit.

I am (G)root

Now, that it is possible to obtain a useful information leak, a good write primitive is needed to perform a privilege escalation.

A few weeks ago, Lam Jun Rong from Starlabs released an article that describes a new way to exploit the CVE-2021-41073. He is presenting a new write primitive, the unlinking attack. It is based on the list_del operation. After corrupting a list_head object with two addresses, one address gets stored at the other.

As in the article of L.J. Rong, the target for the list_head corruption in my exploit is a simple_xattr object.

struct simple_xattr {
    struct list_head list;
    char *name;
    size_t size;
    char value[];

To work, this technique requires to know which object has been corrupted. In the other case, removing random items from the list leads surly to a fault in the list traversal. The items in the list are identified with their names.

To identify the corrupted object, I perform a trick on the name field: allocating name with a length high enough to have 256 bytes reserved, the least significant byte of the returned address is null. Little endian architectures, such as x86_64, allow us to just erase the least significant byte of name after the two pointers in the list_head. Consequently, it is possible to prepare the list field for the write primitive and at the same time to identify the corrupted object truncating its name. The only requirement is that all the names have the same end.

The following schema summarizes the construction of the name for a spray with simple_xattr objects.

Schema of the xattr names construction

Using this write primitive, it is possible to edit the modprobe_path with a path in the /tmp/ folder. It allows to execute any program with root privileges, and enjoy a root shell !

Running poc


This exploitation method is based on the hypothesis that a specific address is mapped in kernel land which is not always the case. So the exploit is not fully reliable but it still has a good success rate. The second drawback of the unlinking attack is the kernel panic that comes when the exploit is finished. This could be avoided by finding objects that can stay in the kernel memory at the end of the exploit process.

The end of the story

This vulnerability has been reported to the Linux security team and CVE-2022-34918 has been assigned. They proposed a patch that I tested and reviewed, and it has been released in the upstream tree within the commit 7e6bc1f6cabcd30aba0b11219d8e01b952eacbb6.


To sum up, I found a heap buffer overflow within the Netfilter subsystem of the Linux kernel. This vulnerability could be exploited to get a privilege escalation on Ubuntu 22.04. The source code of the exploit is available on our GitHub.

I would like to acknowledge RandoriSec for the opportunity they gave me to perform vulnerability research inside the Linux kernel during my internship and also my research team for their advice.