Yet another bug into Netfilter

This year, I have done my end-of-study internship at Randorisec, the original purpose of my internship was to study different ways of doing variant analysis on open-sources project but also on closed-source software solutions. The Linux kernel is an attractive target, and so it was chosen as the target of my project. I started scouting for publicly disclosed bugs which could lead to interesting variant patterns. Since the beginning of the year, several bugs have been found and fixed within the kernel component netfilter, such has CVE 2022-1015. In this article, I am going to talk about two bugs I’ve found inside the netfilter kernel section. One of them lead to information leakage, and potentially to a fully functional local privilege escalation (LPE) exploit.

A brief introduction to netfilter

Netfilter is an open-source project that is used to perform packet filtering, aka the linux firewall.

This project is often mentioned iptables which is the userland application used to configure your firewall. In 2014, a new subsystem was added to the netfilter firewall, called nftable, which is now configured through the nftables userland application.

Schema of the netfilter organization

The core logic of netfilter is implemented within the kernel (/net/netfilter), and the userland tools communicate with it via netlink messages.

A VERY basic usage

The main purpose of netfilter is the table object. This object will contain all the information about a filter. The following nftables command creates a new table that is named my-table that could be used as a filter on the IP protocol.

nft> add table ip my-table

A table can contain different object such as sets, that is used to store data. The following command create a new set associated to the table my-table named my-set that stores IPv4 addresses.

nft> add set ip my-table my-set {type: ipv4_addr;}

Finally, we can create chains of rules that would be applied on the received packet.

Starting point

On April 2022, David Bouman released an article about vulnerabilities within the Linux kernel. I focused my interest on the CVE-2022-1015, as it is the target of my variant analysis. This vulnerability is very interesting because some of the out-of-bound vulnerabilities happen because the user input checks are not well implemented. Moreover, this kind of bugs can be found both in the Linux kernel and in userland applications. The bug associated to this vulnerability is an integer overflow located into the function nft_validate_register_load (/net/netfilter/nf_tables_api.c) at (0).

static int nft_validate_register_load(enum nft_registers reg, unsigned int len)
{
    if (reg < NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE)
        return -EINVAL;
	
    if (len == 0)
        return -EINVAL;
	
    if (reg * NFT_REG32_SIZE + len > sizeof_field(struct nft_regs, data))           <===== (0)
        return -ERANGE;

    return 0;
}

After reading and understanding how this vulnerability works, I decided to write a CodeQL query able to find such a vulnerability. The analysis of the different results of the query led to several candidates for CVE-2022-1015 variant.

CVE-2022-1015 variant

The bug

One of these candidates hold my attention, it is located inside the function nft_set_desc_concat_parse (/net/netfilter/nf_tables_api.c).

static int nft_set_desc_concat_parse(const struct nlattr *attr,
                     struct nft_set_desc *desc)
{
    struct nlattr *tb[NFTA_SET_FIELD_MAX + 1];
    u32 len;
    int err;

    err = nla_parse_nested_deprecated(tb, NFTA_SET_FIELD_MAX, attr,
                      nft_concat_policy, NULL);
    if (err < 0)
        return err;

    if (!tb[NFTA_SET_FIELD_LEN])
        return -EINVAL;

    len = ntohl(nla_get_be32(tb[NFTA_SET_FIELD_LEN]));

    if (len * BITS_PER_BYTE / 32 > NFT_REG32_COUNT)             <===== (1)
        return -E2BIG;

    desc->field_len[desc->field_count++] = len;                 <===== (2)

    return 0;
}

At (1), we discern the pattern which matched the query.

Let’s focus on the left part of the comparison len * BITS_PER_BYTE / 32. len is a 32 bits integer extracted from a netlink message with nla_get_be32, so we can assume that this is user controlable. The macro BITS_PER_BYTE is defined in /include/linux/bits.h.

#define BITS_PER_BYTE       8

The multiplication len * BITS_PER_BYTE will result in left shift of three bits. The next arithmetic operation is the division with 32, which will result in a right shift of five bits. We understand that five bits of len are ignored, the three most significant and the two less significant.

Now, let’s have look to the right side of the comparison. The macro NFT_REG32_COUNT is defined in the file /include/uapi/linux/netfilter/nf_tables.h.

enum nft_registers {

    ...

    NFT_REG32_00    = 8,
    NFT_REG32_01,
    NFT_REG32_02,
    NFT_REG32_03,
    NFT_REG32_04,
    NFT_REG32_05,
    NFT_REG32_06,
    NFT_REG32_07,
    NFT_REG32_08,
    NFT_REG32_09,
    NFT_REG32_10,
    NFT_REG32_11,
    NFT_REG32_12,
    NFT_REG32_13,
    NFT_REG32_14,
    NFT_REG32_15,
};
...
#define NFT_REG32_COUNT (NFT_REG32_15 - NFT_REG32_00 + 1)

We have NFT_REG32_COUNT equal to 23 - 8 + 1 = 16.

If we invert the condition at (1) in order to get a condition on len, we get len < 64. Nevertheless, in this comparison five bits are ignored, so this five bits could be used to store an unwanted value at (2).

The next step is to find a way for a random unprivileged user to reach this buggy code.

Unprivileged access

The kernel netfilter module can be accessed from a netlink defined in /net/netfilter/nfnetlink.c.

static int __net_init nfnetlink_net_init(struct net *net)
{
    struct nfnl_net *nfnlnet = nfnl_pernet(net);
    struct netlink_kernel_cfg cfg = {
        .groups = NFNLGRP_MAX,
        .input  = nfnetlink_rcv,    <===== (3)
#ifdef CONFIG_MODULES
        .bind   = nfnetlink_bind,
#endif
    };

    nfnlnet->nfnl = netlink_kernel_create(net, NETLINK_NETFILTER, &cfg);
    if (!nfnlnet->nfnl)
        return -ENOMEM;
    return 0;
}

The input field of the structure netlink_kernel_cfg at (3) gives us the function that will be called when a message is received via the netfilter netlink. Let’s have a look to this handler.

static void nfnetlink_rcv(struct sk_buff *skb)
{
    struct nlmsghdr *nlh = nlmsg_hdr(skb);

    if (skb->len < NLMSG_HDRLEN ||
        nlh->nlmsg_len < NLMSG_HDRLEN ||
        skb->len < nlh->nlmsg_len)
        return;

    if (!netlink_net_capable(skb, CAP_NET_ADMIN)) {                         <===== (4)
        netlink_ack(skb, nlh, -EPERM, NULL);
        return;
    }

    if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN)
        nfnetlink_rcv_skb_batch(skb, nlh);
    else
        netlink_rcv_skb(skb, nfnetlink_rcv_msg);
}

We see a capability check at (4) done by the function netlink_net_capable, it’s mandatory that the user have the CAP_NET_ADMIN capabilities to use netfilter. Luckily for us, this capability could be acquired with user namespaces (CONFIG_USER_NS). User namespace have been a game changer those past few years in Linux kernel exploitation, they’ve opened a new set of attack surfaces. This is kind of funny, because they are also used to sandbox applications, a good example of double-edged sword.

Let’s get back to our topic, we can map the call path from nfnetlink_rcv to nft_set_desc_concat_parse as the following.

nfnetlink_rcv
    nfnetlink_rcv_skb_batch
        nfnetlink_rcv_batch
            nf_tables_newset
                nf_tables_set_desc_parse
                    nft_set_desc_concat
                        nft_set_desc_concat_parse

There is not any relevant information within the intermediate functions, so we are going to ignore them for the moment.

If the kernel is compiled with CONFIG_USER_NS enabled, we will be able to reach our vulnerable code path.

Integer overflow verification into GDB

Well, what can we do?

Sadly, it seems that this bug didn’t have any impact on the linux security.

Indeed, the assignment at (2) is hiding a cast from 32 bits unsigned integer into a 8 bits one, because the field field_len of the nft_set_desc structure is of type u8.

struct nft_set_desc {
    unsigned int        klen;
    unsigned int        dlen;
    unsigned int        size;
    u8          field_len[NFT_REG32_COUNT];
    u8          field_count;
    bool            expr;
};

It reduces a lot the power of this overflow as the two less significant bits can’t be used.

A strange neighbor

However, this is not the end of the adventure.

When I was looking to exploit the previous bug, something halted me: there was no checks on the index field_count before the assignment at (2).

The parent function of nft_set_desc_concat_parse is nft_set_desc_concat (/net/netfilter/nf_tables_api.c).

static int nft_set_desc_concat(struct nft_set_desc *desc,
                   const struct nlattr *nla)
{
    struct nlattr *attr;
    int rem, err;

    nla_for_each_nested(attr, nla, rem) {                                   <===== (5)
        if (nla_type(attr) != NFTA_LIST_ELEM)
            return -EINVAL;

        err = nft_set_desc_concat_parse(attr, desc);                        <===== (6)
        if (err < 0)
            return err;
    }

    return 0;
}

The call to nft_set_desc_concat_parse is done at (6) but the most interesting thing is that the call is located in a loop that starts at (5), and there is not any verification in the body loop on the value of desc->field_count. nla_for_each_nested will loop over all the netlink attributes provided by the user, if the user provides more than NFT_REG32_COUNT, a buffer overflow will happen.

Gimme gimme gimme an infoleak …

We will see how to turn this buffer overflow into a kernel infoleak. The first byte that will be overflown is the field field_count that stored the number of elements stored into the field field_len. We will use this field to reach our goal.

Netfilter sets

We will use the netfilter sets, represented by the structure nft_set (/include/net/netfilter/nf_tables.h).

struct nft_set {
    struct list_head        list;
    struct list_head        bindings;
    struct nft_table        *table;
    possible_net_t          net;
    char                *name;
    u64             handle;
    u32             ktype;
    u32             dtype;
    u32             objtype;
    u32             size;
    u8              field_len[NFT_REG32_COUNT];
    u8              field_count;
    u32             use;
    atomic_t            nelems;
    u32             ndeact;
    u64             timeout;
    u32             gc_int;
    u16             policy;
    u16             udlen;
    unsigned char           *udata;
    /* runtime data below here */
    const struct nft_set_ops    *ops ____cacheline_aligned;
    u16             flags:14,
                    genmask:2;
    u8              klen;
    u8              dlen;
    u8              num_exprs;
    struct nft_expr         *exprs[NFT_SET_EXPR_MAX];
    struct list_head        catchall_list;
    unsigned char           data[]
        __attribute__((aligned(__alignof__(u64))));
};

This sets are allocated into the function nf_tables_newset (/net/netfilter/nf_tables_api.c) that is a parent of the vulnerable function nft_set_desc_concat_parse.

static int nf_tables_newset(struct sk_buff *skb, const struct nfnl_info *info,
                const struct nlattr * const nla[])
{
    const struct nft_set_ops *ops;
    struct nft_set_desc desc;
    struct nft_set *set;
    size_t alloc_size;
    u64 size;
    ...

    memset(&desc, 0, sizeof(desc));             <===== (7)
    
    ...
    
    if (nla[NFTA_SET_DESC] != NULL) {
        err = nf_tables_set_desc_parse(&desc, nla[NFTA_SET_DESC]);              <===== (8)
        if (err < 0)
            return err;
    }
    
    ...
    
    size = 0;
    if (ops->privsize != NULL)
        size = ops->privsize(nla, &desc);
    alloc_size = sizeof(*set) + size + udlen;
    if (alloc_size < size || alloc_size > INT_MAX)
        return -ENOMEM;
    set = kvzalloc(alloc_size, GFP_KERNEL);                                     <===== (9)
    if (!set)
        return -ENOMEM;

    ...

    set->field_count = desc.field_count;
    for (i = 0; i < desc.field_count; i++)                                      <===== (10)
        set->field_len[i] = desc.field_len[i];
        
    ...
    
}

We retrieve at (8) the call that leads to the vulnerable function. Its first argument is the address of the local variable desc, that is initialized at (7). This is the object that will be modified into the function nft_set_desc_concat_parse. We can see that its fields field_count and field_len are used at (10) and they are stored into the freshly allocated nft_set (9). So, if the out-of-bound has been used to override solely the field field_len of desc with a value bigger than 20 (end of the structure + alignment), some data from the stack will be copied into the nft_set structure at (10). It will also, in this case, produce an out-of-bound into the nft_set structure, but it won’t be useful for our leak.

Back to the userland

The next step is to find a way to send back this information to userland. In order to do so, we will use the function nf_tables_getset (/net/netfilter/nf_tables_api.c) that can be reached in a similar way of nf_tables_newset. nf_tables_getset can be used to retrieve information about a registered set.

tatic int nf_tables_getset(struct sk_buff *skb, const struct nfnl_info *info,
                const struct nlattr * const nla[])
{
    const struct nft_set *set;
    struct sk_buff *skb2;
    struct nft_ctx ctx;
    int err;

    ...
    
    skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
    if (skb2 == NULL)
        return -ENOMEM;

    err = nf_tables_fill_set(skb2, &ctx, set, NFT_MSG_NEWSET, 0);           <===== (11)
    if (err < 0)
        goto err_fill_set_info;

    return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);            <===== (12)

}

This function will send data to userland through the netfilter netlink thanks to nfnetlink_unicast at (12) and this data will be setup into the call to nf_tables_fill_set (/net/netfilter/nf_tables_api.c).

static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
                  const struct nft_set *set, u16 event, u16 flags)
{
    struct nlmsghdr *nlh;
    u32 portid = ctx->portid;
    struct nlattr *nest;
    u32 seq = ctx->seq;
    int i;

    event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
    nlh = nfnl_msg_put(skb, portid, seq, event, flags, ctx->family,
               NFNETLINK_V0, nft_base_seq(ctx->net));
               
    ...
 
    if (set->udata &&
        nla_put(skb, NFTA_SET_USERDATA, set->udlen, set->udata))
        goto nla_put_failure;
    
    ...
 
    if (set->field_count > 1 &&
        nf_tables_fill_set_concat(skb, set))                <===== (13)
        goto nla_put_failure;
        
    ...
    
    nlmsg_end(skb, nlh);
    return 0;
    
nla_put_failure:
    nlmsg_trim(skb, nlh);
    return -1;
}

At (13), an other auxiliary function is used to manage data concerning the fields field_len and field_count of the nft_set structure.

static int nf_tables_fill_set_concat(struct sk_buff *skb,
                     const struct nft_set *set)
{
    struct nlattr *concat, *field;
    int i;

    concat = nla_nest_start_noflag(skb, NFTA_SET_DESC_CONCAT);
    if (!concat)
        return -ENOMEM;

    for (i = 0; i < set->field_count; i++) {                <===== (14)
        field = nla_nest_start_noflag(skb, NFTA_LIST_ELEM);
        if (!field)
            return -ENOMEM;

        if (nla_put_be32(skb, NFTA_SET_FIELD_LEN,           <===== (15)
                 htonl(set->field_len[i])))
            return -ENOMEM;

        nla_nest_end(skb, field);
    }

    nla_nest_end(skb, concat);

    return 0;
}

As we can see at (14), we will iterate set->field_count times, and at each iteration, one element of the set->field_len buffer will be added to the socket buffer at (15). So if the off-by-one has been produced at (2) as explained previously, information stored into the stack will be copied into the heap at (10) and finally, the user will be able to access them thanks to sockets.

A leak example

This leak is the result of my PoC. We can observe that it allows, in these specific circumstances, to leak a stack canary and an address of the data section.

Limitations

We have seen that the out-of-bound can lead to an infoleak, however this leak is not so powerful because it is restricted to 28 bytes due to the organization of the nft_set structure. Indeed, if we want a bigger leak, it requires to overwrite the least significant bytes of the field udata, and it can lead to segmentation fault in the nf_tables_fill_set function. Moreover, the leak is dependent on the stack organization which is itself dependent on the compilation, so this leak can provide interesting information on a specific kernel but could also be useless on the next version.

Remediation

I didn’t manage to turn this out-of-bound into a fully functional LPE exploit before it was publicly disclosed and fixed. It has been fixed on the 31st may 2022 by the commit fecf31ee395b0295f2d7260aa29946b7605f7c85. This commit is fixing both the integer overflow and the out-of-bound. The patch of the integer overflow allows to have a length up to 255, that will had facilitated the exploitation of the vulnerability.

This vulnerability has been officially disclosed on oss-security, and the CVE-2022-1972 has been assigned to this vulnerability.

Say LPE again! I dare you!

The nft_set structure stores several pointers, making it a good place to start. The field udata stores the address at the end of the structure, address of the field data. If some data are stored into the field data, the number of stored bytes into this field is stored in the udlen field. The write primitive allows to overwrite this field, which could lead to an heap infoleak. From this leak, we can imagine being able to get some pointers from different caches.

We can also try to have a random free primitive with an overflow that reaches the field num_exprs of the structure nft_set and set a bigger value than 5 (NFT_SET_EXPR_MAX = 2). As a result, a part of the field data will also be considered as part as the field exprs, and we are going to obtain a full control over the stored exprs. We just have to find a way to call nft_expr_destroy. So as to do that, we may try to produce an error inside the management of the expression, just after the out-of-bound inside the nft_set structure.

By combining these techniques, we should be able to get a random free primitive and try to produce use-after-free.

All of this is a hypothesis as it has not been tested, so it may or may not work.

Conclusion

To sum-up, I found two bugs within the kernel portion of netfilter which lead to a 28 bytes infoleak. These bugs could not be exploited to obtain a privilege escalation due to a lack of time and weak write primitive (bytes should be lower than 67). All the sources of my poc can be found on our GitHub.

Arthur Mongodin June 13, 2022 13 min