nftables - nft command line tool

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	expression: remove elem_flags from EXPR_SET_ELEM to shrink struct expr size	Pablo Neira Ayuso	2025-01-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Move NFTNL_SET_ELEM_F_INTERVAL_OPEN flag to the existing flags field in struct expr. This saves 4 bytes in struct expr, shrinking it to 128 bytes according to pahole. This reworks: 6089630f54ce ("segtree: Introduce flag for half-open range elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: allow binop expressions with variable right-hand operands	Jeremy Sowden	2024-12-04	1	-3/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hitherto, the kernel has required constant values for the `xor` and `mask` attributes of boolean bitwise expressions. This has meant that the right-hand operand of a boolean binop must be constant. Now the kernel has support for AND, OR and XOR operations with right-hand operands passed via registers, we can relax this restriction. Allow non-constant right-hand operands if the left-hand operand is not constant, e.g.: ct mark & 0xffff0000 \| meta mark & 0xffff The kernel now supports performing AND, OR and XOR operations directly, on one register and an immediate value or on two registers, so we need to be able to generate and parse bitwise boolean expressions of this form. If a boolean operation has a constant RHS, we continue to send a mask-and-xor expression to the kernel. Add tests for {ct,meta} mark with variable RHS operands. JSON support is also included. This requires Linux kernel >= 6.13-rc. [ Originally posted as patch 1/8 and 6/8 which has been collapsed and simplified to focus on initial {ct,meta} mark support. Tests have been extracted from 8/8 including a tests/py fix to payload output due to incorrect output in original patchset. JSON support has been extracted from patch 7/8 --pablo] Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: allow to map key to nfqueue number	Florian Westphal	2024-11-11	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow to specify a numeric queue id as part of a map. The parser side is easy, but the reverse direction (listing) is not. 'queue' is a statement, it doesn't have an expression. Add a generic 'queue_type' datatype as a shim to the real basetype with constant expressions, this is used only for udata build/parse, it stores the "key" (the parser token, here "queue") as udata in kernel and can then restore the original key. Add a dumpfile to validate parser & output. JSON support is missing because JSON allow typeof only since quite recently. Joint work with Pablo. Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1455 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: remove unused flags field	Pablo Neira Ayuso	2024-11-11	1	-2/+0
\| \| \| \| \| \| \|	Leftover unused struct datatype field, remove it. Fixes: e35aabd511c4 ("datatype: replace DTYPE_F_ALLOC by bitfield") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	monitor: Recognize flowtable add/del events	Phil Sutter	2024-11-06	3	-0/+12
\| \| \| \| \| \| \|	These were entirely ignored before, add the necessary code analogous to e.g. objects. Signed-off-by: Phil Sutter <phil@nwl.cc>
*	src: fix extended netlink error reporting with large set elements	Pablo Neira Ayuso	2024-10-28	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Large sets can expand into several netlink messages, use sequence number and attribute offset to correlate the set element and the location. When set element command expands into several netlink messages, increment sequence number for each netlink message. Update struct cmd to store the range of netlink messages that result from this command. struct nlerr_loc remains in the same size in x86_64. # nft -f set-65535.nft set-65535.nft:65029:22-32: Error: Could not process rule: File exists create element x y { 1.1.254.253 } ^^^^^^^^^^^ Fixes: f8aec603aa7e ("src: initial extended netlink error reporting") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	rule: netlink attribute offset is uint32_t for struct nlerr_loc	Pablo Neira Ayuso	2024-10-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	The maximum netlink message length (nlh->nlmsg_len) is uint32_t, struct nlerr_loc stores the offset to the netlink attribute which must be uint32_t, not uint16_t. While at it, remove check for zero netlink attribute offset in nft_cmd_error() which should not ever happen, likely this check was there to prevent the uint16_t offset overflow. Fixes: f8aec603aa7e ("src: initial extended netlink error reporting") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	mnl: update cmd_add_loc() to take struct nlmsghdr	Pablo Neira Ayuso	2024-10-28	1	-1/+1
\| \| \| \| \| \| \| \|	To prepare for a fix for very large sets. No functional change is intended. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	mnl: rename to mnl_seqnum_alloc() to mnl_seqnum_inc()	Pablo Neira Ayuso	2024-10-28	1	-1/+1
\| \| \| \| \| \| \| \|	rename mnl_seqnum_alloc() to mnl_seqnum_inc(). No functional change is intended. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: collapse set element commands from parser	Pablo Neira Ayuso	2024-10-28	4	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	498a5f0c219d ("rule: collapse set element commands") does not help to reduce memory consumption in the case of large sets defined by one element per line: add element ip x y { 1.1.1.1 } add element ip x y { 1.1.1.2 } ... This patch reduces memory consumption by ~75%, set elements are collapsed into an existing cmd object wherever possible to reduce the number of cmd objects. This patch also adds a special case for variables for sets similar to: be055af5c58d ("cmd: skip variable set elements when collapsing commands") This patch requires this small kernel fix: commit b53c116642502b0c85ecef78bff4f826a7dd4145 Author: Pablo Neira Ayuso <pablo@netfilter.org> Date: Fri May 20 00:02:06 2022 +0200 netfilter: nf_tables: set element extended ACK reporting support which is already included in recent -stable kernels: # cat ruleset.nft add table ip x add chain ip x y add set ip x y { type ipv4_addr; } create element ip x y { 1.1.1.1 } create element ip x y { 1.1.1.1 } # nft -f ruleset.nft ruleset.nft:5:25-31: Error: Could not process rule: File exists create element ip x y { 1.1.1.1 } ^^^^^^^ since there is no need to relate commands via sequence number anymore, this allows also removes the uncollapse step. Fixes: 498a5f0c219d ("rule: collapse set element commands") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: support for timeout never in elements	Pablo Neira Ayuso	2024-09-17	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow to specify elements that never expire in sets with global timeout. set x { typeof ip saddr timeout 1m elements = { 1.1.1.1 timeout never, 2.2.2.2, 3.3.3.3 timeout 2m } } in this example above: - 1.1.1.1 is a permanent element - 2.2.2.2 expires after 1 minute (uses default set timeout) - 3.3.3.3 expires after 2 minutes (uses specified timeout override) Use internal NFT_NEVER_TIMEOUT marker as UINT64_MAX to differenciate between use default set timeout and timeout never if "timeout N" is used in set declaration. Maximum supported timeout in milliseconds which is conveyed within a netlink attribute is 0x10c6f7a0b5ec. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	cache: consolidate reset command	Pablo Neira Ayuso	2024-08-26	2	-9/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reset command does not utilize the cache infrastructure. This implicitly fixes a crash with anonymous sets because elements are not fetched. I initially tried to fix it by toggling the missing cache flags, but then ASAN reports memleaks. To address these issues relies on Phil's list filtering infrastructure which updates is expanded to accomodate filtering requirements of the reset commands, such as 'reset table ip' where only the family is sent to the kernel. After this update, tests/shell reports a few inconsistencies between reset and list commands: - reset rules chain t c2 display sets, but it should only list the given chain. - reset rules table t reset rules ip do not list elements in the set. In both cases, these are fully listing a given table and family, elements should be included. The consolidation also ensures list and reset will not differ. A few more notes: - CMD_OBJ_TABLE is used for: rules family table from the parser, due to the lack of a better enum, same applies to CMD_OBJ_CHAIN. - CMD_OBJ_ELEMENTS still does not use the cache, but same occurs in the CMD_GET command case which needs to be consolidated. Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1763 Fixes: 83e0f4402fb7 ("Implement 'reset {set,map,element}' commands") Fixes: 1694df2de79f ("Implement 'reset rule' and 'reset rules' commands") Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	cache: add filtering support for objects	Pablo Neira Ayuso	2024-08-26	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, full ruleset flag is set on to fetch objects. Follow a similar approach to these patches from Phil: de961b930660 ("cache: Filter set list on server side") and cb4b07d0b628 ("cache: Support filtering for a specific flowtable") in preparation to update the reset command to use the cache infrastructure. Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: replace DTYPE_F_ALLOC by bitfield	Pablo Neira Ayuso	2024-08-21	1	-11/+3
\| \| \| \| \| \| \|	Only user of the datatype flags field is DTYPE_F_ALLOC, replace it by bitfield, squash byteorder to 8 bits which is sufficient. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: remove DTYPE_F_PREFIX	Pablo Neira Ayuso	2024-08-21	1	-2/+1
\| \| \| \| \| \| \| \|	only ipv4 and ipv6 datatype support this, add datatype_prefix_notation() helper function to report that datatype prefers prefix notation, if possible. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: mnl: always dump all netdev hooks if no interface name was given	Florian Westphal	2024-08-21	1	-0/+2
\| \| \| \| \| \| \| \| \|	Instead of not returning any results for nft list hooks netdev Iterate all interfaces and then query all of them. Signed-off-by: Florian Westphal <fw@strlen.de>
*	cache: populate chains on demand from error path	Pablo Neira Ayuso	2024-08-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Updates on verdict maps that require many non-base chains are slowed down due to fetching existing non-base chains into the cache. Chains are only required for error reporting hints if kernel reports ENOENT. Populate the cache from this error path only. Similar approach already exists from rule ENOENT error path since: deb7c5927fad ("cmd: add misspelling suggestions for rule commands") however, NFT_CACHE_CHAIN was toggled inconditionally for rule commands, rendering this on-demand cache population useless. before this patch, running Neels' nft_slew benchmark (peak values): created idx 4992 in 52587950 ns (128 in 7122 ms) ... deleted idx 128 in 43542500 ns (127 in 6187 ms) after this patch: created idx 4992 in 11361299 ns (128 in 1612 ms) ... deleted idx 1664 in 5239633 ns (128 in 733 ms) Tested-by: Eric Garver <eric@garver.life> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: drop obsolete hook argument form hook dump functions	Florian Westphal	2024-08-19	1	-1/+1
\| \| \| \| \| \| \| \|	since commit b98fee20bfe2 ("mnl: revisit hook listing"), handle.chain is never set in this path, so 'hook' is always set to -1, so the hook arg can be dropped. Signed-off-by: Florian Westphal <fw@strlen.de>
*	src: remove decnet support	Florian Westphal	2024-07-30	1	-72/+0
\| \| \| \| \| \|	Removed two years ago with v6.1, ditch this from hook list code as well. Signed-off-by: Florian Westphal <fw@strlen.de>
*	src: add string preprocessor and use it for log prefix string	Pablo Neira Ayuso	2024-06-25	3	-3/+5
\| \| \| \| \| \| \| \|	Add a string preprocessor to identify and replace variables in a string. Rework existing support to variables in log prefix strings to use it. Fixes: e76bb3794018 ("src: allow for variables in the log prefix string") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	Add support for table's persist flag	Phil Sutter	2024-04-19	1	-1/+3
\| \| \| \| \| \| \| \| \|	Bison parser lacked support for passing multiple flags, JSON parser did not support table flags at all. Document also 'owner' flag (and describe their relationship in nft.8. Signed-off-by: Phil Sutter <phil@nwl.cc>
*	src: disentangle ICMP code types	Pablo Neira Ayuso	2024-04-04	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, ICMP{v4,v6,inet} code datatypes only describe those that are supported by the reject statement, but they can also be used for icmp code matching. Moreover, ICMP code types go hand-to-hand with ICMP types, that is, ICMP code symbols depend on the ICMP type. Thus, the output of: nft describe icmp_code look confusing because that only displays the values that are supported by the reject statement. Disentangle this by adding internal datatypes for the reject statement to handle the ICMP code symbol conversion to value as well as ruleset listing. The existing icmp_code, icmpv6_code and icmpx_code remain in place. For backward compatibility, a parser function is defined in case an existing ruleset relies on these symbols. As for the manpage, move existing ICMP code tables from the DATA TYPES section to the REJECT STATEMENT section, where this really belongs to. But the icmp_code and icmpv6_code table stubs remain in the DATA TYPES section because that describe that this is an 8-bit integer field. After this patch: # nft describe icmp_code datatype icmp_code (icmp code) (basetype integer), 8 bits # nft describe icmpv6_code datatype icmpv6_code (icmpv6 code) (basetype integer), 8 bits # nft describe icmpx_code datatype icmpx_code (icmpx code) (basetype integer), 8 bits do not display the symbol table of the reject statement anymore. icmpx_code_type is not used anymore, but keep it in place for backward compatibility reasons. And update tests/shell accordingly. Fixes: 5fdd0b6a0600 ("nft: complete reject support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	netlink_delinearize: reverse cross-day meta hour range	Pablo Neira Ayuso	2024-03-20	2	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f8f32deda31d ("meta: Introduce new conditions 'time', 'day' and 'hour'") reverses the hour range in case that a cross-day range is used, eg. meta hour "03:00"-"14:00" counter accept which results in (Sidney, Australia AEDT time): meta hour != "14:00"-"03:00" counter accept kernel handles time in UTC, therefore, cross-day range may not be obvious according to local time. The ruleset listing above is not very intuitive to the reader depending on their timezone, therefore, complete netlink delinearize path to reverse the cross-day meta range. Update manpage to recommend to use a range expression when matching meta hour range. Recommend range expression for meta time and meta day too. Extend testcases/listing/meta_time to cover for this scenario. Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1737 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: do not merge a set with a erroneous one	Florian Westphal	2024-03-20	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The included sample causes a crash because we attempt to range-merge a prefix expression with a symbolic expression. The first set is evaluated, the symbol expression evaluation fails and nft queues an error message ("Could not resolve hostname"). However, nft continues evaluation. nft then encounters the same set definition again and merges the new content with the preceeding one. But the first set structure is dodgy, it still contains the unresolved symbolic expression. That then makes nft crash (assert) in the set internals. There are various different incarnations of this issue, but the low level set processing code does not allow for any partially transformed expressions to still remain. Before: nft --check -f tests/shell/testcases/bogons/nft-f/invalid_range_expr_type_binop BUG: invalid range expression type binop nft: src/expression.c:1479: range_expr_value_low: Assertion `0' failed. After: nft --check -f tests/shell/testcases/bogons/nft-f/invalid_range_expr_type_binop invalid_range_expr_type_binop:4:18-25: Error: Could not resolve hostname: Name or service not known elements = { 1&.141.0.1 - 192.168.0.2} ^^^^^^^^ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	evaluate: translate meter into dynamic set	Pablo Neira Ayuso	2024-03-12	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	129f9d153279 ("nft: migrate man page examples with `meter` directive to sets") already replaced meters by dynamic sets. This patch removes NFT_SET_ANONYMOUS flag from the implicit set that is instantiated via meter, so the listing shows a dynamic set instead which is the recommended approach these days. Therefore, a batch like this: add table t add chain t c add rule t c tcp dport 80 meter m size 128 { ip saddr timeout 1s limit rate 10/second } gets translated to a dynamic set: table ip t { set m { type ipv4_addr size 128 flags dynamic,timeout } chain c { tcp dport 80 update @m { ip saddr timeout 1s limit rate 10/second burst 5 packets } } } Check for NFT_SET_ANONYMOUS flag is also relaxed for list and flush meter commands: # nft list meter ip t m table ip t { set m { type ipv4_addr size 128 flags dynamic,timeout } } # nft flush meter ip t m As a side effect the legacy 'list meter' and 'flush meter' commands allow to flush a dynamic set to retain backward compatibility. This patch updates testcases/sets/0022type_selective_flush_0 and testcases/sets/0038meter_list_0 as well as the json output which now uses the dynamic set representation. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	evaluate: permit use of host-endian constant values in set lookup keys	Pablo Neira Ayuso	2024-02-13	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AFL found following crash: table ip filter { map ipsec_in { typeof ipsec in reqid . iif : verdict flags interval } chain INPUT { type filter hook input priority filter; policy drop; ipsec in reqid . 100 @ipsec_in } } Which yields: nft: evaluate.c:1213: expr_evaluate_unary: Assertion `!expr_is_constant(arg)' failed. All existing test cases with constant values use big endian values, but "iif" expects host endian values. As raw values were not supported before, concat byteorder conversion doesn't handle constants. Fix this: 1. Add constant handling so that the number is converted in-place, without unary expression. 2. Add the inverse handling on delinearization for non-interval set types. When dissecting the concat data soup, watch for integer constants where the datatype indicates host endian integer. Last, extend an existing test case with the afl input to cover in/output. A new test case is added to test linearization, delinearization and matching. Based on original patch from Florian Westphal, patch subject and description wrote by him. Fixes: b422b07ab2f9 ("src: permit use of constant values in set lookup keys") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: Describe rt symbol tables	Phil Sutter	2024-01-02	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Implement a symbol_table_print() wrapper for the run-time populated rt_symbol_tables which formats output similar to expr_describe() and includes the data source. Since these tables reside in struct output_ctx there is no implicit connection between data type and therefore providing callbacks for relevant datat types which feed the data into said wrapper is a simpler solution than extending expr_describe() itself. Signed-off-by: Phil Sutter <phil@nwl.cc>
*	src: do not allow to chain more than 16 binops	Florian Westphal	2023-12-22	2	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	netlink_linearize.c has never supported more than 16 chained binops. Adding more is possible but overwrites the stack in netlink_gen_bitwise(). Add a recursion counter to catch this at eval stage. Its not enough to just abort once the counter hits NFT_MAX_EXPR_RECURSION. This is because there are valid test cases that exceed this. For example, evaluation of 1 \| 2 will merge the constans, so even if there are a dozen recursive eval calls this will not end up with large binop chain post-evaluation. v2: allow more than 16 binops iff the evaluation function did constant-merging. Signed-off-by: Florian Westphal <fw@strlen.de>
*	intervals: set_to_range can be static	Florian Westphal	2023-12-16	1	-1/+0
\| \| \| \|	Signed-off-by: Florian Westphal <fw@strlen.de>
*	src: reject large raw payload and concat expressions	Florian Westphal	2023-12-15	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The kernel will reject this too, but unfortunately nft may try to cram the data into the underlying libnftnl expr. This causes heap corruption or BUG: nld buffer overflow: want to copy 132, max 64 After: Error: Concatenation of size 544 exceeds maximum size of 512 udp length . @th,0,512 . @th,512,512 { 47-63 . 0xe373135363130 . 0x33131303735353203 } ^^^^^^^^^ resp. same warning for an over-sized raw expression. Signed-off-by: Florian Westphal <fw@strlen.de>
*	netlink: add and use nft_data_memcpy helper	Florian Westphal	2023-12-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a stack overflow somewhere in this code, we end up memcpy'ing a way too large expr into a fixed-size on-stack buffer. This is hard to diagnose, most of this code gets inlined so the crash happens later on return from alloc_nftnl_setelem. Condense the mempy into a helper and add a BUG so we can catch the overflow before it occurs. ->value is too small (4, should be 16), but for normal cases (well-formed data must fit into max reg space, i.e. 64 byte) the chain buffer that comes after value in the structure provides a cushion. In order to have the new BUG() not trigger on valid data, bump value to the correct size, this is userspace so the additional 60 bytes of stack usage is no concern. Signed-off-by: Florian Westphal <fw@strlen.de>
*	evaluate: reset statement length context before evaluating statement	Pablo Neira Ayuso	2023-12-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch consolidates ctx->stmt_len reset in stmt_evaluate() to avoid this problem. Note that stmt_evaluate_meta() and stmt_evaluate_ct() already reset it after the statement evaluation. Moreover, statement dependency can be generated while evaluating a meta and ct statement. Payload statement dependency already manually stashes this before calling stmt_evaluate(). Add a new stmt_dependency_evaluate() function to stash statement length context when evaluating a new statement dependency and use it for all of the existing statement dependencies. Florian also says: 'meta mark set vlan id map { 1 : 0x00000001, 4095 : 0x00004095 }' will crash. Reason is that the l2 dependency generated here is errounously expanded to a 32bit-one, so the evaluation path won't recognize this as a L2 dependency. Therefore, pctx->stacked_ll_count is 0 and __expr_evaluate_payload() crashes with a null deref when dereferencing pctx->stacked_ll[0]. nft-test.py gains a fugly hack to tolerate '!map typeof vlan id : meta mark'. For more generic support we should find something more acceptable, e.g. !map typeof( everything here is a key or data ) timeout ... tests/py update and assert(pctx->stacked_ll_count) by Florian Westphal. Fixes: edecd58755a8 ("evaluate: support shifts larger than the width of the left operand") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
*	src: remove xfree() and use plain free()	Thomas Haller	2023-11-09	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	xmalloc() (and similar x-functions) are used for allocation. They wrap malloc()/realloc() but will abort the program on ENOMEM. The meaning of xmalloc() is that it wraps malloc() but aborts on failure. I don't think x-functions should have the notion, that this were potentially a different memory allocator that must be paired with a particular xfree(). Even if the original intent was that the allocator is abstracted (and possibly not backed by standard malloc()/free()), then that doesn't seem a good idea. Nowadays libc allocators are pretty good, and we would need a very special use cases to switch to something else. In other words, it will never happen that xmalloc() is not backed by malloc(). Also there were a few places, where a xmalloc() was already "wrongly" paired with free() (for example, iface_cache_release(), exit_cookie(), nft_run_cmd_from_buffer()). Or note how pid2name() returns an allocated string from fscanf(), which needs to be freed with free() (and not xfree()). This requirement bubbles up the callers portid2name() and name_by_portid(). This case was actually handled correctly and the buffer was freed with free(). But it shows that mixing different allocators is cumbersome to get right. Of course, we don't actually have different allocators and whether to use free() or xfree() makes no different. The point is that xfree() serves no actual purpose except raising irrelevant questions about whether x-functions are correctly paired with xfree(). Note that xfree() also used to accept const pointers. It is bad to unconditionally for all deallocations. Instead prefer to use plain free(). To free a const pointer use free_const() which obviously wraps free, as indicated by the name. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	src: add free_const() and use it instead of xfree()	Thomas Haller	2023-11-09	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Almost everywhere xmalloc() and friends is used instead of malloc(). This is almost everywhere paired with xfree(). xfree() has two problems. First, it brings the wrong notion that xmalloc() should be paired with xfree(), as if xmalloc() would not use the plain malloc() allocator. In practices, xfree() just wraps free(), and it wouldn't make sense any other way. xfree() should go away. This will be addressed in the next commit. The problem addressed by this commit is that xfree() accepts a const pointer. Paired with the practice of almost always using xfree() instead of free(), all our calls to xfree() cast away constness of the pointer, regardless whether that is necessary. Declaring a pointer as const should help us to catch wrong uses. If the xfree() function always casts aways const, the compiler doesn't help. There are many places that rightly cast away const during free. But not all of them. Add a free_const() macro, which is like free(), but accepts const pointers. We should always make an intentional choice whether to use free() or free_const(). Having a free_const() macro makes this very common choice clearer, instead of adding a (void*) cast at many places. Note that we now pair xmalloc() allocations with a free() call (instead of xfree(). That inconsistency will be resolved in the next commit. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	gmputil: add nft_gmp_free() to free strings from mpz_get_str()	Thomas Haller	2023-11-09	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mpz_get_str() (with NULL as first argument) will allocate a buffer using the allocator functions (mp_set_memory_functions()). We should free those buffers with the corresponding free function. Add nft_gmp_free() for that and use it. The name nft_gmp_free() is chosen because "mini-gmp.c" already has an internal define called gmp_free(). There wouldn't be a direct conflict, but using the same name is confusing. And maybe our own defines should have a clear nft prefix. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	build: no recursive-make for "include/**/Makefile.am"	Thomas Haller	2023-11-02	8	-70/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Switch from recursive-make to a single top-level Makefile. This is the first step, the following patches will continue this. Unlike meson's subdir() or C's #include, automake's SUBDIRS= does not include a Makefile. Instead, it calls `make -C $dir`. https://www.gnu.org/software/make/manual/html_node/Recursion.html https://www.gnu.org/software/automake/manual/html_node/Subdirectories.html See also, "Recursive Make Considered Harmful". https://accu.org/journals/overload/14/71/miller_2004/ This has several problems, which we an avoid with a single Makefile: - recursive-make is harder to maintain and understand as a whole. Recursive-make makes sense, when there are truly independent sub-projects. Which is not the case here. The project needs to be considered as a whole and not one directory at a time. When we add unit tests (which we should), those would reside in separate directories but have dependencies between directories. With a single Makefile, we see all at once. The build setup has an inherent complexity, and that complexity is not necessarily reduced by splitting it into more files. On the contrary it helps to have it all in once place, provided that it's sensibly structured, named and organized. - typing `make` prints irrelevant "Entering directory" messages. So much so, that at the end of the build, the terminal is filled with such messages and we have to scroll to see what even happened. - with recursive-make, during build we see: make[3]: Entering directory '.../nftables/src' CC meta.lo meta.c:13:2: error: #warning hello test [-Werror=cpp] 13 \| #warning hello test \| ^~~~~~~ With a single Makefile we get CC src/meta.lo src/meta.c:13:2: error: #warning hello test [-Werror=cpp] 13 \| #warning hello test \| ^~~~~~~ This shows the full filename -- assuming that the developer works from the top level directory. The full name is useful, for example to copy+paste into the terminal. - single Makefile is also faster: $ make && perf stat -r 200 -B make -j I measure 35msec vs. 80msec. - recursive-make limits parallel make. You have to craft the SUBDIRS= in the correct order. The dependencies between directories are limited, as make only sees "LDADD = $(top_builddir)/src/libnftables.la" and not the deeper dependencies for the library. - I presume, some people like recursive-make because of `make -C $subdir` to only rebuild one directory. Rebuilding the entire tree is already very fast, so this feature seems not relevant. Also, as dependency handling is limited, we might wrongly not rebuild a target. For example, make check touch src/meta.c make -C examples check does not rebuild "examples/nft-json-file". What we now can do with single Makefile (and better than before), is `make examples/nft-json-file`, which works as desired and rebuilds all dependencies. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
*	icmpv6: Allow matching target address in NS/NA, redirect and MLD	Nicolas Cavallari	2023-10-06	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It was currently not possible to match the target address of a neighbor solicitation or neighbor advertisement against a dynamic set, unlike in IPv4. Since they are many ICMPv6 messages with an address at the same offset, allow filtering on the target address for all icmp types that have one. While at it, also allow matching the destination address of an ICMPv6 redirect. Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr> Signed-off-by: Florian Westphal <fw@strlen.de>
*	json: add missing map statement stub	Pablo Neira Ayuso	2023-09-28	1	-0/+1
\| \| \| \| \| \| \|	Add map statement stub to restore compilation without json support. Fixes: 27a2da23d508 ("netlink_linearize: skip set element expression in map statement key") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	include: include <string.h> in <nft.h>	Thomas Haller	2023-09-28	3	-3/+1
\| \| \| \| \| \| \| \|	<string.h> provides strcmp(), as such it's very basic and used everywhere. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	netlink_linearize: skip set element expression in map statement key	Pablo Neira Ayuso	2023-09-27	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fix is similar to 22d201010919 ("netlink_linearize: skip set element expression in set statement key") to fix map statement. netlink_gen_map_stmt() relies on the map key, that is expressed as a set element. Use the set element key instead to skip the set element wrap, otherwise get_register() abort execution: nft: netlink_linearize.c:650: netlink_gen_expr: Assertion `dreg < ctx->reg_low' failed. This includes JSON support to make this feature complete and it updates tests/shell to cover for this support. Reported-by: Luci Stanescu <luci@cnix.ro> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	expression: cleanup expr_ops_by_type() and handle u32 input	Thomas Haller	2023-09-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make fewer assumptions about the underlying integer type of the enum. Instead, be clear about where we have an untrusted uint32_t from netlink and an enum. Rename expr_ops_by_type() to expr_ops_by_type_u32() to make this clearer. Later we might make the enum as packed, when this starts to matter more. Also, only the code path expr_ops() wants strict validation and assert against valid enum values. Move the assertion out of __expr_ops_by_type(). Then expr_ops_by_type_u32() does not need to duplicate the handling of EXPR_INVALID. We still need to duplicate the check against EXPR_MAX, to ensure that the uint32_t value can be cast to an enum value. [ Remove cast on EXPR_MAX. --pablo ] Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: return const pointer from datatype_get()	Thomas Haller	2023-09-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	"struct datatype" is for the most part immutable, and most callers deal with const pointers. That's why datatype_get() accepts a const pointer to increase the reference count (mutating the refcnt field). It should also return a const pointer. In fact, all callers are fine with that already. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: use "enum byteorder" instead of int in set_datatype_alloc()	Thomas Haller	2023-09-20	1	-1/+1
\| \| \| \| \| \| \|	Use the enum types as we have them. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	include: fix missing definitions in <cache.h>/<headers.h>	Thomas Haller	2023-09-20	2	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \|	The headers should be self-contained so they can be included in any order. With exception of <nft.h>, which any internal header can rely on. Some fixes for <cache.h>/<headers.h>. In case of <cache.h>, forward declare some of the structs instead of including the headers. <headers.h> uses struct in6_addr. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: initialize TYPE_CT_EVENTBIT slot in datatype array	Pablo Neira Ayuso	2023-09-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Matching on ct event makes no sense since this is mostly used as statement to globally filter out ctnetlink events, but do not crash if it is used from concatenations. Add the missing slot in the datatype array so this does not crash. Fixes: 2595b9ad6840 ("ct: add conntrack event mask support") Reported-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: initialize TYPE_CT_LABEL slot in datatype array	Pablo Neira Ayuso	2023-09-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Otherwise, ct label with concatenations such as: table ip x { chain y { ct label . ct mark { 0x1 . 0x1 } } } crashes: ../include/datatype.h:196:11: runtime error: member access within null pointer of type 'const struct datatype' AddressSanitizer:DEADLYSIGNAL ================================================================= ==640948==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fc970d3199b bp 0x7fffd1f20560 sp 0x7fffd1f20540 T0) ==640948==The signal is caused by a READ memory access. ==640948==Hint: address points to the zero page. sudo #0 0x7fc970d3199b in datatype_equal ../include/datatype.h:196 Fixes: 2fcce8b0677b ("ct: connlabel matching support") Reported-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	libnftables: drop gmp_init() and mp_set_memory_functions()	Thomas Haller	2023-09-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Setting global handles for libgmp via mp_set_memory_functions() is very ugly. When we don't use mini-gmp, then potentially there are other users of the library in the same process, and every process fighting about the allocation functions is not gonna work. It also means, we must not reset the allocation functions after somebody already allocated GMP data with them. Which we cannot ensure, as we don't know what other parts of the process are doing. It's also unnecessary. The default allocation functions for gmp and mini-gmp already abort the process on allocation failure ([1], [2]), just like our xmalloc(). Just don't do this. [1] https://gmplib.org/repo/gmp/file/8225bdfc499f/memory.c#l37 [2] https://git.netfilter.org/nftables/tree/src/mini-gmp.c?id=6d19a902c1d77cb51b940b1ce65f31b1cad38b74#n286 Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: fix leak and cleanup reference counting for struct datatype	Thomas Haller	2023-09-14	2	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Test `./tests/shell/run-tests.sh -V tests/shell/testcases/maps/nat_addr_port` fails: ==118== 195 (112 direct, 83 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 3 ==118== at 0x484682C: calloc (vg_replace_malloc.c:1554) ==118== by 0x48A39DD: xmalloc (utils.c:37) ==118== by 0x48A39DD: xzalloc (utils.c:76) ==118== by 0x487BDFD: datatype_alloc (datatype.c:1205) ==118== by 0x487BDFD: concat_type_alloc (datatype.c:1288) ==118== by 0x488229D: stmt_evaluate_nat_map (evaluate.c:3786) ==118== by 0x488229D: stmt_evaluate_nat (evaluate.c:3892) ==118== by 0x488229D: stmt_evaluate (evaluate.c:4450) ==118== by 0x488328E: rule_evaluate (evaluate.c:4956) ==118== by 0x48ADC71: nft_evaluate (libnftables.c:552) ==118== by 0x48AEC29: nft_run_cmd_from_buffer (libnftables.c:595) ==118== by 0x402983: main (main.c:534) I think the reference handling for datatype is wrong. It was introduced by commit 01a13882bb59 ('src: add reference counter for dynamic datatypes'). We don't notice it most of the time, because instances are statically allocated, where datatype_get()/datatype_free() is a NOP. Fix and rework. - Commit 01a13882bb59 comments "The reference counter of any newly allocated datatype is set to zero". That seems not workable. Previously, functions like datatype_clone() would have returned the refcnt set to zero. Some callers would then then set the refcnt to one, but some wouldn't (set_datatype_alloc()). Calling datatype_free() with a refcnt of zero will overflow to UINT_MAX and leak: if (--dtype->refcnt > 0) return; While there could be schemes with such asymmetric counting that juggle the appropriate number of datatype_get() and datatype_free() calls, this is confusing and error prone. The common pattern is that every alloc/clone/get/ref is paired with exactly one unref/free. Let datatype_clone() return references with refcnt set 1 and in general be always clear about where we transfer ownership (take a reference) and where we need to release it. - set_datatype_alloc() needs to consistently return ownership to the reference. Previously, some code paths would and others wouldn't. - Replace datatype_set(key, set_datatype_alloc(dtype, key->byteorder)) with a __datatype_set() with takes ownership. Fixes: 01a13882bb59 ('src: add reference counter for dynamic datatypes') Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	include: include <stdlib.h> in <nft.h>	Thomas Haller	2023-09-11	2	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It provides malloc()/free(), which is so basic that we need it everywhere. Include via <nft.h>. The ultimate purpose is to define more things in <nft.h>. While it has not corresponding C sources, <nft.h> can contain macros and static inline functions, and is a good place for things that we shall have everywhere. Since <stdlib.h> provides malloc()/free() and size_t, that is a very basic dependency, that will be needed for that. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
*	datatype: rename "dtype_clone()" to datatype_clone()	Thomas Haller	2023-09-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	The struct is called "datatype" and related functions have the fitting "datatype_" prefix. Rename. Also rename the internal "dtype_alloc()" to "datatype_alloc()". This is a follow up to commit 01a13882bb59 ('src: add reference counter for dynamic datatypes'), which started adding "datatype_*()" functions. Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>