summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/Makefile.am30
-rw-r--r--doc/data-types.txt113
-rw-r--r--doc/libnftables-json.adoc82
-rw-r--r--doc/libnftables.adoc50
-rw-r--r--doc/nft.txt227
-rw-r--r--doc/payload-expression.txt319
-rw-r--r--doc/primary-expression.txt56
-rw-r--r--doc/stateful-objects.txt77
-rw-r--r--doc/statements.txt306
9 files changed, 952 insertions, 308 deletions
diff --git a/doc/Makefile.am b/doc/Makefile.am
deleted file mode 100644
index 21482320..00000000
--- a/doc/Makefile.am
+++ /dev/null
@@ -1,30 +0,0 @@
-if BUILD_MAN
-man_MANS = nft.8 libnftables-json.5 libnftables.3
-
-A2X_OPTS_MANPAGE = -L --doctype manpage --format manpage -D ${builddir}
-
-ASCIIDOC_MAIN = nft.txt
-ASCIIDOC_INCLUDES = \
- data-types.txt \
- payload-expression.txt \
- primary-expression.txt \
- stateful-objects.txt \
- statements.txt
-ASCIIDOCS = ${ASCIIDOC_MAIN} ${ASCIIDOC_INCLUDES}
-
-EXTRA_DIST = ${ASCIIDOCS} ${man_MANS} libnftables-json.adoc libnftables.adoc
-
-CLEANFILES = \
- *~
-
-nft.8: ${ASCIIDOCS}
- ${AM_V_GEN}${A2X} ${A2X_OPTS_MANPAGE} $<
-
-.adoc.3:
- ${AM_V_GEN}${A2X} ${A2X_OPTS_MANPAGE} $<
-
-.adoc.5:
- ${AM_V_GEN}${A2X} ${A2X_OPTS_MANPAGE} $<
-
-CLEANFILES += ${man_MANS}
-endif
diff --git a/doc/data-types.txt b/doc/data-types.txt
index a42a55fa..6c0e2f94 100644
--- a/doc/data-types.txt
+++ b/doc/data-types.txt
@@ -242,35 +242,13 @@ integer
The ICMP Code type is used to conveniently specify the ICMP header's code field.
-.Keywords may be used when specifying the ICMP code
-[options="header"]
-|==================
-|Keyword | Value
-|net-unreachable |
-0
-|host-unreachable |
-1
-|prot-unreachable|
-2
-|port-unreachable|
-3
-|frag-needed|
-4
-|net-prohibited|
-9
-|host-prohibited|
-10
-|admin-prohibited|
-13
-|===================
-
ICMPV6 TYPE TYPE
~~~~~~~~~~~~~~~~
[options="header"]
|==================
|Name | Keyword | Size | Base type
|ICMPv6 Type |
-icmpx_code |
+icmpv6_type |
8 bit |
integer
|===================
@@ -340,52 +318,6 @@ integer
The ICMPv6 Code type is used to conveniently specify the ICMPv6 header's code field.
-.keywords may be used when specifying the ICMPv6 code
-[options="header"]
-|==================
-|Keyword |Value
-|no-route|
-0
-|admin-prohibited|
-1
-|addr-unreachable|
-3
-|port-unreachable|
-4
-|policy-fail|
-5
-|reject-route|
-6
-|==================
-
-ICMPVX CODE TYPE
-~~~~~~~~~~~~~~~~
-[options="header"]
-|==================
-|Name | Keyword | Size | Base type
-|ICMPvX Code |
-icmpv6_type |
-8 bit |
-integer
-|===================
-
-The ICMPvX Code type abstraction is a set of values which overlap between ICMP
-and ICMPv6 Code types to be used from the inet family.
-
-.keywords may be used when specifying the ICMPvX code
-[options="header"]
-|==================
-|Keyword |Value
-|no-route|
-0
-|port-unreachable|
-1
-|host-unreachable|
-2
-|admin-prohibited|
-3
-|=================
-
CONNTRACK TYPES
~~~~~~~~~~~~~~~
@@ -492,3 +424,46 @@ For each of the types above, keywords are available for convenience:
|==================
Possible keywords for conntrack label type (ct_label) are read at runtime from /etc/connlabel.conf.
+
+DCCP PKTTYPE TYPE
+~~~~~~~~~~~~~~~~
+[options="header"]
+|==================
+|Name | Keyword | Size | Base type
+|DCCP packet type |
+dccp_pkttype |
+4 bit |
+integer
+|===================
+
+The DCCP packet type abstracts the different legal values of the respective
+four bit field in the DCCP header, as stated by RFC4340. Note that possible
+values 10-15 are considered reserved and therefore not allowed to be used. In
+iptables' *dccp* match, these values are aliased 'INVALID'. With nftables, one
+may simply match on the numeric value range, i.e. *10-15*.
+
+.keywords may be used when specifying the DCCP packet type
+[options="header"]
+|==================
+|Keyword |Value
+|request|
+0
+|response|
+1
+|data|
+2
+|ack|
+3
+|dataack|
+4
+|closereq|
+5
+|close|
+6
+|reset|
+7
+|sync|
+8
+|syncack|
+9
+|=================
diff --git a/doc/libnftables-json.adoc b/doc/libnftables-json.adoc
index 858abbf7..e3b24cc4 100644
--- a/doc/libnftables-json.adoc
+++ b/doc/libnftables-json.adoc
@@ -91,14 +91,15 @@ translates into JSON as such:
{ "add": { "chain": {
"family": "inet",
"table": "mytable",
- "chain": "mychain"
- }}}
+ "name": "mychain"
+ }}},
{ "add": { "rule": {
"family": "inet",
"table": "mytable",
"chain": "mychain",
"expr": [
{ "match": {
+ "op": "==",
"left": { "payload": {
"protocol": "tcp",
"field": "dport"
@@ -174,7 +175,7 @@ kind, optionally filtered by *family* and for some, also *table*.
____
*{ "reset":* 'RESET_OBJECT' *}*
-'RESET_OBJECT' := 'COUNTER' | 'COUNTERS' | 'QUOTA' | 'QUOTAS'
+'RESET_OBJECT' := 'COUNTER' | 'COUNTERS' | 'QUOTA' | 'QUOTAS' | 'RULE' | 'RULES' | 'SET' | 'MAP' | 'ELEMENT'
____
Reset state in suitable objects, i.e. zero their internal counter.
@@ -311,7 +312,8 @@ ____
"elem":* 'SET_ELEMENTS'*,
"timeout":* 'NUMBER'*,
"gc-interval":* 'NUMBER'*,
- "size":* 'NUMBER'
+ "size":* 'NUMBER'*,
+ "auto-merge":* 'BOOLEAN'
*}}*
*{ "map": {
@@ -326,7 +328,8 @@ ____
"elem":* 'SET_ELEMENTS'*,
"timeout":* 'NUMBER'*,
"gc-interval":* 'NUMBER'*,
- "size":* 'NUMBER'
+ "size":* 'NUMBER'*,
+ "auto-merge":* 'BOOLEAN'
*}}*
'SET_TYPE' := 'STRING' | *[* 'SET_TYPE_LIST' *]*
@@ -365,6 +368,8 @@ that they translate a unique key to a value.
Garbage collector interval in seconds.
*size*::
Maximum number of elements supported.
+*auto-merge*::
+ Automatic merging of adjacent/overlapping set elements in interval sets.
==== TYPE
The set type might be a string, such as *"ipv4_addr"* or an array
@@ -681,11 +686,6 @@ processing continues with the next rule in the same chain.
==== OPERATORS
[horizontal]
-*&*:: Binary AND
-*|*:: Binary OR
-*^*:: Binary XOR
-*<<*:: Left shift
-*>>*:: Right shift
*==*:: Equal
*!=*:: Not equal
*<*:: Less than
@@ -904,7 +904,7 @@ Reject the packet and send the given error reply.
*type*::
Type of reject, either *"tcp reset"*, *"icmpx"*, *"icmp"* or *"icmpv6"*.
*expr*::
- ICMP type to reject with.
+ ICMP code to reject with.
All properties are optional.
@@ -1058,10 +1058,22 @@ Assign connection tracking expectation.
=== XT
[verse]
-*{ "xt": null }*
+____
+*{ "xt": {
+ "type":* 'TYPENAME'*,
+ "name":* 'STRING'
+*}}*
+
+'TYPENAME' := *match* | *target* | *watcher*
+____
-This represents an xt statement from xtables compat interface. Sadly, at this
-point, it is not possible to provide any further information about its content.
+This represents an xt statement from xtables compat interface. It is a
+fallback if translation is not available or not complete.
+
+Seeing this means the ruleset (or parts of it) were created by *iptables-nft*
+and one should use that to manage it.
+
+*BEWARE:* nftables won't restore these statements.
== EXPRESSIONS
Expressions are the building blocks of (most) statements. In their most basic
@@ -1171,7 +1183,7 @@ point (*base*). The following *base* values are accepted:
*"th"*::
The offset is relative to Transport Layer header start offset.
-The second form allows to reference a field by name (*field*) in a named packet
+The second form allows one to reference a field by name (*field*) in a named packet
header (*protocol*).
=== EXTHDR
@@ -1200,6 +1212,30 @@ Create a reference to a field (*field*) of a TCP option header (*name*).
If the *field* property is not given, the expression is to be used as a TCP option
existence check in a *match* statement with a boolean on the right hand side.
+=== SCTP CHUNK
+[verse]
+*{ "sctp chunk": {
+ "name":* 'STRING'*,
+ "field":* 'STRING'
+*}}*
+
+Create a reference to a field (*field*) of an SCTP chunk (*name*).
+
+If the *field* property is not given, the expression is to be used as an SCTP
+chunk existence check in a *match* statement with a boolean on the right hand
+side.
+
+=== DCCP OPTION
+[verse]
+*{ "dccp option": {
+ "type":* 'NUMBER'*
+*}}*
+
+Create a reference to a DCCP option (*type*).
+
+The expression is to be used as a DCCP option existence check in a *match*
+statement with a boolean on the right hand side.
+
=== META
[verse]
____
@@ -1307,15 +1343,17 @@ Perform kernel Forwarding Information Base lookups.
=== BINARY OPERATION
[verse]
-*{ "|": [* 'EXPRESSION'*,* 'EXPRESSION' *] }*
-*{ "^": [* 'EXPRESSION'*,* 'EXPRESSION' *] }*
-*{ "&": [* 'EXPRESSION'*,* 'EXPRESSION' *] }*
-*{ "+<<+": [* 'EXPRESSION'*,* 'EXPRESSION' *] }*
-*{ ">>": [* 'EXPRESSION'*,* 'EXPRESSION' *] }*
+*{ "|": [* 'EXPRESSION'*,* 'EXPRESSIONS' *] }*
+*{ "^": [* 'EXPRESSION'*,* 'EXPRESSIONS' *] }*
+*{ "&": [* 'EXPRESSION'*,* 'EXPRESSIONS' *] }*
+*{ "+<<+": [* 'EXPRESSION'*,* 'EXPRESSIONS' *] }*
+*{ ">>": [* 'EXPRESSION'*,* 'EXPRESSIONS' *] }*
+'EXPRESSIONS' := 'EXPRESSION' | 'EXPRESSION'*,* 'EXPRESSIONS'
-All binary operations expect an array of exactly two expressions, of which the
+All binary operations expect an array of at least two expressions, of which the
first element denotes the left hand side and the second one the right hand
-side.
+side. Extra elements are accepted in the given array and appended to the term
+accordingly.
=== VERDICT
[verse]
diff --git a/doc/libnftables.adoc b/doc/libnftables.adoc
index ce4a361b..2cf78d7a 100644
--- a/doc/libnftables.adoc
+++ b/doc/libnftables.adoc
@@ -18,6 +18,9 @@ void nft_ctx_free(struct nft_ctx* '\*ctx'*);
bool nft_ctx_get_dry_run(struct nft_ctx* '\*ctx'*);
void nft_ctx_set_dry_run(struct nft_ctx* '\*ctx'*, bool* 'dry'*);
+unsigned int nft_ctx_input_get_flags(struct nft_ctx* '\*ctx'*);
+unsigned int nft_ctx_input_set_flags(struct nft_ctx* '\*ctx'*, unsigned int* 'flags'*);
+
unsigned int nft_ctx_output_get_flags(struct nft_ctx* '\*ctx'*);
void nft_ctx_output_set_flags(struct nft_ctx* '\*ctx'*, unsigned int* 'flags'*);
@@ -37,6 +40,9 @@ const char *nft_ctx_get_error_buffer(struct nft_ctx* '\*ctx'*);
int nft_ctx_add_include_path(struct nft_ctx* '\*ctx'*, const char* '\*path'*);
void nft_ctx_clear_include_paths(struct nft_ctx* '\*ctx'*);
+int nft_ctx_add_var(struct nft_ctx* '\*ctx'*, const char* '\*var'*);
+void nft_ctx_clear_vars(struct nft_ctx '\*ctx'*);
+
int nft_run_cmd_from_buffer(struct nft_ctx* '\*nft'*, const char* '\*buf'*);
int nft_run_cmd_from_filename(struct nft_ctx* '\*nft'*,
const char* '\*filename'*);*
@@ -68,13 +74,37 @@ The *nft_ctx_free*() function frees the context object pointed to by 'ctx', incl
=== nft_ctx_get_dry_run() and nft_ctx_set_dry_run()
Dry-run setting controls whether ruleset changes are actually committed on kernel side or not.
-It allows to check whether a given operation would succeed without making actual changes to the ruleset.
+It allows one to check whether a given operation would succeed without making actual changes to the ruleset.
The default setting is *false*.
The *nft_ctx_get_dry_run*() function returns the dry-run setting's value contained in 'ctx'.
The *nft_ctx_set_dry_run*() function sets the dry-run setting in 'ctx' to the value of 'dry'.
+=== nft_ctx_input_get_flags() and nft_ctx_input_set_flags()
+The flags setting controls the input format.
+
+----
+enum {
+ NFT_CTX_INPUT_NO_DNS = (1 << 0),
+ NFT_CTX_INPUT_JSON = (1 << 1),
+};
+----
+
+NFT_CTX_INPUT_NO_DNS::
+ Avoid resolving IP addresses with blocking getaddrinfo(). In that case,
+ only plain IP addresses are accepted.
+
+NFT_CTX_INPUT_JSON:
+ When parsing the input, first try to interpret the input as JSON before
+ falling back to the nftables format. This behavior is implied when setting
+ the NFT_CTX_OUTPUT_JSON flag.
+
+The *nft_ctx_input_get_flags*() function returns the input flags setting's value in 'ctx'.
+
+The *nft_ctx_input_set_flags*() function sets the input flags setting in 'ctx' to the value of 'val'
+and returns the previous flags.
+
=== nft_ctx_output_get_flags() and nft_ctx_output_set_flags()
The flags setting controls the output format.
@@ -93,6 +123,7 @@ enum {
NFT_CTX_OUTPUT_NUMERIC_TIME = (1 << 10),
NFT_CTX_OUTPUT_NUMERIC_ALL = (NFT_CTX_OUTPUT_NUMERIC_PROTO |
NFT_CTX_OUTPUT_NUMERIC_PRIO |
+ NFT_CTX_OUTPUT_NUMERIC_SYMBOL |
NFT_CTX_OUTPUT_NUMERIC_TIME),
NFT_CTX_OUTPUT_TERSE = (1 << 11),
};
@@ -114,10 +145,11 @@ NFT_CTX_OUTPUT_HANDLE::
NFT_CTX_OUTPUT_JSON::
If enabled at compile-time, libnftables accepts input in JSON format and is able to print output in JSON format as well.
See *libnftables-json*(5) for a description of the supported schema.
- This flag controls JSON output format, input is auto-detected.
+ This flag enables JSON output format. If the flag is set, the input will first be tried as JSON format,
+ before falling back to nftables format. This flag implies NFT_CTX_INPUT_JSON.
NFT_CTX_OUTPUT_ECHO::
The echo setting makes libnftables print the changes once they are committed to the kernel, just like a running instance of *nft monitor* would.
- Amongst other things, this allows to retrieve an added rule's handle atomically.
+ Amongst other things, this allows one to retrieve an added rule's handle atomically.
NFT_CTX_OUTPUT_GUID::
Display UID and GID as described in the /etc/passwd and /etc/group files.
NFT_CTX_OUTPUT_NUMERIC_PROTO::
@@ -195,9 +227,9 @@ On failure, the functions return non-zero which may only happen if buffering was
The *nft_ctx_get_output_buffer*() and *nft_ctx_get_error_buffer*() functions return a pointer to the buffered output (which may be empty).
=== nft_ctx_add_include_path() and nft_ctx_clear_include_path()
-The *include* command in nftables rulesets allows to outsource parts of the ruleset into a different file.
+The *include* command in nftables rulesets allows one to outsource parts of the ruleset into a different file.
The include path defines where these files are searched for.
-Libnftables allows to have a list of those paths which are searched in order.
+Libnftables allows one to have a list of those paths which are searched in order.
The default include path list contains a single compile-time defined entry (typically '/etc/').
The *nft_ctx_add_include_path*() function extends the list of include paths in 'ctx' by the one given in 'path'.
@@ -205,6 +237,14 @@ The function returns zero on success or non-zero if memory allocation failed.
The *nft_ctx_clear_include_paths*() function removes all include paths, even the built-in default one.
+=== nft_ctx_add_var() and nft_ctx_clear_vars()
+The *define* command in nftables ruleset allows one to define variables.
+
+The *nft_ctx_add_var*() function extends the list of variables in 'ctx'. The variable must be given in the format 'key=value'.
+The function returns zero on success or non-zero if the variable is malformed.
+
+The *nft_ctx_clear_vars*() function removes all variables.
+
=== nft_run_cmd_from_buffer() and nft_run_cmd_from_filename()
These functions perform the actual work of parsing user input into nftables commands and executing them.
diff --git a/doc/nft.txt b/doc/nft.txt
index 5326de16..248b29af 100644
--- a/doc/nft.txt
+++ b/doc/nft.txt
@@ -9,7 +9,7 @@ nft - Administration tool of the nftables framework for packet filtering and cla
SYNOPSIS
--------
[verse]
-*nft* [ *-nNscaeSupyjt* ] [ *-I* 'directory' ] [ *-f* 'filename' | *-i* | 'cmd' ...]
+*nft* [ *-nNscaeSupyjtT* ] [ *-I* 'directory' ] [ *-f* 'filename' | *-i* | 'cmd' ...]
*nft* *-h*
*nft* *-v*
@@ -44,6 +44,10 @@ understanding of their meaning. You can get information about options by running
*--file 'filename'*::
Read input from 'filename'. If 'filename' is -, read from stdin.
+*-D*::
+*--define 'name=value'*::
+ Define a variable. You can only combine this option with '-f'.
+
*-i*::
*--interactive*::
Read input from an interactive readline CLI. You can use quit to exit, or use the EOF marker,
@@ -58,6 +62,11 @@ understanding of their meaning. You can get information about options by running
*--check*::
Check commands validity without actually applying the changes.
+*-o*::
+*--optimize*::
+ Optimize your ruleset. You can combine this option with '-c' to inspect
+ the proposed optimizations.
+
.Ruleset list output formatting that modify the output of the list ruleset command:
*-a*::
@@ -130,7 +139,7 @@ semicolon (;). +
A hash sign (#) begins a comment. All following characters on the same line are
ignored. +
-Identifiers begin with an alphabetic character (a-z,A-Z), followed zero or more
+Identifiers begin with an alphabetic character (a-z,A-Z), followed by zero or more
alphanumeric characters (a-z,A-Z,0-9) and the characters slash (/), backslash
(\), underscore (_) and dot (.). Identifiers using different characters or
clashing with a keyword need to be enclosed in double quotes (").
@@ -148,12 +157,12 @@ relative path) or / for file location expressed as an absolute path. +
If *-I*/*--includepath* is not specified, then nft relies on the default
directory that is specified at compile time. You can retrieve this default
-directory via *-h*/*--help* option. +
+directory via the *-h*/*--help* option. +
-Include statements support the usual shell wildcard symbols (\*,?,[]). Having no
+Include statements support the usual shell wildcard symbols (*,?,[]). Having no
matches for an include statement is not an error, if wildcard symbols are used
in the include statement. This allows having potentially empty include
-directories for statements like **include "/etc/firewall/rules/"**. The wildcard
+directories for statements like **include "/etc/firewall/rules/*"**. The wildcard
matches are loaded in alphabetical order. Files beginning with dot (.) are not
matched by include statements.
@@ -161,17 +170,23 @@ SYMBOLIC VARIABLES
~~~~~~~~~~~~~~~~~~
[verse]
*define* 'variable' *=* 'expr'
+*undefine* 'variable'
+*redefine* 'variable' *=* 'expr'
*$variable*
Symbolic variables can be defined using the *define* statement. Variable
-references are expressions and can be used initialize other variables. The scope
+references are expressions and can be used to initialize other variables. The scope
of a definition is the current block and all blocks contained within.
+Symbolic variables can be undefined using the *undefine* statement, and modified
+using the *redefine* statement.
.Using symbolic variables
---------------------------------------
define int_if1 = eth0
define int_if2 = eth1
define int_ifs = { $int_if1, $int_if2 }
+redefine int_if2 = wlan0
+undefine int_if2
filter input iif $int_ifs accept
---------------------------------------
@@ -189,7 +204,7 @@ packet processing paths, which invoke nftables if rules for these hooks exist.
*inet*:: Internet (IPv4/IPv6) address family.
*arp*:: ARP address family, handling IPv4 ARP packets.
*bridge*:: Bridge address family, handling packets which traverse a bridge device.
-*netdev*:: Netdev address family, handling packets from ingress.
+*netdev*:: Netdev address family, handling packets on ingress and egress.
All nftables objects exist in address family specific namespaces, therefore all
identifiers include an address family. If an identifier is specified without an
@@ -217,6 +232,11 @@ Packets forwarded to a different host are processed by the forward hook.
Packets sent by local processes are processed by the output hook.
|postrouting |
All packets leaving the system are processed by the postrouting hook.
+|ingress |
+All packets entering the system are processed by this hook. It is invoked before
+layer 3 protocol handlers, hence before the prerouting hook, and it can be used
+for filtering and policing. Ingress is only available for Inet family (since
+Linux kernel 5.10).
|===================
ARP ADDRESS FAMILY
@@ -242,17 +262,39 @@ The list of supported hooks is identical to IPv4/IPv6/Inet address families abov
NETDEV ADDRESS FAMILY
~~~~~~~~~~~~~~~~~~~~
-The Netdev address family handles packets from ingress.
+The Netdev address family handles packets from the device ingress and egress
+path. This family allows you to filter packets of any ethertype such as ARP,
+VLAN 802.1q, VLAN 802.1ad (Q-in-Q) as well as IPv4 and IPv6 packets.
.Netdev address family hooks
[options="header"]
|=================
|Hook | Description
|ingress |
-All packets entering the system are processed by this hook. It is invoked before
-layer 3 protocol handlers and it can be used for early filtering and policing.
+All packets entering the system are processed by this hook. It is invoked after
+the network taps (ie. *tcpdump*), right after *tc* ingress and before layer 3
+protocol handlers, it can be used for early filtering and policing.
+|egress |
+All packets leaving the system are processed by this hook. It is invoked after
+layer 3 protocol handlers and before *tc* egress. It can be used for late
+filtering and policing.
|=================
+Tunneled packets (such as *vxlan*) are processed by netdev family hooks both
+in decapsulated and encapsulated (tunneled) form. So a packet can be filtered
+on the overlay network as well as on the underlying network.
+
+Note that the order of netfilter and *tc* is mirrored on ingress versus egress.
+This ensures symmetry for NAT and other packet mangling.
+
+Ingress packets which are redirected out some other interface are only
+processed by netfilter on egress if they have passed through netfilter ingress
+processing before. Thus, ingress packets which are redirected by *tc* are not
+subjected to netfilter. But they are if they are redirected by *netfilter* on
+ingress. Conceptually, tc and netfilter can be thought of as layers, with
+netfilter layered above tc: If the packet hasn't been passed up from the
+tc layer to the netfilter layer, it's not subjected to netfilter on egress.
+
RULESET
-------
[verse]
@@ -279,10 +321,11 @@ Effectively, this is the nft-equivalent of *iptables-save* and
TABLES
------
[verse]
-{*add* | *create*} *table* ['family'] 'table' [*{ flags* 'flags' *; }*]
-{*delete* | *list* | *flush*} *table* ['family'] 'table'
+{*add* | *create*} *table* ['family'] 'table' [*{* [*comment* 'comment' *;*] [*flags* 'flags' *;*] *}*]
+{*delete* | *destroy* | *list* | *flush*} *table* ['family'] 'table'
*list tables* ['family']
*delete table* ['family'] *handle* 'handle'
+*destroy table* ['family'] *handle* 'handle'
Tables are containers for chains, sets and stateful objects. They are identified
by their address family and their name. The address family must be one of *ip*,
@@ -311,7 +354,7 @@ nft --interactive
create table inet mytable
# add a new base chain: get input packets
-add chain inet mytable myin { type filter hook input priority 0; }
+add chain inet mytable myin { type filter hook input priority filter; }
# add a single counter to the chain
add rule inet mytable myin counter
@@ -326,16 +369,18 @@ add table inet mytable
[horizontal]
*add*:: Add a new table for the given family with the given name.
*delete*:: Delete the specified table.
+*destroy*:: Delete the specified table, it does not fail if it does not exist.
*list*:: List all chains and rules of the specified table.
*flush*:: Flush all chains and rules of the specified table.
CHAINS
------
[verse]
-{*add* | *create*} *chain* ['family'] 'table' 'chain' [*{ type* 'type' *hook* 'hook' [*device* 'device'] *priority* 'priority' *;* [*policy* 'policy' *;*] *}*]
-{*delete* | *list* | *flush*} *chain* ['family'] 'table' 'chain'
+{*add* | *create*} *chain* ['family'] 'table' 'chain' [*{ type* 'type' *hook* 'hook' [*device* 'device'] *priority* 'priority' *;* [*policy* 'policy' *;*] [*comment* 'comment' *;*] *}*]
+{*delete* | *destroy* | *list* | *flush*} *chain* ['family'] 'table' 'chain'
*list chains* ['family']
*delete chain* ['family'] 'table' *handle* 'handle'
+*destroy chain* ['family'] 'table' *handle* 'handle'
*rename chain* ['family'] 'table' 'chain' 'newname'
Chains are containers for rules. They exist in two kinds, base chains and
@@ -348,6 +393,7 @@ organization.
are specified, the chain is created as a base chain and hooked up to the networking stack.
*create*:: Similar to the *add* command, but returns an error if the chain already exists.
*delete*:: Delete the specified chain. The chain must not contain any rules or be used as jump target.
+*destroy*:: Delete the specified chain, it does not fail if it does not exist. The chain must not contain any rules or be used as jump target.
*rename*:: Rename the specified chain.
*list*:: List all rules of the specified chain.
*flush*:: Flush all rules of the specified chain.
@@ -369,24 +415,38 @@ statements for instance).
|route | ip, ip6 | output |
If a packet has traversed a chain of this type and is about to be accepted, a
new route lookup is performed if relevant parts of the IP header have changed.
-This allows to e.g. implement policy routing selectors in nftables.
+This allows one to e.g. implement policy routing selectors in nftables.
|=================
Apart from the special cases illustrated above (e.g. *nat* type not supporting
-*forward* hook or *route* type only supporting *output* hook), there are two
+*forward* hook or *route* type only supporting *output* hook), there are three
further quirks worth noticing:
-* The netdev family supports merely a single combination, namely *filter* type and
- *ingress* hook. Base chains in this family also require the *device* parameter
- to be present since they exist per incoming interface only.
+* The netdev family supports merely two combinations, namely *filter* type with
+ *ingress* hook and *filter* type with *egress* hook. Base chains in this
+ family also require the *device* parameter to be present since they exist per
+ interface only.
* The arp family supports only the *input* and *output* hooks, both in chains of type
*filter*.
+* The inet family also supports the *ingress* hook (since Linux kernel 5.10),
+ to filter IPv4 and IPv6 packet at the same location as the netdev *ingress*
+ hook. This inet hook allows you to share sets and maps between the usual
+ *prerouting*, *input*, *forward*, *output*, *postrouting* and this *ingress*
+ hook.
+
+The *device* parameter accepts a network interface name as a string, and is
+required when adding a base chain that filters traffic on the ingress or
+egress hooks. Any ingress or egress chains will only filter traffic from the
+interface specified in the *device* parameter.
The *priority* parameter accepts a signed integer value or a standard priority
-name which specifies the order in which chains with same *hook* value are
+name which specifies the order in which chains with the same *hook* value are
traversed. The ordering is ascending, i.e. lower priority values have precedence
over higher ones.
+With *nat* type chains, there's a lower excluding limit of -200 for *priority*
+values, because conntrack hooks at this priority and NAT requires it.
+
Standard priority values can be replaced with easily memorizable names. Not all
names make sense in every family with every hook (see the compatibility matrices
below) but their numerical value can still be used for prioritizing chains.
@@ -422,9 +482,9 @@ the others. See the following tables that describe the values and compatibility.
Basic arithmetic expressions (addition and subtraction) can also be achieved
with these standard names to ease relative prioritizing, e.g. *mangle - 5* stands
for *-155*. Values will also be printed like this until the value is not
-further than 10 form the standard value.
+further than 10 from the standard value.
-Base chains also allow to set the chain's *policy*, i.e. what happens to
+Base chains also allow one to set the chain's *policy*, i.e. what happens to
packets not explicitly accepted or refused in contained rules. Supported policy
values are *accept* (which is the default) or *drop*.
@@ -433,7 +493,9 @@ RULES
[verse]
{*add* | *insert*} *rule* ['family'] 'table' 'chain' [*handle* 'handle' | *index* 'index'] 'statement' ... [*comment* 'comment']
*replace rule* ['family'] 'table' 'chain' *handle* 'handle' 'statement' ... [*comment* 'comment']
-*delete rule* ['family'] 'table' 'chain' *handle* 'handle'
+{*delete* | *reset*} *rule* ['family'] 'table' 'chain' *handle* 'handle'
+*destroy rule* ['family'] 'table' 'chain' *handle* 'handle'
+*reset rules* ['family'] ['table' ['chain']]
Rules are added to chains in the given table. If the family is not specified, the
ip family is used. Rules are constructed from two kinds of components according
@@ -461,8 +523,10 @@ case the rule is inserted after the specified rule.
beginning of the chain or before the specified rule.
*replace*:: Similar to *add*, but the rule replaces the specified rule.
*delete*:: Delete the specified rule.
+*destroy*:: Delete the specified rule, it does not fail if it does not exist.
+*reset*:: Reset rule-contained state, e.g. counter and quota statement values.
-.*add a rule to ip table input chain*
+.*add a rule to ip table output chain*
-------------
nft add rule filter output ip daddr 192.168.0.0/24 accept # 'ip filter' is assumed
# same command, slightly more verbose
@@ -474,12 +538,12 @@ nft add rule ip filter output ip daddr 192.168.0.0/24 accept
# nft -a list ruleset
table inet filter {
chain input {
- type filter hook input priority 0; policy accept;
+ type filter hook input priority filter; policy accept;
ct state established,related accept # handle 4
ip saddr 10.1.1.1 tcp dport ssh accept # handle 5
...
# delete the rule with handle 5
-# nft delete rule inet filter input handle 5
+nft delete rule inet filter input handle 5
-------------------------
SETS
@@ -510,21 +574,23 @@ The sets allowed_hosts and allowed_ports need to be created first. The next
section describes nft set syntax in more detail.
[verse]
-*add set* ['family'] 'table' 'set' *{ type* 'type' | *typeof* 'expression' *;* [*flags* 'flags' *;*] [*timeout* 'timeout' *;*] [*gc-interval* 'gc-interval' *;*] [*elements = {* 'element'[*,* ...] *} ;*] [*size* 'size' *;*] [*policy* 'policy' *;*] [*auto-merge ;*] *}*
-{*delete* | *list* | *flush*} *set* ['family'] 'table' 'set'
+*add set* ['family'] 'table' 'set' *{ type* 'type' | *typeof* 'expression' *;* [*flags* 'flags' *;*] [*timeout* 'timeout' *;*] [*gc-interval* 'gc-interval' *;*] [*elements = {* 'element'[*,* ...] *} ;*] [*size* 'size' *;*] [*comment* 'comment' *;*'] [*policy* 'policy' *;*] [*auto-merge ;*] *}*
+{*delete* | *destroy* | *list* | *flush* | *reset* } *set* ['family'] 'table' 'set'
*list sets* ['family']
*delete set* ['family'] 'table' *handle* 'handle'
-{*add* | *delete*} *element* ['family'] 'table' 'set' *{* 'element'[*,* ...] *}*
+{*add* | *delete* | *destroy* } *element* ['family'] 'table' 'set' *{* 'element'[*,* ...] *}*
Sets are element containers of a user-defined data type, they are uniquely
identified by a user-defined name and attached to tables. Their behaviour can
be tuned with the flags that can be specified at set creation time.
[horizontal]
-*add*:: Add a new set in the specified table. See the Set specification table below for more information about how to specify a sets properties.
+*add*:: Add a new set in the specified table. See the Set specification table below for more information about how to specify properties of a set.
*delete*:: Delete the specified set.
+*destroy*:: Delete the specified set, it does not fail if it does not exist.
*list*:: Display the elements in the specified set.
*flush*:: Remove all elements from the specified set.
+*reset*:: Reset state in all contained elements, e.g. counter and quota statement values.
.Set specifications
[options="header"]
@@ -537,10 +603,9 @@ string: ipv4_addr, ipv6_addr, ether_addr, inet_proto, inet_service, mark
data type of set element |
expression to derive the data type from
|flags |
-set flags |
-string: constant, dynamic, interval, timeout
+set flags | string: constant, dynamic, interval, timeout. Used to describe the sets properties.
|timeout |
-time an element stays in the set, mandatory if set is added to from the packet path (ruleset).|
+time an element stays in the set, mandatory if set is added to from the packet path (ruleset)|
string, decimal followed by unit. Units are: d, h, m, s
|gc-interval |
garbage collection interval, only available when timeout or flag timeout are
@@ -550,7 +615,7 @@ string, decimal followed by unit. Units are: d, h, m, s
elements contained by the set |
set data type
|size |
-maximum number of elements in the set, mandatory if set is added to from the packet path (ruleset).|
+maximum number of elements in the set, mandatory if set is added to from the packet path (ruleset)|
unsigned integer (64 bit)
|policy |
set policy |
@@ -563,8 +628,8 @@ automatic merge of adjacent/overlapping set elements (only for interval sets) |
MAPS
-----
[verse]
-*add map* ['family'] 'table' 'map' *{ type* 'type' | *typeof* 'expression' [*flags* 'flags' *;*] [*elements = {* 'element'[*,* ...] *} ;*] [*size* 'size' *;*] [*policy* 'policy' *;*] *}*
-{*delete* | *list* | *flush*} *map* ['family'] 'table' 'map'
+*add map* ['family'] 'table' 'map' *{ type* 'type' | *typeof* 'expression' [*flags* 'flags' *;*] [*elements = {* 'element'[*,* ...] *} ;*] [*size* 'size' *;*] [*comment* 'comment' *;*'] [*policy* 'policy' *;*] *}*
+{*delete* | *destroy* | *list* | *flush* | *reset* } *map* ['family'] 'table' 'map'
*list maps* ['family']
Maps store data based on some specific key used as input. They are uniquely identified by a user-defined name and attached to tables.
@@ -572,10 +637,10 @@ Maps store data based on some specific key used as input. They are uniquely iden
[horizontal]
*add*:: Add a new map in the specified table.
*delete*:: Delete the specified map.
+*destroy*:: Delete the specified map, it does not fail if it does not exist.
*list*:: Display the elements in the specified map.
*flush*:: Remove all elements from the specified map.
-*add element*:: Comma-separated list of elements to add into the specified map.
-*delete element*:: Comma-separated list of element keys to delete from the specified map.
+*reset*:: Reset state in all contained elements, e.g. counter and quota statement values.
.Map specifications
[options="header"]
@@ -589,7 +654,7 @@ data type of set element |
expression to derive the data type from
|flags |
map flags |
-string: constant, interval
+string, same as set flags
|elements |
elements contained by the map |
map data type
@@ -601,21 +666,37 @@ map policy |
string: performance [default], memory
|=================
+Users can specifiy the properties/features that the set/map must support.
+This allows the kernel to pick an optimal internal representation.
+If a required flag is missing, the ruleset might still work, as
+nftables will auto-enable features if it can infer this from the ruleset.
+This may not work for all cases, however, so it is recommended to
+specify all required features in the set/map definition manually.
+
+.Set and Map flags
+[options="header"]
+|=================
+|Flag | Description
+|constant | Set contents will never change after creation
+|dynamic | Set must support updates from the packet path with the *add*, *update* or *delete* keywords.
+|interval | Set must be able to store intervals (ranges)
+|timeout | Set must support element timeouts (auto-removal of elements once they expire).
+|=================
ELEMENTS
--------
[verse]
____
-{*add* | *create* | *delete* | *get* } *element* ['family'] 'table' 'set' *{* 'ELEMENT'[*,* ...] *}*
+{*add* | *create* | *delete* | *destroy* | *get* | *reset* } *element* ['family'] 'table' 'set' *{* 'ELEMENT'[*,* ...] *}*
'ELEMENT' := 'key_expression' 'OPTIONS' [*:* 'value_expression']
'OPTIONS' := [*timeout* 'TIMESPEC'] [*expires* 'TIMESPEC'] [*comment* 'string']
'TIMESPEC' := ['num'*d*]['num'*h*]['num'*m*]['num'[*s*]]
____
-Element-related commands allow to change contents of named sets and maps.
+Element-related commands allow one to change contents of named sets and maps.
'key_expression' is typically a value matching the set type.
'value_expression' is not allowed in sets but mandatory when adding to maps, where it
-matches the data part in it's type definition. When deleting from maps, it may
+matches the data part in its type definition. When deleting from maps, it may
be specified but is optional as 'key_expression' uniquely identifies the
element.
@@ -626,6 +707,9 @@ listed elements may already exist.
be non-trivial in very large and/or interval sets. In the latter case, the
containing interval is returned instead of just the element itself.
+*reset* command resets state attached to the given element(s), e.g. counter and
+quota statement values.
+
.Element options
[options="header"]
|=================
@@ -644,7 +728,7 @@ FLOWTABLES
[verse]
{*add* | *create*} *flowtable* ['family'] 'table' 'flowtable' *{ hook* 'hook' *priority* 'priority' *; devices = {* 'device'[*,* ...] *} ; }*
*list flowtables* ['family']
-{*delete* | *list*} *flowtable* ['family'] 'table' 'flowtable'
+{*delete* | *destroy* | *list*} *flowtable* ['family'] 'table' 'flowtable'
*delete* *flowtable* ['family'] 'table' *handle* 'handle'
Flowtables allow you to accelerate packet forwarding in software. Flowtables
@@ -668,24 +752,48 @@ and subtraction can be used to set relative priority, e.g. filter + 5 equals to
[horizontal]
*add*:: Add a new flowtable for the given family with the given name.
*delete*:: Delete the specified flowtable.
+*destroy*:: Delete the specified flowtable, it does not fail if it does not exist.
*list*:: List all flowtables.
+LISTING
+-------
+[verse]
+*list { secmarks | synproxys | flow tables | meters | hooks }* ['family']
+*list { secmarks | synproxys | flow tables | meters | hooks } table* ['family'] 'table'
+*list ct { timeout | expectation | helper | helpers } table* ['family'] 'table'
+
+Inspect configured objects.
+*list hooks* shows the full hook pipeline, including those registered by
+kernel modules, such as nf_conntrack.
STATEFUL OBJECTS
----------------
[verse]
-{*add* | *delete* | *list* | *reset*} 'type' ['family'] 'table' 'object'
-*delete* 'type' ['family'] 'table' *handle* 'handle'
+{*add* | *delete* | *destroy* | *list* | *reset*} *counter* ['family'] 'table' 'object'
+{*add* | *delete* | *destroy* | *list* | *reset*} *quota* ['family'] 'table' 'object'
+{*add* | *delete* | *destroy* | *list*} *limit* ['family'] 'table' 'object'
+*delete* 'counter' ['family'] 'table' *handle* 'handle'
+*delete* 'quota' ['family'] 'table' *handle* 'handle'
+*delete* 'limit' ['family'] 'table' *handle* 'handle'
+*destroy* 'counter' ['family'] 'table' *handle* 'handle'
+*destroy* 'quota' ['family'] 'table' *handle* 'handle'
+*destroy* 'limit' ['family'] 'table' *handle* 'handle'
*list counters* ['family']
*list quotas* ['family']
+*list limits* ['family']
+*reset counters* ['family']
+*reset quotas* ['family']
+*reset counters* ['family'] 'table'
+*reset quotas* ['family'] 'table'
-Stateful objects are attached to tables and are identified by an unique name.
+Stateful objects are attached to tables and are identified by a unique name.
They group stateful information from rules, to reference them in rules the
keywords "type name" are used e.g. "counter name".
[horizontal]
*add*:: Add a new stateful object in the specified table.
*delete*:: Delete the specified object.
+*destroy*:: Delete the specified object, it does not fail if it does not exist.
*list*:: Display stateful information the object holds.
*reset*:: List-and-reset stateful object.
@@ -792,13 +900,26 @@ These are some additional commands included in nft.
MONITOR
~~~~~~~~
The monitor command allows you to listen to Netlink events produced by the
-nf_tables subsystem, related to creation and deletion of objects. When they
+nf_tables subsystem. These are either related to creation and deletion of
+objects or to packets for which *meta nftrace* was enabled. When they
occur, nft will print to stdout the monitored events in either JSON or
native nft format. +
-To filter events related to a concrete object, use one of the keywords 'tables', 'chains', 'sets', 'rules', 'elements', 'ruleset'. +
+[verse]
+____
+*monitor* [*new* | *destroy*] 'MONITOR_OBJECT'
+*monitor* *trace*
+
+'MONITOR_OBJECT' := *tables* | *chains* | *sets* | *rules* | *elements* | *ruleset*
+____
+
+To filter events related to a concrete object, use one of the keywords in
+'MONITOR_OBJECT'.
+
+To filter events related to a concrete action, use keyword *new* or *destroy*.
-To filter events related to a concrete action, use keyword 'new' or 'destroy'.
+The second form of invocation takes no further options and exclusively prints
+events generated for packets with *nftrace* enabled.
Hit ^C to finish the monitor operation.
@@ -822,6 +943,12 @@ Hit ^C to finish the monitor operation.
% nft monitor ruleset
---------------------
+.Trace incoming packets from host 10.0.0.1
+------------------------------------------
+% nft add rule filter input ip saddr 10.0.0.1 meta nftrace set 1
+% nft monitor trace
+------------------------------------------
+
ERROR REPORTING
---------------
When an error is detected, nft shows the line(s) containing the error, the
diff --git a/doc/payload-expression.txt b/doc/payload-expression.txt
index e6f108b1..c7c267da 100644
--- a/doc/payload-expression.txt
+++ b/doc/payload-expression.txt
@@ -21,7 +21,15 @@ ether_type
VLAN HEADER EXPRESSION
~~~~~~~~~~~~~~~~~~~~~~
[verse]
-*vlan* {*id* | *cfi* | *pcp* | *type*}
+*vlan* {*id* | *dei* | *pcp* | *type*}
+
+The vlan expression is used to match on the vlan header fields.
+This expression will not work in the *ip*, *ip6* and *inet* families,
+unless the vlan interface is configured with the *reorder_hdr off* setting.
+The default is *reorder_hdr on* which will automatically remove the vlan tag
+from the packet. See ip-link(8) for more information.
+For these families its easier to match the vlan interface name
+instead, using the *meta iif* or *meta iifname* expression.
.VLAN header expression
[options="header"]
@@ -30,8 +38,8 @@ VLAN HEADER EXPRESSION
|id|
VLAN ID (VID) |
integer (12 bit)
-|cfi|
-Canonical Format Indicator|
+|dei|
+Drop Eligible Indicator|
integer (1 bit)
|pcp|
Priority code point|
@@ -126,6 +134,14 @@ Destination address |
ipv4_addr
|======================
+Careful with matching on *ip length*: If GRO/GSO is enabled, then the Linux
+kernel might aggregate several packets into one big packet that is larger than
+MTU. Moreover, if GRO/GSO maximum size is larger than 65535 (see man ip-link(8),
+specifically gro_ipv6_max_size and gso_ipv6_max_size), then *ip length* might
+be 0 for such jumbo packets. *meta length* allows you to match on the packet
+length including the IP header size. If you want to perform heuristics on the
+*ip length* field, then disable GRO/GSO.
+
ICMP HEADER EXPRESSION
~~~~~~~~~~~~~~~~~~~~~~
[verse]
@@ -236,6 +252,14 @@ Destination address |
ipv6_addr
|=======================
+Careful with matching on *ip6 length*: If GRO/GSO is enabled, then the Linux
+kernel might aggregate several packets into one big packet that is larger than
+MTU. Moreover, if GRO/GSO maximum size is larger than 65535 (see man ip-link(8),
+specifically gro_ipv6_max_size and gso_ipv6_max_size), then *ip6 length* might
+be 0 for such jumbo packets. *meta length* allows you to match on the packet
+length including the IP header size. If you want to perform heuristics on the
+*ip6 length* field, then disable GRO/GSO.
+
.Using ip6 header expressions
-----------------------------
# matching if first extension header indicates a fragment
@@ -245,7 +269,7 @@ ip6 nexthdr ipv6-frag
ICMPV6 HEADER EXPRESSION
~~~~~~~~~~~~~~~~~~~~~~~~
[verse]
-*icmpv6* {*type* | *code* | *checksum* | *parameter-problem* | *packet-too-big* | *id* | *sequence* | *max-delay*}
+*icmpv6* {*type* | *code* | *checksum* | *parameter-problem* | *packet-too-big* | *id* | *sequence* | *max-delay* | *taddr* | *daddr*}
This expression refers to ICMPv6 header fields. When using it in *inet*,
*bridge* or *netdev* families, it will cause an implicit dependency on IPv6 to
@@ -280,6 +304,12 @@ integer (16 bit)
|max-delay|
maximum response delay of MLD queries|
integer (16 bit)
+|taddr|
+target address of neighbor solicit/advert, redirect or MLD|
+ipv6_addr
+|daddr|
+destination address of redirect|
+ipv6_addr
|==============================
TCP HEADER EXPRESSION
@@ -369,7 +399,33 @@ integer (16 bit)
SCTP HEADER EXPRESSION
~~~~~~~~~~~~~~~~~~~~~~~
[verse]
+____
*sctp* {*sport* | *dport* | *vtag* | *checksum*}
+*sctp chunk* 'CHUNK' [ 'FIELD' ]
+
+'CHUNK' := *data* | *init* | *init-ack* | *sack* | *heartbeat* |
+ *heartbeat-ack* | *abort* | *shutdown* | *shutdown-ack* | *error* |
+ *cookie-echo* | *cookie-ack* | *ecne* | *cwr* | *shutdown-complete*
+ | *asconf-ack* | *forward-tsn* | *asconf*
+
+'FIELD' := 'COMMON_FIELD' | 'DATA_FIELD' | 'INIT_FIELD' | 'INIT_ACK_FIELD' |
+ 'SACK_FIELD' | 'SHUTDOWN_FIELD' | 'ECNE_FIELD' | 'CWR_FIELD' |
+ 'ASCONF_ACK_FIELD' | 'FORWARD_TSN_FIELD' | 'ASCONF_FIELD'
+
+'COMMON_FIELD' := *type* | *flags* | *length*
+'DATA_FIELD' := *tsn* | *stream* | *ssn* | *ppid*
+'INIT_FIELD' := *init-tag* | *a-rwnd* | *num-outbound-streams* |
+ *num-inbound-streams* | *initial-tsn*
+'INIT_ACK_FIELD' := 'INIT_FIELD'
+'SACK_FIELD' := *cum-tsn-ack* | *a-rwnd* | *num-gap-ack-blocks* |
+ *num-dup-tsns*
+'SHUTDOWN_FIELD' := *cum-tsn-ack*
+'ECNE_FIELD' := *lowest-tsn*
+'CWR_FIELD' := *lowest-tsn*
+'ASCONF_ACK_FIELD' := *seqno*
+'FORWARD_TSN_FIELD' := *new-cum-tsn*
+'ASCONF_FIELD' := *seqno*
+____
.SCTP header expression
[options="header"]
@@ -387,12 +443,39 @@ integer (32 bit)
|checksum|
Checksum|
integer (32 bit)
+|chunk|
+Search chunk in packet|
+without 'FIELD', boolean indicating existence
|================
+.SCTP chunk fields
+[options="header"]
+|==================
+|Name| Width in bits | Chunk | Notes
+|type| 8 | all | not useful, defined by chunk type
+|flags| 8 | all | semantics defined on per-chunk basis
+|length| 16 | all | length of this chunk in bytes excluding padding
+|tsn| 32 | data | transmission sequence number
+|stream| 16 | data | stream identifier
+|ssn| 16 | data | stream sequence number
+|ppid| 32 | data | payload protocol identifier
+|init-tag| 32 | init, init-ack | initiate tag
+|a-rwnd| 32 | init, init-ack, sack | advertised receiver window credit
+|num-outbound-streams| 16 | init, init-ack | number of outbound streams
+|num-inbound-streams| 16 | init, init-ack | number of inbound streams
+|initial-tsn| 32 | init, init-ack | initial transmit sequence number
+|cum-tsn-ack| 32 | sack, shutdown | cumulative transmission sequence number acknowledged
+|num-gap-ack-blocks| 16 | sack | number of Gap Ack Blocks included
+|num-dup-tsns| 16 | sack | number of duplicate transmission sequence numbers received
+|lowest-tsn| 32 | ecne, cwr | lowest transmission sequence number
+|seqno| 32 | asconf-ack, asconf | sequence number
+|new-cum-tsn| 32 | forward-tsn | new cumulative transmission sequence number
+|==================
+
DCCP HEADER EXPRESSION
~~~~~~~~~~~~~~~~~~~~~~
[verse]
-*dccp* {*sport* | *dport*}
+*dccp* {*sport* | *dport* | *type*}
.DCCP header expression
[options="header"]
@@ -404,6 +487,9 @@ inet_service
|dport|
Destination port|
inet_service
+|type|
+Packet type|
+dccp_pkttype
|========================
AUTHENTICATION HEADER EXPRESSION
@@ -468,6 +554,160 @@ compression Parameter Index |
integer (16 bit)
|============================
+GRE HEADER EXPRESSION
+~~~~~~~~~~~~~~~~~~~~~~~
+[verse]
+*gre* {*flags* | *version* | *protocol*}
+*gre* *ip* {*version* | *hdrlength* | *dscp* | *ecn* | *length* | *id* | *frag-off* | *ttl* | *protocol* | *checksum* | *saddr* | *daddr* }
+*gre* *ip6* {*version* | *dscp* | *ecn* | *flowlabel* | *length* | *nexthdr* | *hoplimit* | *saddr* | *daddr*}
+
+The gre expression is used to match on the gre header fields. This expression
+also allows to match on the IPv4 or IPv6 packet within the gre header.
+
+.GRE header expression
+[options="header"]
+|==================
+|Keyword| Description| Type
+|flags|
+checksum, routing, key, sequence and strict source route flags|
+integer (5 bit)
+|version|
+gre version field, 0 for GRE and 1 for PPTP|
+integer (3 bit)
+|protocol|
+EtherType of encapsulated packet|
+integer (16 bit)
+|==================
+
+.Matching inner IPv4 destination address encapsulated in gre
+------------------------------------------------------------
+netdev filter ingress gre ip daddr 9.9.9.9 counter
+------------------------------------------------------------
+
+GENEVE HEADER EXPRESSION
+~~~~~~~~~~~~~~~~~~~~~~~~
+[verse]
+*geneve* {*vni* | *flags*}
+*geneve* *ether* {*daddr* | *saddr* | *type*}
+*geneve* *vlan* {*id* | *dei* | *pcp* | *type*}
+*geneve* *ip* {*version* | *hdrlength* | *dscp* | *ecn* | *length* | *id* | *frag-off* | *ttl* | *protocol* | *checksum* | *saddr* | *daddr* }
+*geneve* *ip6* {*version* | *dscp* | *ecn* | *flowlabel* | *length* | *nexthdr* | *hoplimit* | *saddr* | *daddr*}
+*geneve* *tcp* {*sport* | *dport* | *sequence* | *ackseq* | *doff* | *reserved* | *flags* | *window* | *checksum* | *urgptr*}
+*geneve* *udp* {*sport* | *dport* | *length* | *checksum*}
+
+The geneve expression is used to match on the geneve header fields. The geneve
+header encapsulates a ethernet frame within a *udp* packet. This expression
+requires that you restrict the matching to *udp* packets (usually at
+port 6081 according to IANA-assigned ports).
+
+.GENEVE header expression
+[options="header"]
+|==================
+|Keyword| Description| Type
+|protocol|
+EtherType of encapsulated packet|
+integer (16 bit)
+|vni|
+Virtual Network ID (VNI)|
+integer (24 bit)
+|==================
+
+.Matching inner TCP destination port encapsulated in geneve
+----------------------------------------------------------
+netdev filter ingress udp dport 4789 geneve tcp dport 80 counter
+----------------------------------------------------------
+
+GRETAP HEADER EXPRESSION
+~~~~~~~~~~~~~~~~~~~~~~~~
+[verse]
+*gretap* {*vni* | *flags*}
+*gretap* *ether* {*daddr* | *saddr* | *type*}
+*gretap* *vlan* {*id* | *dei* | *pcp* | *type*}
+*gretap* *ip* {*version* | *hdrlength* | *dscp* | *ecn* | *length* | *id* | *frag-off* | *ttl* | *protocol* | *checksum* | *saddr* | *daddr* }
+*gretap* *ip6* {*version* | *dscp* | *ecn* | *flowlabel* | *length* | *nexthdr* | *hoplimit* | *saddr* | *daddr*}
+*gretap* *tcp* {*sport* | *dport* | *sequence* | *ackseq* | *doff* | *reserved* | *flags* | *window* | *checksum* | *urgptr*}
+*gretap* *udp* {*sport* | *dport* | *length* | *checksum*}
+
+The gretap expression is used to match on the encapsulated ethernet frame
+within the gre header. Use the *gre* expression to match on the *gre* header
+fields.
+
+.Matching inner TCP destination port encapsulated in gretap
+----------------------------------------------------------
+netdev filter ingress gretap tcp dport 80 counter
+----------------------------------------------------------
+
+VXLAN HEADER EXPRESSION
+~~~~~~~~~~~~~~~~~~~~~~~
+[verse]
+*vxlan* {*vni* | *flags*}
+*vxlan* *ether* {*daddr* | *saddr* | *type*}
+*vxlan* *vlan* {*id* | *dei* | *pcp* | *type*}
+*vxlan* *ip* {*version* | *hdrlength* | *dscp* | *ecn* | *length* | *id* | *frag-off* | *ttl* | *protocol* | *checksum* | *saddr* | *daddr* }
+*vxlan* *ip6* {*version* | *dscp* | *ecn* | *flowlabel* | *length* | *nexthdr* | *hoplimit* | *saddr* | *daddr*}
+*vxlan* *tcp* {*sport* | *dport* | *sequence* | *ackseq* | *doff* | *reserved* | *flags* | *window* | *checksum* | *urgptr*}
+*vxlan* *udp* {*sport* | *dport* | *length* | *checksum*}
+
+The vxlan expression is used to match on the vxlan header fields. The vxlan
+header encapsulates a ethernet frame within a *udp* packet. This expression
+requires that you restrict the matching to *udp* packets (usually at
+port 4789 according to IANA-assigned ports).
+
+.VXLAN header expression
+[options="header"]
+|==================
+|Keyword| Description| Type
+|flags|
+vxlan flags|
+integer (8 bit)
+|vni|
+Virtual Network ID (VNI)|
+integer (24 bit)
+|==================
+
+.Matching inner TCP destination port encapsulated in vxlan
+----------------------------------------------------------
+netdev filter ingress udp dport 4789 vxlan tcp dport 80 counter
+----------------------------------------------------------
+
+ARP HEADER EXPRESSION
+~~~~~~~~~~~~~~~~~~~~~
+[verse]
+*arp* {*htype* | *ptype* | *hlen* | *plen* | *operation* | *saddr* { *ip* | *ether* } | *daddr* { *ip* | *ether* }
+
+.ARP header expression
+[options="header"]
+|==================
+|Keyword| Description| Type
+|htype|
+ARP hardware type|
+integer (16 bit)
+|ptype|
+EtherType|
+ether_type
+|hlen|
+Hardware address len|
+integer (8 bit)
+|plen|
+Protocol address len |
+integer (8 bit)
+|operation|
+Operation |
+arp_op
+|saddr ether|
+Ethernet sender address|
+ether_addr
+|daddr ether|
+Ethernet target address|
+ether_addr
+|saddr ip|
+IPv4 sender address|
+ipv4_addr
+|daddr ip|
+IPv4 target address|
+ipv4_addr
+|======================
+
RAW PAYLOAD EXPRESSION
~~~~~~~~~~~~~~~~~~~~~~
[verse]
@@ -492,6 +732,8 @@ Link layer, for example the Ethernet header
Network header, for example IPv4 or IPv6
|th|
Transport Header, for example TCP
+|ih|
+Inner Header / Payload, i.e. after the L4 transport level header
|==============================
.Matching destination port of both UDP and TCP
@@ -525,14 +767,15 @@ nftables currently supports matching (finding) a given ipv6 extension header, TC
*dst* {*nexthdr* | *hdrlength*}
*mh* {*nexthdr* | *hdrlength* | *checksum* | *type*}
*srh* {*flags* | *tag* | *sid* | *seg-left*}
-*tcp option* {*eol* | *noop* | *maxseg* | *window* | *sack-permitted* | *sack* | *sack0* | *sack1* | *sack2* | *sack3* | *timestamp*} 'tcp_option_field'
+*tcp option* {*eol* | *nop* | *maxseg* | *window* | *sack-perm* | *sack* | *sack0* | *sack1* | *sack2* | *sack3* | *timestamp*} 'tcp_option_field'
*ip option* { lsrr | ra | rr | ssrr } 'ip_option_field'
The following syntaxes are valid only in a relational expression with boolean type on right-hand side for checking header existence only:
[verse]
*exthdr* {*hbh* | *frag* | *rt* | *dst* | *mh*}
-*tcp option* {*eol* | *noop* | *maxseg* | *window* | *sack-permitted* | *sack* | *sack0* | *sack1* | *sack2* | *sack3* | *timestamp*}
+*tcp option* {*eol* | *nop* | *maxseg* | *window* | *sack-perm* | *sack* | *sack0* | *sack1* | *sack2* | *sack3* | *timestamp*}
*ip option* { lsrr | ra | rr | ssrr }
+*dccp option* 'dccp_option_type'
.IPv6 extension headers
[options="header"]
@@ -558,39 +801,45 @@ Segment Routing Header
|Keyword| Description | TCP option fields
|eol|
End if option list|
-kind
-|noop|
-1 Byte TCP No-op options |
-kind
+-
+|nop|
+1 Byte TCP Nop padding option |
+-
|maxseg|
TCP Maximum Segment Size|
-kind, length, size
+length, size
|window|
TCP Window Scaling |
-kind, length, count
-|sack-permitted|
+length, count
+|sack-perm |
TCP SACK permitted |
-kind, length
+length
|sack|
TCP Selective Acknowledgement (alias of block 0) |
-kind, length, left, right
+length, left, right
|sack0|
TCP Selective Acknowledgement (block 0) |
-kind, length, left, right
+length, left, right
|sack1|
TCP Selective Acknowledgement (block 1) |
-kind, length, left, right
+length, left, right
|sack2|
TCP Selective Acknowledgement (block 2) |
-kind, length, left, right
+length, left, right
|sack3|
TCP Selective Acknowledgement (block 3) |
-kind, length, left, right
+length, left, right
|timestamp|
TCP Timestamps |
-kind, length, tsval, tsecr
+length, tsval, tsecr
|============================
+TCP option matching also supports raw expression syntax to access arbitrary options:
+[verse]
+*tcp option*
+[verse]
+*tcp option* *@*'number'*,*'offset'*,*'length'
+
.IP Options
[options="header"]
|==================
@@ -611,7 +860,12 @@ type, length, ptr, addr
.finding TCP options
--------------------
-filter input tcp option sack-permitted kind 1 counter
+filter input tcp option sack-perm exists counter
+--------------------
+
+.matching TCP options
+--------------------
+filter input tcp option maxseg size lt 536
--------------------
.matching IPv6 exthdr
@@ -624,6 +878,11 @@ ip6 filter input frag more-fragments 1 counter
filter input ip option lsrr exists counter
---------------------------------------
+.finding DCCP option
+------------------
+filter input dccp option 40 exists counter
+---------------------------------------
+
CONNTRACK EXPRESSIONS
~~~~~~~~~~~~~~~~~~~~~
Conntrack expressions refer to meta data of the connection tracking entry associated with a packet. +
@@ -637,11 +896,13 @@ is true for the *zone*, if a direction is given, the zone is only matched if the
zone id is tied to the given direction. +
[verse]
-*ct* {*state* | *direction* | *status* | *mark* | *expiration* | *helper* | *label*}
-*ct* [*original* | *reply*] {*l3proto* | *protocol* | *bytes* | *packets* | *avgpkt* | *zone* | *id*}
+*ct* {*state* | *direction* | *status* | *mark* | *expiration* | *helper* | *label* | *count* | *id*}
+*ct* [*original* | *reply*] {*l3proto* | *protocol* | *bytes* | *packets* | *avgpkt* | *zone*}
*ct* {*original* | *reply*} {*proto-src* | *proto-dst*}
*ct* {*original* | *reply*} {*ip* | *ip6*} {*saddr* | *daddr*}
+The conntrack-specific types in this table are described in the sub-section CONNTRACK TYPES above.
+
.Conntrack expressions
[options="header"]
|==================
@@ -698,15 +959,15 @@ integer (64 bit)
conntrack zone |
integer (16 bit)
|count|
-count number of connections
+number of current connections|
integer (32 bit)
|id|
-Connection id
-ct_id
+Connection id|
+ct_id|
|==========================================
-A description of conntrack-specific types listed above can be found sub-section CONNTRACK TYPES above.
.restrict the number of parallel connections to a server
--------------------
-filter input tcp dport 22 meter test { ip saddr ct count over 2 } reject
+nft add set filter ssh_flood '{ type ipv4_addr; flags dynamic; }'
+nft add rule filter input tcp dport 22 add @ssh_flood '{ ip saddr ct count over 2 }' reject
--------------------
diff --git a/doc/primary-expression.txt b/doc/primary-expression.txt
index a9c39cbb..782494bd 100644
--- a/doc/primary-expression.txt
+++ b/doc/primary-expression.txt
@@ -168,15 +168,18 @@ Either an integer or a date in ISO format. For example: "2019-06-06 17:00".
Hour and seconds are optional and can be omitted if desired. If omitted,
midnight will be assumed.
The following three would be equivalent: "2019-06-06", "2019-06-06 00:00"
-and "2019-06-06 00:00:00".
+and "2019-06-06 00:00:00". Use a range expression such as
+"2019-06-06 10:00"-"2019-06-10 14:00" for matching a time range.
When an integer is given, it is assumed to be a UNIX timestamp.
|day|
Either a day of week ("Monday", "Tuesday", etc.), or an integer between 0 and 6.
Strings are matched case-insensitively, and a full match is not expected (e.g. "Mon" would match "Monday").
-When an integer is given, 0 is Sunday and 6 is Saturday.
+When an integer is given, 0 is Sunday and 6 is Saturday. Use a range expression
+such as "Monday"-"Wednesday" for matching a week day range.
|hour|
A string representing an hour in 24-hour format. Seconds can optionally be specified.
-For example, 17:00 and 17:00:00 would be equivalent.
+For example, 17:00 and 17:00:00 would be equivalent. Use a range expression such
+as "17:00"-"19:00" for matching a time range.
|=============================
.Using meta expressions
@@ -190,16 +193,23 @@ filter output oif eth0
# incoming packet was subject to ipsec processing
raw prerouting meta ipsec exists accept
+
+# match incoming packet from 03:00 to 14:00 local time
+raw prerouting meta hour "03:00"-"14:00" counter accept
-----------------------
SOCKET EXPRESSION
~~~~~~~~~~~~~~~~~
[verse]
-*socket* {*transparent* | *mark*}
+*socket* {*transparent* | *mark* | *wildcard*}
+*socket* *cgroupv2* *level* 'NUM'
Socket expression can be used to search for an existing open TCP/UDP socket and
its attributes that can be associated with a packet. It looks for an established
-or non-zero bound listening socket (possibly with a non-local address).
+or non-zero bound listening socket (possibly with a non-local address). You can
+also use it to match on the socket cgroupv2 at a given ancestor level, e.g. if
+the socket belongs to cgroupv2 'a/b', ancestor level 1 checks for a matching on
+cgroup 'a' and ancestor level 2 checks for a matching on cgroup 'b'.
.Available socket attributes
[options="header"]
@@ -209,22 +219,30 @@ or non-zero bound listening socket (possibly with a non-local address).
Value of the IP_TRANSPARENT socket option in the found socket. It can be 0 or 1.|
boolean (1 bit)
|mark| Value of the socket mark (SOL_SOCKET, SO_MARK). | mark
+|wildcard|
+Indicates whether the socket is wildcard-bound (e.g. 0.0.0.0 or ::0). |
+boolean (1 bit)
+|cgroupv2|
+cgroup version 2 for this socket (path from /sys/fs/cgroup)|
+cgroupv2
|==================
.Using socket expression
------------------------
-# Mark packets that correspond to a transparent socket
+# Mark packets that correspond to a transparent socket. "socket wildcard 0"
+# means that zero-bound listener sockets are NOT matched (which is usually
+# exactly what you want).
table inet x {
chain y {
- type filter hook prerouting priority -150; policy accept;
- socket transparent 1 mark set 0x00000001 accept
+ type filter hook prerouting priority mangle; policy accept;
+ socket transparent 1 socket wildcard 0 mark set 0x00000001 accept
}
}
# Trace packets that corresponds to a socket with a mark value of 15
table inet x {
chain y {
- type filter hook prerouting priority -150; policy accept;
+ type filter hook prerouting priority mangle; policy accept;
socket mark 0x0000000f nftrace set 1
}
}
@@ -232,10 +250,18 @@ table inet x {
# Set packet mark to socket mark
table inet x {
chain y {
- type filter hook prerouting priority -150; policy accept;
+ type filter hook prerouting priority mangle; policy accept;
tcp dport 8080 mark set socket mark
}
}
+
+# Count packets for cgroupv2 "user.slice" at level 1
+table inet x {
+ chain y {
+ type filter hook input priority filter; policy accept;
+ socket cgroupv2 level 1 "user.slice" counter
+ }
+}
----------------------
OSF EXPRESSION
@@ -275,7 +301,7 @@ If no TTL attribute is passed, make a true IP header and fingerprint TTL true co
# Accept packets that match the "Linux" OS genre signature without comparing TTL.
table inet x {
chain y {
- type filter hook input priority 0; policy accept;
+ type filter hook input priority filter; policy accept;
osf ttl skip name "Linux"
}
}
@@ -408,6 +434,10 @@ Destination address of the tunnel|
ipv4_addr/ipv6_addr
|=================================
+*Note:* When using xfrm_interface, this expression is not useable in output
+hook as the plain packet does not traverse it with IPsec info attached - use a
+chain in postrouting hook instead.
+
NUMGEN EXPRESSION
~~~~~~~~~~~~~~~~~
@@ -418,7 +448,7 @@ Create a number generator. The *inc* or *random* keywords control its
operation mode: In *inc* mode, the last returned value is simply incremented.
In *random* mode, a new random number is returned. The value after *mod*
keyword specifies an upper boundary (read: modulus) which is not reached by
-returned numbers. The optional *offset* allows to increment the returned value
+returned numbers. The optional *offset* allows one to increment the returned value
by a fixed offset.
A typical use-case for *numgen* is load-balancing:
@@ -448,7 +478,7 @@ header to apply the hashing, concatenations are possible as well. The value
after *mod* keyword specifies an upper boundary (read: modulus) which is
not reached by returned numbers. The optional *seed* is used to specify an
init value used as seed in the hashing function. The optional *offset*
-allows to increment the returned value by a fixed offset.
+allows one to increment the returned value by a fixed offset.
A typical use-case for *jhash* and *symhash* is load-balancing:
diff --git a/doc/stateful-objects.txt b/doc/stateful-objects.txt
index 32a3a5c8..00d3c5f1 100644
--- a/doc/stateful-objects.txt
+++ b/doc/stateful-objects.txt
@@ -1,7 +1,9 @@
CT HELPER
~~~~~~~~~
[verse]
-*ct helper* 'helper' *{ type* 'type' *protocol* 'protocol' *;* [*l3proto* 'family' *;*] *}*
+*add* *ct helper* ['family'] 'table' 'name' *{ type* 'type' *protocol* 'protocol' *;* [*l3proto* 'family' *;*] *}*
+*delete* *ct helper* ['family'] 'table' 'name'
+*list* *ct helpers*
Ct helper is used to define connection tracking helpers that can then be used in
combination with the *ct helper set* statement. 'type' and 'protocol' are
@@ -22,6 +24,9 @@ string (e.g. ip)
|l3proto |
layer 3 protocol of the helper |
address family (e.g. ip)
+|comment |
+per ct helper comment field |
+string
|=================
.defining and assigning ftp helper
@@ -34,7 +39,7 @@ table inet myhelpers {
type "ftp" protocol tcp
}
chain prerouting {
- type filter hook prerouting priority 0;
+ type filter hook prerouting priority filter;
tcp dport 21 ct helper set "ftp-standard"
}
}
@@ -43,7 +48,9 @@ table inet myhelpers {
CT TIMEOUT
~~~~~~~~~~
[verse]
-*ct timeout* 'name' *{ protocol* 'protocol' *; policy = {* 'state'*:* 'value' [*,* ...] *} ;* [*l3proto* 'family' *;*] *}*
+*add* *ct timeout* ['family'] 'table' 'name' *{ protocol* 'protocol' *; policy = {* 'state'*:* 'value' [*,* ...] *} ;* [*l3proto* 'family' *;*] *}*
+*delete* *ct timeout* ['family'] 'table' 'name'
+*list* *ct timeouts*
Ct timeout is used to update connection tracking timeout values.Timeout policies are assigned
with the *ct timeout set* statement. 'protocol' and 'policy' are
@@ -65,15 +72,29 @@ unsigned integer
|l3proto |
layer 3 protocol of the timeout object |
address family (e.g. ip)
+|comment |
+per ct timeout comment field |
+string
|=================
+tcp connection state names that can have a specific timeout value are:
+
+'close', 'close_wait', 'established', 'fin_wait', 'last_ack', 'retrans', 'syn_recv', 'syn_sent', 'time_wait' and 'unack'.
+
+You can use 'sysctl -a |grep net.netfilter.nf_conntrack_tcp_timeout_' to view and change the system-wide defaults.
+'ct timeout' allows for flow-specific settings, without changing the global timeouts.
+
+For example, tcp port 53 could have much lower settings than other traffic.
+
+udp state names that can have a specific timeout value are 'replied' and 'unreplied'.
+
.defining and assigning ct timeout policy
----------------------------------
table ip filter {
ct timeout customtimeout {
protocol tcp;
l3proto ip
- policy = { established: 120, close: 20 }
+ policy = { established: 2m, close: 20s }
}
chain output {
@@ -98,7 +119,9 @@ sport=41360 dport=22
CT EXPECTATION
~~~~~~~~~~~~~~
[verse]
-*ct expectation* 'name' *{ protocol* 'protocol' *; dport* 'dport' *; timeout* 'timeout' *; size* 'size' *; [*l3proto* 'family' *;*] *}*
+*add* *ct expectation* ['family'] 'table' 'name' *{ protocol* 'protocol' *; dport* 'dport' *; timeout* 'timeout' *; size* 'size' *; [*l3proto* 'family' *;*] *}*
+*delete* *ct expectation* ['family'] 'table' 'name'
+*list* *ct expectations*
Ct expectation is used to create connection expectations. Expectations are
assigned with the *ct expectation set* statement. 'protocol', 'dport',
@@ -124,6 +147,9 @@ unsigned integer
|l3proto |
layer 3 protocol of the expectation object |
address family (e.g. ip)
+|comment |
+per ct expectation comment field |
+string
|=================
.defining and assigning ct expectation policy
@@ -147,7 +173,9 @@ table ip filter {
COUNTER
~~~~~~~
[verse]
-*counter* ['packets bytes']
+*add* *counter* ['family'] 'table' 'name' [*{* [ *packets* 'packets' *bytes* 'bytes' ';' ] [ *comment* 'comment' ';' *}*]
+*delete* *counter* ['family'] 'table' 'name'
+*list* *counters*
.Counter specifications
[options="header"]
@@ -159,12 +187,31 @@ unsigned integer (64 bit)
|bytes |
initial count of bytes |
unsigned integer (64 bit)
+|comment |
+per counter comment field |
+string
|=================
+.*Using named counters*
+------------------
+nft add counter filter http
+nft add rule filter input tcp dport 80 counter name \"http\"
+------------------
+
+.*Using named counters with maps*
+------------------
+nft add counter filter http
+nft add counter filter https
+nft add rule filter input counter name tcp dport map { 80 : \"http\", 443 : \"https\" }
+------------------
+
QUOTA
~~~~~
[verse]
-*quota* [*over* | *until*] ['used']
+*add* *quota* ['family'] 'table' 'name' *{* [*over*|*until*] 'bytes' 'BYTE_UNIT' [ *used* 'bytes' 'BYTE_UNIT' ] ';' [ *comment* 'comment' ';' ] *}*
+BYTE_UNIT := bytes | kbytes | mbytes
+*delete* *quota* ['family'] 'table' 'name'
+*list* *quotas*
.Quota specifications
[options="header"]
@@ -177,4 +224,20 @@ Two arguments, unsigned integer (64 bit) and string: bytes, kbytes, mbytes.
|used |
initial value of used quota |
Two arguments, unsigned integer (64 bit) and string: bytes, kbytes, mbytes
+|comment |
+per quota comment field |
+string
|=================
+
+.*Using named quotas*
+------------------
+nft add quota filter user123 { over 20 mbytes }
+nft add rule filter input ip saddr 192.168.10.123 quota name \"user123\"
+------------------
+
+.*Using named quotas with maps*
+------------------
+nft add quota filter user123 { over 20 mbytes }
+nft add quota filter user124 { over 20 mbytes }
+nft add rule filter input quota name ip saddr map { 192.168.10.123 : \"user123\", 192.168.10.124 : \"user124\" }
+------------------
diff --git a/doc/statements.txt b/doc/statements.txt
index 9155f286..39b31fd2 100644
--- a/doc/statements.txt
+++ b/doc/statements.txt
@@ -11,7 +11,7 @@ The verdict statement alters control flow in the ruleset and issues policy decis
[horizontal]
*accept*:: Terminate ruleset evaluation and accept the packet.
The packet can still be dropped later by another hook, for instance accept
-in the forward hook still allows to drop the packet later in the postrouting hook,
+in the forward hook still allows one to drop the packet later in the postrouting hook,
or another forward base chain that has a higher priority number and is evaluated
afterwards in the processing pipeline.
*drop*:: Terminate ruleset evaluation and drop the packet.
@@ -71,7 +71,7 @@ EXTENSION HEADER STATEMENT
The extension header statement alters packet content in variable-sized headers.
This can currently be used to alter the TCP Maximum segment size of packets,
-similar to TCPMSS.
+similar to the TCPMSS target in iptables.
.change tcp mss
---------------
@@ -80,6 +80,13 @@ tcp flags syn tcp option maxseg size set 1360
tcp flags syn tcp option maxseg size set rt mtu
---------------
+You can also remove tcp options via reset keyword.
+
+.remove tcp option
+---------------
+tcp flags syn reset tcp option sack-perm
+---------------
+
LOG STATEMENT
~~~~~~~~~~~~~
[verse]
@@ -93,10 +100,11 @@ packets, such as header fields, via the kernel log (where it can be read with
dmesg(1) or read in the syslog).
In the second form of invocation (if 'nflog_group' is specified), the Linux
-kernel will pass the packet to nfnetlink_log which will multicast the packet
-through a netlink socket to the specified multicast group. One or more userspace
-processes may subscribe to the group to receive the packets, see
-libnetfilter_queue documentation for details.
+kernel will pass the packet to nfnetlink_log which will send the log through a
+netlink socket to the specified group. One userspace process may subscribe to
+the group to receive the logs, see man(8) ulogd for the Netfilter userspace log
+daemon and libnetfilter_log documentation for details in case you would like to
+develop a custom program to digest your logs.
In the third form of invocation (if level audit is specified), the Linux
kernel writes a message into the audit buffer suitably formatted for reading
@@ -163,37 +171,77 @@ REJECT STATEMENT
____
*reject* [ *with* 'REJECT_WITH' ]
-'REJECT_WITH' := *icmp type* 'icmp_code' |
- *icmpv6 type* 'icmpv6_code' |
- *icmpx type* 'icmpx_code' |
+'REJECT_WITH' := *icmp* 'icmp_reject_code' |
+ *icmpv6* 'icmpv6_reject_code' |
+ *icmpx* 'icmpx_reject_code' |
*tcp reset*
____
A reject statement is used to send back an error packet in response to the
matched packet otherwise it is equivalent to drop so it is a terminating
statement, ending rule traversal. This statement is only valid in base chains
-using the *input*,
+using the *prerouting*, *input*,
*forward* or *output* hooks, and user-defined chains which are only called from
those chains.
-.different ICMP reject variants are meant for use in different table families
+.Keywords may be used to reject when specifying the ICMP code
[options="header"]
|==================
-|Variant |Family | Type
-|icmp|
-ip|
-icmp_code
-|icmpv6|
-ip6|
-icmpv6_code
-|icmpx|
-inet|
-icmpx_code
+|Keyword | Value
+|net-unreachable |
+0
+|host-unreachable |
+1
+|prot-unreachable|
+2
+|port-unreachable|
+3
+|frag-needed|
+4
+|net-prohibited|
+9
+|host-prohibited|
+10
+|admin-prohibited|
+13
+|===================
+
+.keywords may be used to reject when specifying the ICMPv6 code
+[options="header"]
|==================
+|Keyword |Value
+|no-route|
+0
+|admin-prohibited|
+1
+|addr-unreachable|
+3
+|port-unreachable|
+4
+|policy-fail|
+5
+|reject-route|
+6
+|==================
+
+The ICMPvX Code type abstraction is a set of values which overlap between ICMP
+and ICMPv6 Code types to be used from the inet family.
+
+.keywords may be used when specifying the ICMPvX code
+[options="header"]
+|==================
+|Keyword |Value
+|no-route|
+0
+|port-unreachable|
+1
+|host-unreachable|
+2
+|admin-prohibited|
+3
+|=================
-For a description of the different types and a list of supported keywords refer
-to DATA TYPES section above. The common default reject value is
-*port-unreachable*. +
+The common default ICMP code to reject is *port-unreachable*.
Note that in bridge family, reject statement is only allowed in base chains
which hook into input or prerouting.
@@ -216,7 +264,7 @@ The conntrack statement can be used to set the conntrack mark and conntrack labe
The ct statement sets meta data associated with a connection. The zone id
has to be assigned before a conntrack lookup takes place, i.e. this has to be
done in prerouting and possibly output (if locally generated packets need to be
-placed in a distinct zone), with a hook priority of -300.
+placed in a distinct zone), with a hook priority of *raw* (-300).
Unlike iptables, where the helper assignment happens in the raw table,
the helper needs to be assigned after a conntrack entry has been
@@ -253,11 +301,11 @@ ct mark set meta mark
------------------------------
table inet raw {
chain prerouting {
- type filter hook prerouting priority -300;
+ type filter hook prerouting priority raw;
ct zone set iif map { "eth1" : 1, "veth1" : 2 }
}
chain output {
- type filter hook output priority -300;
+ type filter hook output priority raw;
ct zone set oif map { "eth1" : 1, "veth1" : 2 }
}
}
@@ -270,7 +318,7 @@ ct event set new,related,destroy
NOTRACK STATEMENT
~~~~~~~~~~~~~~~~~
-The notrack statement allows to disable connection tracking for certain
+The notrack statement allows one to disable connection tracking for certain
packets.
[verse]
@@ -278,7 +326,7 @@ packets.
Note that for this statement to be effective, it has to be applied to packets
before a conntrack lookup happens. Therefore, it needs to sit in a chain with
-either prerouting or output hook and a hook priority of -300 or less.
+either prerouting or output hook and a hook priority of -300 (*raw*) or less.
See SYNPROXY STATEMENT for an example usage.
@@ -288,7 +336,7 @@ A meta statement sets the value of a meta expression. The existing meta fields
are: priority, mark, pkttype, nftrace. +
[verse]
-*meta* {*mark* | *priority* | *pkttype* | *nftrace*} *set* 'value'
+*meta* {*mark* | *priority* | *pkttype* | *nftrace* | *broute*} *set* 'value'
A meta statement sets meta data associated with a packet. +
@@ -308,6 +356,9 @@ pkt_type
|nftrace |
ruleset packet tracing on/off. Use *monitor trace* command to watch traces|
0, 1
+|broute |
+broute on/off. packets are routed instead of being bridged|
+0, 1
|==========================
LIMIT STATEMENT
@@ -326,6 +377,12 @@ using this statement will match until this limit is reached. It can be used in
combination with the log statement to give limited logging. The optional
*over* keyword makes it match over the specified rate.
+The *burst* value influences the bucket size, i.e. jitter tolerance. With
+packet-based *limit*, the bucket holds exactly *burst* packets, by default
+five. If you specify packet *burst*, it must be a non-zero value. With
+byte-based *limit*, the bucket's minimum size is the given rate's byte value
+and the *burst* value adds to that, by default zero bytes.
+
.limit statement values
[options="header"]
|==================
@@ -342,21 +399,16 @@ NAT STATEMENTS
~~~~~~~~~~~~~~
[verse]
____
-*snat to* 'address' [*:*'port'] ['PRF_FLAGS']
-*snat to* 'address' *-* 'address' [*:*'port' *-* 'port'] ['PRF_FLAGS']
-*snat* { *ip* | *ip6* } *to* 'address' *-* 'address' [*:*'port' *-* 'port'] ['PR_FLAGS']
-*dnat to* 'address' [*:*'port'] ['PRF_FLAGS']
-*dnat to* 'address' [*:*'port' *-* 'port'] ['PR_FLAGS']
-*dnat* { *ip* | *ip6* } *to* 'address' [*:*'port' *-* 'port'] ['PR_FLAGS']
-*masquerade to* [*:*'port'] ['PRF_FLAGS']
-*masquerade to* [*:*'port' *-* 'port'] ['PRF_FLAGS']
-*redirect to* [*:*'port'] ['PRF_FLAGS']
-*redirect to* [*:*'port' *-* 'port'] ['PRF_FLAGS']
-
-'PRF_FLAGS' := 'PRF_FLAG' [*,* 'PRF_FLAGS']
-'PR_FLAGS' := 'PR_FLAG' [*,* 'PR_FLAGS']
-'PRF_FLAG' := 'PR_FLAG' | *fully-random*
-'PR_FLAG' := *persistent* | *random*
+*snat* [[*ip* | *ip6*] [ *prefix* ] *to*] 'ADDR_SPEC' [*:*'PORT_SPEC'] ['FLAGS']
+*dnat* [[*ip* | *ip6*] [ *prefix* ] *to*] 'ADDR_SPEC' [*:*'PORT_SPEC'] ['FLAGS']
+*masquerade* [*to :*'PORT_SPEC'] ['FLAGS']
+*redirect* [*to :*'PORT_SPEC'] ['FLAGS']
+
+'ADDR_SPEC' := 'address' | 'address' *-* 'address'
+'PORT_SPEC' := 'port' | 'port' *-* 'port'
+
+'FLAGS' := 'FLAG' [*,* 'FLAGS']
+'FLAG' := *persistent* | *random* | *fully-random*
____
The nat statements are only valid from nat chain types. +
@@ -386,6 +438,9 @@ Before kernel 4.18 nat statements require both prerouting and postrouting base c
to be present since otherwise packets on the return path won't be seen by
netfilter and therefore no reverse translation will take place.
+The optional *prefix* keyword allows to map to map *n* source addresses to *n*
+destination addresses. See 'Advanced NAT examples' below.
+
.NAT statement values
[options="header"]
|==================
@@ -396,7 +451,7 @@ You may specify a mapping to relate a list of tuples composed of arbitrary
expression key with address value. |
ipv4_addr, ipv6_addr, e.g. abcd::1234, or you can use a mapping, e.g. meta mark map { 10 : 192.168.1.2, 20 : 192.168.1.3 }
|port|
-Specifies that the source/destination address of the packet should be modified. |
+Specifies that the source/destination port of the packet should be modified. |
port number (16 bit)
|===============================
@@ -419,8 +474,8 @@ If used then port mapping is generated based on a 32-bit pseudo-random algorithm
---------------------
# create a suitable table/chain setup for all further examples
add table nat
-add chain nat prerouting { type nat hook prerouting priority 0; }
-add chain nat postrouting { type nat hook postrouting priority 100; }
+add chain nat prerouting { type nat hook prerouting priority dstnat; }
+add chain nat postrouting { type nat hook postrouting priority srcnat; }
# translate source addresses of all packets leaving via eth0 to address 1.2.3.4
add rule nat postrouting oif eth0 snat to 1.2.3.4
@@ -445,6 +500,52 @@ add rule inet nat postrouting meta oif ppp0 masquerade
------------------------
+.Advanced NAT examples
+----------------------
+
+# map prefixes in one network to that of another, e.g. 10.141.11.4 is mangled to 192.168.2.4,
+# 10.141.11.5 is mangled to 192.168.2.5 and so on.
+add rule nat postrouting snat ip prefix to ip saddr map { 10.141.11.0/24 : 192.168.2.0/24 }
+
+# map a source address, source port combination to a pool of destination addresses and ports:
+add rule nat postrouting dnat to ip saddr . tcp dport map { 192.168.1.2 . 80 : 10.141.10.2-10.141.10.5 . 8888-8999 }
+
+# The above example generates the following NAT expression:
+#
+# [ nat dnat ip addr_min reg 1 addr_max reg 10 proto_min reg 9 proto_max reg 11 ]
+#
+# which expects to obtain the following tuple:
+# IP address (min), source port (min), IP address (max), source port (max)
+# to be obtained from the map. The given addresses and ports are inclusive.
+
+# This also works with named maps and in combination with both concatenations and ranges:
+table ip nat {
+ map ipportmap {
+ typeof ip saddr : interval ip daddr . tcp dport
+ flags interval
+ elements = { 192.168.1.2 : 10.141.10.1-10.141.10.3 . 8888-8999, 192.168.2.0/24 : 10.141.11.5-10.141.11.20 . 8888-8999 }
+ }
+
+ chain prerouting {
+ type nat hook prerouting priority dstnat; policy accept;
+ ip protocol tcp dnat ip to ip saddr map @ipportmap
+ }
+}
+
+@ipportmap maps network prefixes to a range of hosts and ports.
+The new destination is taken from the range provided by the map element.
+Same for the destination port.
+
+Note the use of the "interval" keyword in the typeof description.
+This is required so nftables knows that it has to ask for twice the
+amount of storage for each key-value pair in the map.
+
+": ipv4_addr . inet_service" would allow associating one address and one port
+with each key. But for this case, for each key, two addresses and two ports
+(The minimum and maximum values for both) have to be stored.
+
+------------------------
+
TPROXY STATEMENT
~~~~~~~~~~~~~~~~
Tproxy redirects the packet to a local socket without changing the packet header
@@ -481,21 +582,21 @@ this case the rule will match for both families.
-------------------------------------
table ip x {
chain y {
- type filter hook prerouting priority -150; policy accept;
+ type filter hook prerouting priority mangle; policy accept;
tcp dport ntp tproxy to 1.1.1.1
udp dport ssh tproxy to :2222
}
}
table ip6 x {
chain y {
- type filter hook prerouting priority -150; policy accept;
+ type filter hook prerouting priority mangle; policy accept;
tcp dport ntp tproxy to [dead::beef]
udp dport ssh tproxy to :2222
}
}
table inet x {
chain y {
- type filter hook prerouting priority -150; policy accept;
+ type filter hook prerouting priority mangle; policy accept;
tcp dport 321 tproxy to :ssh
tcp dport 99 tproxy ip to 1.1.1.1:999
udp dport 155 tproxy ip6 to [dead::beef]:smux
@@ -566,28 +667,13 @@ drop incorrect cookies. Flags combinations not expected during 3WHS will not
match and continue (e.g. SYN+FIN, SYN+ACK). Finally, drop invalid packets, this
will be out-of-flow packets that were not matched by SYNPROXY.
- table ip foo {
+ table ip x {
chain z {
type filter hook input priority filter; policy accept;
- ct state { invalid, untracked } synproxy mss 1460 wscale 9 timestamp sack-perm
+ ct state invalid, untracked synproxy mss 1460 wscale 9 timestamp sack-perm
ct state invalid drop
}
}
-
-The outcome ruleset of the steps above should be similar to the one below.
-
- table ip x {
- chain y {
- type filter hook prerouting priority raw; policy accept;
- tcp flags syn notrack
- }
-
- chain z {
- type filter hook input priority filter; policy accept;
- ct state { invalid, untracked } synproxy mss 1460 wscale 9 timestamp sack-perm
- ct state invalid drop
- }
- }
---------------------------------------
FLOW STATEMENT
@@ -608,13 +694,19 @@ for details.
[verse]
____
-*queue* [*num* 'queue_number'] [*bypass*]
-*queue* [*num* 'queue_number_from' - 'queue_number_to'] ['QUEUE_FLAGS']
+*queue* [*flags* 'QUEUE_FLAGS'] [*to* 'queue_number']
+*queue* [*flags* 'QUEUE_FLAGS'] [*to* 'queue_number_from' - 'queue_number_to']
+*queue* [*flags* 'QUEUE_FLAGS'] [*to* 'QUEUE_EXPRESSION' ]
'QUEUE_FLAGS' := 'QUEUE_FLAG' [*,* 'QUEUE_FLAGS']
'QUEUE_FLAG' := *bypass* | *fanout*
+'QUEUE_EXPRESSION' := *numgen* | *hash* | *symhash* | *MAP STATEMENT*
____
+QUEUE_EXPRESSION can be used to compute a queue number
+at run-time with the hash or numgen expressions. It also
+allows one to use the map statement to assign fixed queue numbers
+based on external inputs such as the source ip address or interface names.
.queue statement values
[options="header"]
@@ -670,7 +762,7 @@ string
ip filter forward dup to 10.2.3.4 device "eth0"
# copy raw frame to another interface
-netdetv ingress dup to "eth0"
+netdev ingress dup to "eth0"
dup to "eth0"
# combine with map dst addr to gateways
@@ -680,10 +772,27 @@ dup to ip daddr map { 192.168.7.1 : "eth0", 192.168.7.2 : "eth1" }
FWD STATEMENT
~~~~~~~~~~~~~
The fwd statement is used to redirect a raw packet to another interface. It is
-only available in the netdev family ingress hook. It is similar to the dup
-statement except that no copy is made.
+only available in the netdev family ingress and egress hooks. It is similar to
+the dup statement except that no copy is made.
+You can also specify the address of the next hop and the device to forward the
+packet to. This updates the source and destination MAC address of the packet by
+transmitting it through the neighboring layer. This also decrements the ttl
+field of the IP packet. This provides a way to effectively bypass the classical
+forwarding path, thus skipping the fib (forwarding information base) lookup.
+
+[verse]
*fwd to* 'device'
+*fwd* [*ip* | *ip6*] *to* 'address' *device* 'device'
+
+.Using the fwd statement
+------------------------
+# redirect raw packet to device
+netdev ingress fwd to "eth0"
+
+# forward packet to next hop 192.168.200.1 via eth0 device
+netdev ingress ether saddr set fwd ip to 192.168.200.1 device "eth0"
+-----------------------------------
SET STATEMENT
~~~~~~~~~~~~~
@@ -699,13 +808,26 @@ will not grow indefinitely) either from the set definition or from the statement
that adds or updates them. The set statement can be used to e.g. create dynamic
blacklists.
+Dynamic updates are also supported with maps. In this case, the *add* or
+*update* rule needs to provide both the key and the data element (value),
+separated via ':'.
+
[verse]
{*add* | *update*} *@*'setname' *{* 'expression' [*timeout* 'timeout'] [*comment* 'string'] *}*
.Example for simple blacklist
-----------------------------
-# declare a set, bound to table "filter", in family "ip". Timeout and size are mandatory because we will add elements from packet path.
-nft add set ip filter blackhole "{ type ipv4_addr; flags timeout; size 65536; }"
+# declare a set, bound to table "filter", in family "ip".
+# Timeout and size are mandatory because we will add elements from packet path.
+# Entries will timeout after one minute, after which they might be
+# re-added if limit condition persists.
+nft add set ip filter blackhole \
+ "{ type ipv4_addr; flags dynamic; timeout 1m; size 65536; }"
+
+# declare a set to store the limit per saddr.
+# This must be separate from blackhole since the timeout is different
+nft add set ip filter flood \
+ "{ type ipv4_addr; flags dynamic; timeout 10s; size 128000; }"
# whitelist internal interface.
nft add rule ip filter input meta iifname "internal" accept
@@ -713,17 +835,18 @@ nft add rule ip filter input meta iifname "internal" accept
# drop packets coming from blacklisted ip addresses.
nft add rule ip filter input ip saddr @blackhole counter drop
-# add source ip addresses to the blacklist if more than 10 tcp connection requests occurred per second and ip address.
-# entries will timeout after one minute, after which they might be re-added if limit condition persists.
-nft add rule ip filter input tcp flags syn tcp dport ssh meter flood size 128000 { ip saddr timeout 10s limit rate over 10/second} add @blackhole { ip saddr timeout 1m } drop
+# add source ip addresses to the blacklist if more than 10 tcp connection
+# requests occurred per second and ip address.
+nft add rule ip filter input tcp flags syn tcp dport ssh \
+ add @flood { ip saddr limit rate over 10/second } \
+ add @blackhole { ip saddr } \
+ drop
-# inspect state of the rate limit meter:
-nft list meter ip filter flood
-
-# inspect content of blackhole:
+# inspect state of the sets.
+nft list set ip filter flood
nft list set ip filter blackhole
-# manually add two addresses to the set:
+# manually add two addresses to the blackhole.
nft add element filter blackhole { 10.2.3.4, 10.23.1.42 }
-----------------------------------------------
@@ -773,3 +896,20 @@ ____
# jump to different chains depending on layer 4 protocol type:
nft add rule ip filter input ip protocol vmap { tcp : jump tcp-chain, udp : jump udp-chain , icmp : jump icmp-chain }
------------------------
+
+XT STATEMENT
+~~~~~~~~~~~~
+This represents an xt statement from xtables compat interface. It is a
+fallback if translation is not available or not complete.
+
+[verse]
+____
+*xt* 'TYPE' 'NAME'
+
+'TYPE' := *match* | *target* | *watcher*
+____
+
+Seeing this means the ruleset (or parts of it) were created by *iptables-nft*
+and one should use that to manage it.
+
+*BEWARE:* nftables won't restore these statements.