summaryrefslogtreecommitdiffstats
path: root/doc/primary-expression.txt
blob: 40aca3d3fcf65d2b10dd76d92b211e94c9af6f55 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
META EXPRESSIONS
~~~~~~~~~~~~~~~~
[verse]
*meta* {*length* | *nfproto* | *l4proto* | *protocol* | *priority*}
[*meta*] {*mark* | *iif* | *iifname* | *iiftype* | *oif* | *oifname* | *oiftype* | *skuid* | *skgid* | *nftrace* | *rtclassid* | *ibrname* | *obrname* | *pkttype* | *cpu* | *iifgroup* | *oifgroup* | *cgroup* | *random* | *ipsec* | *iifkind* | *oifkind* | *time* | *hour* | *day* }

A meta expression refers to meta data associated with a packet.

There are two types of meta expressions: unqualified and qualified meta
expressions. Qualified meta expressions require the meta keyword before the meta
key, unqualified meta expressions can be specified by using the meta key
directly or as qualified meta expressions. Meta l4proto is useful to match a
particular transport protocol that is part of either an IPv4 or IPv6 packet. It
will also skip any IPv6 extension headers present in an IPv6 packet.

meta iif, oif, iifname and oifname are used to match the interface a packet
arrived on or is about to be sent out on.

iif and oif are used to match on the interface index, whereas iifname and
oifname are used to match on the interface name.
This is not the same -- assuming the rule

  filter input meta iif "foo"

Then this rule can only be added if the interface "foo" exists.
Also, the rule will continue to match even if the
interface "foo" is renamed to "bar".

This is because internally the interface index is used.
In case of dynamically created interfaces, such as tun/tap or dialup
interfaces (ppp for example), it might be better to use iifname or oifname
instead.

In these cases, the name is used so the interface doesn't have to exist to
add such a rule, it will stop matching if the interface gets renamed and it
will match again in case interface gets deleted and later a new interface
with the same name is created.

Like with iptables, wildcard matching on interface name prefixes is available for
*iifname* and *oifname* matches by appending an asterisk (*) character. Note
however that unlike iptables, nftables does not accept interface names
consisting of the wildcard character only - users are supposed to just skip
those always matching expressions. In order to match on literal asterisk
character, one may escape it using backslash (\).

.Meta expression types
[options="header"]
|==================
|Keyword | Description | Type
|length|
Length of the packet in bytes|
integer (32-bit)
|nfproto|
real hook protocol family, useful only in inet table|
integer (32 bit)
|l4proto|
layer 4 protocol, skips ipv6 extension headers|
integer (8 bit)
|protocol|
EtherType protocol value|
ether_type
|priority|
TC packet priority|
tc_handle
|mark|
Packet mark |
mark
|iif|
Input interface index |
iface_index
|iifname|
Input interface name |
ifname
|iiftype|
Input interface type|
iface_type
|oif|
Output interface index|
iface_index
|oifname|
Output interface name|
ifname
|oiftype|
Output interface hardware type|
iface_type
|sdif|
Slave device input interface index |
iface_index
|sdifname|
Slave device interface name|
ifname
|skuid|
UID associated with originating socket|
uid
|skgid|
GID associated with originating socket|
gid
|rtclassid|
Routing realm|
realm
|ibrname|
Input bridge interface name|
ifname
|obrname|
Output bridge interface name|
ifname
|pkttype|
packet type|
pkt_type
|cpu|
cpu number processing the packet|
integer (32 bit)
|iifgroup|
incoming device group|
devgroup
|oifgroup|
outgoing device group|
devgroup
|cgroup|
control group id |
integer (32 bit)
|random|
pseudo-random number|
integer (32 bit)
|ipsec|
true if packet was ipsec encrypted |
boolean (1 bit)
|iifkind|
Input interface kind |
|oifkind|
Output interface kind|
|time|
Absolute time of packet reception|
Integer (32 bit) or string
|day|
Day of week|
Integer (8 bit) or string
|hour|
Hour of day|
String
|====================

.Meta expression specific types
[options="header"]
|==================
|Type | Description
|iface_index |
Interface index (32 bit number). Can be specified numerically or as name of an existing interface.
|ifname|
Interface name (16 byte string). Does not have to exist.
|iface_type|
Interface type (16 bit number).
|uid|
User ID (32 bit number). Can be specified numerically or as user name.
|gid|
Group ID (32 bit number). Can be specified numerically or as group name.
|realm|
Routing Realm (32 bit number). Can be specified numerically or as symbolic name defined in /etc/iproute2/rt_realms.
|devgroup_type|
Device group (32 bit number). Can be specified numerically or as symbolic name defined in /etc/iproute2/group.
|pkt_type|
Packet type: *host* (addressed to local host), *broadcast* (to all),
*multicast* (to group), *other* (addressed to another host).
|ifkind|
Interface kind (16 byte string). See TYPES in ip-link(8) for a list.
|time|
Either an integer or a date in ISO format. For example: "2019-06-06 17:00".
Hour and seconds are optional and can be omitted if desired. If omitted,
midnight will be assumed.
The following three would be equivalent: "2019-06-06", "2019-06-06 00:00"
and "2019-06-06 00:00:00". Use a range expression such as
"2019-06-06 10:00"-"2019-06-10 14:00" for matching a time range.
When an integer is given, it is assumed to be a UNIX timestamp.
|day|
Either a day of week ("Monday", "Tuesday", etc.), or an integer between 0 and 6.
Strings are matched case-insensitively, and a full match is not expected (e.g. "Mon" would match "Monday").
When an integer is given, 0 is Sunday and 6 is Saturday. Use a range expression
such as "Monday"-"Wednesday" for matching a week day range.
|hour|
A string representing an hour in 24-hour format. Seconds can optionally be specified.
For example, 17:00 and 17:00:00 would be equivalent. Use a range expression such
as "17:00"-"19:00" for matching a time range.
|=============================

.Using meta expressions
-----------------------
# qualified meta expression
filter output meta oif eth0
filter forward meta iifkind { "tun", "veth" }

# unqualified meta expression
filter output oif eth0

# incoming packet was subject to ipsec processing
raw prerouting meta ipsec exists accept

# match incoming packet from 03:00 to 14:00 local time
raw prerouting meta hour "03:00"-"14:00" counter accept
-----------------------

SOCKET EXPRESSION
~~~~~~~~~~~~~~~~~
[verse]
*socket* {*transparent* | *mark* | *wildcard*}
*socket* *cgroupv2* *level* 'NUM'

Socket expression can be used to search for an existing open TCP/UDP socket and
its attributes that can be associated with a packet. It looks for an established
or non-zero bound listening socket (possibly with a non-local address). You can
also use it to match on the socket cgroupv2 at a given ancestor level, e.g. if
the socket belongs to cgroupv2 'a/b', ancestor level 1 checks for a matching on
cgroup 'a' and ancestor level 2 checks for a matching on cgroup 'b'.

.Available socket attributes
[options="header"]
|==================
|Name |Description| Type
|transparent|
Value of the IP_TRANSPARENT socket option in the found socket. It can be 0 or 1.|
boolean (1 bit)
|mark| Value of the socket mark (SOL_SOCKET, SO_MARK). | mark
|wildcard|
Indicates whether the socket is wildcard-bound (e.g. 0.0.0.0 or ::0). |
boolean (1 bit)
|cgroupv2|
cgroup version 2 for this socket (path from /sys/fs/cgroup)|
cgroupv2
|==================

.Using socket expression
------------------------
# Mark packets that correspond to a transparent socket. "socket wildcard 0"
# means that zero-bound listener sockets are NOT matched (which is usually
# exactly what you want).
table inet x {
    chain y {
        type filter hook prerouting priority mangle; policy accept;
        socket transparent 1 socket wildcard 0 mark set 0x00000001 accept
    }
}

# Trace packets that corresponds to a socket with a mark value of 15
table inet x {
    chain y {
        type filter hook prerouting priority mangle; policy accept;
        socket mark 0x0000000f nftrace set 1
    }
}

# Set packet mark to socket mark
table inet x {
    chain y {
        type filter hook prerouting priority mangle; policy accept;
        tcp dport 8080 mark set socket mark
    }
}

# Count packets for cgroupv2 "user.slice" at level 1
table inet x {
    chain y {
        type filter hook input priority filter; policy accept;
        socket cgroupv2 level 1 "user.slice" counter
    }
}
----------------------

OSF EXPRESSION
~~~~~~~~~~~~~~
[verse]
*osf* [*ttl* {*loose* | *skip*}] {*name* | *version*}

The osf expression does passive operating system fingerprinting. This
expression compares some data (Window Size, MSS, options and their order, DF,
and others) from packets with the SYN bit set.

.Available osf attributes
[options="header"]
|==================
|Name |Description| Type
|ttl|
Do TTL checks on the packet to determine the operating system.|
string
|version|
Do OS version checks on the packet.|
|name|
Name of the OS signature to match. All signatures can be found at pf.os file.
Use "unknown" for OS signatures that the expression could not detect.|
string
|==================

.Available ttl values
---------------------
If no TTL attribute is passed, make a true IP header and fingerprint TTL true comparison. This generally works for LANs.

* loose: Check if the IP header's TTL is less than the fingerprint one. Works for globally-routable addresses.
* skip: Do not compare the TTL at all.
---------------------

.Using osf expression
---------------------
# Accept packets that match the "Linux" OS genre signature without comparing TTL.
table inet x {
    chain y {
        type filter hook input priority filter; policy accept;
        osf ttl skip name "Linux"
    }
}
-----------------------

FIB EXPRESSIONS
~~~~~~~~~~~~~~~
[verse]
*fib* 'FIB_TUPLE' 'FIB_RESULT'
'FIB_TUPLE' := { *saddr* | *daddr*} [ *.* { *iif* | *oif* } *.* *mark* ]
'FIB_RESULT'  := { *oif* | *oifname* | *type* }


A fib expression queries the fib (forwarding information base) to obtain information
such as the output interface index.

The first arguments to the *fib* expression are the input keys to be passed to the fib lookup function.
One of *saddr* or *daddr* is mandatory, they are also mutually exclusive.

*mark*, *iif* and *oif* keywords are optional modifiers to influence the search result, see
the *FIB_TUPLE* keyword table below for a description.
The *iif* and *oif* tuple keywords are also mutually exclusive.

The last argument to the *fib* expression is the desired result type.

*oif* asks to obtain the interface index that would be used to send packets to the packets source
(*saddr* key) or destination (*daddr* key).  If no routing entry is found, the returned interface
index is 0.

*oifname* is like *oif*, but it fills the interface name instead.  This is useful to check dynamic
interfaces such as ppp devices.  If no entry is found, an empty interface name is returned.

*type* returns the address type such as unicast or multicast.  A complete list of supported
address types can be shown with *nft* *describe* *fib_addrtype*.

.FIB_TUPLE keywords
[options="header"]
|==================
|flag| Description
|daddr| Perform a normal route lookup: search fib for route to the *destination address* of the packet.
|saddr| Perform a reverse route lookup: search the fib for route to the *source address* of the packet.
|mark | consider the packet mark (nfmark) when querying the fib.
|iif  | if fib lookups provides a route then check its output interface is identical to the packets *input* interface.
|oif  | if fib lookups provides a route then check its output interface is identical to the packets *output* interface.  This flag can only be used with the *type* result.
|=======================

.FIB_RESULT keywords
[options="header"]
|==================
|Keyword| Description| Result Type
|oif|
Output interface index|
iface_index
|oifname|
Output interface name|
ifname
|type|
Address type |
fib_addrtype (see *nft* *describe* *fib_addrtype* for a list)
|=======================

The *oif* and *oifname* result is only valid in the *prerouting*, *input* and *forward* hooks.
The *type* can be queried from any one of *prerouting*, *input*, *forward* *output* and *postrouting*.

For *type*, the presence of the *iif* keyword in the 'FIB_TUPLE' modifiers restrict the available
hooks to those where the packet is associated with an incoming interface, i.e. *prerouting*, *input* and *forward*.
Likewise, the *oif* keyword in the 'FIB_TUPLE' modifier list will limit the available hooks to
*forward*, *output* and *postrouting*.

.Using fib expressions
----------------------
# drop packets without a reverse path
filter prerouting fib saddr . iif oif missing drop

In this example, 'saddr . iif' looks up a route to the *source address* of the packet and restricts matching
results to the interface that the packet arrived on, then stores the output interface index from the obtained
fib route result.

If no route was found for the source address/input interface combination, the output interface index is zero.
Hence, this rule will drop all packets that do not have a strict reverse path (hypothetical reply packet
would be sent via the interface the tested packet arrived on).

If only 'saddr oif' is used as the input key, then this rule would only drop packets where the fib cannot
find a route. In most setups this will never drop packets because the default route is returned.

# drop packets if the destination ip address is not configured on the incoming interface
filter prerouting fib daddr . iif type != { local, broadcast, multicast } drop

This queries the fib based on the current packets' destination address and the incoming interface.

If the packet is sent to a unicast address that is configured on a different interface, then the packet
will be dropped as such an address would be classified as 'unicast' type.
Without the 'iif' modifier, any address configured on the local machine is 'local', and unicast addresses
not configured on any interface would return the type 'unicast'.

# perform lookup in a specific 'blackhole' table (0xdead, needs ip appropriate ip rule)
filter prerouting meta mark set 0xdead fib daddr . mark type vmap { blackhole : drop, prohibit : jump prohibited, unreachable : drop }
----------------------

ROUTING EXPRESSIONS
~~~~~~~~~~~~~~~~~~~
[verse]
*rt* [*ip* | *ip6*] {*classid* | *nexthop* | *mtu* | *ipsec*}

A routing expression refers to routing data associated with a packet.

.Routing expression types
[options="header"]
|=======================
|Keyword| Description| Type
|classid|
Routing realm|
realm
|nexthop|
Routing nexthop|
ipv4_addr/ipv6_addr
|mtu|
TCP maximum segment size of route |
integer (16 bit)
|ipsec|
route via ipsec tunnel or transport |
boolean
|=================================

.Routing expression specific types
[options="header"]
|=======================
|Type| Description
|realm|
Routing Realm (32 bit number). Can be specified numerically or as symbolic name defined in /etc/iproute2/rt_realms.
|========================

.Using routing expressions
--------------------------
# IP family independent rt expression
filter output rt classid 10

# IP family dependent rt expressions
ip filter output rt nexthop 192.168.0.1
ip6 filter output rt nexthop fd00::1
inet filter output rt ip nexthop 192.168.0.1
inet filter output rt ip6 nexthop fd00::1

# outgoing packet will be encapsulated/encrypted by ipsec
filter output rt ipsec exists
-------------------------- 

IPSEC EXPRESSIONS
~~~~~~~~~~~~~~~~~

[verse]
*ipsec* {*in* | *out*} [ *spnum* 'NUM' ]  {*reqid* | *spi*}
*ipsec* {*in* | *out*} [ *spnum* 'NUM' ]  {*ip* | *ip6*} {*saddr* | *daddr*}

An ipsec expression refers to ipsec data associated with a packet.

The 'in' or 'out' keyword needs to be used to specify if the expression should
examine inbound or outbound policies. The 'in' keyword can be used in the
prerouting, input and forward hooks.  The 'out' keyword applies to forward,
output and postrouting hooks.
The optional keyword spnum can be used to match a specific state in a chain,
it defaults to 0.

.Ipsec expression types
[options="header"]
|=======================
|Keyword| Description| Type
|reqid|
Request ID|
integer (32 bit)
|spi|
Security Parameter Index|
integer (32 bit)
|saddr|
Source address of the tunnel|
ipv4_addr/ipv6_addr
|daddr|
Destination address of the tunnel|
ipv4_addr/ipv6_addr
|=================================

*Note:* When using xfrm_interface, this expression is not useable in output
hook as the plain packet does not traverse it with IPsec info attached - use a
chain in postrouting hook instead.

NUMGEN EXPRESSION
~~~~~~~~~~~~~~~~~

[verse]
*numgen* {*inc* | *random*} *mod* 'NUM' [ *offset* 'NUM' ]

Create a number generator. The *inc* or *random* keywords control its
operation mode: In *inc* mode, the last returned value is simply incremented.
In *random* mode, a new random number is returned. The value after *mod*
keyword specifies an upper boundary (read: modulus) which is not reached by
returned numbers. The optional *offset* allows one to increment the returned value
by a fixed offset.

A typical use-case for *numgen* is load-balancing:

.Using numgen expression
------------------------
# round-robin between 192.168.10.100 and 192.168.20.200:
add rule nat prerouting dnat to numgen inc mod 2 map \
	{ 0 : 192.168.10.100, 1 : 192.168.20.200 }

# probability-based with odd bias using intervals:
add rule nat prerouting dnat to numgen random mod 10 map \
        { 0-2 : 192.168.10.100, 3-9 : 192.168.20.200 }
------------------------

HASH EXPRESSIONS
~~~~~~~~~~~~~~~~

[verse]
*jhash* {*ip saddr* | *ip6 daddr* | *tcp dport* | *udp sport* | *ether saddr*} [*.* ...] *mod* 'NUM' [ *seed* 'NUM' ] [ *offset* 'NUM' ]
*symhash* *mod* 'NUM' [ *offset* 'NUM' ]

Use a hashing function to generate a number. The functions available are
*jhash*, known as Jenkins Hash, and *symhash*, for Symmetric Hash. The
*jhash* requires an expression to determine the parameters of the packet
header to apply the hashing, concatenations are possible as well. The value
after *mod* keyword specifies an upper boundary (read: modulus) which is
not reached by returned numbers. The optional *seed* is used to specify an
init value used as seed in the hashing function. The optional *offset*
allows one to increment the returned value by a fixed offset.

A typical use-case for *jhash* and *symhash* is load-balancing:

.Using hash expressions
------------------------
# load balance based on source ip between 2 ip addresses:
add rule nat prerouting dnat to jhash ip saddr mod 2 map \
	{ 0 : 192.168.10.100, 1 : 192.168.20.200 }

# symmetric load balancing between 2 ip addresses:
add rule nat prerouting dnat to symhash mod 2 map \
        { 0 : 192.168.10.100, 1 : 192.168.20.200 }
------------------------