Feature Expressions¶
A feature expression specifies how to extract a feature. At run time, if the feature is needed, the expression is applied and the result is the feature. The feature can be any of the types
Type |
Description |
---|---|
String |
Text. |
NULL |
The null value, no value. |
Integer |
A signed integral value. |
Boolean |
A value that is |
IP Address |
An IP address. |
Duration |
A time span. |
Tuple |
A list of features. |
Expressions vary from simple to very complex, but ultimately produce a feature of one of these types.
Basic Expressions¶
Basic feature expressions are very similar to Python format strings which contains a mixture of literal strings and extractors. Extractors are referenced by what Python calls a "replacement field" which are set off by braces. The underlying format structure is
**{** [ *name* ] [ **:** [ *format* ] [ **:** *extension* ] ] **}**
name
specifies the extractor, format
makes it possible to control the output of the
extractor (such as width), and extension
is used to pass extra data to the extractor.
In general the result of a quoted string is a string feature. All extractors can generate text output and so can be used in a feature expression.. As a special case, if the format expression consists of a single extractor and no literal text then the result type is the result type of the extractor.
A feature expression can be preceded by the YAML type tag !literal
in which case the string is
treated as a literal string, no extractors are invoked. Doubling braces will escape the braces and
yield a single one, the same way as for Python.
And finally, as unquoted string is parsed differently. It is checked if it can be parsed as one of a particular set of literals and if so is treated as such.
The string "NULL", which yields the no value feature.
integer
boolean
IP address
If none of these, it is treated as an extractor, e.g. as if it had been enclosed by "{}"
. This
is by far the most common case and so it is useful and clearer to skip the quotes and braces.
A feature expression can include modifiers. These modify or transform an extracted feature. The base expression is used to extract a feature, then the modifiers are applied to the feature sequentially to yield the feature. TxnBox uses a data flow style where most feature manipulations are expected to be done by modifiers on expressions.
Some examples -
"Delain concert"
The literal string "Delain concert".
"Forwarded for {inbound-addr-remote} to {ua-req-host}"
This extracts a string that could be "Forwarded for 192.23.34.99 to apache.org". It invokes the
extractors inbound-addr-remote
and ua-req-host
which extract the client IP address and
the host for the client request respectively.
!literal "Forwarded for {inbound-addr-remote} to {ua-req-host}"
The literal string "Forwarded for {inbound-addr-remote} to {ua-req-host}".
"Forwarded for {{inbound-addr-remote}} to {{ua-req-host}}"
The literal string "Forwarded for {inbound-addr-remote} to {ua-req-host}".
56
The integer constant 56.
NULL
The null feature.
"NULL"
The literal string "NULL".
inbound-addr-remote
The extractor inbound-addr-remote
which extracts an IP address feature with the value of the
client source address. This extracts a feature of type IP_Address
. This identical to
"{inbound-addr-remote}"
Formatting¶
The second part of an extractor supports controlling the format of the output. This is not generally
required but in some cases it is quite useful. A good example is the extractor
is-internal
. This returns a true or false value, which is in the C style mapped to 1
and 0. However, it can be changed to "true" and "false" by specifying the output as a string.
proxy-req-field<Carp-Internal>: is-internal:s
Formatting is most commonly useful when setting values, such as field values. The extracted strings can be justified, limited in width, and in particular IP addresses can be formatted in a variety of ways.
Scoped Extractors¶
There are sets of extractors that extract data from a scope. A scope becomes active at a specific point in a configuration and remains active for all nested configuration unless overridden by a scope of the same type. Multiple scopes of the same type nest in the standard scope semantics, where when a scope becomes inactive the most recently active scope becomes active again. Any scope extractors extract data from the most recent / innermost scope.
Regular Expression Scope¶
If a regular expression comparison is successful matched, or explicitly applied, this creates a
scope where the capture groups are accessible via index extractors.These are numbered in the
standard way, with 0
meaning the entire matched string, 1
the first capture group, 2
the
second capture group, and so on. It is an error to use an index that is larger than the available
capture groups, or outside an active scope. For example if a header named "mail-check" should be set
if the host contains the domain "mail", it could be done as
with: ua-req-host
select:
- rxp: "^(?:(.*?)[.])?mail(?:[.](.*?))?$"
do:
- proxy-req-field<mail-check>: "You've got mail from {2}!"
This is a case where a feature string must be used instead of an unquoted extractor. That is
proxy-req-field<mail-check>: 2
sets the field to the integer value 2, which fails. Similarly
proxy-req-field<mail-check>: "2"
sets the field to the string "2". The correct usage is
proxy-req-field<mail-check>: "{2}"
sets the field to the second capture group.