metrics.config¶
This configuration file is used to define dynamic metrics on Traffic Server activity. Metrics defined here are available through all normal means of metrics reporting, including traffic_line and Stats Over HTTP Plugin.
Format¶
The configuration file itself is a Lua script. As with normal Lua code, comments
begin with --
, you may declare your own functions, and you may define global
variables.
Metric Definitions¶
Metrics are defined by calling the supplied metric generator functions. There is one for each supported type, and their parameters are identical:
<typefn> '<name>' [[
<metric generating function body>
]]
In practice, this will look like:
float 'proxy.node.useful_metric' [[
return math.random()
]]
With perhaps something more useful in the body of the metric generator. The
string containing the metric generating function’s body (everything between
[[
and ]]
, which is a multiline literal string in Lua) is stored and
then evaluated as an anonymous function, which will receive a single argument:
the name of the metric (in the example above: proxy.node.useful_metric
). If
necessary, you can capture this parameter using the ...
operator, which
returns the remaining parameters of the enclosing function.
Metric Types¶
float¶
A gauge style metric which will return floating point numbers. Floating point gauge metrics are appropriate for values which may increase or decrease arbitrarily (e.g. disk usage, cache hit ratios, average document sizes, and so on).
integer¶
A gauge style metric which will return integers. Integer gauge metrics are appropriate for values which may increase or descrease arbitrarily, and do not need any decimal components.
counter¶
A metric which will supply integer only values used almost exclusively to report on the number of events, whatever they may be, that have occurred. Frequent uses are the number of requests served, responses by specific HTTP status codes, the number of failed DNS lookups, and so on.
Metric Scopes¶
All dynamic metrics, like their built-in counterparts, exist within a scope which determines whether they reflect the state of the current Traffic Server node, or the state of the entire Traffic Server cluster for which the current node is a member.
The scope of a metric is derived from its name. All metric names begin with
proxy.
followed by either node.
or cluster.
.
Thus, proxy.node.active_origin_connections
might be used for the number of
open connections to origin servers on just the current node, whereas
proxy.cluster.active_origin_connections
would be the counterpart for the
total open connections to origin servers from all Traffic Server nodes in the cluster,
including the current node. (Note that these names are contrived, and you are
advised to both pick as clear and detailed a metric name as possible and also
to ensure there is no conflict with existing metric names).
Support Functions¶
Several supporting functions are defined in the default configuration file.
Existing dynamic metrics shipped with metrics.config
make extensive use
of these functions, and your own custom metrics may as necessary, too.
cluster(name)¶
Returns the sum of metric name
for the entire cluster of which the current
node is a member. Memoization is used to avoid additional cost from calling
this function multiple times within a single metrics pass. The name
must be
a metric within the node scope.
mbits(bytes)¶
Converts and returns bytes
as megabits (bytes * 8 / 1000000
).
mbytes(bytes)¶
Converts and returns bytes
as mebibytes (bytes / (1024 * 1024)
).
now()¶
Returns the current node’s time in milliseconds-from-epoch.
rate_of(msec, key, fn)¶
Returns the rate of change over a period of msec
milliseconds for the
metric value of key
(obtained by invoking the function fn
).
This is accomplished by effectively snapshotting the value of the metric at the
beginning and end of the given period expressed by msec
, multiplying their
difference by 1,000 and dividing that by msec
.
rate_of_10s(key, fn)¶
Returns the rate of change for the past 10 seconds for the metric key
, as
calculated by function fn
. This function simply wraps rate_of
and
supplies an msec
value of 10 * 1000
.
Definition Examples¶
For illustrative purposes, a few of the dynamic metric definitions you may find
in your Traffic Server installation’s default metrics.config
are explained here.
The actual file will contain many more definitions, and of course you may add
your own, as well.
Returning a single value¶
The simplest example is a dynamic node metric which does nothing but return the current value for an underlying process metric:
counter 'proxy.node.http.user_agents_total_documents_served' [[
return proxy.process.http.incoming_requests
]]
This uses the built-in function counter
, which takes two parameters: the
name of the dynamic metric to create followed by the function used to calculate
the value. In this case, the function body is just a return
of the named,
underlying process statistic. No calculations, aggregates, or other processing
are performed.
Returning a rate-of-change¶
Slightly more involved than just returning a point-in-time value from a given statistic is calculating the rate of change:
integer 'proxy.node.dns.lookups_per_second' [[
local self = ...
return rate_of_10s(self,
function() return proxy.process.dns.total_dns_lookups end
)
]]
Similar to the previous example, we are returning another metric’s value, but
in this case we do so within a function that we’re passing into
rate_of_10s
. This function, explained earlier, wraps rate_of
which
tracks the given metric over a specific interval and returns the average
per-second rate of change, obtaining the values it uses to calculate this rate
by invoking the function passed to it.
Calculating a rate-of-change’s delta¶
A more complicated example involves calculating the variance in the rate of change of an underlying statistic over a given period of time. This is not an average of a statistic, nor is it just the raw delta between two samplings of that statistic, and while inappropriate to know how much of an event has occurred, it is useful to know how erratic or unstable the frequency of that event occurring is.
In other words, a large absolute value indicates a deviance from the usual pattern of behavior/activity. For example, if your Traffic Server cache (using the example dynamic metric function below) sees between 10,000 and 10,250 HostDB hits every 10 seconds, the value returned by this metric will remain fairly small. If all of a sudden 50,000 hits make it to HostDB in the span of that same averaging interval, this value will increase significantly. This could then be used to trigger various alerts that something might be up with HostDB lookups on the Traffic Server cluster.
integer 'proxy.node.hostdb.total_hits_avg_10s' [[
local self = ...
return interval_delta_of_10s(self,
function() return proxy.process.hostdb.total_hits end
)
]]
The catch is that if the dramatic increase is actually the new norm, the metric will return to emitting small absolute values again - even though the statistic underneath is now consistently and significantly higher or lower than it used to be. If what you are trying to measure, though, is the stability of a metric that’s, long-term, a good thing.
Converting a metric to a ratio¶
Using a very simplified version of the Traffic Server cache hit reporting, we can demonstrate taking a metric which expresses the occurrence of one type of event within a set of possibilities and converting its absolute value into a ratio of that set’s total.
In this example, we assume we have three cache hit states (misses, hits, and
revalidates) and they are tracked in the metrics proxy.node.cache.<state>
.
These are not the real metric names in Traffic Server, and there are much finer grained
reporting states available, but we’ll use these for brevity.
float 'proxy.node.cache.hits_ratio' [[
return
proxy.node.cache.hits /
( proxy.node.cache.hits +
proxy.node.cache.misses +
proxy.node.cache.revalidates
)
]]
Summing across a cluster¶
When running a Traffic Server cluster of multiple nodes, there are many metrics which are useful to see at both the node and cluster level. Dynamic metrics make it very easy to collect the metric’s value for every node in the cluster and return the sum, as seen here with cache connections:
counter 'proxy.cluster.http.cache_current_connections_count' [[
return cluster('proxy.node.http.cache_current_connections_count')
]]
Further Reading¶
The following resources may be useful when writing dynamic metrics: