metrics.config

This configuration file is used to define dynamic metrics on Traffic Server activity. Metrics defined here are available through all normal means of metrics reporting, including traffic_line and Stats Over HTTP Plugin.

Format

The configuration file itself is a Lua script. As with normal Lua code, comments begin with --, you may declare your own functions, and you may define global variables.

Metric Definitions

Metrics are defined by calling the supplied metric generator functions. There is one for each supported type, and their parameters are identical:

<typefn> '<name>' [[
  <metric generating function body>
]]

In practice, this will look like:

float 'proxy.node.useful_metric' [[
    return math.random()
]]

With perhaps something more useful in the body of the metric generator. The string containing the metric generating function’s body (everything between [[ and ]], which is a multiline literal string in Lua) is stored and then evaluated as an anonymous function, which will receive a single argument: the name of the metric (in the example above: proxy.node.useful_metric). If necessary, you can capture this parameter using the ... operator, which returns the remaining parameters of the enclosing function.

Metric Types

float

A gauge style metric which will return floating point numbers. Floating point gauge metrics are appropriate for values which may increase or decrease arbitrarily (e.g. disk usage, cache hit ratios, average document sizes, and so on).

integer

A gauge style metric which will return integers. Integer gauge metrics are appropriate for values which may increase or descrease arbitrarily, and do not need any decimal components.

counter

A metric which will supply integer only values used almost exclusively to report on the number of events, whatever they may be, that have occurred. Frequent uses are the number of requests served, responses by specific HTTP status codes, the number of failed DNS lookups, and so on.

Metric Scopes

All dynamic metrics, like their built-in counterparts, exist within a scope which determines whether they reflect the state of the current Traffic Server node, or the state of the entire Traffic Server cluster for which the current node is a member.

The scope of a metric is derived from its name. All metric names begin with proxy. followed by either node. or cluster..

Thus, proxy.node.active_origin_connections might be used for the number of open connections to origin servers on just the current node, whereas proxy.cluster.active_origin_connections would be the counterpart for the total open connections to origin servers from all Traffic Server nodes in the cluster, including the current node. (Note that these names are contrived, and you are advised to both pick as clear and detailed a metric name as possible and also to ensure there is no conflict with existing metric names).

Support Functions

Several supporting functions are defined in the default configuration file. Existing dynamic metrics shipped with metrics.config make extensive use of these functions, and your own custom metrics may as necessary, too.

cluster(name)

Returns the sum of metric name for the entire cluster of which the current node is a member. Memoization is used to avoid additional cost from calling this function multiple times within a single metrics pass. The name must be a metric within the node scope.

mbits(bytes)

Converts and returns bytes as megabits (bytes * 8 / 1000000).

mbytes(bytes)

Converts and returns bytes as mebibytes (bytes / (1024 * 1024)).

now()

Returns the current node’s time in milliseconds-from-epoch.

rate_of(msec, key, fn)

Returns the rate of change over a period of msec milliseconds for the metric value of key (obtained by invoking the function fn).

This is accomplished by effectively snapshotting the value of the metric at the beginning and end of the given period expressed by msec, multiplying their difference by 1,000 and dividing that by msec.

rate_of_10s(key, fn)

Returns the rate of change for the past 10 seconds for the metric key, as calculated by function fn. This function simply wraps rate_of and supplies an msec value of 10 * 1000.

Definition Examples

For illustrative purposes, a few of the dynamic metric definitions you may find in your Traffic Server installation’s default metrics.config are explained here. The actual file will contain many more definitions, and of course you may add your own, as well.

Returning a single value

The simplest example is a dynamic node metric which does nothing but return the current value for an underlying process metric:

counter 'proxy.node.http.user_agents_total_documents_served' [[
  return proxy.process.http.incoming_requests
]]

This uses the built-in function counter, which takes two parameters: the name of the dynamic metric to create followed by the function used to calculate the value. In this case, the function body is just a return of the named, underlying process statistic. No calculations, aggregates, or other processing are performed.

Returning a rate-of-change

Slightly more involved than just returning a point-in-time value from a given statistic is calculating the rate of change:

integer 'proxy.node.dns.lookups_per_second' [[
  local self = ...

  return rate_of_10s(self,
    function() return proxy.process.dns.total_dns_lookups end
  )
]]

Similar to the previous example, we are returning another metric’s value, but in this case we do so within a function that we’re passing into rate_of_10s. This function, explained earlier, wraps rate_of which tracks the given metric over a specific interval and returns the average per-second rate of change, obtaining the values it uses to calculate this rate by invoking the function passed to it.

Calculating a rate-of-change’s delta

A more complicated example involves calculating the variance in the rate of change of an underlying statistic over a given period of time. This is not an average of a statistic, nor is it just the raw delta between two samplings of that statistic, and while inappropriate to know how much of an event has occurred, it is useful to know how erratic or unstable the frequency of that event occurring is.

In other words, a large absolute value indicates a deviance from the usual pattern of behavior/activity. For example, if your Traffic Server cache (using the example dynamic metric function below) sees between 10,000 and 10,250 HostDB hits every 10 seconds, the value returned by this metric will remain fairly small. If all of a sudden 50,000 hits make it to HostDB in the span of that same averaging interval, this value will increase significantly. This could then be used to trigger various alerts that something might be up with HostDB lookups on the Traffic Server cluster.

integer 'proxy.node.hostdb.total_hits_avg_10s' [[
  local self = ...

  return interval_delta_of_10s(self,
    function() return proxy.process.hostdb.total_hits end
  )
]]

The catch is that if the dramatic increase is actually the new norm, the metric will return to emitting small absolute values again - even though the statistic underneath is now consistently and significantly higher or lower than it used to be. If what you are trying to measure, though, is the stability of a metric that’s, long-term, a good thing.

Converting a metric to a ratio

Using a very simplified version of the Traffic Server cache hit reporting, we can demonstrate taking a metric which expresses the occurrence of one type of event within a set of possibilities and converting its absolute value into a ratio of that set’s total.

In this example, we assume we have three cache hit states (misses, hits, and revalidates) and they are tracked in the metrics proxy.node.cache.<state>. These are not the real metric names in Traffic Server, and there are much finer grained reporting states available, but we’ll use these for brevity.

float 'proxy.node.cache.hits_ratio' [[
  return
    proxy.node.cache.hits /
    ( proxy.node.cache.hits +
      proxy.node.cache.misses +
      proxy.node.cache.revalidates
    )
]]

Summing across a cluster

When running a Traffic Server cluster of multiple nodes, there are many metrics which are useful to see at both the node and cluster level. Dynamic metrics make it very easy to collect the metric’s value for every node in the cluster and return the sum, as seen here with cache connections:

counter 'proxy.cluster.http.cache_current_connections_count' [[
  return cluster('proxy.node.http.cache_current_connections_count')
]]

Further Reading

The following resources may be useful when writing dynamic metrics: