Metrics 2.0: a proposal

  • Graphite's metrics are strings comprised of dot-separated nodes which, due to their ordering, can be represented as a tree. Many other places use a similar format (stats in /proc etc).
  • OpenTSDB's metrics are shorter, because they move some of the dimensions (server, etc) into key-value tags.
I think we can do better...
I think our metrics format is restrictive and we do our self a disservice using it:
  • the metrics are not fully self-describing (they need additional information to become useful for interpretation, e.g. a unit, and information about what they mean)
  • they impose needless limitations (no dots allowed, strict ordering)
  • aggregators have no way to systematically and consistently express the operation they performed
  • we can't include additional information because that would create a new metric ID
  • we run into scale/display/correctness problems because storage rollups, rendering consolidation, as well as aggregation api functions need to be aware of how the metric is to be aggregated, and this information is often lacking or incorrect.
  • there's no consistency into what information goes where or how it's called (which field is the one that has the hostname?)
We can solve all of this. In an elegant way, even!

Over the past year, I've been working a lot on a graphite dashboard (Graph-Explorer). It's fairly unique in the sense that it leverages what I call "structured metrics" to facilitate a powerful way to build graphs dynamically. (see also "a few common graphite problems and how they are already solved" and the Graph-Explorer homepage). I implemented two solutions. ( structured_metrics, carbon-tagger) for creating, transporting and using these metrics on which dashboards like Graph-Explorer can rely. The former converts graphite metrics into a set of key-value pairs (this is "after the fact" so a little annoying), carbon-tagger uses a prototype extended carbon (graphite) protocol (called "proto2") to maintain an elasticsearch tags database on the fly, but the format proved too restrictive.

Now for Graphite-NG I wanted to rethink the metric, taking the simplicity of Graphite, the ideas of structured_metrics and carbon_tagger (but less annoying) and OpenTSDB (but more powerful), and propose a new way to identify and use metrics and organize tags around them.

The proposed ingestion (carbon etc) protocol looks like this:

  • intrinsic_tags and extrinsic_tags are space-separated strings, those strings can be a regular value, or a key=value pair and they describe one dimension (aspect) of the metric.
    • an intrinsic tag contributes to the identity of the metric. If this section changes, we get a new metric
    • an extrinsic tag provides additional information about the metric. changes in this set doesn't change the metric identity
    Internally, the metric ID is nothing more than the string of intrinsic tags you provided (similar to Graphite style). When defining a new metric, write down the intrinsic tags/nodes (like you would do with Graphite except you can use any order), and then keep 'em in that order (to keep the same ID). The ordering does not affect your ability to work with the metrics in any way. The backend indexes the tags and you'll usually rely on those to work with the metrics, and rarely with the ID itself.
  • A metric in its basic form (without extrinsic tags) would look like:
    graphite: stats.gauges.db15.mysql.bytes_received
    opentsdb: mysql.bytes_received host=db15
    proposal: service=mysql server=db15 direction=in unit=B
  • key=val tags are most useful: the key is a name of the aspect, which allows you to express "I want to see my metrics averaged/summed/.. by this aspect, in different graphs based on another aspect, etc" (e.g. GEQL statements), and because you can use them for filtering, so I highly recommend to take some time to come up with a good key for every tag. However sometimes it can be hard to come up with a concise, but descriptive key for a tag, hence they are not mandatory. For regular words without a key, the backend will assign dummy keys ('n1', 'n2', etc) to facilitate those features without hassle. Note the two spaces between intrinsic and extrinsic. With extrinsic tags the example could look like:
    service=mysql server=db15 direction=in unit=B  src=diamond processed_by_statsd env=prod
    to mean that this metric came from diamond, went through statsd, and that the machine is currently in the prod environment
  • Tags can contain any character except whitespace and null. Specifically: dots (great for metrics that contain an ip, histogram bin, a percentile, etc) and slashes (unit can be 'B/s' too)
  • the unit tag is mandatory. It allows dashboards to show the proper label on the Y-axis, and to do conversions (for example in Graph Explorer, if your metric is an amount of B used on a disk drive, and you request the increase in GB per day. it will automatically convert (and derive) the data). We should aim for standardization of units. I maintain a table of standardized units & prefixes which uses SI and IEC as starting point, and extends it with units commonly used in IT.

Further thoughts/comparisons


  • The concept of extrinsic tags is something I haven't seen before, but I think it makes a lot of sense because we often want to pass extra information but we couldn't because it would create a new metric. It also makes the metrics more self-describing.
  • Sometimes it makes sense for tags only to apply to a metric in a certain time frame. For intrinsic tags this is already the case by design, for extrinsic tags the database could maintain a from/to timestamp based on when it saw the tag for the given metric
  • metric finding: besides the obvious left-to-right auto-complete and pattern matching (which allows searching for substrings in any order), we can now also build an interface that uses facet searches to suggest/auto-complete tags, and filter by toggling on/off tags.
  • Daemons that sit in the pipeline and yield aggregated/computed metrics can do this in a much more useful way. For example a statsd daemon that computes a rate per second for metrics with 'unit=B' can yield a metric with 'unit=B/s'.
  • We can standardize tag keys and values, other than just the unit tag. Beyond the obvious compatibility benefits between tools, imagine querying for:
    • 'unit=(Err|Warn)' and getting all errors and warnings across the entire infrastructure (no matter which tool generated the metric), and grouping/drilling down by tags
    • '$hostname direction=in' and seeing everything coming in to the server, network traffic on the NIC, as well as files being uploaded.
    Also, metrics that are summary statistics (i.e. with statsd) will get intrinsic tags like 'stat=median' or 'stat=upper_90'. This has three fantastic consequences:
    • aggregation (rollups from higher to lower resolution) knows what to do without configurating aggregation functions, because it can be deduced from the metric itself
    • renderers that have to render >1 datapoints per pixel, will produce more accurate, relevant graphs because they can deduce what the metric is meant to represent
    • API functions such as "cumulative", "summarize" and "smartSummarize" don't need to be configured with an explicit aggregation function

From the Graphite perspective, specifically

  • dots and slashes are allowed
  • The tree model is slow to navigate, sometimes hard to find your metrics, and makes it really hard to write queries (target statements) that need metrics in different locations of the tree (because the only way to retrieve multiple metrics in a target is wildcards)
  • The tree model causes people to obsess over node ordering to find the optimal tree organization, but no ordering allows all query use cases anyway, so you'll be limited no matter how much time you spend organising metrics.
  • We do away with the tree entirely. A multi-dimensional tag database is way more powerfuland allows for great "metric finding" (see above)
  • Graphite has no tags support
  • you don't need to configure aggregation functions anymore, less room for errors ("help, my scale is wrong when i zoom out"), better rendering
  • when using statsd, you don't need prefixes like "stats.". In fact that whole prefix/postfix/namespacing thing becomes moot

From the OpenTSDB perspective, specifically

  • allow dots anywhere
  • 'http.hits' becomes 'http unit=Req' (or 'unit=Req http', as long as you pick one and stick with it)
  • probably more, I'm not very familiar with it

From the structured_metrics/carbon-tagger perspective, specifically

  • not every tag requires a key (on metric input), but you still get the same benefits
  • You're not forced to order the tags in any way
  • sometimes relying on the 'plugin' tag is convenient but it's not really intrinsic to the metric, now we can use it as extrinsic tag

Backwards compatibility

  • Sometimes you merely want the ability to copy a "metric id" from the app source code, and paste it in a "/render/?target=" url to see that particular metric. You can still do this: copy the intrinsic tags string and you have the id.
  • if you wanted to, you could easily stay compatible with the official graphite protocol: for incoming metrics, add a 'unit=Unknown' tag and optionally turn the dots into spaces (so that every node becomes a tag), so you can mix old-style and new style metrics in the same system.