Tags

arch bash cakephp dauth devops drums drupal fosdem foss git golang information age linux lua maemo mail monitoring music n900 netlog openstack php productivity python real life thesis travel uzbl vimeo web2.0

Posts

Metrics 2.0: a proposal

  • Graphite's metrics are strings comprised of dot-separated nodes which, due to their ordering, can be represented as a tree. Many other places use a similar format (stats in /proc etc).
  • OpenTSDB's metrics are shorter, because they move some of the dimensions (server, etc) into key-value tags.
I think we can do better...
I think our metrics format is restrictive and we do our self a disservice using it:
  • the metrics are not fully self-describing (they need additional information to become useful for interpretation, e.g. a unit, and information about what they mean)
  • they impose needless limitations (no dots allowed, strict ordering)
  • aggregators have no way to systematically and consistently express the operation they performed
  • we can't include additional information because that would create a new metric ID
  • we run into scale/display/correctness problems because storage rollups, rendering consolidation, as well as aggregation api functions need to be aware of how the metric is to be aggregated, and this information is often lacking or incorrect.
  • there's no consistency into what information goes where or how it's called (which field is the one that has the hostname?)
We can solve all of this. In an elegant way, even!

Over the past year, I've been working a lot on a graphite dashboard (Graph-Explorer). It's fairly unique in the sense that it leverages what I call "structured metrics" to facilitate a powerful way to build graphs dynamically. (see also "a few common graphite problems and how they are already solved" and the Graph-Explorer homepage). I implemented two solutions. ( structured_metrics, carbon-tagger) for creating, transporting and using these metrics on which dashboards like Graph-Explorer can rely. The former converts graphite metrics into a set of key-value pairs (this is "after the fact" so a little annoying), carbon-tagger uses a prototype extended carbon (graphite) protocol (called "proto2") to maintain an elasticsearch tags database on the fly, but the format proved too restrictive.

Now for Graphite-NG I wanted to rethink the metric, taking the simplicity of Graphite, the ideas of structured_metrics and carbon_tagger (but less annoying) and OpenTSDB (but more powerful), and propose a new way to identify and use metrics and organize tags around them.

The proposed ingestion (carbon etc) protocol looks like this:

<intrinsic_tags>  <extrinsic_tags> <value> <timestamp>
  • intrinsic_tags and extrinsic_tags are space-separated strings, those strings can be a regular value, or a key=value pair and they describe one dimension (aspect) of the metric.
    • an intrinsic tag contributes to the identity of the metric. If this section changes, we get a new metric
    • an extrinsic tag provides additional information about the metric. changes in this set doesn't change the metric identity
    Internally, the metric ID is nothing more than the string of intrinsic tags you provided (similar to Graphite style). When defining a new metric, write down the intrinsic tags/nodes (like you would do with Graphite except you can use any order), and then keep 'em in that order (to keep the same ID). The ordering does not affect your ability to work with the metrics in any way. The backend indexes the tags and you'll usually rely on those to work with the metrics, and rarely with the ID itself.
  • A metric in its basic form (without extrinsic tags) would look like:
    graphite: stats.gauges.db15.mysql.bytes_received
    opentsdb: mysql.bytes_received host=db15
    proposal: service=mysql server=db15 direction=in unit=B
    
  • key=val tags are most useful: the key is a name of the aspect, which allows you to express "I want to see my metrics averaged/summed/.. by this aspect, in different graphs based on another aspect, etc" (e.g. GEQL statements), and because you can use them for filtering, so I highly recommend to take some time to come up with a good key for every tag. However sometimes it can be hard to come up with a concise, but descriptive key for a tag, hence they are not mandatory. For regular words without a key, the backend will assign dummy keys ('n1', 'n2', etc) to facilitate those features without hassle. Note the two spaces between intrinsic and extrinsic. With extrinsic tags the example could look like:
    service=mysql server=db15 direction=in unit=B  src=diamond processed_by_statsd env=prod
    
  • Tags can contain any character except whitespace and null. Specifically: dots (great for metrics that contain an ip, histogram bin, a percentile, etc) and slashes (unit can be 'B/s' too)
  • the unit tag is mandatory. It allows dashboards to show the proper label on the Y-axis, and to do conversions (for example in Graph Explorer, if your metric is an amount of B used on a disk drive, and you request the increase in GB per day. it will automatically convert (and derive) the data). We should aim for standardization of units. I maintain a table of standardized units & prefixes which uses SI and IEC as starting point, and extends it with units commonly used in IT.

Further thoughts/comparisons

General

  • The concept of extrinsic tags is something I haven't seen before, but I think it makes a lot of sense because we often want to pass extra information but we couldn't because it would create a new metric. It also makes the metrics more self-describing.
  • Sometimes it makes sense for tags only to apply to a metric in a certain time frame. For intrinsic tags this is already the case by design, for extrinsic tags the database could maintain a from/to timestamp based on when it saw the tag for the given metric
  • metric finding: besides the obvious left-to-right auto-complete and pattern matching (which allows searching for substrings in any order), we can now also build an interface that uses facet searches to suggest/auto-complete tags, and filter by toggling on/off tags.
  • Daemons that sit in the pipeline and yield aggregated/computed metrics can do this in a much more useful way. For example a statsd daemon that computes a rate per second for metrics with 'unit=B' can yield a metric with 'unit=B/s'.
  • We can standardize tag keys and values, other than just the unit tag. Beyond the obvious compatibility benefits between tools, imagine querying for:
    • 'unit=(Err|Warn)' and getting all errors and warnings across the entire infrastructure (no matter which tool generated the metric), and grouping/drilling down by tags
    • '$hostname direction=in' and seeing everything coming in to the server, network traffic on the NIC, as well as files being uploaded.
    Also, metrics that are summary statistics (i.e. with statsd) will get intrinsic tags like 'stat=median' or 'stat=upper_90'. This has three fantastic consequences:
    • aggregation (rollups from higher to lower resolution) knows what to do without configurating aggregation functions, because it can be deduced from the metric itself
    • renderers that have to render >1 datapoints per pixel, will produce more accurate, relevant graphs because they can deduce what the metric is meant to represent
    • API functions such as "cumulative", "summarize" and "smartSummarize" don't need to be configured with an explicit aggregation function

From the Graphite perspective, specifically

  • dots and slashes are allowed
  • The tree model is slow to navigate, sometimes hard to find your metrics, and makes it really hard to write queries (target statements) that need metrics in different locations of the tree (because the only way to retrieve multiple metrics in a target is wildcards)
  • The tree model causes people to obsess over node ordering to find the optimal tree organization, but no ordering allows all query use cases anyway, so you'll be limited no matter how much time you spend organising metrics.
  • We do away with the tree entirely. A multi-dimensional tag database is way more powerfuland allows for great "metric finding" (see above)
  • Graphite has no tags support
  • you don't need to configure aggregation functions anymore, less room for errors ("help, my scale is wrong when i zoom out"), better rendering
  • when using statsd, you don't need prefixes like "stats.". In fact that whole prefix/postfix/namespacing thing becomes moot

From the OpenTSDB perspective, specifically

  • allow dots anywhere
  • 'http.hits' becomes 'http unit=Req' (or 'unit=Req http', as long as you pick one and stick with it)
  • probably more, I'm not very familiar with it

From the structured_metrics/carbon-tagger perspective, specifically

  • not every tag requires a key (on metric input), but you still get the same benefits
  • You're not forced to order the tags in any way
  • sometimes relying on the 'plugin' tag is convenient but it's not really intrinsic to the metric, now we can use it as extrinsic tag

Backwards compatibility

  • Sometimes you merely want the ability to copy a "metric id" from the app source code, and paste it in a "/render/?target=" url to see that particular metric. You can still do this: copy the intrinsic tags string and you have the id.
  • if you wanted to, you could easily stay compatible with the official graphite protocol: for incoming metrics, add a 'unit=Unknown' tag and optionally turn the dots into spaces (so that every node becomes a tag), so you can mix old-style and new style metrics in the same system.

Comments

Well thought. I agree those concepts would greatly improve over what
we already have. The kv tags concept (and intrinsic/extrinsic distinction)
would magnify graphite-like aggregation capabilities by es-like faceted
search and discoverability, and I can image how awesome the result would be.
The combined power of Kibana with Graph-Explorer :).

Like the "unit" tag, a property I always found often missing -while it
should really be mandatory- from tools in the carbon ecosystem is the metric
"type" (yes, this name sucks) as specified by RRD:
  http://oss.oetiker.ch/rrdtool/doc/rrdcreate.en.html#IGAUGE

"gauge" is for absolute values that can increase/decrease (like
temperatures, current speed, etc), "counter" for values that increase
until reseted, "derive" for values that never overflows...

Even though this property may influence rendering decisions, I think it's
very "core" (or "intrinsic") to the metric, in the same way the "unit" is.
Say, two "memory_usage" metrics are very different beasts if one is a gauge
representing the percentage of system memory used, and the other is a derive
(ie. representing an absolute value). Or a request metric, vs. a request per
second metric. And that's not so much redundant with the "unit" property;
for instance, kernel snmp counters (as seen in interfaces bytes in/out for
instance) are 32 bits integers and resets when they overflow (hence the
"counter" vs. "derive"). Also, when a server reboots and his kernel stats are
zeroed, this shouldn't mean "we un-sent 42GB of previously accounted data" ;).

Is using other separators (like tabs, or \n) considered? Disallowing spaces
in vals would make metrics names -and even more so description strings- a lot
less pretty;). The "extrinsic" distinction may encourage to send x and y
labels and metric descriptions (maybe), but without spaces...

"Tags can contain any character except whitespace and null." : I would also
forbid "=" signs, to prevent ambiguities.

This somehow raise the questions : how to keep the metric storage compact
(as with column-oriented dbs: avoiding duplicating the "column names" at
every value added)? And how to express the added search capabilities using
tags, while keeping the consistent graphite-like urls ?

Out of interest :
https://github.com/ganglia/monitor-core/blob/master/gmond/dtd.h
https://collectd.org/wiki/index.php/Value_list
http://munin-monitoring.org/wiki/fieldname
http://www.cacti.net/downloads/docs/html/data_input_methods.html
http://nagiosplug.sourceforge.net/developer-guidelines.html#AEN200
Hi Ben,
that's a great comment, you cover a lot of the same things I've been pondering before/when writing the proposal down.

as you may know currently graph-explorer, structured metrics etc require a 'target_type' tag which is the metric type. (see https://github.com/vimeo/graph-explorer/wiki/Consistent-tag-keys-and-values).  However the last year I've been realizing metric types describe multiple characteristics of a metric and partially overlap with other tags (such as unit i.e. unit=foo/s means its a rate).  I have my own notes about "show i use a metric type tag or not?" with pros and cons, and different use cases to find out how to solve them with and without tags.  I will upload those at some point.  I'm still working on this, but I expect to reach a solution soon.

X labels for timeseries metrics should IMHO always be "time (in seconds)", and I believe Y labels can be generated from the information in other tags, and should not be dictated by the metric, but rather by the graphing context (for example usually i would show unit and type on the Y label, but depending if there are other graphs on the same page that show metrics of the same unit or same type, that becomes redundant, or maybe you have two graphs, one for unit=B direction=in, one for unit=B direction=out, in that cakes it make sense to make the page title "stuff in bytes", and the Y-labels in and out.
Graph-explorer automatically does all these things but looking at what graphs have in common, what separates them, what do the metrics of the same graph have in common, etc.

I did (briefly) think about metric descriptions and agree for those a tab separator would make sense so that they can contain spaces.  However I couldn't come up with an example where having spaces in the actual tags would be useful. (note that that also would get annoying when querying, in the same way doing commandline stuff for files with spaces in them is annoying).  so I'm thinking using spaces in the format like it is now, and when we want to add a description, we can use a tab to separate it from the rest.

Good point about the '=', that's something i should add.

Expressing the new search system over the graph api, this could be done by using a query rather than a wildcard string (i.e. instead of "sum(stats.gauges.mysql.*.bytes_received)" we can do "sum(service=mysql unit=B direction=in)"; although i'm not convinced yet this is a good idea.  Graph-Explorer builds the list of matching targets for a query, and then queryies graphite with a target string that contains all those metrics.  It may be a good idea to keep the graphite api low level and query for all targets explicitly, because in the dashboard UI as a user you want to see which metrics are included, otherwise that would be hidden behind the graphite-ng api.

As for metrics storage, remember the database only contains the metric "names" (id's) and their intrinsic and extrinsic tags. Not a new value for every new datapoint.  I think even with millions of metrics this won't really be a problem (i.e. say a metric consumes 200B and you have a million, that's still only 200MB of space.)  I would store them all in ElasticSaearch which I expect has some internal optimisations.  All solutions i wrote (structured_metrics and graph-explorer) store their stuff in ES right now, at vimeo we have about 300k metrics and it deals with this greatly.  queries respond very quickly etc.
What are your thoughts on using the extrinsic tags to provide relationship from metrics to objects?  I have been going through the best practice to tie metrics to the objects that are responsible for the generation of the metric.

e.g.  Tracking response time of queries that occurred in a database

In graphite you could have some tree such as
database.queries.<id>.responseTime

But now you have bloated the number of metrics being generated as the ID is dynamically generated.  Also 90% of the time you would wildcard out the ID because you want to see the aggregation.  The same object could also have contributed to dozens of metric entries.  If I had the IDs available to me that contributed to a graph, I could then go query the audit system and find all the details of those queries so I can see if there are patterns in the columns / tables used.  I have seen examples also of people putting things like the column / table in the metric path, but again I feel that is just bloating the metric definitions.  Plus I would never be able to keep something such as the SQL statement for the object in the definition of the metric.  Doesn't feel right to push that information into the metric definition.

What I really want is the ability to log a metric, provide some information with it (such extrinsic tags here) and then that information can be used for lookups on detail pages.  Drilling down from a metrics graph to an object viewer.

I may be way off here, new to the Graphite and doing metrics this way, in the past I did metrics in a data warehouse and would have kept all this information in one place, one very structured place :) but that was a different environment that dictated that need.
Hi Michael

* any information that makes a metric more clear or helps in debugging should be welcomed, it's just a matter of balancing that with the resources available. (but the balance is way off -- too little information -- in the systems i've seen so far)

* I'm not sure exactly what you mean with objects and whether the metric identity should remain the same for different objects, though it looks like the meaning of a metric really changes for different objects, and so is more of an intrinsic property that is part of the metric identity.
The main problem in this case is just in graphite the cost of every single metric is so high (it allocates an entire whisper file). with a proper metrics database (even with the ceres storage format i think) this becomes a non-issue cause you will only consume the resources that actually contain information (i.e. you will have many metrics but they only consume space for the points in time for which they have datapoints).


Name:


E-mail:


URL:


Comment:


What is the first name of the guy blogging here?


This comment form is pretty crude. Make sure mandatory fields are entered correctly.
Basic html tags (a,i,b, etc) are allowed, others are sanitized