In web operations, mostly when troubleshooting but also for capacity planning,
I often find myself having very specific information needs from my time-series, and these information needs vary a lot over time.
This usually means I need to correlate or compare things that no one anticipated. Things that relate to specific machines, specific services across machines,
or a few specific metrics of which only the ops team knows how they are related and cross various scopes (application, network, system, etc).
I should have an easy way to filter metrics by any information in the metric's name or values.
I should be able to group metrics into graphs the way I want. (example: when viewing filesystem usage of servers, I should be able to group by server (one graph per server listing the filesystems, but also by mountpoint to compare servers on one graph).
I should be able -with minimal effort- to view metrics by their gauge/count value, but also by their rate of change and where appropriate, as a percentage of a maximum (like diskspace used).
It should be trivial to manipulate the graph interactively (toggling things on/off, switching between lines/stacked mode, inspecting datapoints, zooming, ...).
It should show me all events, colorcoded by type, with text description, and interactive so that it can use hyperlinks.
And most of all, the code should be as simple as possible and it should be easy to get running.
Dashboards which show specific predefined KPI's (this covers most graphite dashboards) are clearly unsuitable for this use case. Template-based "metric exploration" dashboards like cacti and ganglia are in my experience way too limited. Graph composing dashboards (like the stock graphite one, or graphiti) require much manual work to get the graph you want. I couldn't find anything even close to what I wanted, so I started Graph-Explorer.
The approach I'm taking is using plugins which add metadata to metrics (tags for server, service, mountpoint, interface name, ...), having them define how to render as a count, as a rate, as a percent of some max allowed value (or a metric containing the max), and providing a query language which allow you to match/filter metrics, group them into graphs by tag, and render them how you want them. The plugins promote standardized metric naming and reuse across organisations, not in the least because most correspond to plugins for the Diamond monitoring agent.
Furthermore, because it uses my graphitejs plugin (which now btw supports flot as a backend for fast canvas-based graphs and annotated events from anthracite) the manual interactions mentioned earlier are supported or at least on the roadmap.Graph Explorer is not yet where I want it, but it's already a very useful tool at Vimeo.
Being active as both a developer and ops person in the professional life, and both an open source developer and packager in my spare time, I noticed some common ground between both worlds, and I think the open source community can learn from the Devops movement which is solving problems in the professional tech world.
For the sake of getting a point across, I'll simplify some things.
whatwhat is being measured?: bytes, queries, timeouts, jobs, etc
target_typemust be one of the existing clearly defined target_types (count, rate, counter, gauge)
stats.serverdb123.mysql.queries.selects 895 1234567890to something more along these lines:
host=serverdb123 service=mysql type=select what=queries target_type=rate 895 1234567890 host=serverdb123 service=mysql type=select unit=Queries/s 895 1234567890 h=serverdb123 s=mysql t=select queries r 895 1234567890
Graphite can show events such as code deploys and
puppet changes as vertical markers on your graph.
With the advent of new graphite dashboards and interfaces where we can have popups and annotations to show metadata for each event (by means of client-side rendering),
it's time we have a database to track all events along with categorisation and text descriptions (which can include rich text and hyperlinks).
Graphite is meant for time series (metrics over time), Anthracite aims to be the companion for annotated events.
More precisely, Anthracite aims to be a database of "relevant events" (see further down), for the purpose of enriching monitoring dashboards, as well as allowing visual and numerical analysis of events that have a business impact (for the latter, see "Thoughts on incident nomenclature, severity levels and incident analysis" below)
It has a TCP receiver, a database (sqlite3), a http interface to deliver event data in many formats and a simple web frontend for humans.
I recommend you submit any event that has or might have a relevant effect on:
The TCP receiver listens for lines in this format:
<unix_timestamp> <type> <description>
There are no restrictions for type and description, other than that they must be non-empty strings.
I do have some suggestions which I'll demonstrate through fictive examples;
but note that there's room for improvement, see the section below)
# a deploy_* type for each project ts deploy_vimeo.com "deploy e8e5e4 initiated by Nicolas -- github.com/Vimeo/main/compare/foobar..e8e5e4" ts puppet "all nodes of class web_cluster_1: modified apache.conf; restart service apache" ts incident_sev2_start "mysql2 crashed, site degraded" ts incident_sev2_resolved "replaced db server" ts incident "hurricane Sandy, systems unaffected but power outages among users, expect lower site usage" # in those exceptional cases of manual production changes, try to not forget adding your event ts manual_dieter "i have to try this firewall thing on the LB" ts backup "backup from database slave vimeomysql22"
Client-side rendering of charts as opposed to using graphite's server side generated png's allows various interactivity features, such as:
There are many graphite dashboards with a different focus, but as far as plotting graphs, what they need is usually very similar: a plot to draw multiple graphite targets, a legend, an x-axis, 1 or 2 y-axis, lines or stacked bands, so I think there's a lot of value in having many dashboard projects share the same code for the actual charts, and I hope we can work together on this.