Being active as both a developer and ops person in the professional life, and both an open source developer and packager in my spare time, I noticed some common ground between both worlds, and I think the open source community can learn from the Devops movement which is solving problems in the professional tech world.
For the sake of getting a point across, I'll simplify some things.
whatwhat is being measured?: bytes, queries, timeouts, jobs, etc
target_typemust be one of the existing clearly defined target_types (count, rate, counter, gauge)
stats.serverdb123.mysql.queries.selects 895 1234567890to something more along these lines:
host=serverdb123 service=mysql type=select what=queries target_type=rate 895 1234567890 host=serverdb123 service=mysql type=select unit=Queries/s 895 1234567890 h=serverdb123 s=mysql t=select queries r 895 1234567890
I've been a graphite contributor for a while (and still am). It's a great tool for timeseries metrics. Two weeks ago I started working on Graphite-ng: it's somewhere between an early clone/rewrite, a redesign, and an experiment playground, written in Golang. The focus of my work so far is the API web server, which is a functioning prototype, it answers requests like
I.e. it lets you retrieve your timeseries, processed by function pipelines which are setup on the fly based on a spec in your http/rest arguments. Currently it only fetches metrics from text files but I'm working on decent metrics storage as well.
There's a few reasons why I decided to start a new project from scratch:
The API server I developed sets up a processing pipeline as directed by your query: every processing function runs in a goroutine
for concurrency and the metrics flow through using Go channels. It literally compiles a program and executes it. You can add your own functions
to collect, process, and return metrics by writing simple plugins.
As for timeseries storage, for now it uses simple text files, but I'm experimenting and thinking what would be the best metric store(s) that works on small scale (personal netbook install) to large scale ("I have millions of metrics that need to be distributed across nodes, the system should be HA and self-healing in failure scenarios, easily maintainable, and highly performant") and is still easy to deploy, configure and run. Candidates are whisper-go, kairosdb, my own elasticsearch experiment etc.
I won't implement rendering images, because I think client-side rendering using something like timeserieswidget is superior. I can also leave out events because anthracite already does that. There's a ton of dashboards out there (graph-explorer, descartes, etc) so that can be left out as well.
Over the past year, I've been working a lot on a graphite dashboard (Graph-Explorer). It's fairly unique in the sense that it leverages what I call "structured metrics" to facilitate a powerful way to build graphs dynamically. (see also "a few common graphite problems and how they are already solved" and the Graph-Explorer homepage). I implemented two solutions. ( structured_metrics, carbon-tagger) for creating, transporting and using these metrics on which dashboards like Graph-Explorer can rely. The former converts graphite metrics into a set of key-value pairs (this is "after the fact" so a little annoying), carbon-tagger uses a prototype extended carbon (graphite) protocol (called "proto2") to maintain an elasticsearch tags database on the fly, but the format proved too restrictive.
Now for Graphite-NG I wanted to rethink the metric, taking the simplicity of Graphite, the ideas of structured_metrics and carbon_tagger (but less annoying) and OpenTSDB (but more powerful), and propose a new way to identify and use metrics and organize tags around them.
The proposed ingestion (carbon etc) protocol looks like this:
<intrinsic_tags> <extrinsic_tags> <value> <timestamp>
graphite: stats.gauges.db15.mysql.bytes_received opentsdb: mysql.bytes_received host=db15 proposal: service=mysql server=db15 direction=in unit=B
service=mysql server=db15 direction=in unit=B src=diamond processed_by_statsd env=prod
Graphite can show events such as code deploys and
puppet changes as vertical markers on your graph.
With the advent of new graphite dashboards and interfaces where we can have popups and annotations to show metadata for each event (by means of client-side rendering),
it's time we have a database to track all events along with categorisation and text descriptions (which can include rich text and hyperlinks).
Graphite is meant for time series (metrics over time), Anthracite aims to be the companion for annotated events.
More precisely, Anthracite aims to be a database of "relevant events" (see further down), for the purpose of enriching monitoring dashboards, as well as allowing visual and numerical analysis of events that have a business impact (for the latter, see "Thoughts on incident nomenclature, severity levels and incident analysis" below)
It has a TCP receiver, a database (sqlite3), a http interface to deliver event data in many formats and a simple web frontend for humans.
I recommend you submit any event that has or might have a relevant effect on:
The TCP receiver listens for lines in this format:
<unix_timestamp> <type> <description>
There are no restrictions for type and description, other than that they must be non-empty strings.
I do have some suggestions which I'll demonstrate through fictive examples;
but note that there's room for improvement, see the section below)
# a deploy_* type for each project ts deploy_vimeo.com "deploy e8e5e4 initiated by Nicolas -- github.com/Vimeo/main/compare/foobar..e8e5e4" ts puppet "all nodes of class web_cluster_1: modified apache.conf; restart service apache" ts incident_sev2_start "mysql2 crashed, site degraded" ts incident_sev2_resolved "replaced db server" ts incident "hurricane Sandy, systems unaffected but power outages among users, expect lower site usage" # in those exceptional cases of manual production changes, try to not forget adding your event ts manual_dieter "i have to try this firewall thing on the LB" ts backup "backup from database slave vimeomysql22"
In web operations, mostly when troubleshooting but also for capacity planning,
I often find myself having very specific information needs from my time-series, and these information needs vary a lot over time.
This usually means I need to correlate or compare things that no one anticipated. Things that relate to specific machines, specific services across machines,
or a few specific metrics of which only the ops team knows how they are related and cross various scopes (application, network, system, etc).
I should have an easy way to filter metrics by any information in the metric's name or values.
I should be able to group metrics into graphs the way I want. (example: when viewing filesystem usage of servers, I should be able to group by server (one graph per server listing the filesystems, but also by mountpoint to compare servers on one graph).
I should be able -with minimal effort- to view metrics by their gauge/count value, but also by their rate of change and where appropriate, as a percentage of a maximum (like diskspace used).
It should be trivial to manipulate the graph interactively (toggling things on/off, switching between lines/stacked mode, inspecting datapoints, zooming, ...).
It should show me all events, colorcoded by type, with text description, and interactive so that it can use hyperlinks.
And most of all, the code should be as simple as possible and it should be easy to get running.
Dashboards which show specific predefined KPI's (this covers most graphite dashboards) are clearly unsuitable for this use case. Template-based "metric exploration" dashboards like cacti and ganglia are in my experience way too limited. Graph composing dashboards (like the stock graphite one, or graphiti) require much manual work to get the graph you want. I couldn't find anything even close to what I wanted, so I started Graph-Explorer.
The approach I'm taking is using plugins which add metadata to metrics (tags for server, service, mountpoint, interface name, ...), having them define how to render as a count, as a rate, as a percent of some max allowed value (or a metric containing the max), and providing a query language which allow you to match/filter metrics, group them into graphs by tag, and render them how you want them. The plugins promote standardized metric naming and reuse across organisations, not in the least because most correspond to plugins for the Diamond monitoring agent.
Furthermore, because it uses my graphitejs plugin (which now btw supports flot as a backend for fast canvas-based graphs and annotated events from anthracite) the manual interactions mentioned earlier are supported or at least on the roadmap.Graph Explorer is not yet where I want it, but it's already a very useful tool at Vimeo.
Client-side rendering of charts as opposed to using graphite's server side generated png's allows various interactivity features, such as:
There are many graphite dashboards with a different focus, but as far as plotting graphs, what they need is usually very similar: a plot to draw multiple graphite targets, a legend, an x-axis, 1 or 2 y-axis, lines or stacked bands, so I think there's a lot of value in having many dashboard projects share the same code for the actual charts, and I hope we can work together on this.