Graphite can show events such as code deploys and
puppet changes as vertical markers on your graph.
With the advent of new graphite dashboards and interfaces where we can have popups and annotations to show metadata for each event (by means of client-side rendering),
it's time we have a database to track all events along with categorisation and text descriptions (which can include rich text and hyperlinks).
Graphite is meant for time series (metrics over time), Anthracite aims to be the companion for annotated events.
More precisely, Anthracite aims to be a database of "relevant events" (see further down), for the purpose of enriching monitoring dashboards, as well as allowing visual and numerical analysis of events that have a business impact (for the latter, see "Thoughts on incident nomenclature, severity levels and incident analysis" below)
It has a TCP receiver, a database (sqlite3), a http interface to deliver event data in many formats and a simple web frontend for humans.
I recommend you submit any event that has or might have a relevant effect on:
The TCP receiver listens for lines in this format:
<unix_timestamp> <type> <description>
There are no restrictions for type and description, other than that they must be non-empty strings.
I do have some suggestions which I'll demonstrate through fictive examples;
but note that there's room for improvement, see the section below)
# a deploy_* type for each project ts deploy_vimeo.com "deploy e8e5e4 initiated by Nicolas -- github.com/Vimeo/main/compare/foobar..e8e5e4" ts puppet "all nodes of class web_cluster_1: modified apache.conf; restart service apache" ts incident_sev2_start "mysql2 crashed, site degraded" ts incident_sev2_resolved "replaced db server" ts incident "hurricane Sandy, systems unaffected but power outages among users, expect lower site usage" # in those exceptional cases of manual production changes, try to not forget adding your event ts manual_dieter "i have to try this firewall thing on the LB" ts backup "backup from database slave vimeomysql22"
from the website:
We have pretty good storage of timeseries data, collection agents, and dashboards. But the idea of giving timeseries a "name" or a
"key" is profoundly limiting us. Especially when they're not standardized and missing information.
Metrics 2.0 aims for self-describing, standardized metrics using orthogonal tags for every dimension. "metrics" being the pieces of information that point to, and describe timeseries of data.
Client-side rendering of charts as opposed to using graphite's server side generated png's allows various interactivity features, such as:
There are many graphite dashboards with a different focus, but as far as plotting graphs, what they need is usually very similar: a plot to draw multiple graphite targets, a legend, an x-axis, 1 or 2 y-axis, lines or stacked bands, so I think there's a lot of value in having many dashboard projects share the same code for the actual charts, and I hope we can work together on this.
Earlier this month we had another iteration of the Monitorama conference, this time in Portland, Oregon.
(photo by obfuscurity)
I think the conference was better than the first one in Boston, much more to learn. Also this one was quite focused on telemetry (timeseries metrics processing), lots of talks on timeseries analytics, not so much about things like sensu or nagios.
Adrian Cockroft's keynote brought some interesting ideas to the table, like building a feedback loop into the telemetry to drive infrastructure changes (something we do at Vimeo, I briefly give an example in the intro of my talk) or shortening the time from fault to alert (which I'm excited to start working on soon)
My other favorite was Noah Kantrowitz's talk about applying audio DSP techniques to timeseries, I always loved audio processing and production. Combining these two interests hadn't occurred to me so now I'm very excited about the applications.
The opposite, but just as interesting of an idea - conveying information about system state as an audio stream - came up in puppetlab's monitorama recap and that seems to make a lot of sense as well. There's a lot of information in a stream of sound, it is much denser than text, icons and perhaps even graph plots. Listening to an audio stream that's crafted to represent various information might be a better way to get insights into your system.
I'm happy to see the idea reinforced that telemetry is a key part of modern monitoring. For me personally, telemetry (the tech and the process) is the most fascinating part of modern technical operations, and I'm glad to be part of the movement pushing this forward. There's also a bunch of startups in the space (many stealthy ones), validating the market. I'm curious to see how this will play out.
I had the privilege to present metrics 2.0 and
As usual the slides are on slideshare and the footage on the better video sharing platform ;-) .
I'm happy with all the positive feedback, although I'm not aware yet of other tools and applications adopting metrics 2.0, and I'm looking forward to see some more of that, because ultimately that's what will show if my ideas are any good.
In web operations, mostly when troubleshooting but also for capacity planning,
I often find myself having very specific information needs from my time-series, and these information needs vary a lot over time.
This usually means I need to correlate or compare things that no one anticipated. Things that relate to specific machines, specific services across machines,
or a few specific metrics of which only the ops team knows how they are related and cross various scopes (application, network, system, etc).
I should have an easy way to filter metrics by any information in the metric's name or values.
I should be able to group metrics into graphs the way I want. (example: when viewing filesystem usage of servers, I should be able to group by server (one graph per server listing the filesystems, but also by mountpoint to compare servers on one graph).
I should be able -with minimal effort- to view metrics by their gauge/count value, but also by their rate of change and where appropriate, as a percentage of a maximum (like diskspace used).
It should be trivial to manipulate the graph interactively (toggling things on/off, switching between lines/stacked mode, inspecting datapoints, zooming, ...).
It should show me all events, colorcoded by type, with text description, and interactive so that it can use hyperlinks.
And most of all, the code should be as simple as possible and it should be easy to get running.
Dashboards which show specific predefined KPI's (this covers most graphite dashboards) are clearly unsuitable for this use case. Template-based "metric exploration" dashboards like cacti and ganglia are in my experience way too limited. Graph composing dashboards (like the stock graphite one, or graphiti) require much manual work to get the graph you want. I couldn't find anything even close to what I wanted, so I started Graph-Explorer.
The approach I'm taking is using plugins which add metadata to metrics (tags for server, service, mountpoint, interface name, ...), having them define how to render as a count, as a rate, as a percent of some max allowed value (or a metric containing the max), and providing a query language which allow you to match/filter metrics, group them into graphs by tag, and render them how you want them. The plugins promote standardized metric naming and reuse across organisations, not in the least because most correspond to plugins for the Diamond monitoring agent.
Furthermore, because it uses my graphitejs plugin (which now btw supports flot as a backend for fast canvas-based graphs and annotated events from anthracite) the manual interactions mentioned earlier are supported or at least on the roadmap.Graph Explorer is not yet where I want it, but it's already a very useful tool at Vimeo.
Being active as both a developer and ops person in the professional life, and both an open source developer and packager in my spare time, I noticed some common ground between both worlds, and I think the open source community can learn from the Devops movement which is solving problems in the professional tech world.
For the sake of getting a point across, I'll simplify some things.
The talk also briefly covers native metrics 2.0 through your metrics pipeline using statsdaemon and carbon-tagger. I'm psyched that by formatting metrics at the source a little better and having an aggregation daemon that expresses the performed operations by updating the metric tags, all the foundations are in place for some truly next-gen UI's and applications (one of them already being implemented: graph-explorer can pretty much generate all graphs I need by just phrasing an information need as a proper query)
I would love to do another talk that allows me to dive into more of the underlying ideas, the benefits of metrics2.0 for things like metric storage systems, graph renderers, anomaly detection, dashboards, etc.
Hope you like it!
I've been a graphite contributor for a while (and still am). It's a great tool for timeseries metrics. Two weeks ago I started working on Graphite-ng: it's somewhere between an early clone/rewrite, a redesign, and an experiment playground, written in Golang. The focus of my work so far is the API web server, which is a functioning prototype, it answers requests like
I.e. it lets you retrieve your timeseries, processed by function pipelines which are setup on the fly based on a spec in your http/rest arguments. Currently it only fetches metrics from text files but I'm working on decent metrics storage as well.
There's a few reasons why I decided to start a new project from scratch:
The API server I developed sets up a processing pipeline as directed by your query: every processing function runs in a goroutine
for concurrency and the metrics flow through using Go channels. It literally compiles a program and executes it. You can add your own functions
to collect, process, and return metrics by writing simple plugins.
As for timeseries storage, for now it uses simple text files, but I'm experimenting and thinking what would be the best metric store(s) that works on small scale (personal netbook install) to large scale ("I have millions of metrics that need to be distributed across nodes, the system should be HA and self-healing in failure scenarios, easily maintainable, and highly performant") and is still easy to deploy, configure and run. Candidates are whisper-go, kairosdb, my own elasticsearch experiment etc.
I won't implement rendering images, because I think client-side rendering using something like timeserieswidget is superior. I can also leave out events because anthracite already does that. There's a ton of dashboards out there (graph-explorer, descartes, etc) so that can be left out as well.
Over the past year, I've been working a lot on a graphite dashboard (Graph-Explorer). It's fairly unique in the sense that it leverages what I call "structured metrics" to facilitate a powerful way to build graphs dynamically. (see also "a few common graphite problems and how they are already solved" and the Graph-Explorer homepage). I implemented two solutions. ( structured_metrics, carbon-tagger) for creating, transporting and using these metrics on which dashboards like Graph-Explorer can rely. The former converts graphite metrics into a set of key-value pairs (this is "after the fact" so a little annoying), carbon-tagger uses a prototype extended carbon (graphite) protocol (called "proto2") to maintain an elasticsearch tags database on the fly, but the format proved too restrictive.
Now for Graphite-NG I wanted to rethink the metric, taking the simplicity of Graphite, the ideas of structured_metrics and carbon_tagger (but less annoying) and OpenTSDB (but more powerful), and propose a new way to identify and use metrics and organize tags around them.
The proposed ingestion (carbon etc) protocol looks like this:
<intrinsic_tags> <extrinsic_tags> <value> <timestamp>
graphite: stats.gauges.db15.mysql.bytes_received opentsdb: mysql.bytes_received host=db15 proposal: service=mysql server=db15 direction=in unit=B
service=mysql server=db15 direction=in unit=B src=diamond processed_by_statsd env=prod
These might be reasonable solutions based on the circumstances (often based on short-term local gains), but I believe as a community we should solve the problem at its root, so that everyone can reap the long term benefits.
whatwhat is being measured?: bytes, queries, timeouts, jobs, etc
target_typemust be one of the existing clearly defined target_types (count, rate, counter, gauge)
stats.serverdb123.mysql.queries.selects 895 1234567890to something more along these lines:
host=serverdb123 service=mysql type=select what=queries target_type=rate 895 1234567890 host=serverdb123 service=mysql type=select unit=Queries/s 895 1234567890 h=serverdb123 s=mysql t=select queries r 895 1234567890