Mon, 12 Nov 2012

Anthracite, an event database to enrich monitoring dashboards and to allow visual and numerical analysis of events that have a business impact

Introduction

Graphite can show events such as code deploys and puppet changes as vertical markers on your graph. With the advent of new graphite dashboards and interfaces where we can have popups and annotations to show metadata for each event (by means of client-side rendering), it's time we have a database to track all events along with categorisation and text descriptions (which can include rich text and hyperlinks). Graphite is meant for time series (metrics over time), Anthracite aims to be the companion for annotated events.
More precisely, Anthracite aims to be a database of "relevant events" (see further down), for the purpose of enriching monitoring dashboards, as well as allowing visual and numerical analysis of events that have a business impact (for the latter, see "Thoughts on incident nomenclature, severity levels and incident analysis" below)
It has a TCP receiver, a database (sqlite3), a http interface to deliver event data in many formats and a simple web frontend for humans.

design goals:

  • do one thing and do it well. aim for integration.
  • take inspiration from graphite:
    • simple TCP protocol
    • automatically create new event types as they are used
    • run on port 2005 by default (carbon is 2003,2004)
    • deliver events in various formats (html, raw, json, sqlite,...)
    • stay out of the way
  • super easy to install and run: install dependencies, clone repo. the app is ready to run
I have a working prototype on github.com/Dieterbe/anthracite

About "relevant events"

I recommend you submit any event that has or might have a relevant effect on:

  • your application behavior
  • monitoring itself (for example you fixed a bug in metrics reporting. it shouldn't look like the app behavior changed)
  • the business (press coverage, viral videos, etc), because this also affects your app usage and metrics.

Formats and conventions

The TCP receiver listens for lines in this format:

<unix_timestamp> <type> <description>

There are no restrictions for type and description, other than that they must be non-empty strings.
I do have some suggestions which I'll demonstrate through fictive examples;
but note that there's room for improvement, see the section below)

# a deploy_* type for each project
ts deploy_vimeo.com "deploy e8e5e4 initiated by Nicolas -- github.com/Vimeo/main/compare/foobar..e8e5e4"
ts puppet "all nodes of class web_cluster_1: modified apache.conf; restart service apache"
ts incident_sev2_start "mysql2 crashed, site degraded"
ts incident_sev2_resolved "replaced db server"
ts incident "hurricane Sandy, systems unaffected but power outages among users, expect lower site usage"
# in those exceptional cases of manual production changes, try to not forget adding your event
ts manual_dieter "i have to try this firewall thing on the LB"
ts backup "backup from database slave vimeomysql22"

Thoughts on incident nomenclature, severity levels and incident analysis

Because there are so many unique and often subtle pieces of information pertaining to each individual incident, it's often hard to map an incident to a simple severity level or keyword. When displaying events as popups on graphs I think no severity levels are needed, the graphs and event descriptions are much more clear than any severity level could convey.
However, I do think these levels are very useful for reporting and numerical analysis.
On slide 53 of the metametrics slidedeck Allspaw mentions severity levels, which can be paraphrased in terms of service degradation for the end user: 1 (full), 2 (major), 3 (minor), 4 (no).
I would like to extend this notion into the opposite spectrum, and have similar levels on the positivie scale, so that they represent positive incidents (like viral videos, press mentions, ...) as opposed to problematic ones (outages).
For incident analysis we need a rich nomenclature and schema: incidents can (presumably) have a positive or negative impact, can be self-induced or not, and can be categorized with severity levels; they can also be planned for (maintenance, release announcements) or not. While we're at it, how about events to mark point in time where the cause was detected, as well as resolved, so we can calculate TTD and TTR (see Allspaw slidedeck)?
Since basically any event can have a positive or negative impact, an option is to leave out the type 'incident' and give a severity field for every event type.
I'm thinking of a good nomenclature and a schema to express all this. (btw, notice how in common ops literature the word incident is usually associated with outages and bad things; while actually an incident can just as well be a positive event) as well as UI features to support this analysis.

I need your help

The tcp receiver works, the backend works, i have a crude (but functional) web app, and a simple http api to retrieve the events in all kinds of formats. Next up are:
  • monitoring dashboard for graphite that gathers events from anthracite, can show metadata, and can mark a timeframe between start and stop events
  • plugings for puppet, chef to automatically submit their relevant events
  • a better web UI and actually provide features to do statistics on events and analysis such as TTD, TTR, with colors for severity levels etc

Comments

Hi,
have you heard of shinken ? It's a monitoring solution, a legacy of Nagios. It has a graphite module, that can send metrics to it, and can monitor stuff, send alert, and do dashboards as well.

I don't understand the position of your future solution regarding Zabbix, Nagios, and other monitoring solutions. They exist and have been working pretty well for more than 10 years, now. Why re-inventing something already working, and free, and open-source.

Zabbix already have the nice computing feature. Shinken still not.

http://www.shinken-monitoring.org/
http://www.zabbix.com/

Best regards, Frédéric.
I see the point you are making ...

Being able to add stuff next to your metrics like
- Enabled X
- Killed Y
- Added more memory

Tinkering if you could hack something together with

https://github.com/ripienaar/graphite-graph-dsl

And have annotations in your gdash board..
Hi Dieter,

Adding Antracite to our infrastructure now. Thanks a lot!

Would be nice to have both subject and description.

Haven't you thought about ability to add end date (optional) to events and time to start and end dates (also optional). If you are going to use annotations on graphs in KIbana or Graphite time can be important to place it correctly.

End date/time can be used for outages, marketing campaigns, bug detected/fixed etc.

Also would be handy to allow to attach files to events for example screenshots for A/B testing. So it becomes a tool for the company to track events.

Another thing is anability to enter event and get curl query to add it programmatically on deploy to ElasticSearch So you use Antracite to enter all the data - date/time, subject, description, tags and hit a bittoin to generate curl to copy paste to your deployment similar to https://docs.newrelic.com/docs/insights/inserting-events

Last thing is "http://metrics20.org/" - would be probably nive to be able to create your metrics and store their descriptions in Antrachite too? May be even have an option to link events to metrics (optional) so you know which events were specifically set to be important for a particular metric. If we release inline images in comments (event) on 15 June 2014 and I have a metric 'Average comments posted per day' and I am adding a metric 'Average comments posted per day with inline images added' I would like to specifically say that this event is linked to these metrics.

Just interested in your thoughts about it :)

Thanks again.


Name:


E-mail:


URL:


Comment:


What is the first name of the guy blogging here?


This comment form is pretty crude. Make sure mandatory fields are entered correctly.
Basic html tags (a,i,b, etc) are allowed, others are sanitized