Anthracite, an event database to enrich monitoring dashboards and to allow visual and numerical analysis of events that have a business impact

Introduction

Graphite can show events such as code deploys and puppet changes as vertical markers on your graph. With the advent of new graphite dashboards and interfaces where we can have popups and annotations to show metadata for each event (by means of client-side rendering), it's time we have a database to track all events along with categorisation and text descriptions (which can include rich text and hyperlinks). Graphite is meant for time series (metrics over time), Anthracite aims to be the companion for annotated events.
More precisely, Anthracite aims to be a database of "relevant events" (see further down), for the purpose of enriching monitoring dashboards, as well as allowing visual and numerical analysis of events that have a business impact (for the latter, see "Thoughts on incident nomenclature, severity levels and incident analysis" below)
It has a TCP receiver, a database (sqlite3), a http interface to deliver event data in many formats and a simple web frontend for humans.

design goals:

do one thing and do it well. aim for integration.
take inspiration from graphite:
- simple TCP protocol
- automatically create new event types as they are used
- run on port 2005 by default (carbon is 2003,2004)
- deliver events in various formats (html, raw, json, sqlite,...)
- stay out of the way
super easy to install and run: install dependencies, clone repo. the app is ready to run

I have a working prototype on github.com/Dieterbe/anthracite

About "relevant events"

I recommend you submit any event that has or might have a relevant effect on:

your application behavior
monitoring itself (for example you fixed a bug in metrics reporting. it shouldn't look like the app behavior changed)
the business (press coverage, viral videos, etc), because this also affects your app usage and metrics.

Formats and conventions

The TCP receiver listens for lines in this format:

unix_timestamp type description

There are no restrictions for type and description, other than that they must be non-empty strings.
I do have some suggestions which I'll demonstrate through fictive examples;
but note that there's room for improvement, see the section below)

# a deploy_* type for each project
ts deploy_vimeo.com "deploy e8e5e4 initiated by Nicolas -- github.com/Vimeo/main/compare/foobar..e8e5e4"
ts puppet "all nodes of class web_cluster_1: modified apache.conf; restart service apache"
ts incident_sev2_start "mysql2 crashed, site degraded"
ts incident_sev2_resolved "replaced db server"
ts incident "hurricane Sandy, systems unaffected but power outages among users, expect lower site usage"
# in those exceptional cases of manual production changes, try to not forget adding your event
ts manual_dieter "i have to try this firewall thing on the LB"
ts backup "backup from database slave vimeomysql22"

Thoughts on incident nomenclature, severity levels and incident analysis

Because there are so many unique and often subtle pieces of information pertaining to each individual incident, it's often hard to map an incident to a simple severity level or keyword. When displaying events as popups on graphs I think no severity levels are needed, the graphs and event descriptions are much more clear than any severity level could convey.
However, I do think these levels are very useful for reporting and numerical analysis.
On slide 53 of the metametrics slidedeck Allspaw mentions severity levels, which can be paraphrased in terms of service degradation for the end user: 1 (full), 2 (major), 3 (minor), 4 (no).
I would like to extend this notion into the opposite spectrum, and have similar levels on the positivie scale, so that they represent positive incidents (like viral videos, press mentions, ...) as opposed to problematic ones (outages).
For incident analysis we need a rich nomenclature and schema: incidents can (presumably) have a positive or negative impact, can be self-induced or not, and can be categorized with severity levels; they can also be planned for (maintenance, release announcements) or not. While we're at it, how about events to mark point in time where the cause was detected, as well as resolved, so we can calculate TTD and TTR (see Allspaw slidedeck)?
Since basically any event can have a positive or negative impact, an option is to leave out the type 'incident' and give a severity field for every event type.
I'm thinking of a good nomenclature and a schema to express all this. (btw, notice how in common ops literature the word incident is usually associated with outages and bad things; while actually an incident can just as well be a positive event) as well as UI features to support this analysis.

I need your help

The tcp receiver works, the backend works, i have a crude (but functional) web app, and a simple http api to retrieve the events in all kinds of formats. Next up are:

monitoring dashboard for graphite that gathers events from anthracite, can show metadata, and can mark a timeframe between start and stop events
plugings for puppet, chef to automatically submit their relevant events
a better web UI and actually provide features to do statistics on events and analysis such as TTD, TTR, with colors for severity levels etc

Introduction

About "relevant events"

Formats and conventions

Thoughts on incident nomenclature, severity levels and incident analysis

I need your help

Add comment