Mon, 21 Jan 2013

Profiling and behavior testing of processes and daemons, and Devopsdays NYC

Profiling a process run

I wanted the ability to run a given process and get
a plot of key metrics (cpu usage, memory usage, disk i/o) throughout the duration of the process run.
Something light-weight with minimal dependencies so I can easily install it on a server for a one-time need.
Couldn't find a tool for it, so I wrote profile-process
which does exactly that in <100 lines of python.

black-box behavior testing processes/daemons

I wrote simple-black-box to do this.
It runs the subject(s) in a crafted sandbox, sends input (http requests, commands, ...)
and allows to make assertions on http/statsd requests/responses, network listening state, processes running, log entries,
file existence/checksums in the VFS/swift clusters, etc.
Each test-case is a scenario.
It also can use logstash to give a centralized "distributed stack trace" when you need to debug a failure after multiple processes interacting and acting upon received messages; or to compare behavior across different scenario runs.
You can integrate this with profile-process to compare runtime behaviors across testcases/scenarios.

Simple-black-box talk @ Devopsdays NYC

I did a quick 5min talk, despite some display/timing issues it was well received. (in particular I got some really positive feedback from one person and still wonder if that was a recruiter attempting to hire me -but being shy about it...- it was quite awkward)
slides
raw uncut video. Go to 'New York, January 18th, 2013' from 02:36:25 to 02:41:15

More random thoughts about Devopsdays NYC

  • I'm getting tired of people on stage making a big revelation out of adding an index to a database column. This happens too often at web-ops/devops conferences, it's embarrassing. But at least it's not like the "how we made our site 1000x faster"-style Velocity talks that should have been named "caching and query optimization for newbies"
  • Paperless post confirms again they got their act together and keeps us up to date with their great work. Follow them.
  • Knights of the Provisioning Round Table - Bare Metal Provisioning was mostly (to my surprise) 4 individuals presenting their solution instead of a real round-table, but (to my surprise again) they were not as similar/repetitive as I expected and the pros/cons of all solutions were compared more in depth than I dared to hope. I covered dell crowbar before and like it, though I wonder when this thing is actually gonna be reliable.
  • Dave Zwieback and John Willis gave hilarious talks
  • Tried to start an open space discussion around collaboration patterns and anti-patterns, which I think is a very interesting subject, because how individuals in a team collaborate is crucial to success, but yet very little is written about it (that I could find). I would hope we can distill the years of aggregate experience of people into concise patterns and anti-patterns and document how (not) well they work for development styles (such as agile/scrum), team size, company structure/hierarchy, roadmap/technical debt pressure, etc. And especially in light of any of these things changing, because I've found people can be very change-resistive.
  • DevOps At Obama for America & The Democratic National Committee was good, I thought it would be a rehash of what was said at Coding Forward New York City: Meet the Developers Behind the Obama Campaign but there were a bunch of interesting insights about state of the art technology in here (mostly Amazon stuff)
  • A bunch of talks where the same could have been said in half the time, or less
Random thoughts about some sponsors:
  • Librato is quite cool. It's basically how my open source tool graph-explorer would look like after finishing a bunch of TODO's, combining it with graphite, polishing it all up, and offering it as a hosted solution. I find it interesting if this is a successful business with only such a limited scope
  • Even cooler is datadog. It goes beyond just metrics and doesn't just provide hosted graphing, it provides a solution for a philosophy that aims for a centralized insight of all your operational data, related collaboration and prioritized alerts that are to the point. They get a lot of things right, the open source world has some catching up to do
Interesting that both use Cassandra and free-form tags for flexibility, validating the approach I'm taking with graph-explorer. Now Graphite could use a distributed metrics storage backend over which one can do map-reduce style jobs to gather intelligence from metrics archives (maybe based on Cassandra too?), but that's another story.

Anyway, living in NYC with its vibrant ecosystem of devops people and companies organizing plenty of meet-ups and talks on their own makes it less pressing to have an event like Devopsdays, though it was certainly a good event, thanks to the sponsors and the volunteers.

Wed, 09 Jan 2013

Graph-Explorer: A graphite dashboard unlike any other

The above sounds like a marketing phrase and I'm just as skeptical of them as you, but I feel it's in place. Not because GE is necessarily better, but it's certainly different.

In web operations, mostly when troubleshooting but also for capacity planning, I often find myself having very specific information needs from my time-series, and these information needs vary a lot over time. This usually means I need to correlate or compare things that no one anticipated. Things that relate to specific machines, specific services across machines, or a few specific metrics of which only the ops team knows how they are related and cross various scopes (application, network, system, etc).
I should have an easy way to filter metrics by any information in the metric's name or values.
I should be able to group metrics into graphs the way I want. (example: when viewing filesystem usage of servers, I should be able to group by server (one graph per server listing the filesystems, but also by mountpoint to compare servers on one graph).
I should be able -with minimal effort- to view metrics by their gauge/count value, but also by their rate of change and where appropriate, as a percentage of a maximum (like diskspace used).
It should be trivial to manipulate the graph interactively (toggling things on/off, switching between lines/stacked mode, inspecting datapoints, zooming, ...).
It should show me all events, colorcoded by type, with text description, and interactive so that it can use hyperlinks.
And most of all, the code should be as simple as possible and it should be easy to get running.

Dashboards which show specific predefined KPI's (this covers most graphite dashboards) are clearly unsuitable for this use case. Template-based "metric exploration" dashboards like cacti and ganglia are in my experience way too limited. Graph composing dashboards (like the stock graphite one, or graphiti) require much manual work to get the graph you want. I couldn't find anything even close to what I wanted, so I started Graph-Explorer.

The approach I'm taking is using plugins which add metadata to metrics (tags for server, service, mountpoint, interface name, ...), having them define how to render as a count, as a rate, as a percent of some max allowed value (or a metric containing the max), and providing a query language which allow you to match/filter metrics, group them into graphs by tag, and render them how you want them. The plugins promote standardized metric naming and reuse across organisations, not in the least because most correspond to plugins for the Diamond monitoring agent.

Furthermore, because it uses my graphitejs plugin (which now btw supports flot as a backend for fast canvas-based graphs and annotated events from anthracite) the manual interactions mentioned earlier are supported or at least on the roadmap.

Graph Explorer is not yet where I want it, but it's already a very useful tool at Vimeo.

Fri, 03 Sep 2010

What the open source community can learn from Devops

Being active as both a developer and ops person in the professional life, and both an open source developer and packager in my spare time, I noticed some common ground between both worlds, and I think the open source community can learn from the Devops movement which is solving problems in the professional tech world.

For the sake of getting a point across, I'll simplify some things.

First, a crash course on Devops...



::Read more

Thu, 04 Apr 2013

A few common graphite problems and how they are already solved.

metrics often seem to lack details, such as units and metric types

looking at a metric name, it's often hard to know
  • the unit a metric is measured in (bits, queries per second, jiffies, etc)
  • the "type" (a rate, an ever increasing counter, gauge, etc)
  • the scale/prefix (absolute, relative, percentage, mega, milli, etc)
structured_metrics solves this by adding these tags to graphite metrics:
  • what
    what is being measured?: bytes, queries, timeouts, jobs, etc
  • target_type
    must be one of the existing clearly defined target_types (count, rate, counter, gauge)
    These match statsd metric types (i.e. rate is per second, count is per flushInterval)
In Graph-Explorer these tags are mandatory, so that it can show the unit along with the prefix (i.e. 'Gb/s') on the axis.
This will also allow you to request graphs in a different unit and the dashboard will know how to convert (say, Mbps to GB/day)

tree navigation/querying is cumbersome, metrics search is broken. How do I organize the tree anyway?

the tree is a simplistic model. There is simply too much dimensionality that can't be expressed in a flat tree. There's no way you can organize it so that will it satisfy all later needs. A tag space like structured_metrics makes it obsolete. with Graph-Explorer you can do (full-text) search on metric name, by any of their tags, and/or by added metadata. So practically you can filter by things like server, service, unit (e.g. anything expressed in bits/bytes per second, or anything denoting errors). All this irrespective of the source of a metric or the "location in the tree".

no interactivity with graphs

timeserieswidget allows you to easily add interactive graphite graph objects to html pages. You get modern features like togglable/reorderable metrics, realtime switching between lines/stacked, information popups on hoover, highlighting, smoothing, and (WIP) realtime zooming. It has a canvas (flot) and svg (rickshaw/d3) backend. So it basically provides a simpler api to use these libraries specifically with graphite.
There's a bunch of different graphite dashboards with different takes on graph composition/configuration and workflow, but the actual rendering of graphs usually comes down to plotting some graphite targets with a legend. timeserieswidget aims to be a drop-in plugin that brings all modern features so that different dashboards can benefit from a common, shared codebase, because static PNGs are a thing from the past

screenshot:

events lack text annotations, they are simplistic and badly supported

Graphite is a great system for time series metrics. Not for events. metrics and events are very different things across the board. drawAsInFinite() is a bit of a hack.
  • anthracite is designed specifically to manage events.
    It brings extra features such as different submission scripts, outage annotations, various ways to see events and reports with uptime/MTTR/etc metrics.
  • timeserieswidget displays your events on graphs along with their metadata (which can be just some text or even html code).
    this is where client side rendering shines

screenshots:

cumbersome to compose graphs

There's basically two approaches:
  • interactive composing: with the graphite composer, you navigate through the tree and apply functions. This is painfull, dashboards like descartes and graphiti can make this easier
  • use a dashboard that uses predefined templates (gdash and others) They often impose a strict navigation path to reach pages which may or may not give you the information you need (usually less or way more)
With both approaches, you usually end up with an ever growing pile of graphs that you created and then keep for reference.
This becomes unwieldy but is useful for various use cases and needs.
However, neither approach is convenient for changing information needs.
Especially when troubleshooting, one day you might want to compare the rate of increase of open file handles on a set of specific servers to the traffic on given network switches, the next day it's something completely different.
With Graph-Explorer:
  • GE gives you a query interface on top of structured_metric's tag space. this enables a bunch of things (see above)
  • you can yield arbitrary targets for each metric, to look at the same thing from a different angle (i.e. as a rate with `derivative()` or as a daily summary), and you can of course filter by angle
  • You can group metrics into graphs by arbitrary tags (e.g. you can see bytes used of all filesystems on a graph per server, or compare servers on a graph per filesystem). This feature always results in the "wow that's really cool" every time I show it
  • GE includes 'what' and 'target_type' in the group_by tags by default so basically, if things are in a different unit (B/s vs B vs b etc) it'll put them in separate graphs (controllable in query)
  • GE automatically generates the graph title and vertical title (always showing the 'what' and the unit), and shows all metrics' extra tags. This also gives you a lot of inspiration to modify or extend your query

limited options to request a specific time range

GE's query language supports freeform `from` and `to` clauses.

Referenced projects

  • anthracite:
    event/change logging/management with a bunch of ingestion scripts and outage reports
  • timeserieswidget:
    jquery plugin to easily get highly interactive graphite graphs onto html pages (dashboards)
  • structured_metrics:
    python library to convert graphite metrics tree into a tag space with clearly defined units and target types, and arbitrary metadata.
  • graph-explorer:
    dashboard that provides a query language so you can easily compose graphs on the fly to satisfy varying information needs.
All tools are designed for integration with other tools and each other. Timeserieswidget gets data from anthracite, graphite and elasticsearch. Graph-Explorer uses structured_metrics and timeserieswidget.

Future work

There's a whole lot going on in the monitoring space, but I'd like to highlight a few things I personally want to work more on:
  • I spoke with Michael Leinartas at Monitorama (and there's also a launchpad thread). We agreed that native tags in graphite are the way forward. This will address some of the pain points I'm already fixing with structured_metrics but in a more native way. I envision submitting metrics would move from:
    stats.serverdb123.mysql.queries.selects 895 1234567890
    
    to something more along these lines:
    host=serverdb123 service=mysql type=select what=queries target_type=rate 895 1234567890
    host=serverdb123 service=mysql type=select unit=Queries/s 895 1234567890
    h=serverdb123 s=mysql t=select queries r 895 1234567890
    
  • switch Anthracite backend to ElasticSearch for native integration with logstash data (and allow you to use kibana)

Mon, 12 Nov 2012

Anthracite, an event database to enrich monitoring dashboards and to allow visual and numerical analysis of events that have a business impact

Introduction

Graphite can show events such as code deploys and puppet changes as vertical markers on your graph. With the advent of new graphite dashboards and interfaces where we can have popups and annotations to show metadata for each event (by means of client-side rendering), it's time we have a database to track all events along with categorisation and text descriptions (which can include rich text and hyperlinks). Graphite is meant for time series (metrics over time), Anthracite aims to be the companion for annotated events.
More precisely, Anthracite aims to be a database of "relevant events" (see further down), for the purpose of enriching monitoring dashboards, as well as allowing visual and numerical analysis of events that have a business impact (for the latter, see "Thoughts on incident nomenclature, severity levels and incident analysis" below)
It has a TCP receiver, a database (sqlite3), a http interface to deliver event data in many formats and a simple web frontend for humans.

design goals:

  • do one thing and do it well. aim for integration.
  • take inspiration from graphite:
    • simple TCP protocol
    • automatically create new event types as they are used
    • run on port 2005 by default (carbon is 2003,2004)
    • deliver events in various formats (html, raw, json, sqlite,...)
    • stay out of the way
  • super easy to install and run: install dependencies, clone repo. the app is ready to run
I have a working prototype on github.com/Dieterbe/anthracite

About "relevant events"

I recommend you submit any event that has or might have a relevant effect on:

  • your application behavior
  • monitoring itself (for example you fixed a bug in metrics reporting. it shouldn't look like the app behavior changed)
  • the business (press coverage, viral videos, etc), because this also affects your app usage and metrics.

Formats and conventions

The TCP receiver listens for lines in this format:

<unix_timestamp> <type> <description>

There are no restrictions for type and description, other than that they must be non-empty strings.
I do have some suggestions which I'll demonstrate through fictive examples;
but note that there's room for improvement, see the section below)

# a deploy_* type for each project
ts deploy_vimeo.com "deploy e8e5e4 initiated by Nicolas -- github.com/Vimeo/main/compare/foobar..e8e5e4"
ts puppet "all nodes of class web_cluster_1: modified apache.conf; restart service apache"
ts incident_sev2_start "mysql2 crashed, site degraded"
ts incident_sev2_resolved "replaced db server"
ts incident "hurricane Sandy, systems unaffected but power outages among users, expect lower site usage"
# in those exceptional cases of manual production changes, try to not forget adding your event
ts manual_dieter "i have to try this firewall thing on the LB"
ts backup "backup from database slave vimeomysql22"

Thoughts on incident nomenclature, severity levels and incident analysis

Because there are so many unique and often subtle pieces of information pertaining to each individual incident, it's often hard to map an incident to a simple severity level or keyword. When displaying events as popups on graphs I think no severity levels are needed, the graphs and event descriptions are much more clear than any severity level could convey.
However, I do think these levels are very useful for reporting and numerical analysis.
On slide 53 of the metametrics slidedeck Allspaw mentions severity levels, which can be paraphrased in terms of service degradation for the end user: 1 (full), 2 (major), 3 (minor), 4 (no).
I would like to extend this notion into the opposite spectrum, and have similar levels on the positivie scale, so that they represent positive incidents (like viral videos, press mentions, ...) as opposed to problematic ones (outages).
For incident analysis we need a rich nomenclature and schema: incidents can (presumably) have a positive or negative impact, can be self-induced or not, and can be categorized with severity levels; they can also be planned for (maintenance, release announcements) or not. While we're at it, how about events to mark point in time where the cause was detected, as well as resolved, so we can calculate TTD and TTR (see Allspaw slidedeck)?
Since basically any event can have a positive or negative impact, an option is to leave out the type 'incident' and give a severity field for every event type.
I'm thinking of a good nomenclature and a schema to express all this. (btw, notice how in common ops literature the word incident is usually associated with outages and bad things; while actually an incident can just as well be a positive event) as well as UI features to support this analysis.

I need your help

The tcp receiver works, the backend works, i have a crude (but functional) web app, and a simple http api to retrieve the events in all kinds of formats. Next up are:
  • monitoring dashboard for graphite that gathers events from anthracite, can show metadata, and can mark a timeframe between start and stop events
  • plugings for puppet, chef to automatically submit their relevant events
  • a better web UI and actually provide features to do statistics on events and analysis such as TTD, TTR, with colors for severity levels etc

Wed, 14 Nov 2012

Client-side rendered graphite charts for all

Client-side rendering of charts as opposed to using graphite's server side generated png's allows various interactivity features, such as:

  • interactive realtime zooming and panning of the graph, timeline sliders
  • realtime switching between various rendering modes (lines, stacked, etc)
  • toggling certain targets on/off, reordering them, highlighting their plot when hoovering over the legend, etc
  • basic data manipulation, such as smoothing to see how averages compare (akin to movingAverage, but now interactive)
  • popups detailing the exact metrics when hoovering over the chart's datapoints
  • popups for annotated events. (a good use for anthracite).

Those are all features of charting libraries such as flot and rickshaw, the only remaining work is creating a library that unleashes the power of such a framework, integrates it with the graphite api datasource, and makes it available over a simple but powerful api.

That's what I'm trying to achieve with github.com/Dieterbe/graphitejs.
It's based on rickshaw.
It gives you a JavaScript api to which you specify your graphite targets and some options and it'll give you your graph with extra interactivity sauce. Note that graphite has a good and rich api, one that is widely known and understood, that's why I decided to keep it exposed and not abstract any more than needed.

There are many graphite dashboards with a different focus, but as far as plotting graphs, what they need is usually very similar: a plot to draw multiple graphite targets, a legend, an x-axis, 1 or 2 y-axis, lines or stacked bands, so I think there's a lot of value in having many dashboard projects share the same code for the actual charts, and I hope we can work together on this.

Wed, 02 May 2012

Dell crowbar openstack swift

Learned about Dell Crowbar the other day. It seems to be (becoming) a tool I've wanted for quite a while, because it takes automating physical infrastructure to a new level, and is also convenient on virtual.

::Read more

Sun, 24 Mar 2013

Hi Planet Devops and Infratalk

This blog just got added to planet devops and infra-talk, so for my new readers: you might know me as Dieterbe on irc, github or twitter. Since my move from Belgium to NYC (to do backend stuff at Vimeo) I've started writing more about devops-y topics (whereas I used to write more about general hacking and arch linux release engineering and (automated) installations). I'll mention some earlier posts you might be interested in: FWIW, I'm attending Monitorama next weekend in Boston.