Practical fault detection: redux. Next-generation alerting now as presentation

This summer I had the opportunity to present my practical fault detection concepts and hands-on approach as conference presentations.

First at Velocity and then at SRECon16 Europe. The latter page also contains the recorded video.


If you’re interested at all in tackling non-trivial timeseries alerting use cases (e.g. working with seasonal or trending data) this video should be useful to you.

It’s basically me trying to convey in a concrete way why I think the big-data and math-centered algorithmic approaches come with a variety of problems making them unrealistic and unfit, whereas the real breakthroughs happen when tools recognize the symbiotic relationship between operators and software, and focus on supporting a collaborative, iterative process to managing alerting over time. There should be a harmonious relationship between operator and monitoring tool, leveraging the strengths of both sides, with minimal factors harming the interaction. From what I can tell, bosun is pioneering this concept of a modern alerting IDE and is far ahead of other alerting tools in terms of providing high alignment between alerting configuration, the infrastructure being monitored, and individual team members, which are all moving targets, often even fast moving. In my experience this results in high signal/noise alerts and a happy team. (according to Kyle, the bosun project leader, my take is a useful one)

That said, figuring out the tool and using it properly has been, and remains, rather hard. I know many who rather not fight the learning curve. Recently the bosun team has been making strides at making it easier for newcomers - e.g. reloadable configuration and Grafana integration - but there is lots more to do. Part of the reason is that some of the UI tabs aren’t implemented for non-opentsdb databases and integrating Graphite for example into the tag-focused system that is bosun, is bound to be a bit weird. (that’s on me)

For an interesting juxtaposition, we released Grafana v4 with alerting functionality which approaches the problem from the complete other side: simplicity and a unified dashboard/alerting workflow first, more advanced alerting methods later. I’m doing what I can to make the ideas of both projects converge, or at least make the projects take inspiration from each other and combine the good parts. (just as I hope to bring the ideas behind graph-explorer into Grafana, eventually…)

Note: One thing that somebody correctly pointed out to me, is that I’ve been inaccurate with my terminology. Basically, machine learning and anomaly detection can be as simple or complex as you want to make it. In particular, what we’re doing with our alerting software (e.g. bosun) can rightfully also be considered machine learning, since we construct models that learn from data and make predictions. It may not be what we think of at first, and indeed, even a simple linear regression is a machine learning model. So most of my critique was more about the big data approach to machine learning, rather than machine learning itself. As it turns out then the key to applying machine learning successfully is tooling that assists the human operator in every possible way, which is what IDE’s like bosun do and how I should have phrased it, rather than presenting it as an alternative to machine learning.