Graphite, and the storage Achilles heel
Graphite is a neat timeseries metrics storage system that comes with a powerful querying api, mainly due to the whole bunch of available processing functions
For medium to large setups, the storage aspect quickly becomes a pain point. Whisper, the default graphite storage format, is a simple storage format, using one file per metric (timeseries).
- It can't keep all file descriptors in memory so there's a lot of overhead in constantly opening, seeking, and closing files, especially since usually one write comes in for all metrics at the same time.
- Using the rollups feature (different data resolutions based on age) causes a lot of extra IO.
- The format is also simply not optimized for writes. Carbon, the storage agent that sits in front of whisper has a feature to batch up writes to files to make them more sequential but this doesn't seem to help much.
- Worse, due to various implementation details the carbon agent is surprisingly inefficient and even cpu-bound. People often run into cpu limitations before they hit the io bottleneck. Once the writeback queue hits a certain size, carbon will blow up.
Common recommendations are to run multiple carbon agents
running graphite on SSD drives
If you want to scale out across multiple systems, you can get carbon to shard metrics across multiple nodes, but the complexity can get out of hand and manually maintaining a cluster where nodes get added, fail, get phased out, need recovery, etc involves a lot of manual labor even though carbonate
makes this easier. This is a path I simply don't want to go down.
These might be reasonable solutions based on the circumstances (often based on short-term local gains), but I believe as a community we should solve the problem at its root, so that everyone can reap the long term benefits.
In particular, running Ceres instead of whisper
, is only a slight improvement, that suffers from most of the same problems. I don't see any good reason to keep working on Ceres, other than perhaps that it's a fun exercise. This probably explains the slow pace of development.
However, many mistakenly believe Ceres is "the future".
Switching to LevelDB
seems much more sensible but IMHO still doesn't cut it as a general purpose, scalable solution.
The ideal backend
I believe we can build a backend for graphite that
- can easily scale from a few metrics on my laptop in power-save mode to millions of metrics on a highly loaded cluster
- supports nodes joining and leaving at runtime and automatically balancing the load across them
- assures high availability and heals itself in case of disk or node failures
- is simple to deploy. think: just run an executable that knows which directories it can use for storage, elasticsearch-style automatic clustering, etc.
- has the right read/write optimizations. I've never seen a graphite system that is not write-focused, so something like LSM trees seems to make a lot of sense.
- can leverage cpu resources (e.g. for compression)
- provides a more natural model for phasing out data. Optional, runtime-changeable rollups. And an age limit (possibly, but not necessarily round robin)
While we're at it. pub-sub for realtime analytics would be nice too. Especially when it allows to use the same functions as the query api.
And getting rid of the metric name restrictions such as inability to use dots or slashes. Efficient sparse series support would be nice too.
There's a lot of databases that you could hook up to graphite.
riak, hdfs based (opentsdb), Cassandra based (kairosdb, blueflood, cyanite), etc.
Some of these are solid and production ready, and would make sense depending on what you already have and have experience with.
I'm personally very interested in playing with Riak, but decided to choose InfluxDB as my first victim.
InfluxDB is a young project that will need time to build maturity, but is on track to meet all my goals very well.
In particular, installing it is a breeze (no dependencies), it's specifically built for timeseries (not based on a general purpose database),
which allows them to do a bunch of simplifications and optimizations, is write-optimized, and should meet my goals for scalability, performance, and availability well.
And they're in NYC so meeting up for lunch has proven to be pretty fruitful for both parties. I'm pretty confident that these guys can pull off something big.
Technically, InfluxDB is a "timeseries, metrics, and analytics" databases with use cases well beyond graphite and even technical operations.
Like the alternative databases, graphite-like behaviors such as rollups management and automatically picking the series in the most appropriate resolutions, is something to be implemented on top of it. Although you never know, it might end up being natively supported.
Graphite + InfluxDB
InfluxDB developers plan to implement a whole bunch of processing functions (akin to graphite, except they can do locality optimizations) and add a dashboard that talks to InfluxDB natively (or use Grafana
), which means at some point you could completely swap graphite for InfluxDB.
However, I think for quite a while, the ability to use the Graphite api, combine backends, and use various graphite dashboards is still very useful.
So here's how my setup currently works:
carbon-relay-ng is a carbon relay in Go.
It's a pretty nifty program to partition and manage carbon metrics streams. I use it in front of our traditional graphite system, and have it stream - in realtime - a copy of a subset of our metrics into InfluxDB. This way I basically have our unaltered Graphite system, and in parallel to it, InfluxDB, containing a subset of the same data.
With a bit more work it will be a high performance alternative to the python carbon relay, allowing you to manage your streams on the fly.
It doesn't support consistent hashing, because CH should be part of a strategy of a highly available storage system (see requirements above), using CH in the relay still results in a poor storage system, so there's no need for it.
- I contributed the code to InfluxDB to make it listen on the carbon protocol. So basically, for the purpose of ingestion, InfluxDB can look and act just like a graphite server. Anything that can write to graphite, can now write to InfluxDB. (assuming the plain-text protocol, it doesn't support the pickle protocol, which I think is a thing to avoid anyway because almost nothing supports it and you can't debug what's going on)
- graphite-api is a fork/clone of graphite-web, stripped of needless dependencies, stripped of the composer. It's conceived for many of the same reasons behind graphite-ng (graphite technical debt, slow development pace, etc) though it doesn't go to such extreme lengths and for now focuses on being a robust alternative for the graphite server, api-compatible, trivial to install and with a faster pace of development.
That's where graphite-influxdb comes in. It hooks InfluxDB into graphite-api, so that you can query the graphite api, but using data in InfluxDB.
It should also work with the regular graphite, though I've never tried. (I have no incentive to bother with that, because I don't use the composer. And I think it makes more sense to move the composer into a separate project anyway).
With all these parts in place, I can run our dashboards next to each other - one running on graphite with whisper, one on graphite-api with InfluxDB - and simply look whether the returned data matches up, and which dashboards loads graphs faster.
Later i might do more extensive benchmarking and acceptance testing.
If all goes well, I can make carbon-relay-ng fully mirror all data, make graphite-api/InfluxDB the primary, and turn our old graphite box into a live "backup".
We'll need to come up with something for rollups and deletions of old data (although it looks like by itself influx is already more storage efficient than whisper too), and I'm really looking forward to the InfluxDB team building out the function api, having the same function api available for historical querying as well as realtime pub-sub. (my goal used to be implementing this in graphite-ng and/or carbon-relay-ng, but if they do this well, I might just abandon graphite-ng)
To be continued..