InfluxDB as a graphite backend, part 2

Updated oct 1, 2014 with a new Disk space efficiency section which fixes some mistakes and adds more clarity.

The Graphite + InfluxDB series continues.

In part 1, "On Graphite, Whisper and InfluxDB" I described the problems of Graphite's whisper and ceres, why I disagree with common graphite clustering advice as being the right path forward, what a great timeseries storage system would mean to me, why InfluxDB - despite being the youngest project - is my main interest right now, and introduced my approach for combining both and leveraging their respective strengths: InfluxDB as an ingestion and storage backend (and at some point, realtime processing and pub-sub) and graphite for its renown data processing-on-retrieval functionality. Furthermore, I introduced some tooling: carbon-relay-ng to easily route streams of carbon data (metrics datapoints) to storage backends, allowing me to send production data to Carbon+whisper as well as InfluxDB in parallel, graphite-api, the simpler Graphite API server, with graphite-influxdb to fetch data from InfluxDB.
Not Graphite related, but I wrote influx-cli which I introduced here. It allows to easily interface with InfluxDB and measure the duration of operations, which will become useful for this article.
In the Graphite & Influxdb intermezzo I shared a script to import whisper data into InfluxDB and noted some write performance issues I was seeing, but the better part of the article described the various improvements done to carbon-relay-ng, which is becoming an increasingly versatile and useful tool.
In part 2, which you are reading now, I'm going to describe recent progress, share more info about my setup, testing results, state of affairs, and ideas for future work

Progress made

InfluxDB saw two major releases:
- 0.7 (and followups), which was mostly about some needed features and bug fixes
- 0.8 was all about bringing some major refactorings in the hands of early adopters/testers: support for multiple storage engines, configurable shard spaces, rollups and retention schemes. There was some other useful stuff like speed and robustness improvements for the graphite input plugin (by yours truly) and various things like regex filtering for 'list series'. Note that a bunch of older bugs remained open throughout this release (most notably the broken derivative aggregator), and a bunch of new ones appeared. Maybe this is why the release was mostly in the dark. In this context, it's not so bad, because we let graphite-api do all the processing, but if you want to query InfluxDB directly you might hit some roadblocks.
- An older fix, but worth mentioning: series names can now also contain any character, which means you can easily use metrics2.0 identifiers. This is a welcome relief after having struggled with Graphite's restrictions on metric keys.
graphite-api received various bug fixes and support for templating, statsd instrumentation and caching.
Much of this was driven by graphite-influxdb: the caching allows us to cache metadata and the statsd integration gives us insights into the performance of the steps it goes through of building a graph (getting metadata from InfluxDB, querying InfluxDB, interacting with cache, post processing data, etc).
the progress on InfluxDB and graphite-api in turn enabled graphite-influxdb to become faster and simpler (note: graphite-influxdb requires InfluxDB 0.8). Furthermore you can now configure series resolutions (but different retentions per serie is on the roadmap, see State of affairs and what's coming), and of course it also got a bunch of bugfixes.

Because of all these improvements, all involved components are now ready for serious use.

Putting it all together, with docker

Docker probably needs no introduction, it's a nifty tool to build an environment with given software installed, and allows to easily deploy it and run it in isolation. graphite-api-influxdb-docker is a very creatively named project that generates the - also very creatively named - docker image graphite-api-influxdb, which contains graphite-api and graphite-influxdb, making it easy to hook in a customized configuration and get it up and running quickly. This is the recommended way to set this up, and this is what we run in production.

The setup

a server running InfluxDB and graphite-api with graphite-influxdb via the docker approach described above:

dell PowerEdge R610
24 x Intel(R) Xeon(R) X5660  @ 2.80GHz
96GB RAM
perc raid h700
6x600GB seagate 10k rpm drives in raid10 = 1.6 TB, Adaptive Read Ahead, Write Back, 64 kB blocks, no read caching
no sharding/shard spaces, compiled from git just before 0.8, using LevelDB (not rocksdb, which is now the default)
LevelDB max-open-files = 10000 (lsof shows about 30k open files total for the InfluxDB process), LRU 4096m, everything else is default I think.

a server running graphite-web, carbon, and whisper:

dell PowerEdge R710
16 x Intel(R) Xeon(R) E5640  @ 2.67GHz
96GB RAM
perc raid h700
8x150GB seagate 15k rm in raid5 = 952 GB, Read Ahead, Write Back, 64 kB blocks, no read caching
MAX_UPDATES_PER_SECOND = 1000  # to sequentialize writes

a relay server running carbon-relay-ng that sends the same production load into both. (about 2500 metrics/s, or 150k minutely)

As you can tell, on both machines RAM is vastly over provisioned, and they have lots of cpu available (the difference in cores should be negligible), but the difference in RAID level is important to note: RAID 5 comes with a write penalty. Even though the whisper machine has more, and faster disks, it probably has a disadvantage for writes. Maybe. Haven't done raid stuff in a long time, and I haven't it measured it out.
Clearly you'll need to take the results with a grain of salt, as unfortunately I do not have 2 systems available with the same configuration and their baseline (raw) performance is unknown..
Note: no InfluxDB clustering, see State of affairs and what's coming.

The empirical validation & migration

Once everything was setup and I could confidently send 100% of traffic to InfluxDB via carbon-relay-ng, it was trivial to run our dashboards with a flag deciding which server to go to. This way I have literally been running our graphite dashboards next to each other, allowing us to compare both stacks on:

visual differences: after a bunch of work and bug fixing, we got to a point where both dashboards looked almost exactly the same. (note that graphite-api's implementation of certain functions can behave slightly different, see for example this divideSeries bug)
speed differences by simply refreshing both pages and watching the PNGs load, with some assistance from firebug's network requests profiler. The difference here was big: graphs served up by graphite-api + InfluxDB loaded considerably faster. A page with 40 graphs or so would load in a few seconds instead of 20-30 seconds (on both first, as well as subsequent hits). This is for our default, 6-hour timeframe views. When cranking the timeframes up to a couple of weeks, graphite-api + InfluxDB was still faster.

Soon enough my colleagues started asking to make graphite-api + InfluxDB the default, as it was much faster in all common cases. I flipped the switch and everybody has been happy.

When loading a page with many dashboards, the InfluxDB machine will occasionally spike up to 500% cpu, though I rarely get to see any iowait (!), even after syncing the block cache (i just realized it'll probably still use the cache for reads after sync?)
The carbon/whisper machine, on the other hand, is always fighting iowait, which could be caused by the raid 5 write amplification but the random io due to the whisper format probably has more to do with it. Via the MAX_UPDATES_PER_SECOND I've tried to linearize writes, with mixed success. But I've never gone to deep into it. So basically comparing write performance would be unfair in these circumstances, I am only comparing reads in these tests. Despite the different storage setups, the Linux block cache should make things fair for reads. Whisper's iowait will handicap the reads, but I always did successive runs with fully loaded PNGs to make sure the block cache was warm for reads.

A "slightly more professional" benchmark

I could have stopped here, but the validation above was not very scientific. I wanted to do a somewhat more formal benchmark, to measure read speeds (though I did not have much time so it had to be quick and easy).
I wanted to compare InfluxDB vs whisper, and specifically how performance scales as you play with parameters such as number of series, points per series, and time range fetched (i.e. amount of points). I posted the benchmark on the InfluxDB mailing list. Look there for all information. I just want to reiterate the conclusion here: I was surprised. Because of the results above, I had assumed that InfluxDB would perform reads noticeably quicker than whisper but this is not the case. (maybe because whisper reads are nicely sequential - it's mostly writes that suffer from the whisper format)
This very much contrasts my earlier findings where the graphite-api+InfluxDB powered dashboards clearly take the lead. I have yet to figure out why this is. Maybe something to do with the performance of graphite-web vs graphite-api itself, gunicorn vs apache, worker configuration, or maybe InfluxDB only starts outperforming whisper as concurrency increases. Some more investigation is definitely needed!

Future benchmarks

The simple benchmark above was very simple to execute, as it only requires influx-cli and whisper-fetch (so you can easily check for yourself), but clearly there is a need to test more realistic scenarios with concurrent reads, and doing some write benchmarks would be nice too.
We should also look into cpu and memory usage. I have had the luxury of being able to completely ignore memory usage, but others seem to notice excessive InfluxDB memory usage.
conclusion: many tests and benchmarks should happen, but I don't really have time to conduct them. Hopefully other people in the community will take this on.

Disk space efficiency

Last time I checked, using LevelDB I was pretty close to 24B per record (which makes sense because time, seq_no and value are all 64bit values, and each record has those 3 fields). (this was with snappy compression enabled, so it didn't seem to give much benefit).
Whisper seems to consume 12 Bytes per record - a 32bit timestamp and a 64bit float value - making it considerably more storage efficient than InfluxDB/levelDB for now.
Some notes on this though:

whisper explicitly encodes None values, with InfluxDB those are implied (and require no space). We have some clusters of metrics that have very sparse data, so whisper gives us a lot of overhead here, but this is different for everyone. (note: Ceres should also be better at handling sparse data)
Whisper and Influxdb both explictly encode the timestamp for every record. Influxdb uses 64bit so you can do very high resolution (up to microseconds), whisper is limited to per-second data. Ceres AFAIK doesn't explicitly encode the timestamp at every record, which should also give it a space advantage.
I've been using a data format in InfluxDB where every record is timestamp-sequence_number-value. It currently works best overall, and so that's how the graphite ingestion plugin stores it and the graphite-influxdb plugin queries for it. But it exacerbates the overhead of the timestamp and sequence number.
We could technically use a row format where we use more variables as part of the record, storing them as columns instead of separate series, which would improve this dynamic (but currently comes with a big tradeoff in performance characteristics - see the column indexes ticket).
Another thing is that we could technically come up with a storage format for InfluxDB that is optimized for even-spaced metrics, it wouldn't need sequence numbers, and timestamps could be implicit instead of explicit, saving a lot of space. We could even go further and introduce types (int, etc) for values which would consume even less space.

It would be great if somebody with more Ceres experience could chip in here, as - in the context of space efficiency - it looks like a neat little format. Also, I'm probably not making proper use of the compression features that InfluxDB's storage engines support. This also requires some more looking into.

State of affairs and what's coming

InfluxDB typically performs pretty well, but not in all cases. More validation is needed. It wouldn't surprise me at this point if tools like hbase/Cassandra/riak clearly outperform InfluxDB, as long as we keep in mind that InfluxDB is a young project. A year, or two, from now, it'll probably perform much better. (and then again, it's not all about raw performance. InfluxDB's has other strengths)
A long time goal which is now a reality: You can use any Graphite dashboard on top of InfluxDB, as long as the data is stored in a graphite-compatible format.. Again, the easiest to get running is via graphite-api-influxdb-docker. There are two issues to be mentioned, though:
- graphite-influxdb needs to query InfluxDB for metadata, and this can be slow. If you have millions of metrics, it can take tens of seconds before querying for the data even starts. I am trying to work with the InfluxDB people on a solution.
- graphite-api doesn't work with metric id's that have equals signs in them.
With the 0.8 release out the door, the shard spaces/rollups/retention intervals feature will start stabilizing, so we can start supporting multiple retention intervals per metric
Because InfluxDB clustering is undergoing major changes, and because clustering is not a high priority for me, I haven't needed to worry about this. I'll probably only start looking at clustering somewhere in 2015 because I have more pressing issues.
Once the new clustering system and the storage subsystem have matured (sounds like a v1.0 ~ v1.2 to me) we'll get more speed improvements and robustness. Most of the integration work is done, it's just a matter of doing smaller improvements, bug fixes and waiting for InfluxDB to become better. Maintaining this stack aside, I personally will start focusing more on:
- per-second resolution in our data feeds, and potentially storage
- realtime (but basic) anomaly detection, realtime graphs for some key timeseries. Adrian Cockcroft had an inspirational piece in his Monitorama keynote about how alerts from timeseries should trigger within seconds.
- Mozilla's awesome heka project (this heka video is great), which should help a lot with the above. Also looking at Etsy's kale stack for anomaly detection
- metrics 2.0 and making sure metrics 2.0 works well with InfluxDB. Up to now I find the series / columns as a data model too limiting and arbitrary, it could be so much more powerful, ditto for the query language.
Can we do anything else to make InfluxDB (+graphite) faster? Yes!
- Long term, of course, InfluxDB should have powerful enough processing functions and query syntax, so that we don't even need a graphite layer anymore.
- A storage engine optimized for fixed intervals would probably help, timestamps and sequence numbers currently consume 2/3 of the record... and there's no reason to explicitly store either one in this use case. I've even rarely seen people make use of the sequence number in any other InfluxDB use case. See all the remarks in the Disk space efficiency section above. Finally we could have InfluxDB have fill in None values without it doing "group by" (timeframe consolidation), which would shave off runtime overhead.
- Then of course, there are projects to replace graphite-web/graphite-api with a Go codebase: graphite-ng and carbonapi. the latter is more production ready, but depends on some custom tooling and io using protobufs. But it performs an order of magnitude better than the python api server! I haven't touched graphite-ng in a while, but hopefully at some point I can take it up again
Another thing to keep in mind when switching to graphite-api + InfluxDB: you loose the graphite composer. I have a few people relying on this, so I can either patch it to talk to graphite-api (meh), separate it out (meh) or replace it with a nicer dashboard like tessera, grafana or descartes. (or Graph-Explorer, but it can be a bit too much of a paradigm shift).
some more InfluxDB stuff I'm looking forward to:
- binary protocol and result streaming (faster communication and responses!) (the latter might not get implemented though)
- "list series" speed improvements (if metadata querying gets fast enough, we won't need ES anymore for metrics2.0 index)
- InfluxDB instrumentation so we actually start getting an idea of what's going on in the system, a lot of the testing and troubleshooting is still in the dark.
Tracking exceptions in graphite-api is much harder than it should be. Currently there's no way to display exceptions to the user (in the http response) or to even log them. So sometimes you'll get http 500 responses and don't know why. You can use the sentry integration which works all right, but is clunky. Hopefully this will be addressed soon.

Conclusion

The graphite-influxdb stack works and is ready for general consumption. It's easy to install and operate, and performs well. It is expected that InfluxDB will over time mature and ultimately meet all my requirements of the ideal backend. It definitely has a long way to go. More benchmarks and tests are needed. Keep in mind that we're not doing large volumes of metrics. For small/medium shops this solution should work well, but on larger scales you will definitely run into issues. You might conclude that InfluxDB is not for you (yet) (there are alternative projects, after all).

Finally, a closing thought:
Having graphs and dashboards that look nice and load fast is a good thing to have, but keep in mind that graphs and dashboards should be a last resort. It's a solution if all else fails. The fewer graphs you need, the better you're doing.
How can you avoid needing graphs? Automatic alerting on your data.

I see graphs as a temporary measure: they provide headroom while you develop an understanding of the operational behavior of your infrastructure, conceive a model of it, and implement the alerting you need to do troubleshooting and capacity planning. Of course, this process consumes more resources (time and otherwise), and these expenses are not always justifiable, but I think this is the ideal case we should be working towards.

Either way, good luck and have fun!