The discipline of collecting infrastructure & application performance metrics, aggregation, storage, visualizations and alerting has many terms associated with it... Telemetry. Insights engineering. Operational visibility. I've seen a bunch of people present their work in advancing the state of the art in this domain:
from Anton Lebedevich's statistics for monitoring series, Toufic Boubez' talks on anomaly detection and Twitter's work on detecting mean shifts to projects such as flapjack (which aims to offload the alerting responsibility from your monitoring apps), the metrics 2.0 standardization effort or Etsy's Kale stack which tries to bring interesting changes in timeseries to your attention with minimal configuration.
Much of this work is being shared via conference talks and blog posts, especially around anomaly and fault detection, and I couldn't find a location for collaboration, quicker feedback and discussions on more abstract (algorithmic/mathematical) topics or those that cross project boundaries. So I created the IT-telemetry Google group. If I missed something existing, let me know. I can shut this down and point to whatever already exists. Either way I hope this kind of avenue proves useful to people working on these kinds of problems.