'Grafana dashboard best practice for large scale monitoring

We have spark clusters with 100-200 nodes and we plot several metrics of executors, driver

We are not sure what's the best way to create a dashboard at such scale? Visualizing all the 100-200 nodes and executor stats doesn't surface the problem as there is lot of noise. It also slows down the dashboard tremendously

What are some good practices around grafana dashboards?

  1. Visualize using top K
  2. Plot only anomalies? How do we detect anomalies?
  3. How to reduce noise?
  4. How to make the dashboard more performant?

We use prometheus in the backend



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source