Taming your Prometheus
Anyone who has run a busy Kubernetes Cluster has faced the issue with the Prometheus deployment quickly becoming the Largest element in the Cluster.
To keep things simple we usually run 2 PODs of Prometheus Server with their own Storage for high availability. When the number of metrics grows and clusters become busier and busier, Prometheus Servers require more and more memory to process metrics before saving them into Disk.
Start times for Prometheus gradually increase. Two Instances double the amount of used CPU/RAM. Eventually, you hit the limits of an individual worker and you have an even bigger problem.
How to tame the lion
There are several things that can be done to bring your Prometheus under control besides increasing the spec:
- Move Prometheus on its own dedicated worker of such spec so it holds the expanded requirements
- Reduce the number of scraping sources. Do you really need everything? let’s take node-exporter – is it possible that you already have those metrics saved somewhere in your conventional monitoring or in your CloudProvider monitoring?
- Reduce the number of individual metrics. Do you really need all those Istio metrics? and every metric from your GO applications? Some of them can be dropped even before the scraping.
- Find the biggest offenders – metrics and labels which take the most space in the Prometheus Storage. Prometheus stores data in TSDB. Use
tsdb analyse /prometheus/datato find the offenders and patterns.
- If none of the above helps and you definitely need all those metrics you have two options – Sharding(splitting your data scraping and storage between multiple instances of Prometheus) or deploying a more scalable Backend for Prometheus like Thanos
- Switch to a SAAS service 🙂
- Setup monitoring of the Disk space used by Prometheus and be ready to extend it when it’s close to the threshold
- Setup monitoring for memory used by Prometheus and make sure you scale it before it’s not able to recover
Have fun taming!