Dagster Monitoring with Prometheus: From System Metrics to Custom Assets (Part 1)

The Monitoring Challenge

Maintaining a self-hosted instance of Dagster comes with its own unique set of challenges. Every distributed application like Dagster should have comprehensive monitoring in place to measure bottlenecks, record performance metrics, track execution times, monitor materialization durations, and capture many other vital operational metrics alongside business-specific parameters.

When running applications on Kubernetes, Prometheus has become practically the default choice for monitoring.

Prometheus is an open-source monitoring and alerting toolkit that operates on a pull-based model, actively scraping metrics from /metrics endpoints at regular intervals and storing them in a time-series database.


System-Level Monitoring. The Easy Part

Monitoring system components of Dagster like dagster-webserver, dagster-daemon, or even the PostgreSQL database is relatively straightforward when running in Kubernetes. The Kubernetes API automatically records and provides comprehensive system-level metrics for these services through tools like kubelet, cAdvisor, and kube-state-metrics. This functionality comes out of the box without any additional configuration.


The Real Challenge. Business Logic Monitoring

The real challenge emerges when we need to monitor Dagster jobs and assets that step close to the business logic of your applications. This is where traditional Prometheus scraping patterns break down due to the fundamental nature of data processing workloads.

By nature, Dagster jobs can be short-lived, sporadic, dynamic, and distributed. The conventional Prometheus approach of scraping metrics endpoints at regular intervals (typically 15-60 seconds) simply doesn't work for ephemeral processes that may start and complete between scrape intervals.

Without proper monitoring of job-level metrics, you lose visibility into asset materialization times, job success/failure rates, resource consumption during data processing, data quality metrics, and business KPIs tied to your data pipelines.


Push-Based Metrics with Prometheus

Despite these challenges, Prometheus remains effective for monitoring Dagster jobs through push functionality using specialized gateways. Dagster provides built-in support for the prometheus_client library, allowing you to instrument your code and generate Prometheus-ready metrics directly within your application logic.

The workflow is simple: Dagster jobs push metrics to a gateway, and Prometheus scrapes the gateway on its regular schedule, storing everything in its time-series database for analysis.


Let's explore how to implement this approach with a concrete example that demonstrates instrumenting a Dagster asset with Prometheus metrics.


Setting Up Dagster Metrics with Prometheus

First, we need to configure the push gateway to allow Prometheus to receive metrics from our Dagster jobs. Use an Aggregation Gateway like prom-aggregation-gateway instead of the official Prometheus Pushgateway. The official version doesn't aggregate metrics and may overwrite data from concurrent jobs, while an aggregation gateway properly combines metrics from multiple sources, preventing data loss when Dagster runs jobs in parallel.

docker-compose.yaml

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./config/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./data/prometheus:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./data/grafana:/var/lib/grafana
      - ./config/grafana:/etc/grafana/provisioning/dashboards
    ports:
      - "8081:3000"
    depends_on:
      - prometheus
    restart: unless-stopped

  aggregation-gateway:
    image: ghcr.io/zapier/prom-aggregation-gateway:latest
    container_name: aggregation-gateway
    ports:
        - "80:80"
    restart: unless-stopped


Instrumenting Dagster Assets with Prometheus Metrics

Now that our Prometheus setup is ready, let's instrument the code using the prometheus_client library to send valuable metrics from our Dagster assets. This involves adding metric collection directly into your asset definitions to capture key performance indicators like execution time, data quality metrics, and business-specific measurements.


from prometheus_client import Counter, Histogram, REGISTRY, generate_latest
import requests

# Define metrics
my_asset_runs_total = Counter("my_asset_runs_total", "Total runs of my asset")
my_asset_success_total = Counter("my_asset_success_total", "Total successful runs")
my_asset_failure_total = Counter("my_asset_failure_total", "Total failed runs")
my_asset_duration_seconds = Histogram("my_asset_duration_seconds", "Duration of asset runs in seconds")

PUSHGATEWAY_URL = "http://aggregation-gateway/metrics/job/my_asset"

@asset
def my_asset(context):
    my_asset_runs_total.inc()
    with my_asset_duration_seconds.time():
        try:
            # ... your asset logic ...
            my_asset_success_total.inc()
        except Exception as e:
            my_asset_failure_total.inc()
            raise
        finally:
            # Push metrics to the gateway
            metrics_payload = generate_latest(REGISTRY)
            requests.post(
                PUSHGATEWAY_URL,
                data=metrics_payload,
                headers={"Content-Type": "text/plain"},
                timeout=5
            )

Materialise the asset to make sure everything is working. Once you trigger the asset materialization, you should see the metrics being pushed to the gateway

From the Prometheus side, we can see the same metrics by checking our aggregation gateway target. Navigate to your Prometheus web interface and verify that the gateway is being scraped successfully and the metrics are available for querying.

Now that we've confirmed the data is flowing correctly, we can start building our monitoring dashboard using Grafana. The various types of metrics we defined in our instrumentation code - counters, gauges, and histograms - will allow us to create diverse visualisations in Grafana, from simple time series graphs tracking asset execution times to complex heatmaps showing performance distributions across different jobs and time periods.



What's Next: Monitoring Your dbt Assets

This covers the foundation of monitoring your Dagster infrastructure and custom assets with Prometheus. But we're only halfway there - the most critical piece is still missing.

Remember, 50% of Dagster users rely on dbt for their data transformations, yet monitoring these SQL-based assets presents unique challenges. You can't simply add prometheus_client calls to your dbt models like you can with Python assets. So how do you capture execution times, row counts, and performance metrics from your dbt transformations?

In our next post, we'll dive deep into monitoring dbt assets with Prometheus, exploring how to extract valuable metrics from dbt's artifacts, instrument your SQL transformations, and create comprehensive dashboards that give you visibility into your entire data pipeline - from Python assets to dbt models.

Stay tuned to complete your Dagster observability stack!

Monitoring isn't just about knowing when things break - it's about understanding your system well enough to prevent failures before they happen. Prometheus gives you that visibility into Dagster.

Monitoring isn't just about knowing when things break - it's about understanding your system well enough to prevent failures before they happen. Prometheus gives you that visibility into Dagster.

Monitoring isn't just about knowing when things break - it's about understanding your system well enough to prevent failures before they happen. Prometheus gives you that visibility into Dagster.

Abhivan Chekuri

Subscribe for the latest blogs and news updates!

Related Posts

dagster

Jul 17, 2025

The combination of open-source tools like Authentik with Kubernetes ingress controllers provides enterprise-grade authentication without the enterprise price tag, making secure self-hosted data stacks accessible to organizations of any size.

dagster

Jul 7, 2025

Enter GitOps - a modern operations model where desired state lives in version control, and automation reconciles it to reality. In software engineering, GitOps is already the go-to for managing microservices. But it’s just as powerful for managing pipelines.

© MetaOps 2024

© MetaOps 2024

© MetaOps 2024