Перейти к основному содержимому

For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

Метрики оператора

Обзор

Dynamo Operator публикует метрики Prometheus для мониторинга собственного состояния и производительности. Эти метрики отделены от метрик приложения (frontend/worker) и дают видимость в:

  • Reconciliation контроллеров: насколько эффективно контроллеры обрабатывают DynamoGraphDeployments, DynamoComponentDeployments и DynamoModels
  • Валидацию webhook'ов: производительность и результаты admission webhook-запросов
  • Инвентаризацию ресурсов: текущее число управляемых ресурсов по состоянию и namespace

Предварительные требования

Для метрик оператора требуется та же инфраструктура мониторинга, что и для метрик приложения. Подробные инструкции по настройке см. в Kubernetes Metrics Guide.

Краткий чеклист:

  • ✅ установлен kube-prometheus-stack (для поддержки ServiceMonitor)
  • ✅ запущены Prometheus и Grafana
  • Dynamo Operator установлен через Helm

Сбор метрик

ServiceMonitor

Метрики оператора автоматически собираются через ServiceMonitor, который создается chart'ом Helm, когда metricsService.enabled: true (по умолчанию).

В отличие от метрик приложения (которые используют PodMonitor), оператор использует ServiceMonitor и не требует ручной настройки RBAC. Endpoint метрик оператора использует встроенный фильтр controller-runtime WithAuthenticationAndAuthorization для защищенной публикации.

Чтобы проверить, что ServiceMonitor создан:

kubectl get servicemonitor -n dynamo-system

Отключение сбора метрик

Чтобы отключить сбор метрик оператора:

helm upgrade dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace dynamo-system \
--set dynamo-operator.metricsService.enabled=false

Доступные метрики

All metrics use the dynamo_operator namespace prefix.

Метрики reconciliation

MetricTypeLabelsDescription
dynamo_operator_reconcile_duration_secondsHistogramresource_type, namespace, resultDuration of reconciliation loops
dynamo_operator_reconcile_totalCounterresource_type, namespace, resultTotal number of reconciliations
dynamo_operator_reconcile_errors_totalCounterresource_type, namespace, error_typeTotal reconciliation errors by type

Labels:

  • resource_type: DynamoGraphDeployment, DynamoComponentDeployment, DynamoModel, DynamoGraphDeploymentRequest, DynamoGraphDeploymentScalingAdapter
  • namespace: Target namespace of the resource
  • result: success, error, requeue
  • error_type: not_found, already_exists, conflict, validation, bad_request, unauthorized, forbidden, timeout, server_timeout, unavailable, rate_limited, internal

Метрики webhook'ов

MetricTypeLabelsDescription
dynamo_operator_webhook_duration_secondsHistogramresource_type, operationDuration of webhook validation requests
dynamo_operator_webhook_requests_totalCounterresource_type, operation, resultTotal webhook admission requests
dynamo_operator_webhook_denials_totalCounterresource_type, operation, reasonTotal webhook denials with reasons

Labels:

  • resource_type: Same as reconciliation metrics
  • operation: CREATE, UPDATE, DELETE
  • result: allowed, denied
  • reason: Validation failure reason (e.g., immutable_field_changed, invalid_config)

Метрики инвентаризации ресурсов

MetricTypeLabelsDescription
dynamo_operator_resources_totalGaugeresource_type, namespace, statusCurrent count of resources by state

Labels:

  • resource_type: DynamoGraphDeployment, DynamoComponentDeployment, DynamoModel, DynamoGraphDeploymentRequest, DynamoGraphDeploymentScalingAdapter
  • namespace: Resource namespace
  • status: Resource state derived from each CRD's status. Common values:
    • "ready" - Resource is healthy and operational (DCD, DM, DGDSA)
    • "not_ready" - Resource exists but is not operational (DCD, DM, DGDSA)
    • "unknown" - State cannot be determined (default for empty status)
    • DGD uses: "pending", "successful", "failed" from .status.state
    • DGDR uses: "Pending", "Profiling", "Ready", "Deploying", "Deployed", "Failed" from .status.phase

Example Queries

Reconciliation Performance

# P95 reconciliation duration by resource type
histogram_quantile(0.95,
sum by (resource_type, le) (
rate(dynamo_operator_reconcile_duration_seconds_bucket[5m])
)
)

# Reconciliation rate by result
sum by (resource_type, result) (
rate(dynamo_operator_reconcile_total[5m])
)

# Error rate by type
sum by (resource_type, error_type) (
rate(dynamo_operator_reconcile_errors_total[5m])
)

Webhook Performance

# Webhook P95 latency
histogram_quantile(0.95,
sum by (resource_type, le) (
rate(dynamo_operator_webhook_duration_seconds_bucket[5m])
)
)

# Webhook denial rate
sum by (resource_type, operation, reason) (
rate(dynamo_operator_webhook_denials_total[5m])
)

Resource Inventory

# Total resources by type and state
sum by (resource_type, status) (
dynamo_operator_resources_total
)

# DynamoGraphDeployments by state
sum by (status) (
dynamo_operator_resources_total{resource_type="DynamoGraphDeployment"}
)

# All resources by namespace and state
sum by (resource_type, namespace, status) (
dynamo_operator_resources_total
)

Дашборд Grafana

A pre-built Grafana dashboard is available for visualizing operator metrics.

Разделы дашборда

  1. Reconciliation Metrics (3 panels)

    • Reconciliation rate by resource type and result
    • P95 reconciliation duration
    • Reconciliation errors by type
  2. Webhook Metrics (3 panels)

    • Webhook request rate by operation
    • P95 webhook duration
    • Webhook denials by reason
  3. Resource Inventory (2 panels)

    • Resource inventory timeline by state and namespace (filterable by resource type)
    • Current resource count by state (filterable by resource type)
  4. Operational Health (2 panels)

    • Reconciliation success rate gauges
    • Webhook admission success rate gauges

Deploying the Dashboard

kubectl apply -f deploy/observability/grafana-operator-dashboard-configmap.yaml

The dashboard will automatically appear in Grafana (assuming you have the Grafana dashboard sidecar configured, which is included in kube-prometheus-stack).

Finding the Dashboard

  1. Port-forward to Grafana (if needed):

    kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
  2. Log in to Grafana at http://localhost:3000

  3. Navigate to Dashboards → Search for "Dynamo Operator"

Dashboard Filters

The dashboard includes two filter variables:

  • Namespace: View metrics across all namespaces or filter by specific ones (multi-select)
  • Resource Type: Filter all panels by resource type or select "All" to see aggregated metrics across all CRDs (single select)

When "All" is selected for Resource Type, all panels will show data for all five managed CRDs with resource_type labels for differentiation.

Прямой доступ к метрикам

For instructions on accessing Prometheus and Grafana, see the Kubernetes Metrics Guide.

Once you have access to Prometheus, you can query operator metrics directly:

# Port-forward to Prometheus
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring

# Visit http://localhost:9090 and try queries like:
# - dynamo_operator_reconcile_total
# - dynamo_operator_webhook_requests_total
# - dynamo_operator_resources_total

Устранение неполадок

Metrics Not Appearing in Prometheus

  1. Check ServiceMonitor exists:

    kubectl get servicemonitor -n dynamo-system | grep operator
  2. Check ServiceMonitor is discovered by Prometheus:

    • Go to Prometheus UI → Status → Targets
    • Look for serviceMonitor/dynamo-system/dynamo-platform-dynamo-operator-operator
    • Should show state: UP
  3. Check Prometheus selector configuration:

    kubectl get prometheus -o yaml | grep serviceMonitorSelector

    Ensure serviceMonitorSelectorNilUsesHelmValues: false was set during kube-prometheus-stack installation.

Dashboard Not Appearing in Grafana

  1. Check ConfigMap is created:

    kubectl get configmap -n monitoring grafana-operator-dashboard
  2. Check ConfigMap has the label:

    kubectl get configmap -n monitoring grafana-operator-dashboard -o jsonpath='{.metadata.labels.grafana_dashboard}'

    Should return "1"

  3. Check Grafana dashboard sidecar configuration:

    kubectl get deployment -n monitoring prometheus-grafana -o yaml | grep -A 5 sidecar

    The sidecar should be configured to watch for grafana_dashboard: "1" label.

  4. Restart Grafana pod to force dashboard refresh:

    kubectl rollout restart deployment/prometheus-grafana -n monitoring

Связанная документация