Monitoring, Reliability & Shepherd

At FarmGPU, we view monitoring not as a set of dashboards, but as the operational backbone of the AI data path. Our reliability model is built specifically to exceed the ClusterMax™ Evaluation Criteria, ensuring that “random” hardware events never dictate the success of your training run.

1. Shepherd: The “Autonomous SRE”

Shepherd is FarmGPU’s proprietary reliability engine. It continuously audits the state of the cluster against the ClusterMax™ Evaluation Criteria, automating the transition from detection to remediation.

The Unified Telemetry Fabric

GPU Failure Detection: Real-time capture of NVIDIA XID/SXID events, SM occupancy, memory pressure, and PCIe AER rates.
Interconnect (RoCEv2): Monitoring of link flaps, retry counters, and fabric-wide NCCL collective timing.
Data Plane: BlueField-3 telemetry for NVMe queue depths, IOPS/throughput histograms, and DPU-offload health.
Physical: PDU-level power draw, fan speeds, and multi-sensor thermal mapping (CPU, GPU, RAM, NIC, and Transceivers).

2. Passive Health Checks (The Safety Net)

Aligned with ClusterMax™ Monitoring Standards, Shepherd runs continuous passive checks to identify “failing” nodes before they become “failed” nodes.

GPU/Memory ECC Monitoring: Real-time tracking of single-bit and double-bit ECC errors (via DCGM and EDAC) with automated node-cordoning upon crossing threshold.
Link Flap Detection: Sub-millisecond monitoring of fabric stability; Shepherd automatically drains nodes exhibiting link degradation to prevent NCCL collective timeouts.
Thermal Drift Alerts: Proactive detection of thermal anomalies in transceivers or GPUs, triggering automated power-capping or workload migration before throttling occurs.
“GPU Off the Bus” Detection: Immediate capture of NVML_ERROR_GPU_IS_LOST or XID 79 events, with automated node isolation.

3. Active Health Checks

Idle nodes in the Farm undergoes regular “Active Diagnostic” suite:

Compute Validation: NVIDIA DCGM Diag level 3 and 4
Stress Testing: In house developed GPU-Burn validate thermal stability under maximum synthetic load.
Interconnect Validation: Pairwise ib_write_bw and multi-node NCCL All-Reduce tests to verify that every node achieves >90% of theoretical fabric bandwidth.
Storage Throughput: Validating Sustained GB/s-per-GPU from the Silo storage tiers to the GPU memory.

4. Managed Observability (Grafana)

FarmGPU bare metal customers will get a dedicated monitoring stack and a custom URL.

Cluster-Level: Node availability, pending pod counts, and fabric-wide goodput.
Node-Level: Granular GPU SM utilization, power draw, and temperature heatmaps.
Job-Level: Integration with Slurm (sacct) and Kubernetes to correlate infrastructure metrics directly with training loss curves.

Overview

Products & Services

Technical Architecture

Data Center & Operations

Security

Team & Culture

Monitoring and Reliability

Monitoring, Reliability & Shepherd

1. Shepherd: The “Autonomous SRE”

The Unified Telemetry Fabric

2. Passive Health Checks (The Safety Net)

3. Active Health Checks

4. Managed Observability (Grafana)

Overview

Products & Services

Technical Architecture

Data Center & Operations

Security

Team & Culture

​Monitoring, Reliability & Shepherd

​1. Shepherd: The “Autonomous SRE”

​The Unified Telemetry Fabric

​2. Passive Health Checks (The Safety Net)

​3. Active Health Checks

​4. Managed Observability (Grafana)

Monitoring, Reliability & Shepherd

1. Shepherd: The “Autonomous SRE”

The Unified Telemetry Fabric

2. Passive Health Checks (The Safety Net)

3. Active Health Checks

4. Managed Observability (Grafana)