> ## Documentation Index
> Fetch the complete documentation index at: https://docs.farmgpu.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Software Stack

> TractorOS, control plane, and FarmGPU operational software

## Software Stack & Control Plane

FarmGPU's software stack is designed to **operate AI infrastructure at scale with minimal human overhead**, while preserving performance, transparency, and control. Rather than relying on proprietary cloud control planes, FarmGPU builds on **open-source, Linux-native primitives**, extended with purpose-built systems for GPU operations.

Together, these components form a **unified control plane** for provisioning, operating, observing, and optimizing AI clusters.

***

## TractorOS — The NeoCloud Operating System

**TractorOS** is FarmGPU's minimal, immutable, container-native operating system optimized specifically for AI compute.

<img src="https://mintcdn.com/farmgpu/jwn3M3s5i0_vYQ6f/images/tractoros-architecture.png?fit=max&auto=format&n=jwn3M3s5i0_vYQ6f&q=85&s=31f6443cb16cb9a60590134ecc9b4ee7" alt="TractorOS Architecture" width="1632" height="1064" data-path="images/tractoros-architecture.png" />

### What TractorOS Does

* Provides a hardened, reproducible OS image for GPU and storage nodes
* Ships with **built-in GPU, networking, and storage drivers**
* Enables **zero-touch provisioning and fleet-wide upgrades**
* Eliminates configuration drift across clusters

### Key Characteristics

* Linux-based (RHEL-derived)
* Immutable image-based updates
* Container-first runtime model
* Designed for bare metal AI infrastructure, not VMs

TractorOS allows FarmGPU to deploy and scale GPU fleets rapidly while maintaining **predictable behavior across thousands of nodes**.

***

## Homestead — Bare-Metal Provisioning & Lifecycle Management

**Homestead** is the entry point for every FarmGPU server.

### What Homestead Does

* Bare-metal provisioning via Redfish and Ansible
* Hardware discovery and inventory
* Initial OS deployment and configuration
* Support for KVM VMs and LXC containers where needed

Homestead ensures that servers move from **rack to production** with minimal manual intervention, forming the foundation for repeatable cluster deployments.

***

## Shepherd — System Health, Reliability & Diagnostics

**Shepherd** is FarmGPU's system health evaluation platform, designed to make AI infrastructure **observable, debuggable, and predictable**.

### What Shepherd Monitors

* GPU health and performance (via NVIDIA DCGM)
* Network health and packet loss
* Storage latency and throughput
* Host-level metrics and anomalies

### Advanced Capabilities

* Predictive failure detection
* Agentic diagnostics and root-cause analysis
* Automated incident summaries and reports
* Integration with observability tools (Grafana, Prometheus, Loki, Alloy)

Shepherd enables FarmGPU to move from **reactive monitoring** to **proactive reliability**, reducing downtime and improving SLA adherence.

***

## Silo — Storage Infrastructure & Benchmarking

**Silo** is FarmGPU's storage evaluation and optimization platform.

### What Silo Does

* Deploys and benchmarks block, file, and object storage systems
* Evaluates storage performance under AI workloads
* Performs hardware discovery and tuning
* Validates storage configurations before production rollout

Silo is used internally to benchmark platforms such as MinIO, VAST Data, Weka, and Ceph, ensuring that storage performance is **measured, repeatable, and workload-aware**.

***

## Haystack — AI Workload Profiling & Optimization

**Haystack** provides workload-level visibility into how AI jobs consume infrastructure.

### What Haystack Measures

* GPU utilization by workload
* Storage I/O patterns
* Network usage and contention
* Differences between training, fine-tuning, and inference jobs

This enables:

* Per-workload optimization
* Capacity planning based on real usage
* Improved scheduling and cost efficiency
* Better matching of workloads to hardware profiles

Haystack helps FarmGPU optimize **goodput**, not just raw utilization.

***

## Automation, Observability & Drivers (Underlying Stack)

Across all systems, FarmGPU standardizes on proven, open tooling:

### Automation

* **Ansible** for configuration and orchestration

### Observability

* **Grafana**
* **Prometheus**
* **Loki**
* **Alloy**
* **Node Exporter**
* **NVIDIA DCGM**
* **cAdvisor**

### Drivers & Acceleration

* NVIDIA GPU drivers
* SPDK for NVMe
* NVIDIA DOCA / OFED for networking and DPU acceleration

This approach keeps the stack **transparent, auditable, and extensible**, while avoiding vendor lock-in.

***

## Why This Software Stack Matters

FarmGPU's software stack is not a SaaS product suite—it is **operational leverage**.

It enables:

* Faster cluster bring-up
* Higher GPU utilization
* Lower operational overhead
* Predictable performance at scale
* Rapid adoption of new hardware generations

By building a Linux-native, hardware-aware control plane, FarmGPU achieves **hyperscaler-level operational maturity without hyperscaler complexity or cost**.