Multi-platform compute

CUDA · Metal · ROCm.
One codebase. No compromises.

Our training infrastructure targets three GPU compute platforms natively. Rather than wrapping a single vendor's API, each backend is implemented at the appropriate level of abstraction — giving us full access to hardware capabilities without performance penalties from generic middleware.

Capability matrix by platform
Component
CUDA · NVIDIA
Metal · Apple
ROCm · AMD
Custom kernels
hand-written compute shaders
Production
Production
In progress
Attention layers
flash-style fused ops
Production
In progress
In progress
Optimizer kernels
custom int-weight method
Production
In progress
Evaluating
Pipeline inference
distributed node routing
Production
Production
In progress
Data preprocessing
tokenizer + quality pipeline
Production
Production
Production
Platform notes backend-specific detail
CUDA / NVIDIA
Metal / Apple Silicon
ROCm / AMD
NOTE /
We do not use PyTorch, TensorFlow, or JAX as training backends. Our C++ training stack interfaces with vendor libraries directly, giving us full control over memory layout, kernel scheduling, and gradient flow. Framework abstractions are used only at the evaluation and tooling layer.