CUDA · Metal · ROCm. One codebase. No compromises.
Our training infrastructure targets three GPU compute platforms natively. Rather than wrapping a single vendor's API, each backend is implemented at the appropriate level of abstraction — giving us full access to hardware capabilities without performance penalties from generic middleware.
Capability matrixby platform
Component
CUDA · NVIDIA
Metal · Apple
ROCm · AMD
Custom kernels
hand-written compute shaders
Production
Production
In progress
Attention layers
flash-style fused ops
Production
In progress
In progress
Optimizer kernels
custom int-weight method
Production
In progress
Evaluating
Pipeline inference
distributed node routing
Production
Production
In progress
Data preprocessing
tokenizer + quality pipeline
Production
Production
Production
Platform notesbackend-specific detail
CUDA / NVIDIA
Primary research and training platform
PTX-level kernel optimisation where needed
RTX 5090 + 5080 in production fleet
cuBLAS / cuDNN used selectively
Custom memory allocator for training runs
Target: RTX PRO 6000 Blackwell (96 GB)
Metal / Apple Silicon
Inference and evaluation target
Metal Performance Shaders for data pipeline
M5 MacBook Pro · M4 Mac Mini in fleet
Unified memory exploited for large contexts
MLX compatibility layer under evaluation
Strong mobile / edge deployment story
ROCm / AMD
Secondary compute node, active development
RX 9070 XT in current fleet
HIP used for cross-compile portability
rocBLAS integration in progress
Evaluated as cost-effective training node
Full kernel parity is the target milestone
NOTE /
We do not use PyTorch, TensorFlow, or JAX as training backends. Our C++ training stack interfaces with vendor libraries directly, giving us full control over memory layout, kernel scheduling, and gradient flow. Framework abstractions are used only at the evaluation and tooling layer.