Multi-platform compute

CUDA · Metal · ROCm.
One codebase. No compromises.

Our training infrastructure targets three GPU compute platforms natively. Rather than wrapping a single vendor's API, each backend is implemented at the appropriate level of abstraction — giving us full access to hardware capabilities without performance penalties from generic middleware.

Component

CUDA · NVIDIA

Metal · Apple

ROCm · AMD

Custom kernels

hand-written compute shaders

Production

In progress

Attention layers

flash-style fused ops

Production

In progress

Optimizer kernels

custom int-weight method

Production

In progress

Evaluating

Pipeline inference

distributed node routing

Production

In progress

Data preprocessing

tokenizer + quality pipeline

Production

CUDA / NVIDIA

Primary research and training platform
PTX-level kernel optimisation where needed
RTX 5090 + 5080 in production fleet
cuBLAS / cuDNN used selectively
Custom memory allocator for training runs
Target: RTX PRO 6000 Blackwell (96 GB)

Metal / Apple Silicon

Inference and evaluation target
Metal Performance Shaders for data pipeline
M5 MacBook Pro · M4 Mac Mini in fleet
Unified memory exploited for large contexts
MLX compatibility layer under evaluation
Strong mobile / edge deployment story

ROCm / AMD

Secondary compute node, active development
RX 9070 XT in current fleet
HIP used for cross-compile portability
rocBLAS integration in progress
Evaluated as cost-effective training node
Full kernel parity is the target milestone

NOTE /

We do not use PyTorch, TensorFlow, or JAX as training backends. Our C++ training stack interfaces with vendor libraries directly, giving us full control over memory layout, kernel scheduling, and gradient flow. Framework abstractions are used only at the evaluation and tooling layer.

CUDA · Metal · ROCm.One codebase. No compromises.

CUDA · Metal · ROCm.
One codebase. No compromises.