First to support fine-tuning for Llama 3.1 405B
We've built a platform for non-Nvidia GPUs, starting with TPUs. It offers the same performance as NVIDIA H100, at 30% lower cost, and is optimized for large models.
Backed by
Key Features of Felafax
One-click large training cluster
Effortlessly spin-up TPU/non-Nvidia GPU clusters from 8 to 1024 chips. Our framework seamlessly handles the training orchestration on any size cluster.
Unbeatable performance at lower cost
We built a custom training platform using a non-cuda, XLA architecture. You get same performance as H100 at 30% lower cost.
Customization at your finger-tip
Drop into Jupyter notebook and tailor your training run. Full control, zero compromises.
We handle the heavy lifting
We provide optimised model partitioning for Llama 3.1 405B, handle distributed checkpointing and multi-controller training orchestration. Focus on your innovation, not infrastructure.
Out-of-box templates
Choose between Pytorch XLA or JAX. Hit the ground running with pre-configured environments with all the necessary dependencies installed.
JAX implementation of Llama 3.1 (coming soon!)
With JAX, you get 25% faster training and 20% higher GPU utilization. Make good use of the costly compute you've paid for.
Want to fine-tune Llama 405B on your enterprise data?
Please reach out to us, and we'll work with you to get you set up. 🙂
Meet our team
Built by engineers with experience at
Let’s connect
We’re here to help and answer any questions you might have. We look forward to hearing from you.
Email
[email protected]Meeting
cal.com