First to support fine-tuning for Llama 3.1 405B

We've built a platform for non-Nvidia GPUs, starting with TPUs. It offers the same performance as NVIDIA H100, at 30% lower cost, and is optimized for large models.

Get access Schedule Demo

Backed by

Key Features of Felafax

One-click large training cluster
Effortlessly spin-up TPU/non-Nvidia GPU clusters from 8 to 1024 chips. Our framework seamlessly handles the training orchestration on any size cluster.
Unbeatable performance at lower cost
We built a custom training platform using a non-cuda, XLA architecture. You get same performance as H100 at 30% lower cost.
Customization at your finger-tip
Drop into Jupyter notebook and tailor your training run. Full control, zero compromises.
We handle the heavy lifting
We provide optimised model partitioning for Llama 3.1 405B, handle distributed checkpointing and multi-controller training orchestration. Focus on your innovation, not infrastructure.
Out-of-box templates
Choose between Pytorch XLA or JAX. Hit the ground running with pre-configured environments with all the necessary dependencies installed.
JAX implementation of Llama 3.1 (coming soon!)
With JAX, you get 25% faster training and 20% higher GPU utilization. Make good use of the costly compute you've paid for.

Want to fine-tune Llama 405B on your enterprise data?

Please reach out to us, and we'll work with you to get you set up. 🙂

Reach out

Meet our team

Nikhil Sonti
Co-Founder & CEO
Over 6 years at Meta and 3+ years at Microsoft, Nikhil has worked on ML inference infrastructure for Facebook Feed, focusing on performance and efficiency.
Nithin Sonti
Co-Founder & CTO
Nithin has over 5 years of experience at Google and Nvidia, specializing in building large-scale ML training infrastructure. He worked on building the trainer platform for YouTube recommender models and fine-tuned Gemini for YouTube.