Toolmaker. Software creator, optimizer and harmonizer.
Makes things work and fly at Contextual.AI
Training LLM/RAG/Generative AI/Machine Learning/Scalability
A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.
This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.