Are you passionate about scaling cutting-edge AI across distributed systems? Together with our partner, a prominent Online Fashion & Beauty Retailer in Europe, we’re looking for an experienced ML Engineer - Distributed Training Specialist to develop and optimize large language models (LLMs) tailored to the fashion industry.
Working with massive-scale data, we’re creating LLMs designed to entertain and inspire customers, shaping the future of AI in fashion. Join us and make an impact!
Implement and optimize distributed training pipelines for large-scale multimodal models
Set up and maintain training infrastructure across multiple nodes/GPUs
Develop and optimize data loading pipelines for multimodal inputs
Monitor and improve training efficiency and resource utilization
Implement checkpointing and fault tolerance mechanisms
Bachelor's/Master's in Computer Science, Engineering, or related field
5+ years of experience in ML engineering
Experience with LLM and/or image processing
Proven track record with large-scale model training and optimization
Experience with multimodal data processing and training
Proficiency in Python, PyTorch
Proficiency with Cloud Technologies
Preference but not a must: Strong experience with distributed training frameworks (DeepSpeed, FSDP, Megatron)