
About Amazon
The everything store and cloud computing leader
Key Highlights
- Headquartered in South Lake Union, Seattle, WA
- Over 1.5 million employees worldwide
- Leading cloud services through Amazon Web Services (AWS)
- Acquired Whole Foods, Twitch, and Ring
Amazon, headquartered in South Lake Union, Seattle, WA, is the world's largest online retailer and a leader in cloud computing through Amazon Web Services (AWS). With over 1.5 million employees globally, Amazon operates in various sectors, including AI with its Alexa devices and a vast marketplace k...
🎁 Benefits
Amazon offers competitive salaries, stock options, generous PTO policies, and comprehensive health benefits. Employees also have access to a learning ...
🌟 Culture
Amazon's culture is driven by customer obsession and a focus on innovation. The company encourages employees to think big and move fast, fostering an ...
Skills & Technologies
Overview
Amazon is hiring a Senior Machine Learning Engineer for the AWS Neuron Distributed Training team. You'll develop and tune performance for large-scale ML models using AWS Trainium. This position requires experience with PyTorch and distributed training strategies.
Job Description
Who you are
You have 5+ years of experience in machine learning engineering, particularly with distributed training of large models. Your expertise includes working with frameworks like PyTorch and TensorFlow, and you understand the intricacies of training massive-scale models such as Large Language Models (LLMs). You are familiar with distributed training strategies, including Fully-Sharded Data Parallel (FSDP) and other parallelization techniques, which are essential for optimizing performance in cloud environments.
Your background includes collaboration with chip architects and compiler engineers, allowing you to bridge the gap between hardware and software for optimal ML performance. You have a strong understanding of the AWS ecosystem, particularly AWS Neuron, and how it integrates with machine learning workflows. You are detail-oriented and have a knack for performance tuning, ensuring that models run efficiently on cloud-scale infrastructure.
Desirable
Experience with additional distributed training libraries such as torchtitan, torchtune, and HF RL would be a plus. Familiarity with post-training strategies like DPO/PPO and HF torch-tune will further strengthen your application. You are passionate about leveraging technology to solve complex challenges and are eager to contribute to innovative solutions that impact customers globally.
What you'll do
In this role, you will be responsible for the development and enablement of distributed training solutions for a variety of machine learning models. You will work closely with a team of engineers to create, build, and tune these solutions, ensuring they are optimized for AWS Trainium instances. Your work will directly contribute to the performance and scalability of machine learning applications in the cloud.
You will engage in hands-on development, focusing on the implementation of distributed training strategies that enhance model performance. Collaborating with cross-functional teams, you will help design and refine the software stack that supports AWS Neuron, ensuring it meets the needs of diverse ML workloads. Your contributions will help shape the future of machine learning at Amazon, enabling customers to tackle challenges that were previously unimaginable.
What we offer
Amazon provides a comprehensive benefits package, including competitive compensation, equity options, and a full range of medical and financial benefits. You will have the opportunity to work in a dynamic environment that fosters innovation and collaboration. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds in our teams.
Interested in this role?
Apply now or save it for later. Get alerts for similar jobs at Amazon.
Similar Jobs You Might Like
Based on your interests and this role

Machine Learning Engineer
Amazon is hiring a Senior Machine Learning Engineer to optimize performance for AWS Neuron's distributed training. You'll work with cutting-edge AI/ML technologies and contribute to the development of large-scale models. This position requires expertise in AWS and machine learning frameworks.

Machine Learning Engineer
Amazon is hiring a Machine Learning Engineer to optimize distributed training performance on AWS Trainium. You'll work with technologies like Python and TensorFlow to enhance machine learning models. This position requires experience in performance tuning and distributed systems.

Machine Learning Engineer
Amazon is hiring a Senior Machine Learning Engineer to shape the future of AI accelerators at AWS Neuron. You'll work with technologies like PyTorch and JAX to optimize AI models at scale. This position requires significant experience in machine learning and distributed systems.

Machine Learning Engineer
Amazon is hiring a Senior Machine Learning Engineer to work on AWS Neuron, a toolkit for accelerating deep learning and GenAI workloads. You'll utilize AWS, PyTorch, and JAX to optimize performance on custom ML accelerators. This position requires expertise in machine learning and deep learning.

Machine Learning Engineer
Amazon is hiring a Senior Machine Learning Engineer to work on the AWS Neuron SDK, which accelerates deep learning and GenAI workloads. You'll utilize AWS, PyTorch, and JAX to optimize performance on custom ML accelerators. This position requires strong expertise in machine learning and deep learning technologies.