
About Amazon
The everything store and cloud computing leader
Key Highlights
- Headquartered in South Lake Union, Seattle, WA
- Over 1.5 million employees worldwide
- Leading cloud services through Amazon Web Services (AWS)
- Acquired Whole Foods, Twitch, and Ring
Amazon, headquartered in South Lake Union, Seattle, WA, is the world's largest online retailer and a leader in cloud computing through Amazon Web Services (AWS). With over 1.5 million employees globally, Amazon operates in various sectors, including AI with its Alexa devices and a vast marketplace k...
🎁 Benefits
Amazon offers competitive salaries, stock options, generous PTO policies, and comprehensive health benefits. Employees also have access to a learning ...
🌟 Culture
Amazon's culture is driven by customer obsession and a focus on innovation. The company encourages employees to think big and move fast, fostering an ...
Overview
Amazon is hiring a Senior Machine Learning Engineer to develop and optimize distributed training solutions for AWS Neuron. You'll work with technologies like Python, PyTorch, and AWS to enhance performance for large-scale ML models. This position requires experience in training large models and distributed systems.
Job Description
Who you are
You have 5+ years of experience in machine learning engineering, particularly in developing and optimizing distributed training solutions. Your expertise includes working with large-scale models such as GPT and Llama, and you understand the intricacies of training these models effectively. You are proficient in Python and have hands-on experience with distributed training libraries like Deepspeed and Nemo.
You thrive in collaborative environments, working alongside chip architects and compiler engineers to create innovative solutions. Your strong analytical skills enable you to optimize models for peak performance on AWS custom silicon, ensuring efficiency and effectiveness in your work.
You are familiar with the AWS ecosystem and have experience with AWS Neuron, which is crucial for this role. Your background includes a solid understanding of machine learning frameworks, and you are comfortable leading efforts to integrate distributed training support into frameworks like PyTorch and JAX.
Desirable
Experience with FSDP (Fully-Sharded Data Parallel) is a plus, as is familiarity with the Neuron compiler and runtime stacks. You have a passion for pushing the boundaries of what is possible in machine learning and are eager to tackle complex challenges.
What you'll do
In this role, you will lead the development of distributed training support for AWS Neuron, focusing on enhancing the performance of various ML model families. You will collaborate closely with cross-functional teams to build and tune distributed training solutions that leverage AWS Trainium instances. Your responsibilities will include optimizing models to achieve peak performance and maximizing efficiency on custom silicon.
You will be responsible for integrating distributed training capabilities into PyTorch and JAX, utilizing XLA and the Neuron compiler. Your work will directly impact the performance of large-scale machine learning applications, enabling customers to solve complex challenges with innovative cloud solutions.
You will also engage in performance tuning and enablement of a wide variety of ML models, ensuring that they run efficiently on AWS infrastructure. Your contributions will help shape the future of machine learning at Amazon, making a significant impact on how customers utilize cloud solutions.
What we offer
Amazon provides a competitive salary range of $193,300.00 - $261,500.00 USD annually, along with comprehensive benefits including dental, vision, and mental health support. You will have access to a 401(k) matching program, paid time off, and parental leave, ensuring a supportive work-life balance. Join us at Amazon to be part of a team that is at the forefront of innovation in machine learning and cloud technology.
Interested in this role?
Apply now or save it for later. Get alerts for similar jobs at Amazon.
Similar Jobs You Might Like
Based on your interests and this role

Machine Learning Engineer
Amazon is hiring a Senior Machine Learning Engineer for the AWS Neuron Distributed Training team. You'll develop and optimize distributed training solutions for large-scale ML models using Python and various libraries. This role requires expertise in machine learning and cloud technologies.

Machine Learning Engineer
Amazon is hiring a Senior Machine Learning Engineer to develop and optimize distributed training solutions for large-scale ML models. You'll work with AWS Trainium and frameworks like Hugging Face and TensorFlow. This position requires expertise in machine learning and distributed systems.

Machine Learning Engineer
Amazon is hiring a Senior Machine Learning Engineer to develop and optimize software solutions for AWS Neuron. You'll work with AWS services and machine learning frameworks to build scalable applications. This position requires expertise in Python and machine learning technologies.

Machine Learning Engineer
Amazon is hiring a Machine Learning Engineer to develop and optimize large-scale ML model training solutions. You'll work with AWS Trainium and collaborate with cross-functional teams to deliver impactful machine learning products. This position requires experience in machine learning frameworks and AWS technologies.

Machine Learning Engineer
Amazon is hiring a Machine Learning Engineer for the AWS Neuron team to develop and optimize distributed training solutions for large-scale machine learning models. You'll work with technologies like Python, AWS, and PyTorch. This position requires experience in training large models and performance tuning.