
About OpenAI
Empowering humanity through safe AI innovation
Key Highlights
- Headquartered in San Francisco, CA with 1,001+ employees
- $68.9 billion raised in funding from top investors
- Launched ChatGPT, gaining 1 million users in 5 days
- 20-week paid parental leave and unlimited PTO policy
OpenAI is a leading AI research and development platform headquartered in the Mission District of San Francisco, CA. With over 1,001 employees, OpenAI has raised $68.9 billion in funding and is known for its groundbreaking products like ChatGPT, which gained over 1 million users within just five day...
🎁 Benefits
OpenAI offers flexible work hours and encourages unlimited paid time off, promoting at least 4 weeks of vacation per year. Employees enjoy comprehensi...
🌟 Culture
OpenAI's culture is centered around its mission to ensure that AGI benefits all of humanity. The company values transparency and ethical consideration...
Skills & Technologies
Overview
OpenAI is hiring a Reliability/DFX Engineer to oversee the architecture and implementation of reliable AI accelerator systems. You'll work closely with chip design and platform design, leveraging your expertise in machine learning and hardware engineering. This role requires a strong background in making ML systems reliable at scale.
Job Description
Who you are
You have a strong background in hardware engineering and machine learning, with hands-on experience in making ML systems reliable at scale. Your expertise in DFX architecture allows you to oversee the implementation and execution of reliability features in silicon, ensuring high-performance AI hardware meets the demands of advanced workloads. You are skilled in building system-level reliability models grounded in empirical data, guiding the development of innovative solutions.
You thrive in collaborative environments, working closely with chip design and platform design teams to architect and deploy next-generation AI accelerator systems. Your ability to identify high-ROI opportunities for improving reliability and availability across the stack sets you apart. You are detail-oriented and have a strategic mindset, translating complex technical challenges into actionable solutions.
What you'll do
In this role, you will oversee the DFX architecture from concept to high-volume deployment, proposing features that enhance reliability and fault tolerance in AI hardware. You will collaborate with cross-functional teams to evaluate system and chip architecture holistically, ensuring that the hardware is optimized for AI workloads. Your responsibilities will include building and refining reliability models, guiding the development process with empirical data, and ensuring compliance with job posting standards.
You will play a critical role in shaping the future of AI technology at OpenAI, contributing to the development of custom design tools and methodologies that accelerate innovation. Your work will directly impact the performance and reliability of AI systems, making a significant contribution to the company's mission of advancing artificial intelligence for the benefit of humanity.
What we offer
At OpenAI, we are committed to fostering an inclusive and supportive work environment. We offer competitive compensation and benefits, along with opportunities for professional growth and development. Join us in shaping the future of technology and making a positive impact on the world through AI.
Interested in this role?
Apply now or save it for later. Get alerts for similar jobs at OpenAI.
Similar Jobs You Might Like
Based on your interests and this role

Software Engineering
OpenAI is hiring a Software Engineer specializing in Reliability to ensure the performance and scalability of their systems. You'll work with Python, JavaScript, and AWS to build resilient infrastructure. This position requires experience in engineering and problem-solving skills.

Director Of Engineering
Crusoe is seeking a Director of Engineering & Reliability to lead engineering design standards and reliability strategies for their AI and HPC data centers. You'll work with AWS and Azure technologies to ensure world-class uptime and performance. This role requires significant experience in engineering management.

Hardware Engineer
Samsara is seeking a Senior Hardware Reliability Engineer to design quality processes ensuring high standards for hardware. You'll implement comprehensive reliability strategies throughout the product development lifecycle. This role requires expertise in hardware reliability engineering.

Site Reliability Engineer
Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Site Reliability Engineer
WorkOS is hiring a Site Reliability Engineer to ensure the platform remains fast, reliable, and resilient at scale. You'll work with AWS, Docker, and Kubernetes to build systems that handle hundreds of millions of requests. This role requires a strong understanding of complex systems and incident response.