OpenAI

About OpenAI

Empowering humanity through safe AI innovation

🏢 Tech👥 1001+ employees📅 Founded 2015📍 Mission District, San Francisco, CA💰 $68.9b4.2
B2CB2BArtificial IntelligenceEnterpriseSaaSAPIDevOps

Key Highlights

  • Headquartered in San Francisco, CA with 1,001+ employees
  • $68.9 billion raised in funding from top investors
  • Launched ChatGPT, gaining 1 million users in 5 days
  • 20-week paid parental leave and unlimited PTO policy

OpenAI is a leading AI research and development platform headquartered in the Mission District of San Francisco, CA. With over 1,001 employees, OpenAI has raised $68.9 billion in funding and is known for its groundbreaking products like ChatGPT, which gained over 1 million users within just five day...

🎁 Benefits

OpenAI offers flexible work hours and encourages unlimited paid time off, promoting at least 4 weeks of vacation per year. Employees enjoy comprehensi...

🌟 Culture

OpenAI's culture is centered around its mission to ensure that AGI benefits all of humanity. The company values transparency and ethical consideration...

Overview

OpenAI is hiring a Reliability/DFX Engineer to oversee the architecture and implementation of reliable AI accelerator systems. You'll work closely with chip design and platform design, leveraging your expertise in machine learning and hardware engineering. This role requires a strong background in making ML systems reliable at scale.

Job Description

Who you are

You have a strong background in hardware engineering and machine learning, with hands-on experience in making ML systems reliable at scale. Your expertise in DFX architecture allows you to oversee the implementation and execution of reliability features in silicon, ensuring high-performance AI hardware meets the demands of advanced workloads. You are skilled in building system-level reliability models grounded in empirical data, guiding the development of innovative solutions.

You thrive in collaborative environments, working closely with chip design and platform design teams to architect and deploy next-generation AI accelerator systems. Your ability to identify high-ROI opportunities for improving reliability and availability across the stack sets you apart. You are detail-oriented and have a strategic mindset, translating complex technical challenges into actionable solutions.

What you'll do

In this role, you will oversee the DFX architecture from concept to high-volume deployment, proposing features that enhance reliability and fault tolerance in AI hardware. You will collaborate with cross-functional teams to evaluate system and chip architecture holistically, ensuring that the hardware is optimized for AI workloads. Your responsibilities will include building and refining reliability models, guiding the development process with empirical data, and ensuring compliance with job posting standards.

You will play a critical role in shaping the future of AI technology at OpenAI, contributing to the development of custom design tools and methodologies that accelerate innovation. Your work will directly impact the performance and reliability of AI systems, making a significant contribution to the company's mission of advancing artificial intelligence for the benefit of humanity.

What we offer

At OpenAI, we are committed to fostering an inclusive and supportive work environment. We offer competitive compensation and benefits, along with opportunities for professional growth and development. Join us in shaping the future of technology and making a positive impact on the world through AI.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at OpenAI.

Similar Jobs You Might Like

Based on your interests and this role

OpenAI

Software Engineering

OpenAI📍 San Francisco - On-Site

OpenAI is hiring a Software Engineer specializing in Reliability to ensure the performance and scalability of their systems. You'll work with Python, JavaScript, and AWS to build resilient infrastructure. This position requires experience in engineering and problem-solving skills.

🏛️ On-SiteMid-Level
4 months ago
Crusoe

Director Of Engineering

Crusoe📍 San Francisco - On-Site

Crusoe is seeking a Director of Engineering & Reliability to lead engineering design standards and reliability strategies for their AI and HPC data centers. You'll work with AWS and Azure technologies to ensure world-class uptime and performance. This role requires significant experience in engineering management.

🏛️ On-SiteLead
1 month ago
Samsara

Hardware Engineer

Samsara📍 San Francisco - On-Site

Samsara is seeking a Senior Hardware Reliability Engineer to design quality processes ensuring high standards for hardware. You'll implement comprehensive reliability strategies throughout the product development lifecycle. This role requires expertise in hardware reliability engineering.

🏛️ On-SiteSenior
1w ago
Together AI

Site Reliability Engineer

Together AI📍 San Francisco

Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Mid-Level
2w ago
WorkOS

Site Reliability Engineer

WorkOS📍 San Francisco - Remote

WorkOS is hiring a Site Reliability Engineer to ensure the platform remains fast, reliable, and resilient at scale. You'll work with AWS, Docker, and Kubernetes to build systems that handle hundreds of millions of requests. This role requires a strong understanding of complex systems and incident response.

🏠 Remote
8 months ago