About OpenAI

Empowering humanity through safe AI innovation

🏢 Tech👥 1001+ employees📅 Founded 2015📍 Mission District, San Francisco, CA💰 $68.9b⭐ 4.2

B2CB2BArtificial IntelligenceEnterpriseSaaSAPIDevOps

Key Highlights

Headquartered in San Francisco, CA with 1,001+ employees
$68.9 billion raised in funding from top investors
Launched ChatGPT, gaining 1 million users in 5 days
20-week paid parental leave and unlimited PTO policy

OpenAI is a leading AI research and development platform headquartered in the Mission District of San Francisco, CA. With over 1,001 employees, OpenAI has raised $68.9 billion in funding and is known for its groundbreaking products like ChatGPT, which gained over 1 million users within just five day...

🎁 Benefits

OpenAI offers flexible work hours and encourages unlimited paid time off, promoting at least 4 weeks of vacation per year. Employees enjoy comprehensi...

🌟 Culture

OpenAI's culture is centered around its mission to ensure that AGI benefits all of humanity. The company values transparency and ethical consideration...

🌐 Website 💼 LinkedIn 𝕏 Twitter All 499 jobs →

Site Reliability Engineer

OpenAI • San Francisco - On-Site

Posted 3 months ago🏛️ On-Site Site Reliability Engineer 📍 San Francisco

Apply Now →

Skills & Technologies

kubernetes linux automation

Overview

OpenAI is hiring a Site Reliability Engineer to operate and scale the next generation of compute clusters for frontier research. You'll work with Kubernetes and automation to ensure the reliability of large-scale supercomputers. This role requires experience in distributed systems and infrastructure management.

Job Description

Who you are

You have a strong background in distributed systems engineering and hands-on experience with infrastructure management — you've successfully operated large-scale compute clusters and understand the complexities involved in maintaining their reliability. Your expertise in Kubernetes is complemented by a solid understanding of Linux systems, allowing you to effectively manage and scale clusters in a hyperscale environment.

You thrive in fast-paced environments where quick problem-solving is essential — when issues arise, you can diagnose and resolve them efficiently, ensuring minimal downtime. Your experience with automation tools has enabled you to streamline processes, improve operational metrics, and enhance the overall efficiency of system operations.

What you'll do

In this role, you will be responsible for spinning up and scaling large Kubernetes clusters, focusing on automation for provisioning, bootstrapping, and cluster lifecycle management. You will build software abstractions that unify multiple clusters, presenting a seamless interface to training workloads. Your responsibilities will also include owning the node bring-up process from bare metal through firmware upgrades, ensuring fast and repeatable deployment at massive scale.

You will continuously work to improve operational metrics, aiming to reduce downtime and enhance system reliability. Collaborating with cross-functional teams, you will contribute to the design and implementation of systems that support OpenAI's cutting-edge model training initiatives. Your role will be pivotal in maintaining the efficiency and reliability of the infrastructure that powers groundbreaking AI research.

What we offer

At OpenAI, you will be part of a mission-driven team that believes in the potential of artificial intelligence to solve global challenges. We offer a collaborative work environment where your contributions will have a direct impact on the future of technology. Join us in shaping the future of AI and enjoy the opportunity to work with some of the brightest minds in the field.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at OpenAI.

Apply Now →Get Job Alerts

✨

Similar Jobs You Might Like

Based on your interests and this role

Site Reliability Engineer

Together AI•📍 San Francisco

Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Mid-Level

2w ago

Software Engineering

OpenAI•📍 San Francisco - On-Site

OpenAI is hiring a Software Engineer for the Frontier Clusters Infrastructure team to operate next-generation compute clusters. You'll work with Kubernetes and automation to support large-scale model training. This position requires experience in distributed systems and infrastructure.

🏛️ On-SiteMid-Level

1 year ago

Software Engineering

OpenAI•📍 San Francisco - On-Site

OpenAI is hiring a Senior Software Engineer for the Frontier Systems team to build critical infrastructure for supercomputers. You'll work with Python, SQL, and automation tools to ensure reliable model training. This position requires 7+ years of experience.

🏛️ On-SiteSenior

9 months ago

Site Reliability Engineer

WorkOS•📍 San Francisco - Remote

WorkOS is hiring a Site Reliability Engineer to ensure the platform remains fast, reliable, and resilient at scale. You'll work with AWS, Docker, and Kubernetes to build systems that handle hundreds of millions of requests. This role requires a strong understanding of complex systems and incident response.

🏠 Remote

8 months ago

Site Reliability Engineer

Mercor•📍 San Francisco - On-Site

Mercor is seeking a Site Reliability Engineer to own production reliability across critical systems. You'll work with AWS, Kubernetes, and Terraform to build and improve high-availability systems in San Francisco.

🏛️ On-SiteMid-Level

1 month ago

Browse all jobs →