
About OpenAI
Empowering humanity through safe AI innovation
Key Highlights
- Headquartered in San Francisco, CA with 1,001+ employees
- $68.9 billion raised in funding from top investors
- Launched ChatGPT, gaining 1 million users in 5 days
- 20-week paid parental leave and unlimited PTO policy
OpenAI is a leading AI research and development platform headquartered in the Mission District of San Francisco, CA. With over 1,001 employees, OpenAI has raised $68.9 billion in funding and is known for its groundbreaking products like ChatGPT, which gained over 1 million users within just five day...
🎁 Benefits
OpenAI offers flexible work hours and encourages unlimited paid time off, promoting at least 4 weeks of vacation per year. Employees enjoy comprehensi...
🌟 Culture
OpenAI's culture is centered around its mission to ensure that AGI benefits all of humanity. The company values transparency and ethical consideration...
Skills & Technologies
Overview
OpenAI is hiring a Software Engineer for the Frontier Clusters Infrastructure team to operate next-generation compute clusters. You'll work with Kubernetes and automation to support large-scale model training. This position requires experience in distributed systems and infrastructure.
Job Description
Who you are
You have a strong background in distributed systems engineering and hands-on experience with infrastructure work, particularly in large-scale data centers. Your expertise includes scaling Kubernetes clusters and automating processes for provisioning and lifecycle management. You thrive in fast-paced environments where reliability and speed are critical, and you have a knack for quickly diagnosing and resolving issues.
You are familiar with managing bare-metal deployments and have experience with node bring-up processes, ensuring that deployments are fast and repeatable at massive scale. Your understanding of software abstractions allows you to unify multiple clusters, presenting a seamless interface to training workloads. You are committed to improving operational metrics and enhancing system reliability.
What you'll do
In this role, you will be responsible for spinning up and scaling large Kubernetes clusters, focusing on automation for provisioning, bootstrapping, and cluster lifecycle management. You will build software abstractions that simplify the complexity of managing multiple clusters across various data centers. Your work will involve owning the node bring-up process from bare metal through firmware upgrades, ensuring that deployments are efficient and reliable.
You will collaborate closely with cross-functional teams to enhance the performance of OpenAI's frontier research infrastructure. Your contributions will directly impact the efficiency and reliability of the supercomputers that support cutting-edge model training. You will continuously seek to improve operational metrics, aiming to reduce downtime and enhance system performance.
What we offer
At OpenAI, you will be part of a mission-driven team that believes in the potential of artificial intelligence to solve global challenges. We offer a collaborative work environment where your contributions will have a significant impact on the future of technology. You will have opportunities for professional growth and development as you work alongside some of the brightest minds in the field.
We encourage you to apply even if your experience doesn't match every requirement. Join us in shaping the future of AI and technology.
Interested in this role?
Apply now or save it for later. Get alerts for similar jobs at OpenAI.
Similar Jobs You Might Like
Based on your interests and this role

Software Engineering
OpenAI is hiring a Senior Software Engineer for the Frontier Systems team to build critical infrastructure for supercomputers. You'll work with Python, SQL, and automation tools to ensure reliable model training. This position requires 7+ years of experience.

Site Reliability Engineer
OpenAI is hiring a Site Reliability Engineer to operate and scale the next generation of compute clusters for frontier research. You'll work with Kubernetes and automation to ensure the reliability of large-scale supercomputers. This role requires experience in distributed systems and infrastructure management.

Software Engineering
OpenAI is hiring a Software Engineer for the Frontier Systems team to optimize power management in large-scale supercomputers. You'll work with Python and automation tools to ensure efficient operations. This position requires a strong focus on system-level investigations and experience in power management.

Software Engineering
OpenAI is hiring a Software Engineer for their Fleet Infrastructure team to design and operate systems for model deployment and training on a large GPU fleet. You'll work with technologies like Kubernetes and Docker, and this position requires experience in infrastructure systems.

Full Stack Engineer
OpenAI is hiring a Full-Stack Software Engineer to create innovative ways for people to interact with AI. You'll work with technologies like JavaScript, Python, and React in San Francisco. This position requires a creative mindset and experience in software development.