OpenAI

About OpenAI

Empowering humanity through safe AI innovation

🏢 Tech👥 1001+ employees📅 Founded 2015📍 Mission District, San Francisco, CA💰 $68.9b4.2
B2CB2BArtificial IntelligenceEnterpriseSaaSAPIDevOps

Key Highlights

  • Headquartered in San Francisco, CA with 1,001+ employees
  • $68.9 billion raised in funding from top investors
  • Launched ChatGPT, gaining 1 million users in 5 days
  • 20-week paid parental leave and unlimited PTO policy

OpenAI is a leading AI research and development platform headquartered in the Mission District of San Francisco, CA. With over 1,001 employees, OpenAI has raised $68.9 billion in funding and is known for its groundbreaking products like ChatGPT, which gained over 1 million users within just five day...

🎁 Benefits

OpenAI offers flexible work hours and encourages unlimited paid time off, promoting at least 4 weeks of vacation per year. Employees enjoy comprehensi...

🌟 Culture

OpenAI's culture is centered around its mission to ensure that AGI benefits all of humanity. The company values transparency and ethical consideration...

Overview

OpenAI is hiring a Software Engineer for the Frontier Clusters Infrastructure team to operate next-generation compute clusters. You'll work with Kubernetes and automation to support large-scale model training. This position requires experience in distributed systems and infrastructure.

Job Description

Who you are

You have a strong background in distributed systems engineering and hands-on experience with infrastructure work, particularly in large-scale data centers. Your expertise includes scaling Kubernetes clusters and automating processes for provisioning and lifecycle management. You thrive in fast-paced environments where reliability and speed are critical, and you have a knack for quickly diagnosing and resolving issues.

You are familiar with managing bare-metal deployments and have experience with node bring-up processes, ensuring that deployments are fast and repeatable at massive scale. Your understanding of software abstractions allows you to unify multiple clusters, presenting a seamless interface to training workloads. You are committed to improving operational metrics and enhancing system reliability.

What you'll do

In this role, you will be responsible for spinning up and scaling large Kubernetes clusters, focusing on automation for provisioning, bootstrapping, and cluster lifecycle management. You will build software abstractions that simplify the complexity of managing multiple clusters across various data centers. Your work will involve owning the node bring-up process from bare metal through firmware upgrades, ensuring that deployments are efficient and reliable.

You will collaborate closely with cross-functional teams to enhance the performance of OpenAI's frontier research infrastructure. Your contributions will directly impact the efficiency and reliability of the supercomputers that support cutting-edge model training. You will continuously seek to improve operational metrics, aiming to reduce downtime and enhance system performance.

What we offer

At OpenAI, you will be part of a mission-driven team that believes in the potential of artificial intelligence to solve global challenges. We offer a collaborative work environment where your contributions will have a significant impact on the future of technology. You will have opportunities for professional growth and development as you work alongside some of the brightest minds in the field.

We encourage you to apply even if your experience doesn't match every requirement. Join us in shaping the future of AI and technology.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at OpenAI.

Similar Jobs You Might Like

Based on your interests and this role

OpenAI

Software Engineering

OpenAI📍 San Francisco - On-Site

OpenAI is hiring a Senior Software Engineer for the Frontier Systems team to build critical infrastructure for supercomputers. You'll work with Python, SQL, and automation tools to ensure reliable model training. This position requires 7+ years of experience.

🏛️ On-SiteSenior
9 months ago
OpenAI

Site Reliability Engineer

OpenAI📍 San Francisco - On-Site

OpenAI is hiring a Site Reliability Engineer to operate and scale the next generation of compute clusters for frontier research. You'll work with Kubernetes and automation to ensure the reliability of large-scale supercomputers. This role requires experience in distributed systems and infrastructure management.

🏛️ On-Site
3 months ago
OpenAI

Software Engineering

OpenAI📍 San Francisco - On-Site

OpenAI is hiring a Software Engineer for the Frontier Systems team to optimize power management in large-scale supercomputers. You'll work with Python and automation tools to ensure efficient operations. This position requires a strong focus on system-level investigations and experience in power management.

🏛️ On-SiteMid-Level
1 year ago
OpenAI

Software Engineering

OpenAI📍 San Francisco - Hybrid

OpenAI is hiring a Software Engineer for their Fleet Infrastructure team to design and operate systems for model deployment and training on a large GPU fleet. You'll work with technologies like Kubernetes and Docker, and this position requires experience in infrastructure systems.

🏢 HybridMid-Level
1 year ago
OpenAI

Full Stack Engineer

OpenAI📍 San Francisco - Hybrid

OpenAI is hiring a Full-Stack Software Engineer to create innovative ways for people to interact with AI. You'll work with technologies like JavaScript, Python, and React in San Francisco. This position requires a creative mindset and experience in software development.

🏢 HybridMid-Level
4 months ago