
About Baseten
Simplifying machine learning for every organization
Key Highlights
- Headquartered in Union Square, San Francisco, CA
- $285 million raised in Series C funding
- Team growth of 3x over the last five years
- Unlimited PTO with a company-wide holiday break
Baseten is a machine learning application builder headquartered in Union Square, San Francisco, CA. With $285 million in funding from investors like Coatue Management and Founders Fund, Baseten simplifies AI integration for businesses, enabling data scientists to deploy ML models without needing spe...
🎁 Benefits
Baseten offers a remote-first work environment with a $1,000 stipend for home office setup, unlimited PTO with a company-wide break during the holiday...
🌟 Culture
Baseten's culture emphasizes simplifying complex AI technologies for businesses, fostering a collaborative environment where team members can connect ...
Skills & Technologies
Overview
Baseten is hiring a Site Reliability Engineer to build and maintain scalable infrastructure for deploying machine learning models. You'll work with technologies like AWS, Docker, and Kubernetes. This position requires experience in managing CI/CD pipelines and optimizing performance.
Job Description
Who you are
You have a strong background in site reliability engineering, with experience in building and maintaining scalable infrastructure for deploying machine learning models. You understand the importance of reliability and performance, and you have a knack for automating processes to improve efficiency. Your experience with cloud platforms, particularly AWS, has equipped you with the skills to manage multi-cloud environments effectively.
You are proficient in using containerization technologies like Docker and orchestration tools such as Kubernetes. Your programming skills in Python allow you to write scripts and automate tasks, making you a valuable asset to any engineering team. You thrive in collaborative environments and enjoy learning from users to enhance operational processes.
What you'll do
As a Site Reliability Engineer at Baseten, you will be responsible for envisioning and building robust systems that ensure our infrastructure is scalable, reliable, and efficient. You will work on automating deployments and monitoring systems, optimizing performance, and managing incidents. Your role will involve establishing standards and best practices for reliability across the infrastructure, ensuring that our systems can handle the demands of machine learning operations.
You will collaborate closely with the infrastructure team on projects such as multi-cloud capacity management and optimizing GPU usage for model serving. Your contributions will directly impact the efficiency and reliability of our platform, enabling engineers to ship AI products seamlessly. You will also be involved in incident management, ensuring that any issues are resolved quickly and effectively to minimize downtime.
What we offer
At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants. Join us in building a platform that empowers AI companies to bring cutting-edge models into production. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds.
We offer competitive compensation and benefits, along with opportunities for professional growth and development. As part of a rapidly growing company, you will have the chance to work on innovative projects that shape the future of AI infrastructure. Our team is collaborative and forward-thinking, and we are excited to welcome new members who share our vision.
Interested in this role?
Apply now or save it for later. Get alerts for similar jobs at Baseten.
Similar Jobs You Might Like
Based on your interests and this role

Site Reliability Engineer
ConductorOne is hiring a Site Reliability Engineer to design and operate highly reliable infrastructure across cloud environments. You'll work with AWS, GCP, and Azure while building automation and tooling to enhance system reliability. This position requires 3+ years of experience in SRE or DevOps.

Site Reliability Engineer
Alembic is seeking a Senior Site Reliability Engineer to enhance the reliability and performance of their platform. You'll work with technologies like Docker and Kubernetes to build and maintain scalable infrastructure. This role requires 8+ years of experience in SRE or DevOps.

Site Reliability Engineer
Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Site Reliability Engineer
Mercor is seeking a Site Reliability Engineer to own production reliability across critical systems. You'll work with AWS, Kubernetes, and Terraform to build and improve high-availability systems in San Francisco.

Site Reliability Engineer
WorkOS is hiring a Site Reliability Engineer to ensure the platform remains fast, reliable, and resilient at scale. You'll work with AWS, Docker, and Kubernetes to build systems that handle hundreds of millions of requests. This role requires a strong understanding of complex systems and incident response.