About Crusoe

Sustainable AI cloud solutions for a greener future

🏢 Tech👥 501-1000📅 Founded 2018📍 Denver, Colorado, United States

Key Highlights

Headquartered in Denver, Colorado
501-1000 employees focused on AI and renewable energy
First vertically integrated AI cloud platform
Committed to sustainable computing practices

Crusoe is a pioneering AI cloud platform headquartered in Denver, Colorado, that utilizes clean, renewable energy to power its operations. The company focuses on providing scalable computing resources for AI and machine learning applications, serving a diverse range of clients across various industr...

🎁 Benefits

Crusoe offers competitive salaries, equity options, generous PTO, and a flexible remote work policy to support work-life balance....

🌟 Culture

Crusoe fosters a culture centered on sustainability and innovation, encouraging employees to contribute to environmentally friendly computing solution...

🌐 Website 💼 LinkedIn 𝕏 Twitter All 232 jobs →

Site Reliability Engineer

Crusoe • San Francisco - On-Site

Posted 4w ago🏛️ On-Site Site Reliability Engineer 📍 San Francisco💰 $204,000 - $247,000 / yearly

Apply Now →

Skills & Technologies

Distributed systems Automation Cloud infrastructure Ai services Telemetry Performance tuning

Overview

Crusoe is hiring a Site Reliability Engineer to ensure the reliability and scalability of their AI-optimized cloud platform. You'll work on building and operating managed AI services at scale, focusing on distributed systems and large language models.

Job Description

Who you are

You have a strong background in distributed systems and have hands-on experience with large language models — you understand the intricacies of building and operating reliable managed AI services at scale. Your expertise in automation and reliability tooling allows you to support distributed AI pipelines effectively.

You are skilled in defining, measuring, and improving SLIs/SLOs across AI workloads — ensuring that performance and reliability targets are consistently met. Collaboration is key for you; you work closely with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters.

Your experience includes automating observability by building telemetry and performance tuning strategies for latency-sensitive AI services — you thrive on investigating and resolving reliability issues in distributed AI systems.

What you'll do

As a Site Reliability Engineer at Crusoe, you will design and operate reliable managed AI services with a focus on serving and scaling LLM workloads. You will build automation and reliability tooling to support distributed AI pipelines and inference services, ensuring that our infrastructure can handle compute-intensive workloads efficiently.

You will define, measure, and improve SLIs/SLOs across AI workloads, collaborating with various teams to optimize performance and reliability. Your role will involve automating observability processes, allowing for better monitoring and performance tuning of our AI services.

In addition, you will investigate and resolve reliability issues in distributed AI systems, contributing to the overall success of Crusoe's mission to accelerate the abundance of energy and intelligence through sustainable technology.

What we offer

At Crusoe, you will be part of a team that is setting the pace for responsible, transformative cloud infrastructure. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds. Join us in driving meaningful innovation and making a tangible impact in the AI revolution.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Crusoe.

Apply Now →Get Job Alerts

✨

Similar Jobs You Might Like

Based on your interests and this role

Site Reliability Engineer

Crusoe•📍 San Francisco - On-Site

Crusoe is hiring a Senior Site Reliability Engineer to ensure the reliability and scalability of their AI-optimized cloud platform. You'll work with distributed systems and large language models to build and operate managed AI services. This role requires strong experience in automation and cloud infrastructure.

🏛️ On-SiteSenior

2w ago

Site Reliability Engineer

Crusoe•📍 San Francisco - On-Site

Crusoe is seeking a Senior Site Reliability Engineer to enhance the stability and performance of their GPU cloud platform. You'll collaborate with cross-functional teams and utilize skills in AWS, Docker, and Kubernetes. This role requires a strong background in operational excellence and incident management.

🏛️ On-SiteSenior

2 months ago

Site Reliability Engineer

Together AI•📍 San Francisco

Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Mid-Level

2w ago

Site Reliability Engineer

Apple•📍 San Francisco - On-Site

Apple is seeking a Site Reliability Engineer to join their Services Engineering team. You'll be responsible for building secure, end-to-end solutions and managing the full infrastructure stack. This role requires expertise in solving complex problems at scale.

🏛️ On-Site

1 month ago

Ai Engineer

Postman•📍 San Francisco - On-Site

Postman is hiring a Lead AI Engineer to develop and manage reliability metrics for AI-driven API services. You'll work with technologies like Python and AWS to ensure the performance and scalability of AI systems. This position requires significant experience in reliability engineering.

🏛️ On-SiteLead

2w ago

Browse all jobs →