
About Crusoe
Sustainable AI cloud solutions for a greener future
Key Highlights
- Headquartered in Denver, Colorado
- 501-1000 employees focused on AI and renewable energy
- First vertically integrated AI cloud platform
- Committed to sustainable computing practices
Crusoe is a pioneering AI cloud platform headquartered in Denver, Colorado, that utilizes clean, renewable energy to power its operations. The company focuses on providing scalable computing resources for AI and machine learning applications, serving a diverse range of clients across various industr...
🎁 Benefits
Crusoe offers competitive salaries, equity options, generous PTO, and a flexible remote work policy to support work-life balance....
🌟 Culture
Crusoe fosters a culture centered on sustainability and innovation, encouraging employees to contribute to environmentally friendly computing solution...
Skills & Technologies
Overview
Crusoe is hiring a Senior Site Reliability Engineer to ensure the reliability and scalability of their AI-optimized cloud platform. You'll work with distributed systems and large language models to build and operate managed AI services. This role requires strong experience in automation and cloud infrastructure.
Job Description
Who you are
You have a strong background in distributed systems and have hands-on experience with large language models — you've designed and operated reliable managed AI services that scale effectively. Your expertise in automation allows you to build reliability tooling that supports distributed AI pipelines and inference services, ensuring high performance and availability.
You are skilled in defining, measuring, and improving SLIs/SLOs across AI workloads — you understand the importance of meeting performance and reliability targets and have a track record of collaborating with cross-functional teams to optimize large-scale training and inference clusters. Your ability to automate observability through telemetry and performance tuning strategies is a key asset in your role.
What you'll do
In this role, you will design and operate reliable managed AI services with a focus on serving and scaling LLM workloads — your work will directly impact the performance of compute-intensive, latency-sensitive workloads for customers. You will build automation and reliability tooling that supports distributed AI pipelines, ensuring that services are both efficient and resilient.
You will collaborate closely with AI, platform, and infrastructure teams to optimize the performance of large-scale training and inference clusters — your insights will help drive improvements in system reliability and efficiency. Investigating and resolving reliability issues will be a key part of your responsibilities, as you work to maintain the high standards expected of Crusoe's AI infrastructure.
What we offer
At Crusoe, you will be part of a mission-driven team that is at the forefront of the AI revolution — we are committed to creating sustainable technology that empowers ambitious creativity. You will have the opportunity to drive meaningful innovation and make a tangible impact in the field of cloud infrastructure. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds.
Interested in this role?
Apply now or save it for later. Get alerts for similar jobs at Crusoe.
Similar Jobs You Might Like
Based on your interests and this role

Site Reliability Engineer
Crusoe is hiring a Site Reliability Engineer to ensure the reliability and scalability of their AI-optimized cloud platform. You'll work on building and operating managed AI services at scale, focusing on distributed systems and large language models.

Site Reliability Engineer
Crusoe is seeking a Senior Site Reliability Engineer to enhance the stability and performance of their GPU cloud platform. You'll collaborate with cross-functional teams and utilize skills in AWS, Docker, and Kubernetes. This role requires a strong background in operational excellence and incident management.

Site Reliability Engineer
Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Site Reliability Engineer
Apple is seeking a Site Reliability Engineer to join their Services Engineering team. You'll be responsible for building secure, end-to-end solutions and managing the full infrastructure stack. This role requires expertise in solving complex problems at scale.

Site Reliability Engineer
Braze is hiring a Senior Site Reliability Engineer to ensure the uptime of internal-facing services and platforms. You'll work with Linux, distributed systems, and automation to maintain high service availability. This position requires a strong background in system administration and software engineering.