About Crusoe

Sustainable AI cloud solutions for a greener future

🏢 Tech👥 501-1000📅 Founded 2018📍 Denver, Colorado, United States

Key Highlights

Headquartered in Denver, Colorado
501-1000 employees focused on AI and renewable energy
First vertically integrated AI cloud platform
Committed to sustainable computing practices

Crusoe is a pioneering AI cloud platform headquartered in Denver, Colorado, that utilizes clean, renewable energy to power its operations. The company focuses on providing scalable computing resources for AI and machine learning applications, serving a diverse range of clients across various industr...

🎁 Benefits

Crusoe offers competitive salaries, equity options, generous PTO, and a flexible remote work policy to support work-life balance....

🌟 Culture

Crusoe fosters a culture centered on sustainability and innovation, encouraging employees to contribute to environmentally friendly computing solution...

🌐 Website 💼 LinkedIn 𝕏 Twitter All 232 jobs →

Site Reliability Engineer • Staff

Crusoe • San Francisco - On-Site

Posted 3 months ago🏛️ On-Site Staff Site Reliability Engineer 📍 San Francisco💰 $204,000 - $247,000 / yearly

Apply Now →

Skills & Technologies

aws docker kubernetes linux python terraform

Overview

Crusoe is hiring a Staff Site Reliability Engineer focused on Storage to ensure the performance and reliability of their AI-optimized cloud infrastructure. You'll work with technologies like AWS, Docker, and Kubernetes to build and optimize distributed storage systems. This position requires significant experience in site reliability engineering.

Job Description

Who you are

You have 5+ years of experience in site reliability engineering, particularly with cloud infrastructure — you've successfully maintained and optimized large-scale distributed systems, ensuring high availability and performance. Your expertise in storage solutions, including block, file, and object storage systems, allows you to tackle complex challenges in data management and reliability.

You possess strong programming skills in Python and experience with automation tools — you've built self-healing systems and monitoring solutions that enhance operational efficiency. Your familiarity with containerization technologies like Docker and orchestration tools such as Kubernetes enables you to deploy and manage applications seamlessly in cloud environments.

Your background includes working with AWS and other cloud platforms — you understand the intricacies of cloud storage and compute services, and you can implement best practices for data replication, encryption, and backup strategies. You thrive in collaborative environments, working closely with storage engineers and other teams to drive reliability initiatives.

You are a proactive problem solver who enjoys optimizing systems for performance and scalability — your analytical mindset helps you identify potential issues before they impact users. You are committed to continuous learning and staying updated with the latest trends in site reliability and cloud technologies.

Desirable

Experience with infrastructure as code tools like Terraform is a plus — you appreciate the importance of automating infrastructure management to reduce manual errors and improve deployment speed. Familiarity with monitoring and logging tools such as Datadog or Prometheus will help you maintain system health and performance metrics effectively.

What you'll do

In this role, you will be responsible for ensuring the availability and reliability of Crusoe's cloud storage products — you will build automation tools to monitor and maintain distributed storage infrastructure, focusing on performance and fault tolerance. Your work will directly support compute-intensive workloads for AI and high-performance computing (HPC) use cases.

You will collaborate with cross-functional teams to implement and maintain high-performance NVMe- and SSD-backed volumes — your contributions will enhance the capabilities of large-scale AI compute clusters, enabling them to operate efficiently and effectively. You will drive reliability initiatives that include data replication, encryption, and robust failover mechanisms, ensuring that data is always accessible and secure.

You will also participate in incident response and post-mortem analysis — your insights will help improve system resilience and inform future design decisions. By leveraging your expertise, you will contribute to the development of a sustainable cloud platform that aligns with Crusoe's mission to accelerate the abundance of energy and intelligence.

What we offer

At Crusoe, you will be part of a team that is at the forefront of the AI revolution — we are committed to crafting sustainable technology that empowers creativity without compromising on scale or speed. You will have the opportunity to drive meaningful innovation and make a tangible impact in the cloud infrastructure space.

We offer a competitive salary and benefits package, along with opportunities for professional growth and development — you will work in a collaborative environment that values your contributions and encourages you to take ownership of your projects. Join us in our mission to create a responsible and transformative cloud infrastructure that supports the future of AI.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Crusoe.

Apply Now →Get Job Alerts

✨

Similar Jobs You Might Like

Based on your interests and this role

Site Reliability Engineer

Crusoe•📍 San Francisco

Crusoe is hiring a Senior Site Reliability Engineer focused on Storage to maintain and optimize their AI-optimized cloud infrastructure. You'll work with technologies like AWS, Docker, and Kubernetes to ensure the reliability and performance of cloud storage systems.

Senior

3 months ago

Site Reliability Engineer

Crusoe•📍 San Francisco - On-Site

Crusoe is seeking a Senior Site Reliability Engineer to enhance the stability and performance of their GPU cloud platform. You'll collaborate with cross-functional teams and utilize skills in AWS, Docker, and Kubernetes. This role requires a strong background in operational excellence and incident management.

🏛️ On-SiteSenior

2 months ago

Site Reliability Engineer

GoDaddy•📍 United Kingdom - Remote

GoDaddy is seeking a Site Reliability Engineer to automate and maintain their storage infrastructure with a focus on Ceph. You'll ensure the reliability and performance of systems while working remotely from the United Kingdom.

🏠 RemoteMid-Level

7h ago

Site Reliability Engineer

Apple•📍 San Francisco - On-Site

Apple is seeking a Site Reliability Engineer to join their Services Engineering team. You'll be responsible for building secure, end-to-end solutions and managing the full infrastructure stack. This role requires expertise in solving complex problems at scale.

🏛️ On-Site

1 month ago

Site Reliability Engineer

GoDaddy•📍 Canada - Remote

🏠 RemoteMid-Level

7h ago

Browse all jobs →