About Crusoe

Sustainable AI cloud solutions for a greener future

🏢 Tech👥 501-1000📅 Founded 2018📍 Denver, Colorado, United States

Key Highlights

Headquartered in Denver, Colorado
501-1000 employees focused on AI and renewable energy
First vertically integrated AI cloud platform
Committed to sustainable computing practices

Crusoe is a pioneering AI cloud platform headquartered in Denver, Colorado, that utilizes clean, renewable energy to power its operations. The company focuses on providing scalable computing resources for AI and machine learning applications, serving a diverse range of clients across various industr...

🎁 Benefits

Crusoe offers competitive salaries, equity options, generous PTO, and a flexible remote work policy to support work-life balance....

🌟 Culture

Crusoe fosters a culture centered on sustainability and innovation, encouraging employees to contribute to environmentally friendly computing solution...

🌐 Website 💼 LinkedIn 𝕏 Twitter All 232 jobs →

Site Reliability Engineer • Senior

Crusoe • San Francisco - On-Site

Posted 2 months ago🏛️ On-Site Senior Site Reliability Engineer 📍 San Francisco💰 $172,000 - $209,000 / yearly

Apply Now →

Skills & Technologies

aws docker kubernetes linux prometheus

Overview

Crusoe is seeking a Senior Site Reliability Engineer to enhance the stability and performance of their GPU cloud platform. You'll collaborate with cross-functional teams and utilize skills in AWS, Docker, and Kubernetes. This role requires a strong background in operational excellence and incident management.

Job Description

Who you are

You have 5+ years of experience in site reliability engineering or a related field, with a strong focus on operational excellence and incident management. You thrive in environments where you can solve complex operational problems and improve system reliability. Your expertise in cloud infrastructure, particularly with AWS, allows you to effectively manage and optimize resources for performance and efficiency. You are comfortable working with containerization technologies like Docker and orchestration tools such as Kubernetes, which you use to streamline deployment processes and enhance system resilience. Your strong understanding of Linux systems enables you to troubleshoot and optimize server performance effectively. You are familiar with monitoring tools like Prometheus, which you leverage to track system health and performance metrics, ensuring that service level objectives are met.

Desirable

Experience with incident response processes and root cause analysis documentation is a plus. Familiarity with defining and refining availability metrics, including SLIs and SLOs, will help you excel in this role. You are a proactive communicator who enjoys collaborating with cross-functional teams to drive improvements in system reliability and operational efficiency.

What you'll do

In this role, you will be responsible for ensuring the stability and performance of Crusoe’s GPU cloud platform. You will collaborate with senior SREs and infrastructure engineers to define and refine availability metrics, establishing and tracking SLIs and SLOs to enhance service reliability. You will assist in incident response by identifying, diagnosing, and resolving service disruptions, ensuring that post-incident processes are documented through root cause analysis. Your contributions will directly impact the operational excellence of the cloud platform, as you work to reduce operational toil and improve incident management practices. You will also engage in continuous improvement initiatives, identifying areas for optimization and implementing solutions that enhance system performance and reliability.

What we offer

At Crusoe, you will be part of a mission-driven team focused on accelerating the abundance of energy and intelligence through sustainable technology. We offer a collaborative work environment where innovation is encouraged, and your contributions will have a tangible impact on the future of cloud infrastructure. You will have opportunities for professional growth and development, working alongside talented engineers who are passionate about their work. We provide competitive compensation and benefits, ensuring that you are rewarded for your expertise and dedication to operational excellence.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Crusoe.

Apply Now →Get Job Alerts

✨

Similar Jobs You Might Like

Based on your interests and this role

Site Reliability Engineer

Braze•📍 San Francisco - On-Site

Braze is hiring a Senior Site Reliability Engineer to ensure the uptime of internal-facing services and platforms. You'll work with Linux, distributed systems, and automation to maintain high service availability. This position requires a strong background in system administration and software engineering.

🏛️ On-SiteSenior

1w ago

Site Reliability Engineer

Together AI•📍 San Francisco

Together AI is hiring a Site Reliability Engineer to ensure the reliability and performance of user-facing services and production systems. You'll work with Ansible, Terraform, and Kubernetes to build and manage infrastructure. This role requires 2+ years of experience in SRE or a related field.

Mid-Level

2w ago

Site Reliability Engineer

Stellar Development Foundation•📍 San Francisco - On-Site

Stellar Development Foundation is hiring a Senior Site Reliability Engineer to enhance the reliability and scalability of their systems. You'll work with AWS, GCP, and Kubernetes to support the Stellar blockchain ecosystem. This role requires strong experience in infrastructure management and automation.

🏛️ On-SiteSenior

3w ago

Site Reliability Engineer

Mercor•📍 San Francisco - On-Site

Mercor is seeking a Site Reliability Engineer to own production reliability across critical systems. You'll work with AWS, Kubernetes, and Terraform to build and improve high-availability systems in San Francisco.

🏛️ On-SiteMid-Level

1 month ago

Site Reliability Engineer

WorkOS•📍 San Francisco - Remote

WorkOS is hiring a Site Reliability Engineer to ensure the platform remains fast, reliable, and resilient at scale. You'll work with AWS, Docker, and Kubernetes to build systems that handle hundreds of millions of requests. This role requires a strong understanding of complex systems and incident response.

🏠 Remote

8 months ago

Browse all jobs →