
About Confluent
The leading data streaming platform for enterprises
Key Highlights
- Founded in 2014, built on Apache Kafka technology
- Raised $455.9 million in Series D funding
- Headquartered in Mountain View, California
- 1,001+ employees dedicated to data streaming solutions
Confluent, founded in 2014 and headquartered in Mountain View, California, is a leading data streaming platform built on the open-source Apache Kafka project. With over 1,001 employees, Confluent has raised $455.9 million in funding and serves a diverse range of enterprise customers, providing real-...
π Benefits
Confluent offers unlimited holiday, life insurance, income protection insurance, and delicious catered food twice a week. Employees also benefit from ...
π Culture
Confluent fosters a culture that values open-source collaboration and innovation, emphasizing the importance of real-time data solutions. The company ...
Skills & Technologies
Overview
Confluent is hiring a Staff Site Reliability Engineer to enhance incident management and reliability for their cloud platform. You'll work with AWS, GCP, and Azure to build automation and improve tooling. This position requires expertise in proactive reliability improvements and incident response.
Job Description
Who you are
You have extensive experience in site reliability engineering, particularly in incident management and reliability within multi-cloud environments. Your background includes hands-on technical work and strategic program ownership, allowing you to drive proactive reliability improvements that prevent incidents before they occur. You thrive in collaborative settings, coaching teams through post-mortems and evolving incident response practices.
You possess deep systems thinking skills, enabling you to analyze systemic failure patterns and design reliability improvements effectively. Your expertise spans across cloud platforms such as AWS, GCP, and Azure, and you are adept at building automation and improving tooling to enhance operational efficiency. You are a strong communicator, capable of teaching and coordinating with teams to ensure sustainable operations.
What you'll do
In this role, you will spend approximately 75% of your time on engineering tasks, focusing on building automation and improving tooling for incident management. You will analyze failure patterns and design reliability improvements that enhance the overall performance of Confluent Cloud, which processes millions of events per second. The remaining 25% of your time will be dedicated to teaching and coordination, where you will coach teams through post-mortems and train incident commanders to ensure effective incident response practices.
You will be part of a global team that provides follow-the-sun coverage, ensuring clean handoffs and sustainable working hours for all team members. Your contributions will directly impact the reliability and performance of the data streaming platform, allowing companies to react faster and build smarter solutions. You will collaborate with cross-functional teams to foster a culture of continuous improvement and innovation in incident management.
What we offer
Confluent offers a dynamic work environment where you can lead, grow, and challenge whatβs possible in the field of data streaming. You will have the opportunity to work with cutting-edge technologies and be part of a team that values diverse perspectives and collaboration. We encourage you to apply even if your experience doesn't match every requirement, as we believe in the potential of every individual to contribute to our mission.
Interested in this role?
Apply now or save it for later. Get alerts for similar jobs at Confluent.
Similar Jobs You Might Like
Based on your interests and this role

Site Reliability Engineer
Clarifai is seeking a Senior Site Reliability Engineer to ensure the smooth operation and high availability of their AI platform. You'll work with Kubernetes, Python, and Golang to address infrastructure challenges. This role requires expertise in cloud systems and microservice architecture.

Site Reliability Engineer
Zscaler is hiring a Staff Site Reliability Engineer to join their Cloud Infrastructure & Operations team. You'll leverage your expertise to enhance the reliability and performance of their systems. This role is remote and based in the Netherlands.

Site Reliability Engineer
PandaDoc is hiring a Senior Site Reliability Engineer to ensure reliable service with minimal downtime. You'll work with Python, Java, AWS, and Kubernetes to manage incident processes and observability stacks. This role requires solid programming experience and expertise in maintaining production services.

Site Reliability Engineer
PandaDoc is hiring a Senior Site Reliability Engineer to ensure reliable service with minimal downtime. You'll work with Python, Java, AWS, and Kubernetes to manage incident processes and observability tools. This role requires solid programming experience and expertise in maintaining production services.

Site Reliability Engineer
Zscaler is hiring a Staff Site Reliability Engineer to enhance their zero trust security platform. You'll work in a hybrid role based in San Jose or fully remote across the USA. This position requires expertise in site reliability engineering.