ThoughtWorks

About ThoughtWorks

Transforming businesses through technology and innovation

🏒 TechπŸ‘₯ 5K-10KπŸ“… Founded 1993πŸ“ Chicago, Illinois, United States

Key Highlights

  • Headquartered in Chicago, Illinois, with 43 global offices
  • Approximately 7,000 employees worldwide
  • Serves clients including BMW, BBC, and the UN
  • Focus on software development and digital transformation

ThoughtWorks is a global technology consultancy headquartered in Chicago, Illinois, with over 43 offices across 14 countries. The company specializes in software development, digital transformation, and agile consulting, serving clients like BMW, the BBC, and the United Nations. With a workforce of ...

🎁 Benefits

ThoughtWorks offers competitive salaries, equity options, a generous PTO policy, and flexible remote work arrangements. Employees also benefit from a ...

🌟 Culture

ThoughtWorks fosters a culture of continuous learning and innovation, emphasizing agile methodologies and collaborative problem-solving. The company v...

Skills & Technologies

Overview

ThoughtWorks is hiring a Senior Site Reliability Engineer to ensure technical excellence and operational efficiency within the infrastructure domain. You'll specialize in reliability, resilience, and system performance while utilizing automation and monitoring tools. This role requires expertise in SRE principles and a commitment to continuous improvement.

Job Description

Who you are

You have a strong background in Site Reliability Engineering, with a focus on reliability, resilience, and system performance. Your experience includes conducting SRE and Disaster Recovery maturity assessments, and you are adept at engineering automation solutions using tools like Ansible to replace manual workflows. You understand the importance of shared responsibility and are committed to fostering a collaborative culture that meets and exceeds reliability and business objectives.

You have a proven track record of improving site reliability through mechanisms and architectures that enhance fault tolerance and reduce Mean Time to Recovery (MTTR) and Mean Time to Detection (MTTD). You are skilled in integrating observability automation into CI/CD pipelines and have experience handling production incidents, leading client communication, and creating root cause analysis documentation. Your ability to monitor the performance of production systems and improve scaling to meet Service Level Agreements (SLA) and Service Level Objectives (SLO) is a key asset.

You thrive on proactive improvements rather than reactive fixes, and you are at the forefront of cost optimization, automation, and scalable solutions. Your expertise will play a crucial role in streamlining operations, boosting efficiency, and ensuring systems grow with clients’ needs. You are excited to join a team that values curiosity, innovation, and purpose.

Desirable

Experience with additional automation tools and monitoring solutions will be beneficial. Familiarity with cloud infrastructure and a strong understanding of incident response processes will enhance your ability to succeed in this role.

What you'll do

As a Senior Site Reliability Engineer at ThoughtWorks, you will take a lead role in championing the principles of Site Reliability Engineering within the DAMO service line. You will conduct SRE and Disaster Recovery maturity assessments to identify areas for improvement and implement strategies to enhance operational efficiency. Your responsibilities will include engineering automation solutions using Ansible to streamline workflows and improve site reliability.

You will own and manage the current manual Disaster Recovery process and pipeline, ensuring that it meets the evolving needs of the organization. Your role will involve driving the integration of observability automation into the CI/CD pipeline, which is essential for maintaining high performance and reliability in production systems.

Handling production incidents will be a critical part of your job, where you will lead client communication and create thorough root cause analysis documentation. You will monitor the performance of production systems, working closely with application development teams to improve scaling and meet SLA and SLO targets.

Your focus will be on proactive improvements, allowing you to contribute to the evolution of operations from traditional methods to a more customer-focused and agile approach. You will cultivate a collaborative culture within the team, emphasizing shared responsibility and a commitment to continuous improvement.

What we offer

At ThoughtWorks, you will be part of a dynamic team that thrives on curiosity and innovation. We offer a supportive environment where you can grow your skills and make a significant impact on our clients' success. You will have the opportunity to work on cutting-edge projects that challenge you and allow you to develop your expertise further. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at ThoughtWorks.

✨

Similar Jobs You Might Like

Based on your interests and this role

ThoughtWorks

Site Reliability Engineer

ThoughtWorksβ€’πŸ“ Singapore - On-Site

ThoughtWorks is hiring a Lead Service Reliability Engineer to enhance operational efficiency and reliability within the infrastructure domain. You'll work with AWS, Docker, and Kubernetes to implement solutions that improve system performance. This role requires expertise in incident management and automation.

πŸ›οΈ On-SiteLead
1d ago
AvePoint

Site Reliability Engineer

AvePointβ€’πŸ“ Singapore

AvePoint is seeking a Site Reliability Engineer to build and operate a Whole-of-Government runtime platform. You'll design and manage AWS and Kubernetes-based infrastructure while ensuring system stability and performance. This role requires experience with GitLab and CI/CD automation.

Mid-Level
1w ago
Point72

Site Reliability Engineer

Point72β€’πŸ“ India

Point72 is hiring a Site Reliability Engineer to develop and maintain complex distributed systems for their Macro Technology team. You'll focus on optimizing operations and ensuring system reliability. This role requires a strong background in software and systems engineering.

Mid-Level
2w ago
PandaDoc

Site Reliability Engineer

PandaDocβ€’πŸ“ Portugal

PandaDoc is hiring a Senior Site Reliability Engineer to ensure reliable service with minimal downtime. You'll manage incident processes and contribute to service codebases using Python and Java. This role requires strong experience with AWS and Kubernetes.

Senior
2w ago
Cryptio

Site Reliability Engineer

Cryptioβ€’πŸ“ London

Cryptio is hiring a Senior Site Reliability Engineer to enhance the reliability and performance of their platform. You'll work extensively with AWS and Kubernetes, focusing on system stability and automation. This role requires strong experience in SRE practices and cloud technologies.

Senior
4 months ago