About Nebius AI

Empowering AI with robust infrastructure solutions

🏢 Tech👥 51-250📅 Founded 2022📍 Amsterdam, North Holland, Netherlands

Key Highlights

Publicly traded on Nasdaq, expanding AI infrastructure market
Headquartered in Amsterdam with hubs in the US, Europe, and Israel
Team of around 400 skilled engineers focused on AI/ML
Specializes in large-scale GPU clusters and cloud platforms

Nebius is a Nasdaq-listed company headquartered in Amsterdam, specializing in AI infrastructure solutions. With a team of around 400 engineers, Nebius provides large-scale GPU clusters and cloud platforms designed to support the rapid growth of the AI industry. The company has established R&D and co...

🎁 Benefits

Nebius offers competitive equity packages, a flexible PTO policy, and opportunities for remote work. Employees also benefit from a learning budget to ...

🌟 Culture

Nebius fosters a culture centered around engineering excellence and innovation in AI infrastructure. The company values collaboration across its globa...

🌐 Website 💼 LinkedIn All 257 jobs →

Site Reliability Engineer • Senior

Nebius AI • Amsterdam - Remote

Posted 15h ago🏠 Remote Senior Site Reliability Engineer 📍 Amsterdam 📍 Berlin 📍 London 📍 Prague

Apply Now →

Skills & Technologies

Kubernetes Python Docker Prometheus Grafana

Overview

Nebius AI is seeking a Senior Site Reliability Engineer for their Token Factory team to ensure the reliability and performance of their inference platform. You'll work with technologies like Kubernetes and Docker to manage large-scale AI workloads. This role requires strong expertise in observability and performance tuning.

Job Description

Who you are

You have 5+ years of experience in site reliability engineering, with a strong focus on maintaining high availability and performance in cloud environments. Your expertise in Kubernetes allows you to manage container orchestration effectively, ensuring that applications run smoothly under varying loads. You are proficient in Python, which you use to automate tasks and improve system reliability. Your experience with observability tools like Prometheus and Grafana enables you to monitor system performance and troubleshoot issues proactively. You understand the importance of telemetry pipelines and can design them to provide actionable insights from large volumes of data. You thrive in collaborative environments and enjoy working with cross-functional teams to enhance system performance and reliability.

Desirable

Experience with AI and machine learning platforms is a plus, as it will help you understand the unique challenges of deploying AI workloads at scale. Familiarity with infrastructure as code tools will also be beneficial in automating deployment processes and managing configurations efficiently.

What you'll do

In this role, you will own the reliability and performance of the inference stack, ensuring that it can handle extreme loads without compromising on service quality. You will design and refine telemetry pipelines to capture metrics, logs, and traces, transforming terabytes of data into clear insights that drive operational improvements. Your responsibilities will include tuning Kubernetes autoscalers to optimize resource usage and enhance application performance. You will collaborate closely with engineering teams to implement best practices for incident management and recovery, ensuring that the platform can recover gracefully from unexpected failures. You will also participate in capacity planning and performance testing to ensure that the system can scale effectively as demand grows. Your contributions will directly impact the success of Nebius AI's cloud offerings, helping to deliver reliable AI solutions to customers worldwide.

What we offer

Nebius AI provides a competitive salary and a comprehensive benefits package, including opportunities for professional growth within the company. We value initiative and innovation, fostering a collaborative work environment where your contributions are recognized and rewarded. Flexible working arrangements are available to support work-life balance, and you will have the chance to work alongside some of the most experienced leaders and engineers in the AI cloud infrastructure space. Join us in shaping the future of cloud computing and making a real impact in the AI economy.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Nebius AI.

Apply Now →Get Job Alerts

✨

Similar Jobs You Might Like

Based on your interests and this role

Site Reliability Engineer

Nebius AI•📍 Amsterdam - Remote

Nebius AI is seeking a Senior Site Reliability Engineer to ensure fault-tolerance and scale for their cloud services. You'll work with technologies like Go, Python, and Kubernetes to solve infrastructure challenges. This role requires solid experience in programming and systems management.

🏠 RemoteSenior

15h ago

Site Reliability Engineer

N26•📍 Barcelona

N26 is seeking a Senior Site Reliability Engineer to enhance the reliability and scalability of their AI Platform infrastructure. You'll work with cloud infrastructure, networking, and CI/CD practices. This role requires expertise in SRE principles and a passion for AI technologies.

Senior

2w ago

Algolia is hiring a Senior Site Reliability Engineer to support the AI Research team in ensuring the stability and scalability of their infrastructure. You'll work with technologies like GCP, Docker, and Kubernetes. This position requires experience in cloud-first architectures.

🏠 RemoteSenior

2d ago

Browse all jobs →