About Together AI

Empowering corporate mentorship for effective learning

👥 21-100 employees📍 CityPlace, Toronto, ON💰 $1.7m

B2BHRLearningSaaSCommunity

Key Highlights

Founded in 2018, headquartered in Toronto, ON
Raised $1.7 million in seed funding
Partnerships with Heineken, Reddit, and 7-Eleven
4 weeks paid vacation and competitive equity packages

Together is a corporate mentorship management platform founded in 2018, headquartered in CityPlace, Toronto, ON. The platform streamlines the mentorship lifecycle, facilitating connections among employees at companies like Heineken, Reddit, and 7-Eleven. With $1.7 million in seed funding, Together a...

🎁 Benefits

Together offers competitive salaries and equity packages, 4 weeks of paid vacation, and a comprehensive health, dental, and vision plan through Honeyb...

🌟 Culture

Together fosters a culture of autonomy and impact, allowing employees to take on significant responsibilities without bureaucratic constraints. The fo...

🌐 Website All 36 jobs →

Staff Engineer • Senior

Together AI • San Francisco - On-Site

Posted 2w ago🏛️ On-Site Senior Staff Engineer 📍 San Francisco💰 $30 - $50 / hourly

Apply Now →

Skills & Technologies

kubernetes ceph lustre wekafs nvme-of iscsi tcp/ip infiniband 400gbe terraform

Overview

Together AI is hiring a Staff Engineer to design and deliver multi-petabyte storage systems for AI workloads. You'll work with technologies like Kubernetes, Ceph, and Lustre to optimize high-performance storage solutions. This role requires expertise in distributed systems and storage architecture.

Job Description

Who you are

You have extensive experience in designing and delivering large-scale storage systems, particularly for AI and machine learning workloads. Your background includes architecting high-performance parallel filesystems and object stores, and you are adept at integrating cutting-edge technologies such as WekaFS, Ceph, and Lustre. You have a strong understanding of cost optimization strategies, routinely achieving significant savings through intelligent tiering and lifecycle policies. Your technical skills extend to building Kubernetes-native storage operators and self-service platforms that ensure automated provisioning and performance isolation at scale.

You are familiar with optimizing end-to-end data paths for high throughput, capable of delivering 10-50 GB/s per node. Your expertise includes designing multi-tier caching architectures and implementing intelligent prefetching and model-weight distribution. You thrive in environments where you can troubleshoot bottlenecks and optimize TCP/IP for storage, ensuring maximum efficiency and minimal latency.

Desirable

Experience with RDMA and InfiniBand networks is a plus, as is familiarity with NVMe-oF and iSCSI protocols. You are comfortable working with large datasets and have a keen eye for capacity forecasting and right-sizing storage solutions. Your collaborative spirit allows you to work effectively with cross-functional teams, driving projects from conception through to successful implementation.

What you'll do

In this role, you will be responsible for designing and delivering multi-petabyte AI/ML storage systems that meet the demands of the world’s largest AI training and inference workloads. You will lead the integration of advanced storage technologies, ensuring that systems are optimized for performance and cost. Your day-to-day tasks will include architecting high-performance parallel filesystems and object stores, as well as evaluating new technologies to enhance system capabilities.

You will also focus on capacity planning and cost optimization, routinely achieving 30-50% savings through intelligent tiering and lifecycle policies. Your role will involve designing and optimizing RDMA and InfiniBand networks, tuning them for maximum throughput and minimum latency. You will implement NVMe-oF and iSCSI protocols, troubleshooting any bottlenecks that arise to maintain system integrity and performance.

Building Kubernetes storage operators and controllers will be a key part of your responsibilities, enabling automated provisioning and creating self-service abstractions for users. You will ensure strict multi-tenancy and quota enforcement at cluster scale, allowing for efficient resource management across multiple users and applications.

Your work will directly impact the performance of AI workloads, as you will be tasked with delivering 10-50 GB/s per GPU node. You will optimize caching strategies for weights, datasets, and checkpoints, ensuring that data paths are efficient and effective. Your ability to troubleshoot and optimize parallel filesystems will be crucial in maintaining high performance for AI applications.

What we offer

Together AI offers a collaborative and innovative work environment where your contributions will have a significant impact on the future of AI infrastructure. You will have the opportunity to work with cutting-edge technologies and be part of a team that is dedicated to pushing the boundaries of what is possible in AI and machine learning. We encourage you to apply even if your experience doesn't match every requirement, as we value diverse perspectives and backgrounds.

We provide competitive compensation and benefits, along with opportunities for professional growth and development. Join us in shaping the future of AI infrastructure and making a difference in the world of technology.

Interested in this role?

Apply now or save it for later. Get alerts for similar jobs at Together AI.

Apply Now →Get Job Alerts

✨

Similar Jobs You Might Like

Based on your interests and this role

Staff Engineer

Together AI•📍 Amsterdam - Hybrid

Together AI is hiring a Staff Engineer to design and deliver multi-petabyte storage systems for AI workloads. You'll work with technologies like WekaFS, Ceph, and Kubernetes to optimize high-performance storage solutions. This position requires extensive experience in distributed storage and HPC infrastructure.

🏢 HybridSenior

2w ago

Software Engineering

OpenAI•📍 San Francisco - On-Site

OpenAI is hiring a Software Engineer for their Storage Infrastructure team to design and operate Exascale systems for data management. You'll work with distributed systems and cloud technologies, particularly Azure. This role requires a deep understanding of scalable storage solutions.

🏛️ On-SiteMid-Level

1 year ago

Software Engineering

Patreon•📍 San Francisco - Hybrid

Patreon is hiring a Senior Software Engineer to design and implement scalable storage systems for their creator platform. You'll work with Java and REST APIs to build high-performance services. This position requires 5+ years of experience in backend engineering.

🏢 HybridSenior

4 months ago

Ai Research Engineer

Anthropic•📍 San Francisco - On-Site

Anthropic is hiring a Research Engineer to work on building advanced AI systems. You'll focus on large-scale infrastructure for AI training and evaluation, utilizing skills in Python, Docker, and Kubernetes. This position requires familiarity with machine learning and distributed systems.

🏛️ On-SiteMid-Level

9h ago

Software Engineering

Crusoe•📍 San Francisco - On-Site

Crusoe is hiring a Senior Software Engineer to design and build next-generation cloud storage products. You'll work with technologies like Java and Python to create scalable distributed storage systems. This position requires deep expertise in building storage systems.

🏛️ On-SiteSenior

5 months ago

Browse all jobs →