• Last Published
  • Feb. 3, 2025
  • Sector
  • AI/ML

About Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

The Role

• Conduct pioneering research at the intersection of audio signal processing, machine learning, and generative modeling to push the boundaries of voice AI systems.

• Develop cutting-edge algorithms for tasks such as speech enhancement, echo cancellation, denoising, and voice activity detection, leveraging generative approaches like diffusion models, VAEs, or autoregressive frameworks.

• Design novel methods for end-to-end modeling of audio signals, exploring advancements in neural audio synthesis, speech representation learning, and self-supervised training paradigms.

• Lead the development of robust evaluation pipelines to analyze performance, validate real-world effectiveness, and identify future research directions.

What We’re Looking For

• Deep expertise in audio signal processing, generative modeling, and machine learning, with a proven track record of publishing impactful research in top-tier conferences (e.g., NeurIPS, ICASSP, ICLR).

• Proficiency in frameworks such as PyTorch, TensorFlow, or specialized tools for audio processing like torchaudio or librosa.

• Strong understanding of state-of-the-art generative techniques, including diffusion models, autoregressive models, flow-based models, etc.

• Passion for solving complex problems in speech and audio, with a focus on creating innovative and practical solutions for noisy, multi-modal, or real-time environments.

• Excellent collaboration and communication skills, with the ability to work effectively in research-driven and cross-functional teams.

Nice-to-Haves

• Experience building audio models that have been used in production at scale.

• Background in hardware-aware optimization for deploying real-time audio models.

• Early-stage startup experience or experience working in fast-paced R&D environments.

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

Our perks

🍽 Lunch, dinner and snacks at the office.

🏥 Fully covered medical, dental, and vision insurance for employees.

🏦 401(k).

✈️ Relocation and immigration support.

🦖 Your own personal Yoshi.