Researcher: Multimodal

Cartesia

Location
San Francisco

Last Published
Jul. 2, 2025

Sector
AI/ML

About Cartesia

Our mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

The Role

• Conduct cutting-edge research at the intersection of machine learning, multimodal data, and generative modeling to advance the state of AI across audio, text, vision, and other modalities.

• Develop novel algorithms for multimodal understanding and generation, leveraging new architectures, training algorithms, datasets, and inference techniques.

• Design and build models that enable seamless integration of modalities for multimodal reasoning on streaming data.

• Lead the creation of robust evaluation frameworks to benchmark model performance on multimodal datasets and tasks.

• Collaborate closely with cross-functional teams to translate research breakthroughs into impactful products and applications.

What We’re Looking For

• Expertise in machine learning, multimodal learning, and generative modeling, with a strong research track record in top-tier conferences (e.g., CVPR, ICML, NeurIPS, ICCV).

• Proficiency in deep learning frameworks such as PyTorch or TensorFlow, with experience in handling diverse data modalities (e.g., audio, video, text).

• Strong understanding of state-of-the-art techniques for multimodal modeling, such as autoregressive and diffusion modeling, and deep understanding of architectural tradeoffs.

• Passion for exploring the interplay between modalities to solve complex problems and create groundbreaking applications.

• Excellent problem-solving skills, with the ability to independently tackle research challenges and collaborate effectively with multidisciplinary teams.

Nice-to-Haves

• Experience working with multimodal datasets, such as audio-visual datasets, video-captioning datasets, or large-scale cross-modal corpora.

• Background in designing or deploying real-time multimodal systems in resource-constrained environments.

• Early-stage startup experience or experience working in fast-paced R&D environments.

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

Our perks

🍽 Lunch, dinner and snacks at the office.

🏥 Fully covered medical, dental, and vision insurance for employees.

🏦 401(k).

✈️ Relocation and immigration support.

🦖 Your own personal Yoshi.

Apply for this position

Recent Cartesia jobs

Name	Domain	Purpose	Expiration	Security
Homepage Popup	indexventures.com	We set a cookie to prevent showing you the home screen overlay more than once.	1 year	HTTPS
Newsletter Popup	indexventures.com	We set a cookie to prevent showing you the newsletter popup more than once.	1 year	HTTPS
Index	indexventures.com	We store your cookie preferences.	1 year	HTTPS

Name	Domain	Purpose	Expiration	Security
LinkedIn	linkedin.com	If enabled we will embed LinkedIn posts directly on the site.	1 year	HTTPS
X	x.com	If enabled we will embed posts directly on the site.	1 year	HTTPS
TypeForm	typeform.com	If enabled we will embed TypeForm forms directly on the site.	1 year	HTTPS
Vimeo	vimeo.com	If enabled we will embed Vimeo videos directly on the site, if not we'll display a thumbnail. Vimeo sends data collected to a server located in USA.	1 year	HTTPS
Youtube	youtube.com	If enabled we will embed youtube videos directly on the site, if not we'll display a thumbnail.	1 year	HTTPS
Simplecast	simplecast.com	If enabled we will embed Simplecast player directly on the site, if not we'll display a prompt.	1 year	HTTPS

Name	Domain	Purpose	Expiration	Security
Google Analytics	google.com	To help us provide the best quality content we monitor the most popular pages on our site.	1 year	HTTPS
Hubspot	hubspot.com	We use hubspot for managing our newsletter	1 year	HTTPS

Researcher: Multimodal

About Cartesia

This website uses cookies

Necessary

Media

Analytics