AI Models

We integrate the best open-source speech AI models. All models are available via API and web interface.

Text-to-Speech Models

Fish Speech v1.5

Recommended

State-of-the-art multilingual TTS model with the lowest word error rate. Supports zero-shot voice cloning from 10-30 second samples.

Languages 13

Parameters 4B

VRAM ~6GB

Latency ~300ms

Zero-shot Cloning Multilingual Streaming

GitHub

Orpheus TTS 3B

Llama-based expressive TTS with emotion control. Produces human-like speech with natural intonation and the ability to add emotion tags.

Languages English

Parameters 3B

VRAM ~8GB

Latency ~100ms

Emotion Control Real-time Streaming Voice Cloning

GitHub

OpenVoice v2

Lightweight voice cloning with tone and style control. Great for voice conversion and quick cloning with minimal resources.

Languages 4

Parameters ~500M

VRAM ~3GB

Latency ~150ms

Fast Voice Conversion Style Control

GitHub

XTTS v2

Coqui's versatile multilingual TTS. Supports 17 languages with voice cloning and fine-tuning capabilities.

Languages 17

Parameters ~1.5B

VRAM ~4GB

Latency ~400ms

Multilingual Fine-tunable Long-form

GitHub

Speech-to-Text Models

Whisper Large v3

Most Used

OpenAI's industry-standard speech recognition. Excellent accuracy across 100+ languages with robust noise handling.

Languages 100+

Parameters 1.5B

VRAM ~10GB

WER (English) ~4%

High Accuracy Multilingual Timestamps

Canary Qwen 2.5B

NVIDIA's speech-augmented language model. Currently tops the Open ASR Leaderboard with lowest WER.

Languages English focus

Parameters 2.5B

VRAM ~8GB

WER (English) ~5.6%

Lowest WER LLM-powered

Conversational Models

PersonaPlex 7B

Featured

NVIDIA's full-duplex conversational AI. Listens and speaks simultaneously with natural interruptions and backchannels.

Parameters 7B

Base Model Moshi

Audio Rate 24kHz

Latency <200ms

Full-Duplex Custom Personas Voice Conditioning

GitHub

Moshi (Kyutai)

The original full-duplex speech model that PersonaPlex is built upon. Real-time voice conversations with natural turn-taking.

Parameters 7B

LLM Helium

Codec Mimi

License CC-BY-4.0

Full-Duplex Foundation Model

GitHub