Audio

SambaNova’s first speech reasoning model on SambaNova Cloud will extend our multimodal AI capabilities beyond vision to include advanced audio processing and understanding. This model offers OpenAI compatible endpoints that enable real-time reasoning, transcriptions and translations.

The Whisper-Large-v3 model

Model: Whisper-Large-v3
Description: State-of-the-art automatic speech recognition (ASR) and translation model. Developed by OpenAI and trained on 5M+ hours of labeled audio. Excels in multilingual and zero-shot speech tasks across diverse domains.
Model ID: whisper-large-v3
Supported languages: Multilingual

Core capabilities

Transcribes and translates extended audio inputs (up to 25 MB).
Demonstrates high accuracy in speech recognition and translation tasks.
Provides OpenAI-compatible endpoints for transcriptions and translations.

Request parameters

Parameter	Type	Default	Description	Endpoints
`model`	String	Required	The ID of the model to use.	`transcriptions`, `translations`
`prompt`	String	Optional	Prompt provided to influence transcription style or vocabulary. Example: “Please transcribe carefully, including pauses and hesitations.”	`transcriptions`, `translations`
`temperature`	Number	0	Sampling temperature between 0 and 1. Higher values increase randomness; lower values produce more focused output.	`transcriptions`, `translations`
`file`	File	Required	Audio file in FLAC, MP3, MP4, MPEG, MPGA, M4A, Ogg, WAV, or WebM format. File size limit: 25 MB	`transcriptions`, `translations`
`response format`	String	JSON	Output format: either JSON or text.	`transcriptions`, `translations`
`language`	String	Optional	The language of the input audio. Supplying the input language in ISO-639-1 (e.g., en) format will improve accuracy and latency.	`transcriptions`

The Qwen2-Audio Instruct model

Model: Qwen2-Audio Instruct
Description: Instruction-tuned large audio language model. Built on Qwen-7B with Whisper-large-v3 audio encoder (8.2B parameters).
Model ID: qwen2-audio-7b-instruct
Supported languages: Multilingual

This model is currently being provided as a beta model.

Core capabilities

Transform audio into Intelligence: Allows you to build GPT-4-like voice applications quickly.
Provides direct question-answering for any audio input.
Comprehensive audio processing that includes real-time conversation, transcription, translation, and analysis through a single unified model.

Customization and control

System-level prompts: Use Assistant Prompt in the request to customize model behavior for specific requirements. See the message parameter in the Request parameters section for more details.
- Brand-specific formatting (e.g., BrandName vs brandname).
- Domain-specific terminology.
- Response style and tone control.

View the Audio reasoning, Translation, and Transcription API endpoint documents for more details.

Audio processing

Silence detection: Intelligent identification of meaningful pauses and gaps in speech.
Noise cancellation: Advanced noise filtering and clean audio processing.
Multilingual processing: Support for multiple languages with automatic language detection.

Analysis capabilities

Sentiment analysis: Detects and analyzes emotional content in speech.
Multi-speaker handling: Processes conversations with multiple participants.
Mixed audio understanding: Comprehends speech, music, and environmental sounds.

Speech recognition performance numbers

Metrics taken from published Qwen2-Audio paper benchmarks.
WER%, lower is better

Language	Dataset	Qwen2-Audio	Whisper-large-v3	Improvement
English	Common Voice 15	8.6%	9.3%	+7.5%
Chinese	Common Voice 15	6.9%	12.8%	+46.1%

Request parameters

Parameter	Type	Default	Description	Endpoints
`model`	String	Required	The ID of the model to use. Only Qwen2-Audio-7B-Instruct is currently available.	All
`messages`	Message	Required	A list of messages containing role (user/system/assistant), type (text/audio_content), and audio_content (base64 audio content).	All
`response_format`	String	JSON	The output format, either JSON or text.	All
`temperature`	Number	0	Sampling temperature between 0 and 1. Higher values (e.g., 0.8) increase randomness, while lower values (e.g., 0.2) make output more focused.	All
`max_tokens`	Number	1000	The maximum number of tokens to generate.	All
`file`	File	Required	Audio file in FLAC, MP3, MP4, MPEG, MPGA, M4A, Ogg, WAV, or WebM format. Each single file must not exceed 30 seconds in duration.	All
`language`	String	Optional	The target language for transcription or translation.	Transcription, Translation
`stream`	Boolean	False	Enables streaming responses.	All
`stream_options`	Object	Optional	Additional streaming configuration (e.g., {“include_usage”: true}).	All

Get started

Capabilities

Build with SambaNova

Integrations

Examples

Resources

The Whisper-Large-v3 model

Core capabilities

Request parameters

The Qwen2-Audio Instruct model

Core capabilities

Customization and control

Audio processing

Analysis capabilities

Speech recognition performance numbers

Request parameters

Get started

Capabilities

Build with SambaNova

Integrations

Examples

Resources

​The Whisper-Large-v3 model

​Core capabilities

​Request parameters

​The Qwen2-Audio Instruct model

​Core capabilities

​Customization and control

​Audio processing

​Analysis capabilities

​Speech recognition performance numbers

​Request parameters

The Whisper-Large-v3 model

Core capabilities

Request parameters

The Qwen2-Audio Instruct model

Core capabilities

Customization and control

Audio processing

Analysis capabilities

Speech recognition performance numbers

Request parameters