whisper-large-v3
Parameter | Type | Default | Description | Endpoints |
---|---|---|---|---|
model | String | Required | The ID of the model to use. | transcriptions , translations |
prompt | String | Optional | Prompt provided to influence transcription style or vocabulary. Example: “Please transcribe carefully, including pauses and hesitations.” | transcriptions , translations |
temperature | Number | 0 | Sampling temperature between 0 and 1. Higher values increase randomness; lower values produce more focused output. | transcriptions , translations |
file | File | Required | Audio file in FLAC, MP3, MP4, MPEG, MPGA, M4A, Ogg, WAV, or WebM format. File size limit: 25 MB | transcriptions , translations |
response format | String | JSON | Output format: either JSON or text. | transcriptions , translations |
language | String | Optional | The language of the input audio. Supplying the input language in ISO-639-1 (e.g., en) format will improve accuracy and latency. | transcriptions |
qwen2-audio-7b-instruct
message
parameter in the Request parameters section for more details.
Language | Dataset | Qwen2-Audio | Whisper-large-v3 | Improvement |
---|---|---|---|---|
English | Common Voice 15 | 8.6% | 9.3% | +7.5% |
Chinese | Common Voice 15 | 6.9% | 12.8% | +46.1% |
Parameter | Type | Default | Description | Endpoints |
---|---|---|---|---|
model | String | Required | The ID of the model to use. Only Qwen2-Audio-7B-Instruct is currently available. | All |
messages | Message | Required | A list of messages containing role (user/system/assistant), type (text/audio_content), and audio_content (base64 audio content). | All |
response_format | String | JSON | The output format, either JSON or text. | All |
temperature | Number | 0 | Sampling temperature between 0 and 1. Higher values (e.g., 0.8) increase randomness, while lower values (e.g., 0.2) make output more focused. | All |
max_tokens | Number | 1000 | The maximum number of tokens to generate. | All |
file | File | Required | Audio file in FLAC, MP3, MP4, MPEG, MPGA, M4A, Ogg, WAV, or WebM format. Each single file must not exceed 30 seconds in duration. | All |
language | String | Optional | The target language for transcription or translation. | Transcription, Translation |
stream | Boolean | False | Enables streaming responses. | All |
stream_options | Object | Optional | Additional streaming configuration (e.g., {“include_usage”: true}). | All |