AIArtificial IntelligenceTrends

Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

Views: 1
0 0
Read Time:17 Minute, 58 Second

  

Gradium today released two real-time speech translation models: stt-translate and s2s-translate. Both run across five languages and stream results live in the browser.

Gradium claims a better accuracy-latency tradeoff than gpt-realtime-translate and gemini-3.5-live-translate. It also adds output voice control, including cloning, that gpt-realtime-translate lacks.

TL;DR

  • Gradium launched two real-time speech translation models: stt-translate (speech → text) and s2s-translate (speech → speech).
  • They cover five languages (EN, FR, DE, ES, PT) and 20 pairs, collapsing the usual 3-model cascade into 2.
  • Accuracy leads gemini-3.5-live-translate on BLEU and MetricX, and beats gpt-realtime-translate on BLEU (comparable on MetricX).
  • Latency averages 3.0s — ahead of gpt-realtime-translate (3.6s), just behind gemini-3.5-live-translate (2.9s).
  • Unlike gpt-realtime-translate, you pick the output voice or clone your own, all over one duplex WebSocket.

stt-translate

stt-translate takes speech in one language and returns text in another. It supports English (EN), French (FR), German (DE), Spanish (ES), and Portuguese (PT).

Any source maps to any target across that set. That is 20 language pairs in total, in every direction.

The key design choice is collapsing two steps into one. Transcription and translation happen in a single pass, inside the speech model. There is no intermediate transcript to wait on and no handoff between systems.

According to Gradium: the approach draws on the Hibiki-Zero framework. The model optimizes low latency and high accuracy jointly through Reinforcement Learning. This means fewer moving parts in the pipeline.

s2s-translate

s2s-translate turns spoken audio in one language into spoken audio in another, end to end. It builds on stt-translate and pairs it with a Gradium TTS model in one service.

You stream audio in over a WebSocket. You receive both the synthesized output audio and the translated transcript as they are produced.

That removes integration work. You do not wire STT and TTS together yourself or manage two connections. The server runs the pipeline and streams results back.

Input audio is PCM at 24 kHz, 16-bit signed mono. Output audio is PCM at 48 kHz, 16-bit signed mono. WAV, Opus, mu-law, and A-law are also supported.

How Gradium Measures Quality: BLEU and MetricX

Translation quality is not one number, so Gradium reports two complementary metrics:

BLEU (Bilingual Evaluation Understudy) is the long-standing machine translation standard (Papineni et al.). It measures n-gram overlap between model output and human reference translations. It runs from 0 to 100, where higher is better.

BLEU is fast, reproducible, and comparable across systems. Its limit is that it rewards surface word matching. A correct translation using different wording can be penalized.

MetricX is a learned, neural quality metric developed by Google (Juraska et al.). It predicts how a human would rate a translation. It is an error score, so lower is better, and it tracks human judgment more closely than BLEU.

The two catch different failures. BLEU checks lexical fidelity; MetricX checks semantic adequacy.

Benchmark

Gradium benchmarks on a proprietary dataset of conversational speech. The data reflects everyday topics like work, travel, and weather, rather than scripted text.

Against gemini-3.5-live-translate, Gradium leads on both BLEU and MetricX. Against gpt-realtime-translate, Gradium leads on BLEU and is comparable on MetricX.

Capability Gradium gpt-realtime-translate gemini-3.5-live-translate
Average latency (all pairs) 3.0s 3.6s 2.9s
BLEU (higher is better) Leads both Lower than Gradium Lower than Gradium
MetricX (lower error is better) Comparable to GPT; leads Gemini Comparable to Gradium Higher error than Gradium
Choose output voice Yes (catalogue) No Not stated
Clone your own voice Yes No Not stated
Languages 5 languages, 20 pairs Not stated Not stated

Accuracy (BLEU and MetricX) is measured on stt-translate‘s translation; latency is for the full s2s-translate pipeline. Read it as a tradeoff, not a clean sweep. Gemini is fractionally faster; Gradium is more accurate and adds voice control.

Why Two Models Beat Three

The standard speech-to-speech stack uses three models: Speech-To-Text, then Text-To-Text translation, then Text-To-Speech. Each stage is a separate inference call. Each adds processing time and a handoff.

Gradium uses two. stt-translate performs transcription and translation in a single pass. The dedicated Text-To-Text stage disappears entirely.

That removes one full model from the critical path, along with its latency and handoff. The end-to-end path is shorter than a three-model cascade at equivalent quality.

The numbers back the design. s2s-translate averages 3.0s across all language pairs. That beats gpt-realtime-translate at 3.6s and sits near gemini-3.5-live-translate at 2.9s.

Use Cases With Examples

  • Live dubbing and localization: Clone a presenter’s voice once. Translate a French keynote into Spanish that still sounds like the original speaker.
  • Multilingual voice agents: Route a support call through s2s-translate. An English agent hears a German caller in English, and replies stream back in German.
  • Real-time meetings: Pipe microphone audio in over the WebSocket. Each participant receives translated speech and transcript in their own language.
  • Accessibility and captioning: Use stt-translate alone when you only need text. Render live translated captions without generating audio.

Translate in a Few Lines of Code

The Python SDK streams audio through the Speech-To-Speech endpoint and returns translated audio plus transcript.

import asyncio
import numpy as np
from gradium import client as gradium_client

grc = gradium_client.GradiumClient()  # reads GRADIUM_API_KEY from the environment

setup = {
    "model_name": "s2s-translate",
    "input_format": "pcm_24000",        # 24 kHz, 16-bit signed mono input
    "output_format": "pcm_48000",       # 48 kHz, 16-bit signed mono output
    "voice_id": "cLONiZ4hQ8VpQ4Sz",     # must be a voice in the target language
    "stt_model_name": "stt-translate",
    "tts_model_name": "default",
    "target_language": "en",
}

# Raw 24 kHz, 16-bit mono PCM bytes (from a file, buffer, or microphone).
with open("input_24k_mono.pcm", "rb") as f:
    pcm = f.read()

async def main() -> np.ndarray:
    audio_out: list[bytes] = []
    async with grc.s2s_realtime(wait_for_ready_on_start=True, **setup) as s2s:
        async def send_loop():
            for i in range(0, len(pcm), 1920):       # 1920 bytes = 40 ms at 24 kHz
                await s2s.send_audio(pcm[i : i + 1920])
            await s2s.send_eos()                     # signal end of input

        async def recv_loop():
            async for msg in s2s:
                if msg["type"] == "audio":
                    audio_out.append(msg["audio"])           # translated speech (bytes)
                elif msg["type"] == "text":
                    print(msg["text"], end=" ", flush=True)  # translated transcript
                elif msg["type"] == "end_of_stream":
                    break

        async with asyncio.TaskGroup() as tg:
            tg.create_task(send_loop())
            tg.create_task(recv_loop())

    return np.frombuffer(b"".join(audio_out), dtype=np.int16)  # 48 kHz mono PCM

translated_pcm = asyncio.run(main())

The SDK exposes three ways to drive S2S. Use s2s_realtime for live sources, s2s_stream for finite iterables, and s2s for buffered files. All three talk to wss://api.gradium.ai/api/speech/s2s.

Strengths and Weaknesses

Strengths

  • Single-pass stt-translate removes one model from the latency path
  • Leads gemini-3.5-live-translate on both BLEU and MetricX
  • Output voice choice and cloning, which gpt-realtime-translate lacks
  • One duplex WebSocket replaces a hand-wired STT-plus-TTS pipeline

Weaknesses

  • Five languages at launch, with 20 pairs only across that set
  • gemini-3.5-live-translate is fractionally lower latency at 2.9s
  • MetricX is only comparable to, not ahead of, gpt-realtime-translate
  • Benchmarks use a proprietary dataset, so external replication is limited

Interactive Explainer


Try it</button>
<button class=”gtx-tab” role=”tab” aria-selected=”false” data-v=”bench”>Benchmarks</button>
<button class=”gtx-tab” role=”tab” aria-selected=”false” data-v=”arch”>Architecture</button>
</div>

<!– ============ TRY IT ============ –>
<section class=”gtx-view gtx-on” data-view=”try”>
<div class=”gtx-grid”>
<div class=”gtx-field”>
<label>Source language</label>
<select id=”gtx-src”></select>
</div>
<div class=”gtx-field”>
<label>Target language</label>
<select id=”gtx-tgt”></select>
</div>
</div>
<div class=”gtx-grid”>
<div class=”gtx-field”>
<label>Phrase to translate</label>
<select id=”gtx-phrase”></select>
</div>
<div class=”gtx-field”>
<label>Output voice</label>
<select id=”gtx-voice”></select>
</div>
</div>

<div class=”gtx-io”>
<div class=”gtx-card”>
<div class=”gtx-clab”><span id=”gtx-srclang”>Source</span><span>input speech</span></div>
<div class=”gtx-srctext” id=”gtx-srctext”>—</div>
</div>
<div class=”gtx-card”>
<div class=”gtx-clab”><span id=”gtx-tgtlang”>Target</span><span>translated output</span></div>
<div class=”gtx-outtext” id=”gtx-outtext”></div>
</div>
</div>

<div class=”gtx-go”>
<button class=”gtx-btn gtx-primary” id=”gtx-run”>Translate &amp; speak</button>
<button class=”gtx-btn gtx-ghost” id=”gtx-clear”>Clear</button>
</div>

<div class=”gtx-lat”>
<div class=”gtx-latrow”>
<span class=”gtx-latname”>Gradium s2s-translate</span>
<span class=”gtx-bartrack”><span class=”gtx-barfill” id=”gb-grad” style=”background:linear-gradient(90deg,var(–acc),var(–acc2))”></span></span>
<span class=”gtx-latval”>3.0s</span>
</div>
<div class=”gtx-latrow”>
<span class=”gtx-latname”>gemini-3.5-live-translate</span>
<span class=”gtx-bartrack”><span class=”gtx-barfill” id=”gb-gem” style=”background:#aaaaaa”></span></span>
<span class=”gtx-latval”>2.9s</span>
</div>
<div class=”gtx-latrow”>
<span class=”gtx-latname”>gpt-realtime-translate</span>
<span class=”gtx-bartrack”><span class=”gtx-barfill” id=”gb-gpt” style=”background:#555555″></span></span>
<span class=”gtx-latval”>3.6s</span>
</div>
<div class=”gtx-note” id=”gtx-runnote”>Average end-to-end latency over all language pairs (lower is better).</div>
</div>
</section>

<!– ============ BENCHMARKS ============ –>
<section class=”gtx-view” data-view=”bench”>
<table>
<thead>
<tr><th>Metric</th><th>Gradium</th><th>gpt-realtime-translate</th><th>gemini-3.5-live-translate</th></tr>
</thead>
<tbody>
<tr>
<td>Avg latency (all pairs)</td>
<td><b>3.0s</b></td><td>3.6s</td><td>2.9s</td>
</tr>
<tr>
<td>BLEU (higher better)</td>
<td><span class=”gtx-tag t-lead”>Leads</span></td>
<td><span class=”gtx-tag t-trail”>Lower</span></td>
<td><span class=”gtx-tag t-trail”>Lower</span></td>
</tr>
<tr>
<td>MetricX (lower error better)</td>
<td><span class=”gtx-tag t-lead”>Leads / comp.</span></td>
<td><span class=”gtx-tag t-comp”>Comparable</span></td>
<td><span class=”gtx-tag t-trail”>Higher error</span></td>
</tr>
<tr>
<td>Choose output voice</td>
<td><span class=”gtx-tag t-lead”>Yes</span></td>
<td><span class=”gtx-tag t-trail”>No</span></td>
<td><span class=”gtx-tag t-na”>Not stated</span></td>
</tr>
<tr>
<td>Clone your voice</td>
<td><span class=”gtx-tag t-lead”>Yes</span></td>
<td><span class=”gtx-tag t-trail”>No</span></td>
<td><span class=”gtx-tag t-na”>Not stated</span></td>
</tr>
<tr>
<td>Languages</td>
<td><b>5 · 20 pairs</b></td>
<td><span class=”gtx-tag t-na”>Not stated</span></td>
<td><span class=”gtx-tag t-na”>Not stated</span></td>
</tr>
</tbody>
</table>
<p class=”gtx-note”>Accuracy claims: vs gemini-3.5-live-translate, Gradium leads BLEU and MetricX. vs gpt-realtime-translate, Gradium leads BLEU and is comparable on MetricX. Source: Gradium launch benchmark on a proprietary conversational-speech dataset.</p>
</section>

<!– ============ ARCHITECTURE ============ –>
<section class=”gtx-view” data-view=”arch”>
<div class=”gtx-archtoggle”>
<button class=”on” data-arch=”grad”>Gradium (2 models)</button>
<button data-arch=”cascade”>Standard cascade (3 models)</button>
</div>
<div class=”gtx-flow” id=”gtx-flow”></div>
<p class=”gtx-archnote” id=”gtx-archnote”></p>
</section>

<div class=”gtx-foot”>
<span>Illustrative demo · speech via your browser · numbers from <b>Gradium</b></span>
<span><a href=”https://gradium.ai/translate” target=”_blank” rel=”noopener”>gradium.ai/translate ↗</a></span>
</div>

</div>

<script>
(function(){
var root=document.getElementById(‘gtx-root’);

/* —- data —- */
var LANGS=[
{c:’EN’,name:’English’,bcp:’en-US’,flag:’🇬🇧‘},
{c:’FR’,name:’French’,bcp:’fr-FR’,flag:’🇫🇷‘},
{c:’ES’,name:’Spanish’,bcp:’es-ES’,flag:’🇪🇸‘},
{c:’DE’,name:’German’,bcp:’de-DE’,flag:’🇩🇪‘},
{c:’PT’,name:’Portuguese’,bcp:’pt-BR’,flag:’🇧🇷‘}
];
var PHRASES=[
{EN:”Good morning, how are you today?”,FR:”Bonjour, comment allez-vous aujourd’hui ?”,ES:”Buenos días, ¿cómo estás hoy?”,DE:”Guten Morgen, wie geht es dir heute?”,PT:”Bom dia, como você está hoje?”},
{EN:”Where is the nearest train station?”,FR:”Où se trouve la gare la plus proche ?”,ES:”¿Dónde está la estación de tren más cercana?”,DE:”Wo ist der nächste Bahnhof?”,PT:”Onde fica a estação de trem mais próxima?”},
{EN:”I would like to book a table for two.”,FR:”Je voudrais réserver une table pour deux.”,ES:”Me gustaría reservar una mesa para dos.”,DE:”Ich möchte einen Tisch für zwei reservieren.”,PT:”Eu gostaria de reservar uma mesa para dois.”},
{EN:”The weather is beautiful today.”,FR:”Il fait très beau aujourd’hui.”,ES:”Hoy hace un tiempo precioso.”,DE:”Das Wetter ist heute wunderschön.”,PT:”O tempo está lindo hoje.”},
{EN:”Thank you very much for your help.”,FR:”Merci beaucoup pour votre aide.”,ES:”Muchas gracias por tu ayuda.”,DE:”Vielen Dank für Ihre Hilfe.”,PT:”Muito obrigado pela sua ajuda.”}
];

var $=function(s){return root.querySelector(s)};
var srcSel=$(‘#gtx-src’),tgtSel=$(‘#gtx-tgt’),phSel=$(‘#gtx-phrase’),vSel=$(‘#gtx-voice’);

LANGS.forEach(function(l){
srcSel.add(new Option(l.flag+’ ‘+l.name,l.c));
tgtSel.add(new Option(l.flag+’ ‘+l.name,l.c));
});
srcSel.value=’EN’; tgtSel.value=’FR’;
PHRASES.forEach(function(p,i){ phSel.add(new Option(p.EN,i)); });

function lang(c){return LANGS.filter(function(l){return l.c===c})[0];}

/* —- voices —- */
function loadVoices(){
var tgt=tgtSel.value, bcp=lang(tgt).bcp, pre=bcp.split(‘-‘)[0];
vSel.innerHTML=”;
var vs=(window.speechSynthesis?speechSynthesis.getVoices():[])||[];
var match=vs.filter(function(v){return v.lang&&v.lang.toLowerCase().indexOf(pre)===0;});
if(match.length){
match.forEach(function(v){ vSel.add(new Option(v.name.replace(/Microsoft|Google/gi,”).trim()+’ (‘+v.lang+’)’,v.name)); });
vSel.disabled=false;
} else {
vSel.add(new Option(‘System default voice’,”));
vSel.disabled=true;
}
}
if(window.speechSynthesis){ speechSynthesis.onvoiceschanged=loadVoices; }
loadVoices();

/* —- render source/labels —- */
function refresh(){
var p=PHRASES[+phSel.value], s=srcSel.value, t=tgtSel.value;
$(‘#gtx-srclang’).textContent=lang(s).flag+’ ‘+lang(s).name;
$(‘#gtx-tgtlang’).textContent=lang(t).flag+’ ‘+lang(t).name;
$(‘#gtx-srctext’).textContent=p[s];
loadVoices();
}
srcSel.onchange=refresh; tgtSel.onchange=refresh; phSel.onchange=refresh;
refresh();

/* —- run translation (type + speak) —- */
var running=false;
function setBars(on){
$(‘#gb-grad’).style.width=on?’83%’:’0′;
$(‘#gb-gem’).style.width=on?’81%’:’0′;
$(‘#gb-gpt’).style.width=on?’100%’:’0′;
}
$(‘#gtx-run’).onclick=function(){
if(running) return;
var p=PHRASES[+phSel.value], t=tgtSel.value, out=p[t], bcp=lang(t).bcp;
var box=$(‘#gtx-outtext’); box.innerHTML=”;
setBars(false);
if(window.speechSynthesis) speechSynthesis.cancel();
running=true; $(‘#gtx-run’).disabled=true;
$(‘#gtx-runnote’).textContent=’Translating in a single pass (stt-translate), then synthesizing voice…’;

var i=0;
var timer=setInterval(function(){
box.textContent=out.slice(0,i);
var c=document.createElement(‘span’); c.className=’gtx-cur’; box.appendChild(c);
i++;
if(i>out.length){
clearInterval(timer);
box.textContent=out;
setBars(true);
speak(out,bcp);
$(‘#gtx-runnote’).textContent=’Average end-to-end latency over all language pairs (lower is better).’;
running=false; $(‘#gtx-run’).disabled=false;
}
},26);
};
function speak(text,bcp){
if(!window.speechSynthesis){return;}
var u=new SpeechSynthesisUtterance(text); u.lang=bcp; u.rate=.96;
var want=vSel.value, vs=speechSynthesis.getVoices();
var v=vs.filter(function(x){return x.name===want;})[0]
||vs.filter(function(x){return x.lang&&x.lang.toLowerCase().indexOf(bcp.split(‘-‘)[0])===0;})[0];
if(v) u.voice=v;
speechSynthesis.speak(u);
}
$(‘#gtx-clear’).onclick=function(){
$(‘#gtx-outtext’).innerHTML=”; setBars(false);
if(window.speechSynthesis) speechSynthesis.cancel();
};

/* —- tabs —- */
root.querySelectorAll(‘.gtx-tab’).forEach(function(tb){
tb.onclick=function(){
root.querySelectorAll(‘.gtx-tab’).forEach(function(x){x.setAttribute(‘aria-selected’,’false’);});
tb.setAttribute(‘aria-selected’,’true’);
root.querySelectorAll(‘.gtx-view’).forEach(function(v){v.classList.remove(‘gtx-on’);});
$(‘[data-view=”‘+tb.dataset.v+'”]’).classList.add(‘gtx-on’);
report();
};
});

/* —- architecture —- */
var FLOWS={
grad:[[‘🎙‘,’Input speech’,”],[‘stt-translate’,’transcribe + translate’,’acc’],[‘TTS’,’synthesize voice’,”]],
cascade:[[‘🎙‘,’Input speech’,”],[‘STT’,’transcribe’,”],[‘T2T’,’translate’,’drop’],[‘TTS’,’synthesize’,”]]
};
function drawArch(k){
var flow=$(‘#gtx-flow’); flow.innerHTML=”;
FLOWS[k].forEach(function(st,idx){
if(idx>0){var a=document.createElement(‘span’);a.className=’gtx-arrow’;a.textContent=’→’;flow.appendChild(a);}
var d=document.createElement(‘div’); d.className=’gtx-stage’+(st[2]?’ ‘+st[2]:”);
d.innerHTML='<b>’+st[0]+'</b><span>’+st[1]+'</span>’; flow.appendChild(d);
});
$(‘#gtx-archnote’).textContent = k===’grad’
? ‘Two models. stt-translate fuses transcription and translation, removing the separate Text-To-Text stage and its handoff.’
: ‘Three models. Each stage is a separate inference call with its own latency and a handoff the next stage waits on.’;
report();
}
root.querySelectorAll(‘.gtx-archtoggle button’).forEach(function(b){
b.onclick=function(){
root.querySelectorAll(‘.gtx-archtoggle button’).forEach(function(x){x.classList.remove(‘on’);});
b.classList.add(‘on’); drawArch(b.dataset.arch);
};
});
drawArch(‘grad’);

/* —- height reporting for WordPress iframe (offsetHeight + 40, never scrollHeight) —- */
function report(){
var h=root.offsetHeight+40;
parent.postMessage({type:’gtx-height’,height:h},’*’);
}
window.addEventListener(‘load’,report);
setTimeout(report,120);
if(window.ResizeObserver){ new ResizeObserver(report).observe(root); }
})();
</script>
</body>
</html>
“>


You can test real-time translation in the browser at gradium.ai/translate, with integration details in the API docs. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency appeared first on MarkTechPost.

 

​MarkTechPost

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

Leave a Reply

Latest news