Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder
Interfaze, a young YC’s startup, has open-sourced a new speech recognition model. It is called diffusion-gemma-asr-small. The model transcribes audio through a diffusion decoder, not an autoregressive one. It is described as the first multilingual audio diffusion ASR model. One adapter handles six languages. The research team trained only about 42M parameters on top of a frozen 26B backbone. That is roughly 0.16% of the model’s weights.
Here two terms matter up front. Autoregressive models generate text one token at a time. Diffusion models refine all tokens in parallel. This model uses the diffusion approach for speech-to-text.
TL;DR
- Claimed by the Interfaze team, to be the first open-source multilingual diffusion ASR: six languages from a single ~42M-parameter adapter.
- Transcribes via DiffusionGemma’s diffusion decoder using uniform, random-token diffusion, not the absorbing
<mask>scheme. - Transcription cost scales with denoising steps, not transcript length.
- Leads diffusion peers on LibriSpeech (6.6% WER vs Whisfusion’s 8.3%) but trails autoregressive Whisper.
- The adapter ships under Apache-2.0; DiffusionGemma (Gemma terms) and whisper-small (MIT) load separately.
What is diffusion-gemma-asr-small?
diffusion-gemma-asr-small is an audio-native ASR model. It converts speech to text using a discrete diffusion decoder. That decoder belongs to DiffusionGemma, Google’s 26B mixture-of-experts model. DiffusionGemma activates 4B parameters, using 128 experts with top-8 routing. It generates text by discrete diffusion instead of autoregression.
The diffusion detail is specific. Most diffusion LLMs use an absorbing <mask> scheme. DiffusionGemma uses uniform, random-token diffusion instead. It fills a fixed-length canvas with random vocabulary tokens. Each step keeps confident predictions and re-randomizes the rest. After a few steps the noise anneals into text.
Interfaze added audio to this text-only model. Out of the box, DiffusionGemma takes text, images, and video. It does not take audio. The repo ships only the trained adapter, about 42M parameters. The frozen backbones download separately from their own repos.
How it works
The model does not feed raw waveforms to the LLM. An early attempt tried exactly that and failed. A frozen LLM has never seen a spectrogram. The embedding space has no notion of formants or phonemes. The model learned to ignore audio and hallucinate fluent nonsense.
The working design uses a frozen whisper-small encoder. It acts only as a feature extractor, not a decoder. Whisper turns 30 seconds of audio into 1500 frames. Each frame holds 768-dimensional acoustic features. A small trainable projector then compresses these frames. It uses conv layers that subsample 8× plus a linear map. The output is 188 “audio tokens” at 2816 dimensions. These tokens scatter into the prompt’s reserved <|audio|> slots. LoRA adapters let the backbone attend to this new modality. The decoder then denoises a 192-token transcript canvas. It runs bidirectionally over roughly 16 steps.
The pipeline, from the model card, is compact:
raw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M)
─► scatter into <audio> token slots of DiffusionGemma's encoder
─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
─► transcript
The training unlock
The first training runs stalled. Loss flatlined near 8. The failure was circular. The projector started random, so its output was noise. Attention then learned to ignore it. Almost no gradient reached the projector. The model never learned.
The fix supervised the projector directly. The research team ran the 188 audio tokens through DiffusionGemma’s frozen lm_head. They applied a CTC loss against the transcript. CTC means Connectionist Temporal Classification. It aligns audio features to text without needing attention.
This sidesteps the standoff. The audio embeddings became linearly predictive of the right words. CTC loss then dropped from 24 to 8.6 in 300 steps. On LibriSpeech test-clean, English WER fell 90% → 52% → 14.6% → 6.6% over ten epochs.
Performance and benchmarks
WER means Word Error Rate, where lower is better. CER means Character Error Rate. The model trained on FLEURS, LibriSpeech, and VoxPopuli. All scores below use the Whisper text normalizer at 16 diffusion steps.
| benchmark | metric | score |
|---|---|---|
| LibriSpeech test-clean (en) | WER | 6.6% |
| FLEURS English | WER | 15.7% |
| VoxPopuli English | WER | 18.5% |
| FLEURS Hindi | CER | 15.8% |
| FLEURS Mandarin | CER | 29.6% |
Against other diffusion or non-autoregressive ASR, it leads.
| model | approach | LibriSpeech test-clean |
|---|---|---|
| TransFusion (2022) | multinomial diffusion | ~6–7% (proof-of-concept) |
| Whisfusion (Aug 2025) | Whisper-large-v3 + masked diffusion | 8.3% |
| diffusion-gemma-asr-small (2026) | Whisper-small + DiffusionGemma | 6.6% |
Against autoregressive Whisper, it trails. The team frames this gap as data, not architecture.
| benchmark | ours | Whisper-small | Whisper-large-v3 |
|---|---|---|---|
| LibriSpeech clean | 6.6% | ~3.4% | ~2.0% |
| FLEURS-en | 15.7% | ~9–10% | ~4–5% |
| VoxPopuli-en | 18.5% | ~9–11% | ~7–10% |
The denoising-step sweep shows a nearly flat curve.
| steps | FLEURS-en WER | speed |
|---|---|---|
| 8 | 15.7% | 14.9× real-time |
| 16 | 15.6% | 10.3× |
| 32 | 15.2% | 6.5× |
| 48 | 15.6% | 4.7× |
Going from 8 to 48 steps buys about 0.1 WER point. It costs roughly 3× the latency. The model converges in about 8 parallel passes. That is around 0.7–1.5s of model time for a 10-second clip.
Use cases with examples
- Batch transcription pipelines benefit from parallel decoding. Cost is set by denoising steps, not clip length. A 10-second clip needs roughly the same passes as a shorter one.
- Multilingual transcription runs from a single adapter. It covers English, German, French, Spanish, Hindi, and Mandarin. Teams avoid loading a separate model per language.
- Non-autoregressive ASR research gains a reproducible baseline. The recipe grounds a frozen LLM with a small adapter. Researchers can extend it with more audio or a larger encoder.
How to get started
The model lives on the Hub. It ships the adapter, model.py, audio.py, and a runnable inference.py. DiffusionGemma support needs transformers from main.
pip install torch peft soundfile librosa huggingface_hub
"transformers @ git+https://github.com/huggingface/transformers.git"
Then transcribe in Python:
import sys, soundfile as sf
from huggingface_hub import snapshot_download
repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small") # adapter, ~170 MB
sys.path.insert(0, repo)
from inference import load, transcribe
# Loads frozen DiffusionGemma-26B + whisper-small + this adapter.
model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda")
wav, sr = sf.read("audio.wav") # 16 kHz mono float32
print(transcribe(wav, model, tok, fe, max_steps=16))
A command-line path also works from inside the downloaded repo:
python inference.py audio.wav
The max_steps argument trades speed for accuracy. The team notes 8 is near-best and fastest. The default is 16. The base models load under their own licenses: DiffusionGemma under Gemma terms, whisper-small under MIT.
Interactive Explainer
Denoise</button>
</div>
</div>
<!– stage –>
<div class=”stage”>
<div class=”stage-top”>
<span class=”stage-title”>Transcript canvas</span>
<div class=”prog-wrap”><div class=”prog” id=”prog”></div></div>
</div>
<div class=”canvas” id=”canvas”></div>
<p class=”caption”>
Illustrative visualization of parallel diffusion denoising: confident tokens lock, the rest re-randomize each step.
The example transcript is fixed — this animation is not live model inference.
</p>
</div>
<!– readouts –>
<div class=”reads”>
<div class=”read”><div class=”k”>Step</div><div class=”v” id=”rStep”>0 <small>/ 16</small></div></div>
<div class=”read”><div class=”k”>Tokens locked</div><div class=”v” id=”rLocked”>0 <small>/ 0</small></div></div>
<div class=”read rtf”><div class=”k”>Real-time factor</div><div class=”v” id=”rRtf”>10.3×</div></div>
<div class=”read wer”><div class=”k”>FLEURS-en WER</div><div class=”v” id=”rWer”>15.6%</div></div>
</div>
<!– tradeoff –>
<div class=”trade”>
<h3>Steps vs. accuracy vs. speed</h3>
<p>More denoising steps barely move accuracy but cost throughput. The model converges in about 8 parallel passes. Figures are Interfaze’s published FLEURS-en benchmarks.</p>
<div class=”bars” id=”bars”></div>
<div class=”legend”>
<span><i style=”background:var(–violet)”></i>WER (lower is better)</span>
<span><i style=”background:var(–teal)”></i>Real-time factor (higher is faster)</span>
</div>
</div>
<!– footer –>
<div class=”foot”>
<div class=”src”>
Model: <a href=”https://huggingface.co/interfaze-ai/diffusion-gemma-asr-small” target=”_blank” rel=”noopener”>interfaze-ai/diffusion-gemma-asr-small</a>
· <a href=”https://interfaze.ai/blog/the-first-open-source-diffusion-audio-asr-model” target=”_blank” rel=”noopener”>Interfaze blog</a>
· <a href=”https://huggingface.co/spaces/interfaze-ai/diffusion-gemma-asr-demo” target=”_blank” rel=”noopener”>Live demo</a>
</div>
<div class=”brand”>Marktechpost</div>
</div>
</div>
<script>
(function(){
// —- data (all figures from Interfaze’s published benchmarks) —-
var STEP_DATA = {
8: {rtf:”14.9×”, wer:”15.7%”, werNum:15.7, speedNum:14.9},
16: {rtf:”10.3×”, wer:”15.6%”, werNum:15.6, speedNum:10.3},
32: {rtf:”6.5×”, wer:”15.2%”, werNum:15.2, speedNum:6.5},
48: {rtf:”4.7×”, wer:”15.6%”, werNum:15.6, speedNum:4.7}
};
var STEP_LIST = [8,16,32,48];
var SAMPLES = {
“English”: [“the”,”diffusion”,”decoder”,”refines”,”every”,”token”,”in”,”parallel”],
“German”: [“der”,”diffusions”,”decoder”,”verarbeitet”,”alle”,”token”,”gleichzeitig”],
“French”: [“le”,”décodeur”,”de”,”diffusion”,”affine”,”chaque”,”jeton”,”en”,”parallèle”],
“Spanish”: [“el”,”decodificador”,”de”,”difusión”,”refina”,”cada”,”token”,”en”,”paralelo”],
“Hindi”: [“डिफ्यूज़न”,”डिकोडर”,”सभी”,”टोकन”,”एक”,”साथ”,”परिष्कृत”,”करता”,”है”],
“Mandarin”: [“扩散”,”解码器”,”并行”,”地”,”细化”,”每个”,”词元”]
};
var LANGS = Object.keys(SAMPLES);
// pool of random “vocabulary” fragments to fill noisy slots
var NOISE = [“▁the”,”##ion”,”▁der”,”ка”,”的”,”##ing”,”▁le”,”xq”,”▁el”,”zj”,”##er”,”भ”,”并”,
“▁und”,”pф”,”##tok”,”了”,”▁de”,”wм”,”##ité”,”क”,”器”,”▁par”,”##allel”,”zx”,”ство”,”##子”,
“▁jeton”,”的”,”ん”,”##ко”,”▁re”,”qv”,”元”,”##ня”,”▁si”,”ต”,”##라”,”▁ль”];
// —- state —-
var state = { lang:”English”, steps:16, running:false, timer:null };
var canvasEl = document.getElementById(‘canvas’);
var progEl = document.getElementById(‘prog’);
var playBtn = document.getElementById(‘playBtn’);
var resetBtn = document.getElementById(‘resetBtn’);
var rStep=document.getElementById(‘rStep’), rLocked=document.getElementById(‘rLocked’),
rRtf=document.getElementById(‘rRtf’), rWer=document.getElementById(‘rWer’);
var pipeNodes = document.querySelectorAll(‘#pipe .pnode’);
function rnd(a){ return a[Math.floor(Math.random()*a.length)]; }
function postSize(){
try{ var h=document.getElementById(‘app’).offsetHeight+40;
parent.postMessage({type:’resize’, frameHeight:h, height:h},’*’); }catch(e){}
}
// —- build controls —-
function buildPills(containerId, items, current, cls, onPick){
var c=document.getElementById(containerId); c.innerHTML=””;
items.forEach(function(it){
var b=document.createElement(‘button’);
b.className=”pill”+(cls?(” “+cls):””);
b.textContent = (cls===”step”) ? it : it;
b.setAttribute(‘aria-pressed’, String(it===current));
b.onclick=function(){ if(state.running) return; onPick(it);
Array.prototype.forEach.call(c.children,function(x){x.setAttribute(‘aria-pressed’,’false’);});
b.setAttribute(‘aria-pressed’,’true’); };
c.appendChild(b);
});
}
function buildBars(){
var host=document.getElementById(‘bars’); host.innerHTML=””;
var werMax=20, spdMax=16; // honest axes: WER near-flat, speed clearly declining
STEP_LIST.forEach(function(s){
var d=STEP_DATA[s];
var col=document.createElement(‘div’); col.className=”barcol”+(s===state.steps?” sel”:””);
col.setAttribute(‘data-step’, s);
var pair=document.createElement(‘div’); pair.className=”barpair”;
var wh=Math.round(d.werNum/werMax*100);
var sh=Math.round(d.speedNum/spdMax*100);
pair.innerHTML =
‘<div class=”bar w” style=”height:’+wh+’%”><span>’+d.wer+'</span></div>’+
‘<div class=”bar s” style=”height:’+sh+’%”><span>’+d.rtf+'</span></div>’;
var x=document.createElement(‘div’); x.className=”xlab”; x.textContent=s;
col.appendChild(pair); col.appendChild(x);
col.onclick=function(){ if(state.running) return; selectSteps(s);
document.querySelectorAll(‘#stepPills .pill’).forEach(function(p){
p.setAttribute(‘aria-pressed’, String(parseInt(p.textContent,10)===s)); }); };
host.appendChild(col);
});
}
function markSelectedBar(){
document.querySelectorAll(‘#bars .barcol’).forEach(function(c){
c.classList.toggle(‘sel’, parseInt(c.getAttribute(‘data-step’),10)===state.steps);
});
}
// —- render / animation —-
function buildCanvas(){
canvasEl.innerHTML=””;
var words=SAMPLES[state.lang];
words.forEach(function(w,i){
var t=document.createElement(‘span’); t.className=”tok”; t.setAttribute(‘data-i’,i);
t.setAttribute(‘data-word’,w); t.textContent=rnd(NOISE);
canvasEl.appendChild(t);
});
}
function resetView(){
stop();
buildCanvas();
var d=STEP_DATA[state.steps], n=SAMPLES[state.lang].length;
progEl.style.width=”0%”;
rStep.innerHTML=’0 <small>/ ‘+state.steps+'</small>’;
rLocked.innerHTML=’0 <small>/ ‘+n+'</small>’;
rRtf.textContent=d.rtf; rWer.textContent=d.wer;
pipeNodes.forEach(function(p){p.classList.remove(‘locked’);});
playBtn.disabled=false; playBtn.textContent=” Denoise”;
postSize();
}
function lockSchedule(n, steps){
// fraction locked after step k (ease-out), then a fixed random reveal order
var order=[]; for(var i=0;i<n;i++) order.push(i);
for(var j=order.length-1;j>0;j–){var r=Math.floor(Math.random()*(j+1)); var tmp=order[j];order[j]=order[r];order[r]=tmp;}
var perStep=[]; // number locked target at each step
for(var k=1;k<=steps;k++){
var f=1-Math.pow(1-k/steps,1.6);
perStep.push(Math.min(n, Math.round(f*n)));
}
perStep[steps-1]=n;
return {order:order, perStep:perStep};
}
function play(){
if(state.running) return;
state.running=true; playBtn.disabled=true; playBtn.textContent=”Denoising…”;
setPillsDisabled(true);
var words=SAMPLES[state.lang], n=words.length, steps=state.steps;
var toks=canvasEl.querySelectorAll(‘.tok’);
var sched=lockSchedule(n, steps);
var lockedSet={}; var k=0;
var dur = steps<=16 ? 140 : 95;
state.timer=setInterval(function(){
k++;
var target=sched.perStep[k-1];
// lock next tokens up to target
var lockedCount=Object.keys(lockedSet).length;
while(lockedCount<target){
var idx=sched.order[lockedCount];
lockedSet[idx]=true; lockedCount++;
(function(el){ el.textContent=el.getAttribute(‘data-word’);
el.classList.add(‘locking’);
setTimeout(function(){ el.classList.remove(‘locking’); el.classList.add(‘locked’); },180);
})(toks[idx]);
}
// re-randomize still-noisy tokens
for(var i=0;i<n;i++){ if(!lockedSet[i]){ toks[i].textContent=rnd(NOISE); } }
// pipeline pulse
pipeNodes.forEach(function(p,pi){ p.classList.toggle(‘locked’, pi<=Math.min(4, Math.floor(k/steps*4)+1)); });
// readouts
progEl.style.width=Math.round(k/steps*100)+”%”;
rStep.innerHTML=k+’ <small>/ ‘+steps+'</small>’;
rLocked.innerHTML=Object.keys(lockedSet).length+’ <small>/ ‘+n+'</small>’;
if(k>=steps){ stop(); pipeNodes.forEach(function(p){p.classList.add(‘locked’);});
playBtn.textContent=”✓ Done”; playBtn.disabled=false; setPillsDisabled(false); postSize(); }
}, dur);
}
function stop(){ if(state.timer){clearInterval(state.timer); state.timer=null;} state.running=false; }
function setPillsDisabled(v){
document.querySelectorAll(‘.pill’).forEach(function(p){ p.style.pointerEvents=v?’none’:’auto’; p.style.opacity=v?.6:1; });
}
function selectLang(l){ state.lang=l; resetView(); }
function selectSteps(s){ state.steps=s; markSelectedBar(); resetView(); }
// —- init —-
buildPills(‘langPills’, LANGS, state.lang, “”, selectLang);
buildPills(‘stepPills’, STEP_LIST, state.steps, “step”, selectSteps);
buildBars();
playBtn.onclick=play;
resetBtn.onclick=resetView;
resetView();
window.addEventListener(‘load’, postSize);
window.addEventListener(‘resize’, postSize);
})();
</script>
</body>
</html>
“
title=”Diffusion ASR Denoising Visualizer”
scrolling=”no” loading=”lazy”
style=”width:100%;border:0;height:940px;overflow:hidden;display:block”>
Check out the Model weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder appeared first on MarkTechPost.
MarkTechPost
