<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>stt &#8211; Gig City Geek</title>
	<atom:link href="https://gigcitygeek.com/tag/stt/feed/" rel="self" type="application/rss+xml" />
	<link>https://gigcitygeek.com</link>
	<description></description>
	<lastBuildDate>Fri, 17 Apr 2026 14:06:07 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://gigcitygeek.com/wp-content/uploads/2026/01/cropped-GigCityGeek_Logo-32x32.png</url>
	<title>stt &#8211; Gig City Geek</title>
	<link>https://gigcitygeek.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Goodbye Whisper Server: Seamless Speech with Llama 4 and Gemma</title>
		<link>https://gigcitygeek.com/2026/04/17/llama-4-gemma-integrated-speech-input/</link>
					<comments>https://gigcitygeek.com/2026/04/17/llama-4-gemma-integrated-speech-input/#respond</comments>
		
		<dc:creator><![CDATA[Laronski]]></dc:creator>
		<pubDate>Fri, 17 Apr 2026 13:00:00 +0000</pubDate>
				<category><![CDATA[AI Service]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[AI Development]]></category>
		<category><![CDATA[ai-agents]]></category>
		<category><![CDATA[gemma]]></category>
		<category><![CDATA[gpu-memory]]></category>
		<category><![CDATA[llama-4]]></category>
		<category><![CDATA[local AI]]></category>
		<category><![CDATA[speech-to-text]]></category>
		<category><![CDATA[stt]]></category>
		<category><![CDATA[voice-interface]]></category>
		<category><![CDATA[whisper]]></category>
		<guid isPermaLink="false">https://gigcitygeek.com/?p=3636</guid>

					<description><![CDATA[Eliminate the Whisper server! Llama 4 and Gemma 4 bring direct speech input, simplifying local AI agents and removing dependencies. A huge step for frictionl...]]></description>
										<content:encoded><![CDATA[<p>Last week I was tweaking my local setup after everyone went to bed, and I realized something odd: my entire speech pipeline still depended on a separate <a title="" href="https://openai.com/research/whisper" target="_blank" rel="noopener">Whisper</a> server that crashed every time my GPU memory got tight. It worked, mostly, but it felt like dragging a trailer behind a sports car.</p>
<p>Seeing audio support land directly in <a title="" href="https://www.meta.com/blog/llama-2/" target="&lt;em&gt;blank" rel="noopener">llama-server</a> with <a title="" href="https://ai.google.dev/gemma" target="&lt;/em&gt;blank" rel="noopener">Gemma 4 models</a> feels like that moment when you finally take the trailer off and see what the car can really do.</p>
<p><h4>A Quietly Important Upgrade</h4>
</p>
<p>What excites me here is not just “yet another <a title="" href="https://en.wikipedia.org/wiki/Speech&lt;em&gt;recognition" target="&lt;/em&gt;blank" rel="noopener">STT</a> option,” but the fact that speech input now lives in the same process as your main model. No more bolting on a Whisper container, juggling ports, or translating from one API style to another. For people running fully local agents, that is a net positive development, and it makes the whole stack less fragile and easier to reason about.</p>
<p>My son already talks to his games more than he types; giving a local agent that kind of frictionless voice interface without cloud calls or extra services is a big step toward setups normal people could actually use.</p>
<p><h4>Some Rough Edges To Watch</h4>
</p>
<p>That said, the reality on the ground is messy, and pretending otherwise helps nobody. Early testers are already running into context limit issues, crashes on longer clips, odd looping in transcripts, and very specific prompting requirements just to get stable output. You can feel the difference when someone switches to <a title="" href="https://github.com/readout/voxtral" target="&lt;em&gt;blank" rel="noopener">Voxtral</a> or <a title="" href="https://github.com/versatile-ai/parakeet" target="&lt;/em&gt;blank" rel="noopener">Parakeet</a> for anything over a couple of minutes, especially for longer-form speech or noisy environments.</p>
<p>That is the catch with “native” audio support right now: yes, it is integrated, but you still need to babysit it with careful prompts, tuned <a title="" href="https://en.wikipedia.org/wiki/Mini-batch" target="&lt;em&gt;blank" rel="noopener">microbatch settings</a>, and sometimes a separate <a title="" href="https://en.wikipedia.org/wiki/Voice&lt;/em&gt;activity&lt;em&gt;detection" target="&lt;/em&gt;blank" rel="noopener">VAD</a> or noise gate on the front.</p>
<p><h4>Looking Beyond English And Easy Demos</h4>
</p>
<p>There is also the language question, which matters more than benchmarks for many of us. Some folks are getting great Spanish results and claiming clear wins over Whisper, but others point out the weaker coverage for certain languages and specialized phrases. Whisper still has an edge for a lot of Asian languages, while <a title="" href="https://github.com/Qwen/QwenASR" target="&lt;em&gt;blank" rel="noopener">Qwen </a><a title="" href="https://en.wikipedia.org/wiki/Automatic&lt;/em&gt;speech&lt;em&gt;recognition" target="&lt;/em&gt;blank" rel="noopener">ASR</a> and Canary bring their own tradeoffs in speed, latency, and language selection.</p>
<p>If my wife wants to dictate in another language while cooking, I cannot hand her something that silently drops quality the moment she switches tongues. For this to be more than a cool English demo, the multilingual story has to be as strong as the integration story.</p>
<p><h4>Where This Actually Leaves Us</h4>
</p>
<p>So is this good or bad for people who build and run local agents and tools on their own machines? Taken as a whole, it is clearly a net positive for that audience, but it is not “uninstall Whisper and call it a day” territory yet. What we have now is a promising first step: an integrated STT path, decent performance on short clips, and a route to fully local “talk to your model” experiences without spinning up extra services.</p>
<p>The next stretch is going to be all about stability on longer audio, better handling of silence and noise, more robust multilingual behavior, and honest benchmarks that include <a title="" href="https://en.wikipedia.org/wiki/Video&lt;em&gt;RAM" target="&lt;/em&gt;blank" rel="noopener">VRAM pressure</a> and <a title="" href="https://en.wikipedia.org/wiki/Latency&lt;em&gt;(computing)" target="&lt;/em&gt;blank" rel="noopener">CPU latency</a>, not just accuracy.</p>
<p>If that work happens, the separate STT server will start feeling like a historical curiosity rather than a necessary evil.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://gigcitygeek.com/2026/04/17/llama-4-gemma-integrated-speech-input/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
