<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Decoding &#8211; Gig City Geek</title>
	<atom:link href="https://gigcitygeek.com/tag/decoding/feed/" rel="self" type="application/rss+xml" />
	<link>https://gigcitygeek.com</link>
	<description></description>
	<lastBuildDate>Wed, 06 May 2026 15:25:29 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://gigcitygeek.com/wp-content/uploads/2026/01/cropped-GigCityGeek_Logo-32x32.png</url>
	<title>Decoding &#8211; Gig City Geek</title>
	<link>https://gigcitygeek.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>LLM Speed: Domestic Chaos and Hardware Bottlenecks</title>
		<link>https://gigcitygeek.com/2026/05/08/llm-speed-gpu-bottlenecks-mtp-decoding/</link>
					<comments>https://gigcitygeek.com/2026/05/08/llm-speed-gpu-bottlenecks-mtp-decoding/#respond</comments>
		
		<dc:creator><![CDATA[Laronski]]></dc:creator>
		<pubDate>Fri, 08 May 2026 13:00:00 +0000</pubDate>
				<category><![CDATA[AI Service]]></category>
		<category><![CDATA[Hardware]]></category>
		<category><![CDATA[Autoregressive Decoding]]></category>
		<category><![CDATA[Bandwidth]]></category>
		<category><![CDATA[Decoding]]></category>
		<category><![CDATA[gpu]]></category>
		<category><![CDATA[Large Language Models]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[MTP]]></category>
		<category><![CDATA[Multi Token Prediction]]></category>
		<category><![CDATA[streaming]]></category>
		<guid isPermaLink="false">https://gigcitygeek.com/?p=3767</guid>

					<description><![CDATA[Discover how LLM processing impacts home networks! Explore the challenges of GPU bandwidth, streaming interruptions, and the promise of speculative decoding ...]]></description>
										<content:encoded><![CDATA[<p>I was sitting at my desk last night, watching <a href="https://www.gemma.no/" target="<em>blank&#8221; rel=&#8221;noopener&#8221;>Gemma</a> 4 31B chew through a reply at about 10 <a href="https://en.wikipedia.org/wiki/Tokenization" target="</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>tokens</a> per second, when my son walked in to ask if he could queue another download on Steam. My <a href="https://en.wikipedia.org/wiki/Graphics<em>processing</em>unit&#8221; target=&#8221;<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>GPU</a> fans were already loud enough that my wife yelled from the living room asking if “the spaceship” was about to take off again. That was the moment I realized how much of our house now orbits one weird thing: how fast an <a href="https://en.wikipedia.org/wiki/Large</em>language<em>model&#8221; target=&#8221;</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>LLM</a> can finish a sentence.</p>
<p><h4>Multi Token Prediction feels like cheating on that problem.</h4>
</p>
<p>In my house, LLMs have real domestic consequences. If I am running a big model on the GPU, my son’s game pings go to trash, and my wife’s streaming apps start buffering. Traditional <a href="https://www.assemblyai.com/blog/autoregressive-decoding-explained/" target="_blank" rel="noopener noreferrer">autoregressive decoding</a> is part of the reason. The model predicts one token, waits on memory, predicts the next, waits again. Modern hardware has a ton of compute, but memory bandwidth plays goalie and slows everything down. The hardware sits around like that student who finishes the homework early, staring out the window, waiting for the next assignment.</p>
<p><a href="https://www.assemblyai.com/blog/speculative-decoding-explained/" target="_blank" rel="noopener noreferrer">speculative decoding</a> and MTP basically hand that student a stack of “probably next” homework pages so they do not get bored.</p>
<p><h4>Why Multi Token Prediction Actually Matters</h4>
</p>
<p>Here is how I think about it when I am at my desk trying to squeeze one more model into <a href="https://www.techtarget.com/whatis/definition/VRAM" target="_blank" rel="noopener noreferrer">VRAM</a>. With standard speculative decoding, you run a small draft model a few tokens ahead, then let the big model verify those guesses in parallel. If the guesses line up with what the main model would have said anyway, you keep them and jump forward. If not, you toss the bad guesses and fall back to normal decoding for that step. Same quality, less wasted idle time.</p>
<p>Gemma 4’s MTP drafters are built exactly for that pattern. Google shipped tiny specialist models, like that 78M draft for the E2B variant, that sit alongside the main Gemma 4 checkpoints. When wired into a speculative decoding pipeline, they can almost double decoding speed while keeping output identical to “vanilla” generation. For me that is a net positive, because it improves latency without turning my prompts into some lossy “turbo” mode.</p>
<p>The cool twist is how Gemma leans on its tokenizer.</p>
<p><h4>Why Tiny Draft Models Can Punch Above Their Weight</h4>
</p>
<p>A lot of people in our scene still assume you need hundreds of millions of parameters just to get anything useful. The Gemma 4 MTP release quietly argues the opposite. Google invested in a huge, well trained tokenizer: 262k vocabulary, compared to 32k in Llama 2 and 128k in Llama 3. That vocabulary means each token carries more semantic weight, so both the main model and the tiny draft model spend their parameters more efficiently.</p>
<p>So when people on Reddit get excited about a 78M draft being “cute,” they are not wrong. That small safetensor is leaning on a tokenizer that is doing heavy lifting. Some folks even estimate that the tokenizer stack itself behaves like it has billions of “effective” parameters in how it carves up text. In practice, what I care about is simple: fewer tokens, more meaning, less time waiting for the bar to crawl across the screen.</p>
<p>That is exactly what matters on a phone with 6 GB of RAM or a cramped desktop where the GPU already has to share space with games.</p>
<p><h4>The Real Tradeoffs Hiding Behind The Hype</h4>
</p>
<p>Of course there is a catch, and I feel it every time my wife asks why the PC fans spin up when I “just open a chat.” Drafting spends more compute to win back time. You run two models, or at least two heads, which means more memory and more power draw. Some of that compute is wasted when draft tokens get rejected. If I cared more about energy efficiency or packing maximum concurrency into a server, I might skip speculative decoding entirely and just batch requests.</p>
<p>At home, though, I am usually running a single context. No batching, no clients, just me grilling the model while my son tries not to lag out. In that setup, moving from memory bound to compute bound is exactly what I want.</p>
<p><h4>Why This Feels Like A Turning Point</h4>
</p>
<p>What makes Gemma 4 MTP interesting is not only the speedup. It is that these drafters are being wired into real stacks: transformers, vLLM, Ollama, MLX, and soon llama.cpp through that pending pull request. Once MTP is baked directly into a single GGUF, with shared KV and smart offloading, the friction goes away. At that point I can drop one file into my models folder and suddenly my “old” hardware feels new again.</p>
<p>For my house, that means fewer complaints from my wife, fewer dropped frames for my son, and faster replies for me when I am hacking prompts late at night.</p>
<p>In other words, Gemma 4’s MTP setup is a clear net positive for anyone living on the edge of their hardware limits.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://gigcitygeek.com/2026/05/08/llm-speed-gpu-bottlenecks-mtp-decoding/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
