<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>llama.cpp &#8211; Gig City Geek</title>
	<atom:link href="https://gigcitygeek.com/tag/llama-cpp/feed/" rel="self" type="application/rss+xml" />
	<link>https://gigcitygeek.com</link>
	<description></description>
	<lastBuildDate>Mon, 25 May 2026 18:55:49 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://gigcitygeek.com/wp-content/uploads/2026/01/cropped-GigCityGeek_Logo-32x32.png</url>
	<title>llama.cpp &#8211; Gig City Geek</title>
	<link>https://gigcitygeek.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Ditch the AI Bloat: How to Easily Setup Llama.cpp for Raw Speed</title>
		<link>https://gigcitygeek.com/2026/06/04/how-to-setup-llama-cpp-for-local-ai/</link>
					<comments>https://gigcitygeek.com/2026/06/04/how-to-setup-llama-cpp-for-local-ai/#respond</comments>
		
		<dc:creator><![CDATA[Laronski]]></dc:creator>
		<pubDate>Thu, 04 Jun 2026 13:00:00 +0000</pubDate>
				<category><![CDATA[Privacy]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[AI Models]]></category>
		<category><![CDATA[command line]]></category>
		<category><![CDATA[llama.cpp]]></category>
		<category><![CDATA[LLM performance]]></category>
		<category><![CDATA[lm studio]]></category>
		<category><![CDATA[local AI]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[self-hosting]]></category>
		<category><![CDATA[tech tutorial]]></category>
		<category><![CDATA[web ui]]></category>
		<guid isPermaLink="false">https://gigcitygeek.com/?p=3949</guid>

					<description><![CDATA[Stop letting bloated wrappers gatekeep your local LLM performance. Learn how to easily set up llama.cpp for raw speed and ultimate privacy.]]></description>
										<content:encoded><![CDATA[<p>For all those vibe coders out there, we’ve all been there: staring at a local AI setup guide, wanting the ultimate privacy of <a href="https://gigcitygeek.com/2026/05/07/plex-server-unauthorized-streaming-security/" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>self-hosting</a>, but backing away the moment we see a <a href="https://www.codecademy.com/article/command-line-commands" target="</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>command-line instruction</a>. For months, I clung to <a href="https://en.wikipedia.org/wiki/Le<em>Studio&#8221; target=&#8221;</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>LM Studio</a> and <a href="https://ollama.com/">Ollama</a> like a safety blanket because it had a pretty interface and didn&#8217;t make me feel like an imposter. My wife, who expects our home tech to just work like the toaster, already looks at me like I&#8217;m the <a href="https://en.wikipedia.org/wiki/Unabomber" target="_blank" rel="noopener noreferrer">Unabomber</a> whenever I open a terminal.</p>
<p>But those bloated wrappers we use to keep things simple are quietly gatekeeping the best features of modern <a href="https://en.wikipedia.org/wiki/LLMs" target="_blank" rel="noopener noreferrer">LLMs</a>. If you want to stop leaving performance on the table, it&#8217;s time to look under the hood.</p>
<p><h4>Bypassing the Gatekeepers</h4>
</p>
<p>I always assumed <a href="https://en.wikipedia.org/wiki/Llama.cpp" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>llama.cpp</a> was reserved for the elite developers who <a href="https://www.reddit.com/r/explainlikeimfive/comments/233dq5/eli5</em>what<em>does</em>it<em>mean</em>to<em>compile</em>code/&#8221; target=&#8221;<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>compile code</a> in their sleep. It turns out, that’s just a myth we tell ourselves to stay comfortable. You don’t need a <a href="https://en.wikipedia.org/wiki/Computer</em>science&#8221; target=&#8221;_blank&#8221; rel=&#8221;noopener noreferrer&#8221;>computer science</a> degree; you just need to download a prebuilt file, unzip it, and run a single command.</p>
<p>Suddenly, you have a clean <a href="https://en.wikipedia.org/wiki/Web<em>UI&#8221; target=&#8221;</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>web UI</a> running directly in your <a href="https://en.wikipedia.org/wiki/Browser" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>browser</a> without the <a href="https://en.wikipedia.org/wiki/Middleman" target="</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>middleman</a>. And let&#8217;s be honest, cutting out the middleman is a <a href="https://en.wikipedia.org/wiki/Project<em>manager&#8221; target=&#8221;</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>project manager</a>&#8216;s dream.</p>
<p><h4>Raw Speed and Less Bloat</h4>
</p>
<p>Because llama.cpp is the actual engine powering those flashy desktop apps, running it directly cuts out massive <a href="https://en.wikipedia.org/wiki/Overhead" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>overhead</a>. On my <a href="https://en.wikipedia.org/wiki/Ryzen" target="</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>Ryzen 9</a> mini PC, the speed increase was immediately noticeable, clocking in at about 15% faster <a href="https://it-tools.tech/token-generator" target="_blank" rel="noopener noreferrer">token generation</a>.</p>
<p>My son, who measures his self-worth in gaming <a href="https://en.wikipedia.org/wiki/Frame<em>rate&#8221; target=&#8221;</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>frame rates</a>, tried to lecture me on <a href="https://en.wikipedia.org/wiki/VRAM" target="_blank" rel="noopener noreferrer">VRAM</a> allocation, but I was too busy enjoying the instant responses.</p>
<p>You get the absolute latest model updates instantly because you aren&#8217;t waiting on a third-party app developer to package them. This means less lag and more actual <a href="https://en.wikipedia.org/wiki/Productivity" target="_blank" rel="noopener noreferrer">productivity</a>.</p>
<p><h4>Unlocking the Real Power</h4>
</p>
<p>The breaking point for me was trying to run <a href="https://en.wikipedia.org/wiki/Gemma" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>Gemma</a> 4 E4B to test its native <a href="https://learn.microsoft.com/en-us/windows/win32/directshow/audio-capabilities" target="</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>audio capabilities</a>, only to find LM Studio completely ignored the feature. With llama.cpp, not only did the audio analysis work flawlessly, but it also fixed the annoying &#8220;<a href="https://www.studocu.com/en-us/document/keiser-university/basic-adult-health-care/gi-bleed-hypovolemic-shock-rapid-reasoning-keith-rn/10497195" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>reasoning bleed</a>&#8221; bug where models mix their <a href="https://en.wikipedia.org/wiki/Thinking</em>process&#8221; target=&#8221;_blank&#8221; rel=&#8221;noopener noreferrer&#8221;>thinking process</a> with the final answer.</p>
<p>Now, the thoughts are tucked away in a neat, collapsible box.</p>
<p>It’s like finally driving a sports car out of <a href="https://www.gatsbyvalet.com/what-is-valet-mode-what-it-means-for-vehicle-security/" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>valet mode</a>. <a href="https://en.wikipedia.org/wiki/Image</em>analysis&#8221; target=&#8221;<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>Image analysis</a> and <a href="https://en.wikipedia.org/wiki/System</em>prompt&#8221; target=&#8221;_blank&#8221; rel=&#8221;noopener noreferrer&#8221;>system prompts</a> work seamlessly, giving you a highly customized workspace.</p>
<p><h4>Dialing in the Perfect Flow</h4>
</p>
<p>Beyond the speed, the granular controls are where this transition really pays off for daily productivity. Features like <a href="https://en.wikipedia.org/wiki/DRY" target="_blank" rel="noopener noreferrer">DRY</a> (Don&#8217;t Repeat Yourself) prevent the model from getting stuck in repetitive loops without making its vocabulary sound unnatural.</p>
<p>It makes the local <a href="https://en.wikipedia.org/wiki/Chatbot" target="<em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>chatbot</a> feel less like a robotic script and more like a sharp assistant. If you’re ready to stop compromising on your local AI <a href="https://en.wikipedia.org/wiki/Workflow" target="</em>blank&#8221; rel=&#8221;noopener noreferrer&#8221;>workflow</a>, make the jump.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://gigcitygeek.com/2026/06/04/how-to-setup-llama-cpp-for-local-ai/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Local AI Models: A Shift in Workflow</title>
		<link>https://gigcitygeek.com/2026/04/16/gemma-4-local-ai-performance/</link>
					<comments>https://gigcitygeek.com/2026/04/16/gemma-4-local-ai-performance/#respond</comments>
		
		<dc:creator><![CDATA[Laronski]]></dc:creator>
		<pubDate>Thu, 16 Apr 2026 13:00:00 +0000</pubDate>
				<category><![CDATA[AI Service]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[ai]]></category>
		<category><![CDATA[AI assistant]]></category>
		<category><![CDATA[gemma]]></category>
		<category><![CDATA[Gemma 4]]></category>
		<category><![CDATA[Large Language Model]]></category>
		<category><![CDATA[llama.cpp]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[lm studio]]></category>
		<category><![CDATA[local AI]]></category>
		<category><![CDATA[performance]]></category>
		<guid isPermaLink="false">https://gigcitygeek.com/?p=3630</guid>

					<description><![CDATA[Discover a surprising shift in AI workflows! Moving from cloud models to Gemma 4 on local hardware reveals remarkable responsiveness and speed. Experience a ...]]></description>
										<content:encoded><![CDATA[<p>The moment I realized something had shifted was when I caught myself reaching for my <a href="https://en.wikipedia.org/wiki/Large_language_model" title="" target="_blank" rel="noopener">local model</a> instead of a cloud tab, almost on reflex. I was on the couch with my laptop, my son in the next room yelling at a game, and I had one of those annoying “this needs code, context, and a web search” problems from work. Normally that is a straight trip to <a href="https://www.anthropic.com/product/claude" title="" target="_blank" rel="noopener">Claude</a> or <a href="https://www.google.com/gemini/" title="" target="_blank" rel="noopener">Gemini</a>.</p>
<p>This time I pointed my editor at Gemma 4 on my modest box and just waited to see if it fell over. It did not. It acted like a real assistant instead of a fun toy.</p>
<p>What struck me was not raw tokens per second but how little time it wasted thinking. I had <a href="https://huggingface.co/Qwen" title="" target="_blank" rel="noopener">Qwen</a> 3.5 27B and 35B set up before, and while the quality is excellent, you can feel it grind through long chains of thought on fairly simple prompts.</p>
<p>Gemma 4, especially the 26B A4B variants people are running through <a href="https://github.com/ggerganov/llama.cpp" title="" target="_blank" rel="noopener">llama.cpp</a> and <a href="https://lmstudio.ai/" title="" target="_blank" rel="noopener">LM Studio</a>, feels like a high‑strung lawyer who reads fast, decides fast, and just answers. On mid‑range consumer hardware, having that kind of responsiveness from a local agent is a net positive for anyone trying to get real work done without renting someone else’s <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit" title="" target="_blank" rel="noopener">GPU</a>.</p>
<p><h4>Mixed Signals, Real Tradeoffs</h4>
</p>
<p>Of course, the picture is not clean. If you read through enough user reports, you see two parallel realities: on one side, people on M1/M4 or tuned <a href="https://developer.nvidia.com/cuda-zone" title="" target="_blank" rel="noopener">CUDA</a> setups talking about blazing speeds, solid tool use, and 128k‑context coding sessions; on the other, folks stuck in endless tool‑call loops, bad argument schemas, and memory leaks that eat 100 gigabytes for breakfast. That is the price of living at the intersection of new <a href="https://en.wikipedia.org/wiki/Mixture_of_Experts" title="" target="_blank" rel="noopener">MoE</a> architectures, half‑baked frontends, and ever‑shifting chat templates.</p>
<p>Gemma 4 clearly has some temperament when it comes to <a href="https://www.promptingguide.ai/tools/tool-use" title="" target="_blank" rel="noopener">tool calling</a>; Qwen 3.5 often feels more stable and predictable there, especially with complex editing workflows in Zed or <a href="https://github.com/microsoft/copilot" title="" target="_blank" rel="noopener">Copilot</a> style harnesses.</p>
<p>Where Gemma 4 shines is the “good enough across everything” band. People are using it for <a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation" title="" target="_blank" rel="noopener">GDPR</a> adversarial letters, translation, light coding, MCP tools, even life organization and email triage. It can roleplay, it can chat naturally, it can do basic vision tasks, and it respects instructions more often than not.</p>
<p>Qwen is still the heavyweight for deep context and large multi‑file refactors, but Gemma gives you something a lot closer to a generalist colleague living entirely on your desk.</p>
<p><h4>Tools, Templates, And The Human In The Loop</h4>
</p>
<p>What has become obvious to me is that half of the “Gemma is broken” versus “Gemma changed my life” divide comes down to scaffolding. People who keep llama.cpp or <a href="https://github.com/vllm/vllm" title="" target="_blank" rel="noopener">vLLM</a> up to date, use the current Google or Unsloth chat templates, and accept a slightly slower, more conservative <a href="https://en.wikipedia.org/wiki/Sampling_(statistics)" title="" target="_blank" rel="noopener">sampling config</a> tend to report stable behavior.</p>
<p>Those who jam it into old runtimes or mismatch templates with aggressive tool‑calling setups get stuck in loops and think the model is dumb. That is not unique to Gemma, but it is amplified by how strongly it leans on <a href="https://en.wikipedia.org/wiki/Prompt_(computing)" title="" target="_blank" rel="noopener">system prompts</a> and tool schemas to decide when to think and when to act.</p>
<p>At home, that distinction is obvious even outside of work. My wife uses a small 1B helper model wired into the same stack just for naming chats, summarizing web search, and cleaning up emails, while I wake the “big” Gemma only when the task actually needs it. She does not care about MoE routing or <a href="https://en.wikipedia.org/wiki/Quantization_(signal_processing)" title="" target="_blank" rel="noopener">Q4 quantization</a>; she just notices that the assistant answers fast and does not freeze her machine.</p>
<p>That is the line local models have to cross to matter: they stop being a hobby and start being invisible infrastructure.</p>
<p><h4>Where This Actually Leaves Us</h4>
</p>
<p>If I step back and look at the whole thread of experiences, I would still classify Gemma 4 as a net positive for the local‑LLM crowd. It is not strictly better than Qwen 3.5 on quality, especially for vision and huge codebases, and some of the tool‑calling behavior genuinely needs work. But for many people running 3060‑class GPUs, M‑series Macs, or small Strix Halo boxes, Gemma 4 is the first time “local only” feels like a reasonable default instead of a compromise you make out of principle.</p>
<p>The most interesting part is not that it wins any single benchmark, but that it narrows the comfort gap with cloud models to the point that you can realistically mix and match: Gemma 4 locally for everyday coding, writing, and search, Qwen or a cloud model for the rare monster task.</p>
<p>If you care about privacy, latency, or just owning your own tools, that quiet shift might be the biggest story hiding in all those Reddit comments.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://gigcitygeek.com/2026/04/16/gemma-4-local-ai-performance/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
