If you've spent any time building, fine-tuning, or deploying Large Language Models (LLMs) lately, you know that the "local development" dream is constantly fighting a war against VRAM. We are squeezing quantizations, pruning layers, and leveraging techniques like FlashAttention just to run decent-sized models on consumer hardware or budget-friendly cloud instances.
But what if the very foundational architecture of the Transformer—the sacred Query-Key-Value (QKV) split that has defined deep learning since 2017—is actually doing more work than it needs to?
A fascinating new research paper, "Do Transformers Need Three Projections? A Systematic Study of QKV Variants," has been making waves on Hacker News. It asks a deceptively simple question: Do we actually need three separate linear projections ($W_q$, $W_k$, $W_v$) to compute attention? Or can we get the same performance with less parameter bloat, fewer matrix multiplications, and faster inference times? Let's dive deep into how attention works under the hood, what this research changes, and what it means for the future of developer tools and local LLMs.
The Status Quo: The Heavy Toll of QKV
To understand why this paper is a big deal, we need to do a quick code-and-math refresher on standard Scaled Dot-Product Attention. In a classic Transformer layer, our input token embeddings ($X$) are multiplied by three distinct weight matrices to produce Queries ($Q$), Keys ($K$), and Values ($V$):
# Classic QKV Projection in PyTorch-like pseudo-code
import torch
import torch.nn as nn
class StandardAttentionProjection(nn.Module):
def __init__(self, d_model):
super().__init__()
# Three separate weight matrices
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
def forward(self, x):
# x shape: [batch_size, seq_len, d_model]
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
return Q, K, V
Once we have $Q$, $K$, and $V$, we compute attention as:
Attention(Q, K, V) = softmax( (Q @ K.T) / sqrt(d_k) ) @ V
This formulation is elegant, but it is incredibly resource-intensive. During inference—especially during the auto-regressive generation phase—the system has to store the Key and Value states for every single token in memory (the KV Cache). As context windows scale to 32k, 128k, or even a million tokens, this KV cache ballooning is the single biggest bottleneck in LLM serving.
While techniques like Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) have helped compress the $K$ and $V$ heads, they still rely on the fundamental assumption that Query, Key, and Value space must be projected separately from the input. But is that mathematically necessary?
The Challenger: Merging and Eliminating Projections
The core premise of the systematic study is to test whether we can share, tie, or outright eliminate some of these projection matrices without degrading the model's perplexity (loss) or downstream zero-shot task performance.
Imagine if we could set $Q = X$ (no query projection at all) or tie $Q$ and $K$ to use the exact same projection matrix. The paper explores several architectures, but two of the most promising variants are:
- The Dual-Projection Variant (KV-Only or QV-Only): Completely eliminating the Query projection ($W_q$). The raw input representation $X$ is used directly as the Query. This cuts out 33% of the projection parameters for attention!
- Shared/Tied Projections: Using the same weight matrix for both Queries and Keys ($W_{qk}$), forcing them to map to a shared semantic space.
Let's look at how we might implement a "No-Query Projection" (NQP) attention block in PyTorch to see how much simpler the code becomes:
class NoQueryAttention(nn.Module):
def __init__(self, d_model):
super().__init__()
# We completely eliminate W_q!
# We only project Keys and Values.
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
self.scale = 1.0 / (d_model ** 0.5)
def forward(self, x):
# x is used directly as the Query (Q)
Q = x
# Project K and V as usual
K = self.W_k(x)
V = self.W_v(x)
# Compute attention scores
# Q: [batch, seq_len, d_model]
# K: [batch, seq_len, d_model]
attn_weights = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
attn_probs = torch.softmax(attn_weights, dim=-1)
# Output representation
output = torch.matmul(attn_probs, V)
return output
This is highly intuitive. By treating the residual stream ($X$) directly as the Query, we bypass an entire set of matrix multiplications. But does it actually work in practice?
What the Systematic Study Revealed
The researchers didn't just write some toy code; they ran rigorous, controlled training sweeps across different model sizes (from small 110M parameter models up to larger, production-grade scales) and evaluated them on standard NLP benchmarks.
Here are the key takeaways from their findings:
1. You Don't Need Three Projections
The study found that models with modified QKV configurations (specifically those that tied $Q$ and $K$ projections, or eliminated the $Q$ projection entirely) performed almost identically to standard Transformers on downstream tasks. The loss curves were virtually indistinguishable during training.
2. Memory Savings are Real
By dropping the Query projection, we immediately save parameter count. In a standard Transformer, the attention projections represent a significant chunk of the model's non-feedforward parameters. Removing $W_q$ allows developers to either:
- Shrink the model's disk and VRAM footprint by ~10% without losing accuracy.
- Allocate those "saved" parameters to deeper feed-forward networks (FFN), which have been shown to store factual knowledge more efficiently.
3. Hardware Efficiency and Throughput
Fewer projections mean fewer matrix multiplication kernels launched on the GPU. During auto-regressive decoding, latency is heavily dominated by memory bandwidth (getting weights from HBM/VRAM to the GPU registers). Reducing the number of weight matrices that need to be loaded at every single step speeds up token generation directly.
Why Software Engineers and DevOps Should Care
As software engineers, it's easy to look at academic AI papers and think, "Cool math, but I'll just wait for Hugging Face to wrap it." However, understanding this shift gives us a massive head start on where the ecosystem is heading.
Here is why this matters to developers building real-world applications today:
Better Edge and Mobile Deployment
If you are trying to run LLMs on device (iOS, Android, or local desktops via WebGPU/Llama.cpp), every megabyte of model weights matters. A 10% reduction in attention parameters could be the difference between a model fitting comfortably in a phone's unified memory or getting killed by the operating system's OOM (Out Of Memory) daemon.
Designing Custom Architectures
If you are pre-training niche, domain-specific models (e.g., for log analysis, code generation, or genomic sequencing), adopting a two-projection attention mechanism can drastically reduce your AWS/GCP training bill. You can train a model with the same capacity as a traditional Transformer but with significantly less compute budget.
The Optimization of the KV Cache
Combined with Multi-Query Attention (MQA), stripping down QKV projections represents a path toward infinitely scalable context windows. We are moving away from the brute-force architecture of the original Transformer toward highly optimized, surgically precise attention mechanisms.
Looking Ahead: The Shrinking Transformer
The history of deep learning is a history of simplification. We started with complex, hand-crafted feature engineering, which was replaced by deep networks. In deep networks, we are now realizing that many of our "hard requirements"—like having three distinct projection matrices for attention—were just intuitive design choices rather than mathematical laws.
As we look to the next generation of open-source models (perhaps Llama-4 or future Mistral architectures), don't be surprised if the classic QKV layout becomes an artifact of the past, replaced by highly efficient DQV (Dual-Query-Value) or single-projection frameworks.
What are your thoughts? Are you running into VRAM limitations with your current local LLM setups? Would you sacrifice a tiny fraction of perplexity for a 15% speedup in inference? Let me know in the comments below, or jump over to the Sysseder Discord to chat about it!
Keep coding, keep optimizing, and I'll see you in the next post!