The Concrete Will Take Longer

There is a poured concrete slab in Loudoun County, Virginia, that is going to eat your electricity bill.

I do not mean this metaphorically, or at least not only metaphorically. The slab is real. It sits behind a chain-link fence, next to a plywood sign with a construction company's logo on it. There is no building yet. There are no transformers, no switchgear, no servers. The press release said the facility would open this year. The grid connection from Dominion Energy is not scheduled until 2029. You can do the math yourself, or you can trust me that it doesn't work out. The slab is one of roughly 140 such projects in the United States right now that announced 2026 opening dates and, as of this month, haven't actually started being buildings.¹

Meanwhile, in a different genre of document, a paper called TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate was posted to arXiv last April and quietly presented at the International Conference on Learning Representations last month in Brazil.² Seventeen pages. Four pages of method, the rest proofs and experiments. A hundred-odd lines of implementable idea, written in the formal mathematical register that academic computer science uses when it wants to be taken seriously by journals and by itself. The paper's whole argument is that you can compress the memory used during AI workloads by a factor of four or five without losing anything anyone can measure. You multiply a vector by a random rotation matrix, then look up some precomputed numbers, then add a sign bit. That's it. That's the paper.

These two documents (the slab and the preprint) are, I want to argue, in a conversation with each other that almost nobody is conducting on their behalf. It is a quiet conversation, and an expensive one, and a substantial percentage of the readers of this blog are silent participants in it whether or not they know.

Let me try to explain why.

I. The thing in the middle of everything

Modern large language models work by generating one token at a time, and each token has to look at every previous token in the conversation. To avoid recomputing the same attention scores over and over, the model stashes the intermediate calculations in a thing called the KV cache. "KV" stands for "key and value," which are two of the three main objects in a Large Language Model, like ChatGPT or Claude's transformer attention layer.³ The cache is exactly what it sounds like: the model's working memory for "what have we said so far."

The frontier models of the moment, Claude Opus 4.7 and Gemini 3.1 Pro, both advertise context windows of one million tokens. That is roughly 1,500 pages of text. At that scale, a single long-context request produces a KV cache measured in hundreds of gigabytes of high-bandwidth memory (HBM).⁴ If you're doing the math: a single H100 (a GPU used by tech companies) has 80 gigabytes of it. One user, one long prompt, several GPUs ganged together, before the model weights have even loaded. Other concurrent users still need memory. The model itself still needs memory. The cache just keeps growing, linearly with context length, inexorably, forever, until the GPUs run out of room and the request is either queued or dropped or silently truncated in the middle of your document, which is the part of the product experience that you've probably noticed mid-convo and the marketing materials do not typically emphasize.

This is the defining constraint of modern inference, and it is a memory problem.⁵ The GPUs have plenty of floating-point compute sitting idle. Every token you generate requires reading the entire KV cache from HBM into the compute units, and HBM bandwidth is finite. Inference is memory-bound; the bottleneck is bandwidth, and the thing being moved through the pipe is mostly the KV cache. The entire economic structure of modern AI (the price per token, the cost of a long-context query, the reason Claude and GPT cost what they cost) is a function of how expensive it is to shuffle KV caches around.

So when you read that Microsoft is spending $145 billion on capex this year, or that Amazon is planning $200 billion, or that the five largest hyperscalers are collectively guiding to something like $750 billion in 2026 infrastructure spending (up from about $450 billion in 2025, which was itself up about 60% from the year before), a meaningful fraction of that money is, functionally, a vote on the proposition that KV caches are going to stay large.⁶

So, training is expensive. Inference of non-cached workloads is expensive. Model size is expensive. But the shape of modern inference (which has already surpassed training as the dominant share of AI compute, and is expected to keep expanding its lead through the end of the decade) is dominated by attention over long contexts, and attention over long contexts is dominated by KV cache memory pressure.⁷ The concrete is being poured against that number.

And that number is the thing TurboQuant just cut by a factor of four.

II. A brief and incomplete taxonomy of things that are happening

It helps to see all the numbers in one place, because no single number captures it and the specific vertigo of the moment is in the aggregation.

The spending. Microsoft, Google, Amazon, Meta, and Oracle: roughly $660–$750 billion in combined 2026 capex, ~75% of it explicitly AI-related. That's about $500 billion of hardware, buildings, and grid upgrades in a single calendar year, from five companies.⁸ For context, the entire US electric utility industry (generation, transmission, distribution) invested about $160 billion in 2024. The hyperscalers are now outspending the utilities that power them by roughly two to one.

The chips. Nvidia is projected to produce somewhere in the range of 6.89 million Blackwell-300-equivalent units in 2026. Blackwell production is sold out through the middle of the year, with a reported backlog of 3.6 million units.⁹ Taiwan Semiconductor Manufacturing Company (TSMC) has booked over half of its 2026–27 Chip-on-Wafer-on-Substrate (CoWoS) packaging capacity for Nvidia alone.

The company. Nvidia trades, as I write this, at about $200 per share, with a market capitalization just under $5 trillion. This is more than any other company in the history of public markets. It is, as of today, roughly double the market cap of the entire energy sector of the S&P 500. In January 1999, Nvidia was worth $562 million.

The grid. US data centers used about 183 terawatt-hours in 2024, roughly 4% of national consumption, roughly equivalent to all of Pakistan. The International Energy Agency's global forecast puts data center consumption near 945 TWh by 2030, which would make data centers, if they were a country, a top-five national energy consumer, sitting between Japan and Russia.¹⁰ For the past two decades US electricity consumption was basically flat. AI changed that.¹¹

US data center electricity, 2000–2028. Flat at 60–90 TWh for seventeen years, then the line turns. LBNL's 2028 wedge: 325–580 TWh (6.7–12% of US electricity).

The local texture. 26% of Virginia's electricity now goes to data centers. 79% in Dublin, where the government has, since 2023, quietly stopped issuing new grid connections for data centers in the region. Ireland is the first country in the Organisation for Economic Co-operation and Development to explicitly turn down AI infrastructure on capacity grounds.¹² The PJM Interconnection capacity market, which covers the electric grid from Illinois to North Carolina, priced in a $9.3 billion increase attributed to data-center load for the 2025–26 year, which comes out to about $18 extra per month on a typical residential bill in western Maryland and $16 in Ohio.

The water. Large hyperscale data centers consume up to 5 million gallons per day for cooling. The projected growth rate of data-center cooling-water usage is 870% over the next decade.¹³ Each ChatGPT query, by OpenAI's own accounting, evaporates about a third of a milliliter of water. ChatGPT handles roughly 2.5 billion queries a day. Do the multiplication: 800,000 liters, or about 210,000 gallons, drawn through cooling towers every day to run one chatbot, a substantial chunk of it in places like Microsoft's hyperscale cluster in West Des Moines, Iowa, where the corn is. The supply chain is wet all the way down: producing a single microchip also requires about two and a half gallons of ultra-purified water just to rinse the wafers.

The market. The Magnificent Seven (Apple, Amazon, Alphabet, Meta, Microsoft, Nvidia, Tesla) currently make up about 34% of the S&P 500. Nvidia alone is 7% of the index.¹⁴ If you hold a cap-weighted S&P 500 index fund (which is the default in most 401(k) plans, in most target-date funds, in most of the ways Americans are told to save for retirement), about a third of every monthly contribution flows directly into these seven companies. The bond side of the portfolio is no safer: PIMCO, which manages one of the most widely held bond funds in American retirement accounts, just anchored an $18 billion debt package for Meta's Hyperion data center in Louisiana and is in talks for $14 billion more on a new Oracle facility in Michigan.¹⁵

NVDA$4.90

AAPL$4.40

MSFT$3.20

GOOG$2.80

AMZN$2.50

META$1.60

TSLA$1.40

$100 of a typical mid-career 401(k) contribution. Mag 7 weights from a cap-weighted S&P 500 (≈34% of US equity); sleeve split reflects a 2045 target-date fund (~80% equity).

What I am trying to establish with this catalogue is something like the texture of the bet. The bet sprawls across physical infrastructure, semiconductor supply chains, power grids, municipal water tables, household utility bills, and the retirement savings of roughly every American with a job. It is large, it is correlated, and the thing that would most dramatically change its value is a small research paper that makes KV caches smaller.

That paper, the one from the opening, is TurboQuant.

III. What TurboQuant actually does

I am going to try to explain this without too much math, because the beauty of the paper is not really in the math, it is in the fact that the math is so modest.

The core problem with compressing a vector is that it might be lopsided. Imagine an orchestra where the trumpet is much louder than every other instrument. If you want to compress the recording with a single volume rule, say rounding each instrument's level to one of sixteen steps, you are stuck. Calibrate the steps to the trumpet and the violin rounds to zero. Calibrate them to the violin and the trumpet clips at the top. One instrument behaves unlike the others, and no uniform rule fits.

The TurboQuant trick is to run the signal through a random mixer before compressing it. The mixer is a random rotation: a mathematical operation that preserves every note and every phrase but redistributes the sound across channels. In high dimensions, this has a remarkable property. Whatever single instrument dominated before, afterward every channel carries a moderate, roughly equal portion of the total energy. The trumpet's loudness is smeared across the whole ensemble, and no channel is special anymore.

Once every channel carries roughly the same loudness, one compression rule handles all of them. No training data. No calibration. Just the mixer and its inverse.

dimensiond = 64

largest coordinate

1.000

observed σ

0.124

theoretical 1/√d

0.125

Coordinate distribution of a unit vector in d dimensions. Start state: all weight on one coordinate. After a random rotation, the coordinates are distributed as a uniformly random point on the sphere, concentrating near zero with standard deviation 1/√d. Raise d to watch the cluster tighten. The vector's total length stays at 1; only the distribution of its coordinates changes.

This is, in my opinion, one of those tricks that looks like nothing but is pretty neat. Random rotations have been sloshing around applied mathematics since the 1980s, doing similar work in a dozen other contexts (Johnson-Lindenstrauss sketches, locality-sensitive hashing, stochastic gradient descent).¹⁶ What the authors of this paper did was notice, with a kind of clerical precision, that random rotation is exactly the move the KV cache problem has been waiting for. The KV cache is high-dimensional. The KV cache is generated online, one vector at a time, with no time for training. The KV cache needs to preserve inner products for attention to work. Random rotation plus scalar quantization plus a one-bit sign-based correction on the residual: inner products preserved, unbiased, within a factor of 2.7 of the Shannon information-theoretic lower bound.

The benchmark is called needle-in-a-haystack. A model gets a document tens of thousands of tokens long with one sentence hidden inside, and has to find it. The compressed version finds it. The uncompressed version also finds it. They score identically to three decimal places, which is the kind of tie that would get a sports broadcaster demoted. The cache is about a quarter the size, and on an H100 the attention step runs up to eight times faster.¹⁷

And the method is data-oblivious, meaning no calibration, no training, no fine-tuning. You can drop it into an existing inference stack and it works immediately, on any model, for any user.

It would be irresponsible to extrapolate wildly from a single paper. The ML literature is littered with compression methods that looked great on benchmarks and fell apart in production. But TurboQuant comes with actual distortion bounds, not just empirical wins, and the bounds say it is approximately as good as any compression method is ever going to be for this problem. Not "good enough"; approximately optimal. The remaining slack between TurboQuant and the Shannon limit is a factor of about 2.7, and at low bit-widths it's closer to 1.45. Future work will tighten that factor, maybe to 2, maybe to 1.8. It is not going to discover that the current method was secretly 16x worse than possible, because the floor is a mathematical certainty and TurboQuant is already quite close to it.

Which means the KV cache memory problem is, in a specific and narrow sense, approximately solved.

IV. The gap between the slab and the paper

Here is the part that gets strange.

If the KV cache memory problem is approximately solved, then some amount of the compute infrastructure being built against that problem is, retrospectively, over-built. Not all of it, certainly, and perhaps not even most of it: the buildout is hedging against many things (training compute, reasoning compute, longer contexts, more concurrent users, image and video models that have their own memory problems, the general expansion of demand to fill any available supply). KV cache compression does not zero out demand for GPUs. It just makes each GPU more productive.

But "more productive" is the entire thing. The reason Meta is spending $120 billion is that current GPUs are not productive enough for current demand. If you make them 4x more productive for the memory-bound part of the workload, which is the dominant part of the workload, you have done something to the ratio of supply to demand that has to show up somewhere. Maybe as faster responses. Maybe as longer contexts. Maybe as more concurrent users per GPU. Maybe as fewer GPUs needed for a given workload. Maybe as some data centers not getting built at all.

The direction of the effect is not ambiguous. The magnitude is.

And here is where the piece starts to feel less like a tech essay and more like a genre I don't quite have a name for, because the numbers are so large and the mechanism is so small. Call it the arbitrage between concrete and code. Concrete takes years and billions. Code takes weeks and nothing. The two are being priced against each other in real time, by people who are sort of aware of the arbitrage (every hyperscaler has quantization research teams, every model-serving framework is racing to integrate the latest compression papers) but whose financial commitments lag their research by a decade.

Microsoft is not going to cancel a data center it spent three years permitting because a random rotation trick made inference cheaper. The data center will get built; the workloads will expand to fill it; the original forecast will be quietly revised. But there is a version of the same situation, five or six compounding algorithmic wins from now, where the relationship between forecasted compute demand and forecasted infrastructure supply starts to look unpleasant, and the infrastructure is the side that's already committed, because it's made of concrete, and you can't unpour concrete.¹⁸ Concrete shares this property with tattoos and with the decision to get a dog.

I want to be clear that I am not predicting a crash. People have been predicting an AI infrastructure crash approximately continuously since 2023, and they have been approximately continuously wrong, because demand has repeatedly outrun the most aggressive forecasts. It's entirely possible that algorithmic efficiency wins are just adding fuel to a demand curve that was going to outpace them anyway. This is Jevons' paradox (the nineteenth-century observation that making steam engines more efficient didn't reduce coal consumption, because cheaper steam enabled new uses of steam, and new uses of steam ate through all the saved coal and then some). Efficiency gains in compute have, historically, grown the compute footprint, not shrunk it.

But the Jevons story assumes elastic demand. It assumes that cheaper inference will find new uses. Some of it will; a lot of it will. There are probably LLM applications we haven't imagined yet that will only become economical at a quarter of current inference cost. The paperclip-optimizer version of this is that the released compute gets absorbed by whatever new workload shows up.

The other possibility, which is less discussed because it is less dramatic but maybe more realistic, is that some of the efficiency gain gets absorbed by new demand and some of it leaves physical infrastructure slightly stranded, in the same way that my old DVD player is slightly stranded but not gone. The capacity exists; it will get used; it just won't get used as intensely as the forecasts required. Utilization drops from 95% to 80%. Revenue per GPU drops by a corresponding amount. The payback period on the $70 billion data-center campus in Louisiana stretches from seven years to ten. Call it something well short of a crash. A number in a discounted cash flow (DCF) model moving the wrong way, at a scale that compounds.

When the number in the DCF model moves the wrong way for Microsoft, and Amazon, and Google, and Meta, and Oracle, simultaneously, a nontrivial chunk of the S&P 500 reprices. Which brings us, finally, back to the retirement account of the woman in Columbus, whose electricity bill is up $16 a month, and whose target-date fund is up 12% year-over-year, neither of which she had any particular say in.

V. What this is actually about

The capex, the electricity bill, and the retirement account all trace back to the same technical choice, repeated trillions of times a day: compression.

Every act of compression is a choice about what matters. You look at the raw data the model generated about the meaning of a conversation and you ask which bits to keep and which to throw away. TurboQuant's answer is that you keep the inner products, which are how attention actually works, and you throw away almost everything else, which turns out to be about three-quarters of what's there. The default 16-bit floating-point representation is a hardware inheritance; 16 bits is what the chips supported, and nobody had a reason to use less.

When you bother to ask how little precision you actually need, the answer is three or four bits. The other twelve were decorative. The paper is nominally about saving memory; the more interesting result, maybe by accident, is what it tells us about what the model was doing with memory in the first place, which is: not as much as we thought.

Here is where the observation goes if you let it run. Every bit a model keeps costs energy to store, water to cool, a fraction of someone's retirement account to finance, a fraction of a farmer's grandchild's electricity bill to underwrite. Three-quarters of those bits turned out to be redundant. Four researchers at Google just wrote a paper showing how to stop paying for them. An extraordinary amount of energy, water, and money, in principle reclaimable in a few pages of math, once the paper reaches production stacks.

From the inside, each of these decisions looks like a random rotation matrix and a lookup table. In aggregate, across every major AI lab and integrated over the next decade, they determine how much natural gas gets burned in Louisiana, how much water gets drawn from the Ogallala aquifer, and how much the average teacher's retirement nest egg is worth in 2045. These are consequential choices, being made, for the most part, by people who think of themselves as making graphs go up.

VI. The small paper and the big one

There is a kind of essay about AI that I have come to distrust, which treats every new technical result as either salvation or apocalypse. The actual story is neither, and the actual story is duller than either, and the actual story is that there are two papers in conversation: a small paper about random rotations, and a much larger paper, unwritten, being inscribed in concrete across rural Virginia and West Texas and central Louisiana. The small paper is quietly rewriting the assumptions of the large one.

This is fine. This is, in fact, how infrastructure always gets built. The railroads of the 1870s were overbuilt against demand forecasts that assumed steam locomotives would remain inefficient, and then the locomotives got better and the rail companies went bankrupt and the rails stayed in the ground and eventually Amazon bought them, more or less. The fiber optic buildout of the late 1990s was overbuilt against bandwidth forecasts that did not anticipate compression algorithms and by 2002 the dark fiber was sitting in the ground earning nothing and by 2010 it was worth a fortune because YouTube existed. The physical capital of a civilization is almost always built against forecasts that the civilization's own software ends up invalidating, and then the software catches up, and then the physical capital finds a use, and the capital providers either get rich or get wiped out depending on whether they were early or late.

The current AI infrastructure buildout will, on historical odds, get some of this wrong. The weird thing about 2026 is that whatever goes wrong happens to most of us at once, because we are all, via our 401(k)s, both the capital providers and the electricity-bill-payers and the workforce and the end consumers, and the accounting lines between those roles have become extremely blurry.

I have a 401(k). It has an S&P 500 index fund in it.

Somewhere, probably tonight, somebody is going to hit "Enter" on a Python script that, via the TurboQuant paper, saves a hyperscaler forty million dollars in HBM provisioning over the next fiscal year. Somebody else is going to sign off on pouring another concrete slab in Loudoun County. Somewhere in between these two people, all of the rest of us are holding the bag and also the winnings, in the same hand, in approximately equal measure, and no one has a reliable way of telling us which hand is which.

The slab we started with is still a slab. It will, eventually, become a building. Whether that building will be as valuable as its forecast promised is a question being decided, continuously, by papers like TurboQuant, at a speed the concrete cannot match.

The authors of TurboQuant list their affiliation at the top of the paper: Google Research, NYU, Google DeepMind. They thank no one in particular. The paper is seventeen pages long. The method is four pages. The rest is proofs and experiments. You can read it in an afternoon.

The concrete will take longer.