The Concrete Will Take Longer
A small paper about random rotations in TurboQuant, a large paper about concrete in data centers, and the quiet argument between them.
There is a poured concrete slab in Loudoun County, Virginia, that is going to eat your electricity bill.
I do not mean this metaphorically, or at least not only metaphorically. The slab is real. It sits behind a chain-link fence, next to a plywood sign with a construction company's logo on it. There is no building yet. There are no transformers, no switchgear, no servers. The press release said the facility would open this year. The grid connection from Dominion Energy is not scheduled until 2029. You can do the math yourself, or you can trust me that it doesn't work out. The slab is one of roughly 140 such projects in the United States right now that announced 2026 opening dates and, as of this month, haven't actually started being buildings.1
Meanwhile, in a different genre of document, a paper called TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate was posted to arXiv last April and quietly presented at the International Conference on Learning Representations last month in Brazil.2 Seventeen pages. Four pages of method, the rest proofs and experiments. A hundred-odd lines of implementable idea, written in the formal mathematical register that academic computer science uses when it wants to be taken seriously by journals and by itself. The paper's whole argument is that you can compress the memory used during AI workloads by a factor of four or five without losing anything anyone can measure. You multiply a vector by a random rotation matrix, then look up some precomputed numbers, then add a sign bit. That's it. That's the paper.
These two documents (the slab and the preprint) are, I want to argue, in a conversation with each other that almost nobody is conducting on their behalf. It is a quiet conversation, and an expensive one, and a substantial percentage of the readers of this blog are silent participants in it whether or not they know.
Let me try to explain why.
I. The thing in the middle of everything
Modern large language models work by generating one token at a time, and each token has to look at every previous token in the conversation. To avoid recomputing the same attention scores over and over, the model stashes the intermediate calculations in a thing called the KV cache. "KV" stands for "key and value," which are two of the three main objects in a Large Language Model, like ChatGPT or Claude's transformer attention layer.3 The cache is exactly what it sounds like: the model's working memory for "what have we said so far."
The frontier models of the moment, Claude Opus 4.7 and Gemini 3.1 Pro, both advertise context windows of one million tokens. That is roughly 1,500 pages of text. At that scale, a single long-context request produces a KV cache measured in hundreds of gigabytes of high-bandwidth memory (HBM).4 If you're doing the math: a single H100 (a GPU used by tech companies) has 80 gigabytes of it. One user, one long prompt, several GPUs ganged together, before the model weights have even loaded. Other concurrent users still need memory. The model itself still needs memory. The cache just keeps growing, linearly with context length, inexorably, forever, until the GPUs run out of room and the request is either queued or dropped or silently truncated in the middle of your document, which is the part of the product experience that you've probably noticed mid-convo and the marketing materials do not typically emphasize.
This is the defining constraint of modern inference, and it is a memory problem.5 The GPUs have plenty of floating-point compute sitting idle. Every token you generate requires reading the entire KV cache from HBM into the compute units, and HBM bandwidth is finite. Inference is memory-bound; the bottleneck is bandwidth, and the thing being moved through the pipe is mostly the KV cache. The entire economic structure of modern AI (the price per token, the cost of a long-context query, the reason Claude and GPT cost what they cost) is a function of how expensive it is to shuffle KV caches around.
So when you read that Microsoft is spending $145 billion on capex this year, or that Amazon is planning $200 billion, or that the five largest hyperscalers are collectively guiding to something like $750 billion in 2026 infrastructure spending (up from about $450 billion in 2025, which was itself up about 60% from the year before), a meaningful fraction of that money is, functionally, a vote on the proposition that KV caches are going to stay large.6
So, training is expensive. Inference of non-cached workloads is expensive. Model size is expensive. But the shape of modern inference (which has already surpassed training as the dominant share of AI compute, and is expected to keep expanding its lead through the end of the decade) is dominated by attention over long contexts, and attention over long contexts is dominated by KV cache memory pressure.7 The concrete is being poured against that number.
And that number is the thing TurboQuant just cut by a factor of four.
II. A brief and incomplete taxonomy of things that are happening
It helps to see all the numbers in one place, because no single number captures it and the specific vertigo of the moment is in the aggregation.
The spending. Microsoft, Google, Amazon, Meta, and Oracle: roughly $660–$750 billion in combined 2026 capex, ~75% of it explicitly AI-related. That's about $500 billion of hardware, buildings, and grid upgrades in a single calendar year, from five companies.8 For context, the entire US electric utility industry (generation, transmission, distribution) invested about $160 billion in 2024. The hyperscalers are now outspending the utilities that power them by roughly two to one.
The chips. Nvidia is projected to produce somewhere in the range of 6.89 million Blackwell-300-equivalent units in 2026. Blackwell production is sold out through the middle of the year, with a reported backlog of 3.6 million units.9 Taiwan Semiconductor Manufacturing Company (TSMC) has booked over half of its 2026–27 Chip-on-Wafer-on-Substrate (CoWoS) packaging capacity for Nvidia alone.
The company. Nvidia trades, as I write this, at about $200 per share, with a market capitalization just under $5 trillion. This is more than any other company in the history of public markets. It is, as of today, roughly double the market cap of the entire energy sector of the S&P 500. In January 1999, Nvidia was worth $562 million.
The grid. US data centers used about 183 terawatt-hours in 2024, roughly 4% of national consumption, roughly equivalent to all of Pakistan. The International Energy Agency's global forecast puts data center consumption near 945 TWh by 2030, which would make data centers, if they were a country, a top-five national energy consumer, sitting between Japan and Russia.10 For the past two decades US electricity consumption was basically flat. AI changed that.11
The local texture. 26% of Virginia's electricity now goes to data centers. 79% in Dublin, where the government has, since 2023, quietly stopped issuing new grid connections for data centers in the region. Ireland is the first country in the Organisation for Economic Co-operation and Development to explicitly turn down AI infrastructure on capacity grounds.12 The PJM Interconnection capacity market, which covers the electric grid from Illinois to North Carolina, priced in a $9.3 billion increase attributed to data-center load for the 2025–26 year, which comes out to about $18 extra per month on a typical residential bill in western Maryland and $16 in Ohio.
The water. Large hyperscale data centers consume up to 5 million gallons per day for cooling. The projected growth rate of data-center cooling-water usage is 870% over the next decade.13 Each ChatGPT query, by OpenAI's own accounting, evaporates about a third of a milliliter of water. ChatGPT handles roughly 2.5 billion queries a day. Do the multiplication: 800,000 liters, or about 210,000 gallons, drawn through cooling towers every day to run one chatbot, a substantial chunk of it in places like Microsoft's hyperscale cluster in West Des Moines, Iowa, where the corn is. The supply chain is wet all the way down: producing a single microchip also requires about two and a half gallons of ultra-purified water just to rinse the wafers.
The market. The Magnificent Seven (Apple, Amazon, Alphabet, Meta, Microsoft, Nvidia, Tesla) currently make up about 34% of the S&P 500. Nvidia alone is 7% of the index.14 If you hold a cap-weighted S&P 500 index fund (which is the default in most 401(k) plans, in most target-date funds, in most of the ways Americans are told to save for retirement), about a third of every monthly contribution flows directly into these seven companies. The bond side of the portfolio is no safer: PIMCO, which manages one of the most widely held bond funds in American retirement accounts, just anchored an $18 billion debt package for Meta's Hyperion data center in Louisiana and is in talks for $14 billion more on a new Oracle facility in Michigan.15
What I am trying to establish with this catalogue is something like the texture of the bet. The bet sprawls across physical infrastructure, semiconductor supply chains, power grids, municipal water tables, household utility bills, and the retirement savings of roughly every American with a job. It is large, it is correlated, and the thing that would most dramatically change its value is a small research paper that makes KV caches smaller.
That paper, the one from the opening, is TurboQuant.
III. What TurboQuant actually does
I am going to try to explain this without too much math, because the beauty of the paper is not really in the math, it is in the fact that the math is so modest.
The core problem with compressing a vector is that it might be lopsided. Imagine an orchestra where the trumpet is much louder than every other instrument. If you want to compress the recording with a single volume rule, say rounding each instrument's level to one of sixteen steps, you are stuck. Calibrate the steps to the trumpet and the violin rounds to zero. Calibrate them to the violin and the trumpet clips at the top. One instrument behaves unlike the others, and no uniform rule fits.
The TurboQuant trick is to run the signal through a random mixer before compressing it. The mixer is a random rotation: a mathematical operation that preserves every note and every phrase but redistributes the sound across channels. In high dimensions, this has a remarkable property. Whatever single instrument dominated before, afterward every channel carries a moderate, roughly equal portion of the total energy. The trumpet's loudness is smeared across the whole ensemble, and no channel is special anymore.
Once every channel carries roughly the same loudness, one compression rule handles all of them. No training data. No calibration. Just the mixer and its inverse.
This is, in my opinion, one of those tricks that looks like nothing but is pretty neat. Random rotations have been sloshing around applied mathematics since the 1980s, doing similar work in a dozen other contexts (Johnson-Lindenstrauss sketches, locality-sensitive hashing, stochastic gradient descent).16 What the authors of this paper did was notice, with a kind of clerical precision, that random rotation is exactly the move the KV cache problem has been waiting for. The KV cache is high-dimensional. The KV cache is generated online, one vector at a time, with no time for training. The KV cache needs to preserve inner products for attention to work. Random rotation plus scalar quantization plus a one-bit sign-based correction on the residual: inner products preserved, unbiased, within a factor of 2.7 of the Shannon information-theoretic lower bound.
The benchmark is called needle-in-a-haystack. A model gets a document tens of thousands of tokens long with one sentence hidden inside, and has to find it. The compressed version finds it. The uncompressed version also finds it. They score identically to three decimal places, which is the kind of tie that would get a sports broadcaster demoted. The cache is about a quarter the size, and on an H100 the attention step runs up to eight times faster.17
And the method is data-oblivious, meaning no calibration, no training, no fine-tuning. You can drop it into an existing inference stack and it works immediately, on any model, for any user.
It would be irresponsible to extrapolate wildly from a single paper. The ML literature is littered with compression methods that looked great on benchmarks and fell apart in production. But TurboQuant comes with actual distortion bounds, not just empirical wins, and the bounds say it is approximately as good as any compression method is ever going to be for this problem. Not "good enough"; approximately optimal. The remaining slack between TurboQuant and the Shannon limit is a factor of about 2.7, and at low bit-widths it's closer to 1.45. Future work will tighten that factor, maybe to 2, maybe to 1.8. It is not going to discover that the current method was secretly 16x worse than possible, because the floor is a mathematical certainty and TurboQuant is already quite close to it.
Which means the KV cache memory problem is, in a specific and narrow sense, approximately solved.
IV. The gap between the slab and the paper
Here is the part that gets strange.
If the KV cache memory problem is approximately solved, then some amount of the compute infrastructure being built against that problem is, retrospectively, over-built. Not all of it, certainly, and perhaps not even most of it: the buildout is hedging against many things (training compute, reasoning compute, longer contexts, more concurrent users, image and video models that have their own memory problems, the general expansion of demand to fill any available supply). KV cache compression does not zero out demand for GPUs. It just makes each GPU more productive.
But "more productive" is the entire thing. The reason Meta is spending $120 billion is that current GPUs are not productive enough for current demand. If you make them 4x more productive for the memory-bound part of the workload, which is the dominant part of the workload, you have done something to the ratio of supply to demand that has to show up somewhere. Maybe as faster responses. Maybe as longer contexts. Maybe as more concurrent users per GPU. Maybe as fewer GPUs needed for a given workload. Maybe as some data centers not getting built at all.
The direction of the effect is not ambiguous. The magnitude is.
And here is where the piece starts to feel less like a tech essay and more like a genre I don't quite have a name for, because the numbers are so large and the mechanism is so small. Call it the arbitrage between concrete and code. Concrete takes years and billions. Code takes weeks and nothing. The two are being priced against each other in real time, by people who are sort of aware of the arbitrage (every hyperscaler has quantization research teams, every model-serving framework is racing to integrate the latest compression papers) but whose financial commitments lag their research by a decade.
Microsoft is not going to cancel a data center it spent three years permitting because a random rotation trick made inference cheaper. The data center will get built; the workloads will expand to fill it; the original forecast will be quietly revised. But there is a version of the same situation, five or six compounding algorithmic wins from now, where the relationship between forecasted compute demand and forecasted infrastructure supply starts to look unpleasant, and the infrastructure is the side that's already committed, because it's made of concrete, and you can't unpour concrete.18 Concrete shares this property with tattoos and with the decision to get a dog.
I want to be clear that I am not predicting a crash. People have been predicting an AI infrastructure crash approximately continuously since 2023, and they have been approximately continuously wrong, because demand has repeatedly outrun the most aggressive forecasts. It's entirely possible that algorithmic efficiency wins are just adding fuel to a demand curve that was going to outpace them anyway. This is Jevons' paradox (the nineteenth-century observation that making steam engines more efficient didn't reduce coal consumption, because cheaper steam enabled new uses of steam, and new uses of steam ate through all the saved coal and then some). Efficiency gains in compute have, historically, grown the compute footprint, not shrunk it.
But the Jevons story assumes elastic demand. It assumes that cheaper inference will find new uses. Some of it will; a lot of it will. There are probably LLM applications we haven't imagined yet that will only become economical at a quarter of current inference cost. The paperclip-optimizer version of this is that the released compute gets absorbed by whatever new workload shows up.
The other possibility, which is less discussed because it is less dramatic but maybe more realistic, is that some of the efficiency gain gets absorbed by new demand and some of it leaves physical infrastructure slightly stranded, in the same way that my old DVD player is slightly stranded but not gone. The capacity exists; it will get used; it just won't get used as intensely as the forecasts required. Utilization drops from 95% to 80%. Revenue per GPU drops by a corresponding amount. The payback period on the $70 billion data-center campus in Louisiana stretches from seven years to ten. Call it something well short of a crash. A number in a discounted cash flow (DCF) model moving the wrong way, at a scale that compounds.
When the number in the DCF model moves the wrong way for Microsoft, and Amazon, and Google, and Meta, and Oracle, simultaneously, a nontrivial chunk of the S&P 500 reprices. Which brings us, finally, back to the retirement account of the woman in Columbus, whose electricity bill is up $16 a month, and whose target-date fund is up 12% year-over-year, neither of which she had any particular say in.
V. What this is actually about
The capex, the electricity bill, and the retirement account all trace back to the same technical choice, repeated trillions of times a day: compression.
Every act of compression is a choice about what matters. You look at the raw data the model generated about the meaning of a conversation and you ask which bits to keep and which to throw away. TurboQuant's answer is that you keep the inner products, which are how attention actually works, and you throw away almost everything else, which turns out to be about three-quarters of what's there. The default 16-bit floating-point representation is a hardware inheritance; 16 bits is what the chips supported, and nobody had a reason to use less.
When you bother to ask how little precision you actually need, the answer is three or four bits. The other twelve were decorative. The paper is nominally about saving memory; the more interesting result, maybe by accident, is what it tells us about what the model was doing with memory in the first place, which is: not as much as we thought.
Here is where the observation goes if you let it run. Every bit a model keeps costs energy to store, water to cool, a fraction of someone's retirement account to finance, a fraction of a farmer's grandchild's electricity bill to underwrite. Three-quarters of those bits turned out to be redundant. Four researchers at Google just wrote a paper showing how to stop paying for them. An extraordinary amount of energy, water, and money, in principle reclaimable in a few pages of math, once the paper reaches production stacks.
From the inside, each of these decisions looks like a random rotation matrix and a lookup table. In aggregate, across every major AI lab and integrated over the next decade, they determine how much natural gas gets burned in Louisiana, how much water gets drawn from the Ogallala aquifer, and how much the average teacher's retirement nest egg is worth in 2045. These are consequential choices, being made, for the most part, by people who think of themselves as making graphs go up.
VI. The small paper and the big one
There is a kind of essay about AI that I have come to distrust, which treats every new technical result as either salvation or apocalypse. The actual story is neither, and the actual story is duller than either, and the actual story is that there are two papers in conversation: a small paper about random rotations, and a much larger paper, unwritten, being inscribed in concrete across rural Virginia and West Texas and central Louisiana. The small paper is quietly rewriting the assumptions of the large one.
This is fine. This is, in fact, how infrastructure always gets built. The railroads of the 1870s were overbuilt against demand forecasts that assumed steam locomotives would remain inefficient, and then the locomotives got better and the rail companies went bankrupt and the rails stayed in the ground and eventually Amazon bought them, more or less. The fiber optic buildout of the late 1990s was overbuilt against bandwidth forecasts that did not anticipate compression algorithms and by 2002 the dark fiber was sitting in the ground earning nothing and by 2010 it was worth a fortune because YouTube existed. The physical capital of a civilization is almost always built against forecasts that the civilization's own software ends up invalidating, and then the software catches up, and then the physical capital finds a use, and the capital providers either get rich or get wiped out depending on whether they were early or late.
The current AI infrastructure buildout will, on historical odds, get some of this wrong. The weird thing about 2026 is that whatever goes wrong happens to most of us at once, because we are all, via our 401(k)s, both the capital providers and the electricity-bill-payers and the workforce and the end consumers, and the accounting lines between those roles have become extremely blurry.
I have a 401(k). It has an S&P 500 index fund in it.
Somewhere, probably tonight, somebody is going to hit "Enter" on a Python script that, via the TurboQuant paper, saves a hyperscaler forty million dollars in HBM provisioning over the next fiscal year. Somebody else is going to sign off on pouring another concrete slab in Loudoun County. Somewhere in between these two people, all of the rest of us are holding the bag and also the winnings, in the same hand, in approximately equal measure, and no one has a reliable way of telling us which hand is which.
The slab we started with is still a slab. It will, eventually, become a building. Whether that building will be as valuable as its forecast promised is a question being decided, continuously, by papers like TurboQuant, at a speed the concrete cannot match.
The authors of TurboQuant list their affiliation at the top of the paper: Google Research, NYU, Google DeepMind. They thank no one in particular. The paper is seventeen pages long. The method is four pages. The rest is proofs and experiments. You can read it in an afternoon.
The concrete will take longer.
Footnotes
-
The specific slab I'm picturing is based on reporting by Sightline Climate, which found that of 140 US data-center projects announcing 2026 opening dates (totaling 16 gigawatts of capacity), only about 5 GW is actually under construction. The remaining 11 GW is stuck at the "announced" phase with no visible construction, despite typical build times of 12–18 months. The bottleneck is not money; the bottleneck is 30-month transformer lead times and switchgear sold out through 2028. One consequence is that US imports of medium-voltage electrical gear from China jumped from 1,500 units in 2022 to over 8,000 in the first ten months of 2025. This is the part of the AI buildout that doesn't get its own Axios article: that the copper and the steel are, right now, mostly coming from the country that our export-control policy is explicitly trying to box out of the AI race. ↩
-
Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Google Research published an accompanying blog post on March 24, 2026, which is also worth reading, though it frames the result in the optimistic register appropriate to a company blog. The paper itself is more cautious. ↩
-
The third is Q, for "query." Q, K, and V are three different projections of the input embedding; attention computes, roughly, how much each token's query matches each other token's key, then uses those match scores to weight the values. The K and V projections get cached because they're the same across every new token; Q gets recomputed each step. If you want a more detailed explanation and don't mind equations, the original "Attention is All You Need" paper (Vaswani et al., 2017) is the locus classicus. ↩
-
Back-of-the-envelope: a large production model, on the order of 64 transformer layers with 8 KV heads after grouped-query attention and head dimension 128, at 16-bit floating-point precision (FP16), works out to roughly 300 kilobytes of KV cache per token. At one million tokens that's about 300 GB, well past a single 80 GB H100. Frontier models do not publish their exact architectures, so this is indicative rather than precise, and aggressive KV-sharing techniques can bring the number down substantially. But the shape of the problem is robust: double the context, double the cache. Model serving frameworks like vLLM used to waste 60-80% of allocated KV cache memory through fragmentation, before PagedAttention came along in 2023 and brought that number below 4%. The 2-4x throughput improvement from PagedAttention is the entire reason commercial long-context inference became economically viable. It was a software fix. It required no new hardware. The analogous situation, with TurboQuant, might be unfolding now. ↩
-
This is worth repeating because it violates the intuition most people have about "AI is expensive." People picture GPUs straining under the weight of massive matrix multiplications. In practice the GPUs are mostly bored, waiting for data to arrive from memory. Inference is a traffic problem. It is approximately the same problem that makes your phone slow when Chrome has too many tabs open, except with infinitely more money at stake. ↩
-
Hyperscaler capex grew at an annualized rate of about 72% from Q2 2023 through Q4 2025, according to Epoch AI's analysis of public filings. If this rate continued, 2026 capex would be about $770 billion. Actual guidance is coming in slightly below that, around $750 billion, which industry analysts are interpreting as "slowing down." ↩
-
Deloitte estimates that inference accounted for roughly half of AI compute in 2025, rising to two-thirds in 2026; Brookfield's longer projection has inference at 75% of AI compute by 2030. Training is the loud, dramatic, one-time cost; inference is the quiet, recurring, dominant one. The framing of AI as primarily a training problem is a holdover from 2022, when it actually was. It is no longer. ↩
-
For comparison: US investment in tech equipment and software reached 4.4% of GDP in 2025, which is nearly the dotcom-era peak. The hyperscalers' planned AI asset additions through 2030 imply an annual depreciation expense of roughly $400 billion, which is more than their combined 2025 profits. Free cash flow has gone negative at multiple hyperscalers, and the gap is being funded by debt markets: projections suggest $1.5 trillion in new tech-sector debt issuance over the coming years. This is, to put it mildly, a departure from how these companies have historically financed themselves. ↩
-
Measured in B300-equivalents. An H100, the previous generation, is about 0.26 B300-equivalents in performance terms. So 6.89 million B300s is something like 26 million H100-equivalents. There are, at current estimates, fewer than 30 million H100-equivalent AI chips in the world. The plan is to roughly double that, in one year, in one country. ↩
-
The IEA projections have wide uncertainty bands, and the bands widen with the forecast horizon. Their high-growth ("Lift-Off") scenario has data centers consuming something like 1,500 TWh globally by 2030. Their low-growth ("Headwinds") scenario has them around 800. The central estimate of 945 TWh is, essentially, the midpoint between those, which is a legitimate way of handling uncertainty but should not be mistaken for precision. ↩
-
Chart data: historical US data-center electricity consumption from Lawrence Berkeley National Laboratory's 2024 Data Center Energy Usage Report; 2028 projection range (325–580 TWh) also from LBNL 2024, cross-referenced with the IEA's Energy and AI report (2025) for the global picture. The 2014 figure is approximate; LBNL's historical series prior to 2018 is reconstructed from aggregate utility filings and has wider uncertainty than the post-2018 numbers. ↩
-
Dublin's 79% figure is particular because it's not a US number and therefore doesn't fit the American narrative of the crisis. Ireland became a "tech hub" in the 1990s via aggressive corporate tax policy, which made it attractive to Microsoft, Google, Amazon, and Meta, which collectively built a lot of data centers there, which now use 79% of Dublin's electricity and 21% of Ireland's. The Irish government has, as of 2023, quietly stopped issuing new grid connections for data centers in the Dublin region. This is functionally a cap. It has not made international news because nobody particularly cares about Irish electricity policy, but it is the first OECD country to explicitly turn down AI infrastructure demand on environmental and grid-capacity grounds, and it will not be the last. ↩
-
870% is a projection, not a measurement, and is from Brookings' recent review of data-center water usage. Projections of this kind are notoriously bad. I include it for scale rather than precision. The direction is not in serious dispute; the magnitude is. ↩
-
If you find this startling, the comparison that should really startle you is that Nvidia's market cap is currently larger than the combined GDP of every African country, and also larger than the total value of all banks in the eurozone. ↩
-
PIMCO's Income Fund manages roughly $225 billion and sits on nearly every major 401(k) platform in the United States. Americans held about $8 trillion in 401(k) accounts as of 2023, with over 40% in target-date funds (which hold significant bond allocations that grow as the investor nears retirement). The 144A bonds that finance data-center debt are technically restricted to institutional buyers, which means retail investors cannot buy them directly, but they end up holding them anyway through the bond funds in their retirement accounts. This is the kind of sentence that sounds like a conspiracy theory but is just the plumbing of modern retirement finance, rendered visible. ↩
-
Johnson-Lindenstrauss (1984): random matrix, works. Locality-sensitive hashing (1998): random hyperplanes, works. Stochastic gradient descent (Robbins and Monro, 1951, rediscovered for neural networks at the turn of the century): random sampling, works. Every time someone looks hard at high-dimensional geometry or optimization, the tractable solution has randomness in it somewhere, and every time the discovery gets received as a minor embarrassment, as if using randomness were admitting a gap in the theory. Eventually the theorists came around. In spaces with enough dimensions, a random direction is nearly as informative as any hand-picked one, and asking for determinism is asking for a guarantee you don't actually need. ↩
-
Per Google Research's blog post, which I am taking at their word pending community replication. The underlying paper does not benchmark end-to-end attention throughput, only compression quality and indexing time. The 8× number includes both the memory-bandwidth savings from the smaller cache and the arithmetic savings from doing inner products in lower precision. ↩
-
"Unpour concrete" is not a phrase anyone uses because nobody has ever tried. ↩