← Blog
April 1, 2026

The Plumbing

Or: a technical appendix that accidentally became an argument about what 'public' means.

I. Seven Cabinets

Go to fec.gov. Pull up any member of Congress. You can see every contribution they received above $200: the donor's name, their employer, their occupation, the amount, the date. It's all there. It's public. Pick a name from the list, any name. SMITH, PATRICIA A, PFIZER INC, VP REGULATORY AFFAIRS, $3,300, 03/15/2024. Now go to congress.gov. Pull up the same member's voting record. You can see every roll-call vote they've cast: bill number, date, yea or nay. It's all there. Public. Free. Now try to answer a simple question: did the member who took $3,300 from a Pfizer VP vote on any bills that affected the pharmaceutical industry?

You will find that you cannot answer this question. Not because the information is classified, or behind a paywall, or legally restricted. You cannot answer it because the contribution lives in one database and the vote lives in another and the two databases do not know about each other. They don't share an ID system. They don't share a vocabulary. They don't link together.

You can, if you're patient, parsing this out manually: cross-referencing names, dates, bill subjects, committee jurisdictions, lobbying filings from a third database that uses yet another ID system. Assume maybe ten minutes per contribution, if you know what you're doing. The member you're looking at received 1,200 contributions last cycle. That's 200 hours. And that's one member. There are 534 more.

Or you can build a machine.

Because the information is scattered across seven federal databases maintained by seven different organizations using seven incompatible identification systems, and the act of connecting a dollar to a vote, which is the thing the disclosure laws were presumably designed to let citizens do, turns out to require roughly 34 gigabytes of data processing, an entity resolution pipeline that maps four different ID systems onto a single human being, a donor deduplication engine running string similarity algorithms at thresholds chosen (I'll be honest here) partly by feel, and about 36 million joins.1

The companion post describes what the machine found: 21.3%, which is the share of the model's vote predictions attributable to financial features. This post describes the machine itself. I thought, when I started writing it, that I was writing a technical appendix. I am no longer sure that's what this is. Because what the technical details keep revealing, once you line them up and look at them truthfully, is something about the architecture of American campaign finance disclosure that I think is more interesting than the 21.3% itself:

The system does not hide the information. It hides the legibility. And the distance between those two things is where influence lives.

II. The Mirror and the Map

I want to start with the Federal Election Commission, because the FEC is where the architectural principle becomes most visible, and because the FEC's design philosophy, once you understand it, explains almost everything else.

The FEC does not assign stable identifiers to individual donors. When you give money to a congressional candidate, the FEC records your name, city, state, employer, and occupation, exactly as reported on the form. This means the same person can appear as JOHNSON, ROBERT, BOEING CO, ENGINEER in one filing, BOB JOHNSON, BOEING COMPANY, AEROSPACE ENGR in another, and JOHNSON R, BOEING, ENGR in a third. The FEC does not consider these to be the same person. Its matching logic is string identity: are the characters the same? No? Different people.

The instinct here is annoyance. This seems broken. But if you sit with it (which I have, at hours suggesting poor judgment or genuine obsession, and possibly both), what emerges is not a broken system but a system that has made a very specific philosophical choice about its own purpose. The FEC is a filing agency. Its mandate, per the Federal Election Campaign Act, is to receive and make available reports and statements filed with the Commission. Not to analyze them. Not to connect them. Not to tell you what they mean. To receive them and make them available. The form says JOHNSON, ROBERT. The database says JOHNSON, ROBERT. The database has done its job.2

This distinction (between recording what was filed and asserting what is true about the world) is, I've come to believe, the load-bearing wall of the entire disclosure system, and I want to name it clearly because everything downstream depends on it. Every database is one of two things. It is either a mirror (it reflects what someone reported, no more, no less) or a map (it models what someone believes about reality). The FEC's contribution database is a mirror. It is a very good mirror. It is also, if what you want to know is "which industries are funding which legislators and how does that correlate with their votes," a mirror pointed at the wrong wall.

Congress.gov is also a mirror. It records votes. It does not record why anyone voted that way. VoteView (UCLA) is closer to a map: it computes ideology scores from voting patterns, which is an interpretive act. The Senate Lobbying Disclosure database is a mirror. It records what lobbyists filed. It does not record whether the lobbying worked. OpenSecrets is a map: it classifies donors into industry categories, which requires judgment. The congress-legislators YAML file on GitHub is a Rosetta Stone maintained by volunteers, without which the other six databases are, in a meaningful sense, written in mutually unintelligible languages.3

Seven databases. Each one does its job. Each one's job is narrower than what you need it to do if you want to answer the question "does money predict votes." And the collective result is a disclosure system that is simultaneously transparent (every record is public, as the law requires) and opaque (no record connects to any other record, because the law does not require that).

I do not think this was designed. I think it congealed. The FEC was established in 1975. Congress.gov's API launched decades later. The Senate lobbying database was built to a different spec by different people in a different era. Nobody sat in a room and said "let's make the data public but unconstructable." It happened the way a city without sidewalks happens: each road was built for a purpose, and the purposes didn't include pedestrians, and by the time anyone noticed the pedestrians, the concrete had settled.

But the effect is the same whether it was designed or accreted. A citizen who wants to know whether their representative's votes correlate with their representative's donors faces a choice:

A. Spending several hundred hours manually reviewing filings across multiple federal websites.

B. Building a data pipeline.

Option A is theoretically available to every American. Option B requires programming skills that maybe some subset of the population possess, and even less want to do. The information is public. The comprehension is gated. And the gate is not a paywall or a classification level or a legal restriction. The gate is plumbing.

I did the plumbing. Here's what's in it.4

III. The Identity Problem

Ritchie Torres, the congressman from the South Bronx discussed in the companion post, is H4NY15024 to the FEC, ICPSR 21500 to VoteView, Bioguide T000474 to Congress.gov, and CID N00036154 to OpenSecrets. Four strings of characters that mean "Ritchie Torres" in four different institutional languages. The chain that connects a dollar donated to Torres's campaign to a vote Torres cast on the House floor runs: contribution → FEC committee ID → FEC candidate ID → Bioguide (via YAML crosswalk) → vote record. Four joins. Each one can fail. The committee master file has committees with no candidate. The candidate file has candidates with no Bioguide mapping. The YAML crosswalk has legislators from 1789 who, for reasons that will be obvious upon reflection, lack FEC identifiers.

FECH4NY15024contributionsVoteViewICPSR 21500ideology scoresCongress.govT000474votesOpenSecretsN00036154industriesRitchie TorresOne person. Four IDs. Connected by congress-legislators.yaml

The legislator mapping is, comparatively, clean. There are 12,000 of them, each a public figure with a paper trail. The donor mapping is where the disclosure system's architectural choice becomes most consequential.

Because the FEC doesn't assign donor IDs, my pipeline has to decide whether ROBERT JOHNSON and BOB JOHNSON are the same person. I use Jaro-Winkler similarity at a 0.92 threshold with 70/30 name-to-employer weighting. At 0.92, they merge. At 0.93, they don't. The threshold is a knob. The knob creates a reality. Every number downstream, including the 21.3%, lives in that reality.

Here is the thing I keep coming back to: the FEC has 36 million contribution records. If those records had stable donor IDs (the way, say, the Social Security Administration assigns stable IDs to earners),5 the entire donor-industry analysis would be trivial. You could look up a donor, see their employer, classify their industry, and connect them to their contributions in a single query. The reason it requires a deduplication engine running fuzzy string matching at thresholds chosen by a person staring at pairs of records at his desk is that the FEC designed a system for filing, not for understanding, and the difference between those two design goals is the difference between JOHNSON, ROBERT appearing three times as three different people and JOHNSON, ROBERT appearing once as one person with three contributions.

Three contributions, incidentally, totaling $6,200, to the same legislator, in the same cycle, from the same human being who happens to have a nickname and a job change. The mirror reflects three donations from three strangers. The map, if you build one, shows one person with a pattern.

Multiply this by 36 million and you have the donor identity problem. Multiply the donor identity problem by the legislator identity problem by the bill identity problem6 by the lobbying identity problem, and you have the reason that connecting money to votes requires a pipeline instead of a query, which is itself the reason that the connection between money and votes is, for the vast majority of Americans, functionally invisible despite being technically public.

IV. What the Machine Sees

So you build the pipeline. You resolve the identities. You connect the databases. And now you have to decide what to show the model, which is a decision that is often presented as engineering but is actually theory.

A feature, in machine learning, is a thing (number, category, other) you believe contains information relevant to the prediction you're trying to make. "Believe" is doing real work there. By including a measurement, you assert it matters. By excluding one, you assert (or at least imply) it doesn't. The model will tell you whether you were right, which is nice, but the initial decision about what to measure is a hypothesis wearing an engineering hat.

The Congressional Yield Index uses 239 features per vote. Each vote is a (legislator, bill) pair. The features decompose into three categories:

The legislator (~80 features): party, DW-NOMINATE ideology scores,7 district demographics, committee assignments, seniority, caucus memberships, interest group ratings, predecessor voting patterns, attendance rate, defection trend, and 49 features from structured political profiles generated by submitting all 535 current legislators to an LLM.8

The bill (~50 features): subject tags, sponsor party, vote type (passage, amendment, procedural, nomination), and 50 PCA-reduced dimensions of bill text embeddings from a local model that converts 200 pages of legislative language into 50 numbers that capture semantic similarity. What those 50 numbers lose in the conversion is, roughly, everything that makes legislation legislation rather than a vector.

The financial relationship (62 features): For each of the top 50 CRP industry categories, what percentage of this legislator's total funding came from that industry. The concentration index of that vector.9 PAC-to-individual ratio. In-state vs. out-of-state ratio. Independent expenditure totals. Dark money exposure. Donor cluster exposure (from a Gaussian Mixture Model on the co-funding graph, which finds emergent clusters of donors who fund the same legislators; the clusters are not "the pharmaceutical lobby" or "Wall Street" but statistical structures the algorithm discovered, real in the sense that they're predictive and unreal in the sense that nobody in Washington would recognize them).10 Financial conflict scores (the dot product of the legislator's donor-industry vector and the bill's stakeholder-industry vector, weighted by direction and magnitude). Lobbying pressure. Employer concentration.

These 62 features are the ones that produce the 21.3%. They are also the ones that were hardest to build, because they required connecting the databases that were designed not to connect. The industry concentration vector requires linking contributions (FEC) to industry codes (OpenSecrets) via donor names (fuzzy-matched) and committee IDs (crosswalked). The lobbying pressure feature requires a triple join: contributions (FEC) → organizations (built from scratch, because no shared org ID exists across the databases) → lobbying filings (Senate LDA) → issue codes → bill subjects (Congress.gov). Every one of these joins crosses an institutional boundary. Every boundary crossing is a potential failure point, a place where identities don't match, formats don't align, or the mirror on one side is reflecting something that the mirror on the other side doesn't recognize.

The 62 financial features are, in other words, the features that the disclosure system's architecture made maximally difficult to compute. The non-financial features (party, ideology, district, committee) were comparatively easy, because they come from databases that were designed to be legible. Make of this what you will.

V. The Machine That Plays Twenty Questions

The model is XGBoost, which plays 20 Questions.

I mean this almost literally. In 1948, Claude Shannon published "A Mathematical Theory of Communication," in which he proved that information is the resolution of uncertainty and that the optimal way to identify an unknown is to ask the sequence of binary questions that eliminates the most possibilities per question. A decision tree does exactly this. It takes 2.3 million votes, each described by 239 features and a binary target (did this legislator defect from their party?), and learns a sequence of yes/no splits. Is this a Democrat? Is their DW-NOMINATE above -0.3? Does more than 10% of their funding come from financial services? Each split is chosen to maximize Shannon's information gain: the biggest reduction in uncertainty about the outcome. Six layers of questions, and you've carved the space into regions where the answer is mostly yes or mostly no.11

The model uses 500 such trees, each one trained to correct the errors of the previous ensemble (gradient boosting), with hyperparameters selected by Bayesian optimization over 100 trials.12

yesnoyesnoyesnoyesnoyesnoIs this aDemocrat?Moderate fora Democrat?On a financialservices committee?>10% of fundingfrom finance?High lobbyingpressure?party_R 0.5dw_nom_1 > -0.312cmte_finance = 1ind_F1000 > 0.103lobby_pressure > 4.210.73n=1,8420.31n=4,2100.08n=12,4500.44n=8900.15n=3,2000.11n=18,700likely defectsleans party-linestrong party-linenotable defection riskmostly party-linestrong party-line

The thing I want you to notice, though, is what the tree doesn't know. It doesn't know what a Republican is. It doesn't know what the pharmaceutical industry is. It doesn't know what it means for a legislator to defy their party on a vote about banking regulation while their top five donors are banks. It knows that certain conjunctions of feature values, in the training data, correlated with defection at certain rates. It has answered every question. It has understood nothing.

Shannon himself recognized this. His theory deliberately excluded meaning. "Frequently the messages have meaning," he wrote. "These semantic aspects of communication are irrelevant to the engineering problem." The 21.3% lives in the same exclusion zone. It is a statistical statement. Whether it means what you think it means when you read it is, as Shannon would say, a different department.

VI. The Counterfactual

The model has two stages. The architecture is standard.

Stage 1 sees 177 features: everything except money. Party. Ideology. Seniority. Committees. Bill subjects. Bill embeddings. Interest group ratings. District demographics. Caucus memberships. Vote type. Predecessor patterns. Stage 1 is a what-if machine. It asks: if you could not see this legislator's donors, how would you predict their vote?

Stage 2 sees all 239 features, including the 62 financial ones. Same algorithm. Same data.

For each vote, the pipeline computes P₁ (Stage 1's defection probability) and P₂ (Stage 2's). The financial delta is P₂ − P₁.

And here is where I need to slow down, because the counterfactual is the foundation of the entire analysis, and the foundation has a crack in it that I think is important enough to show you rather than plaster over.

Stage 1 does not simulate a world without campaign money. It simulates a world in which the model cannot see campaign money. These are different things. In a world without contributions, the legislator might behave differently because the incentive structure has changed. Industries that currently shape agendas through financial support might still shape them through expertise, information asymmetry, revolving-door employment, or the simple fact that the pharmaceutical industry employs 30,000 people in a particular district and those people vote. The legislator who votes with pharma might vote with pharma even in a moneyless world, because they genuinely believe the industry's position is correct, or because their committee staff drafted the language and the staff came from industry, or because their district would lose jobs if the bill passed.

In Stage 1's world, all of those mechanisms still exist. The model just can't see the financial features. The delta between P₁ and P₂ is not "the effect of money on this vote." It is "the additional predictive power that financial features provide after everything else the model can observe has been accounted for." The first claim is causal. The second is statistical. The gap between them cannot be closed with observational data. To close it, you would need to randomly assign campaign contributions to legislators and observe the result, which would require an experimental design that violates several federal laws and all known principles of political science.13

So the 21.3% lives in the gap. It is a measurement of something, but the something is ambiguous: it could be money's influence on votes, or money's correlation with the part of voting behavior that ideology and party don't already explain, or some combination that varies by member and by vote in proportions we cannot determine.

I want to be transparent about this ambiguity because I think being honest about it is a prerequisite for saying what I'm about to say, which is that despite the ambiguity, I believe the number is pointing at something real.

Here is why. The confounders that Stage 1 controls for are not weak. DW-NOMINATE is the gold standard of ideology measurement. Party membership is the single strongest predictor of votes. District demographics capture constituent pressure. Committee assignments capture institutional position. Interest group ratings capture revealed policy preferences. Bill embeddings capture semantic content. If the financial signal were merely proxying for ideology (if the money followed preferences rather than did anything to them), Stage 1 should absorb it, and Stage 2 should show minimal improvement. Instead, Stage 2 adds predictive power on top of all of that. Consistently. And in patterns that are geographically, temporally, and institutionally coherent: the tenure curve that peaks mid-career, the committee effects that track jurisdictional oversight, the lobbying-defection correlation, the party asymmetries that mirror structurally different donor ecosystems.

Noise doesn't do this. Noise is uniform. Confounders are monotone. The signal in the financial features is neither. It clusters where you'd predict it would cluster if money mattered, appears where you'd predict it would appear if money mattered, and doesn't appear where you'd predict it wouldn't appear if money mattered. This is not proof. It is the kind of evidence that makes a careful person update their beliefs.

VII. Decomposition

The 21.3% is computed using Shapley values, a concept from cooperative game theory that Lloyd Shapley published in 1953 and that won him a Nobel Prize sixty years later. The Shapley value decomposes a prediction into additive contributions from each feature. For 239 features, the brute-force computation requires 2²³⁹ subsets (a number with 72 digits, roughly nine orders of magnitude shy of the number of atoms in the observable universe), but TreeSHAP computes exact values in polynomial time by exploiting the structure of decision trees.

For each vote, the decomposition produces 239 numbers. Feature ind_H3000 contributed +0.02 to defection probability. Feature dw_nominate_dim1 contributed -0.15. The 21.3% is: absolute SHAP values for all 62 financial features, divided by total absolute SHAP values across all 239, averaged over the test set.

Moderate Democrat · financial regulation bill · 119th Congress0.00.20.3base: 0.12dw_nominate_dim1-0.15ind_F1000+0.09party_R-0.08financial_conflict+0.08lobby_pressure_log+0.06cmte_finance+0.05pac_ratio+0.04caucus_progressive-0.03donor_cluster_7+0.03bill_embed_pc3+0.03igr_aclu-0.02ie_total_log+0.02seniority+0.02~225 others+0.05prediction: 0.31financial: 42.7%all other features

I want to flag one thing about this number and then move on. The SHAP values decompose a prediction; they do not explain a phenomenon. The difference is the difference between "60% of the repair cost was the transmission" and "the transmission caused the breakdown." One is a partition. The other is a story. The 21.3% is a partition. The companion post sometimes uses language that implies it's a story. I chose that language because the accurate version ("financial features account for 21.3% of the mean absolute SHAP attribution across a gradient-boosted tree ensemble trained on 2.3 million votes with a defection target, evaluated on a time-separated test set") would lose every reader I'd like to keep. Whether the imprecision is forgivable depends on how much you trust the rest of the machinery to justify it.

VIII. The Target and the Test

The model predicts defection from party. Not yea or nay (which is trivially predictable at 95% accuracy by guessing each party's usual position), but the interesting question: when does a legislator break? The defection rate is 10-15%. The trivial baseline achieves 85-90%. The model achieves 90.8% with a 0.765 AUC-ROC.14

The evaluation has three layers. Standard metrics on a time-separated test set. Baselines (NeverDefect, party-only, logistic regression) that the model must beat to justify its existence; it beats all three, though the margin over logistic regression is modest enough that I want to flag it rather than crow about it. And the CYI Evaluation Set: a curated subset of votes where money plausibly matters most (high financial conflict, close margins, high defection rates, bills with multiple identified stakeholder industries; a vote must meet at least 2 of 4 criteria). If the financial signal is real, the model should perform better on this subset. It does. If the signal were an artifact of bill contentiousness, you'd expect the performance to be uniform. It is not.

IX. What the Plumbing Reveals

This was supposed to be a technical appendix. I want to acknowledge that, because I think the gap between what I thought I was writing and what I ended up writing is itself part of the finding.

I sat down to describe a data pipeline: sources, joins, features, model, evaluation. A document for the technically curious. And what kept emerging, section after section, was not how the pipeline works but what the pipeline's existence implies. Because every hard part of building this system was hard for the same reason: the data was not designed to connect to anything else. The identities don't match. The formats don't align. The FEC records dollars, Congress.gov records votes, the Senate records lobbying, and none of them record the relationship between any of these things, because none of them were asked to.

Congress passed laws requiring that every campaign contribution above $200 be disclosed. Each disclosure mandate was fulfilled. Each mandate, fulfilled in isolation, produces a filing cabinet of extraordinary completeness that is, in isolation, of almost no use for answering the question the disclosure was presumably designed to let you ask. The result is a transparency regime that is, in the strictest legal sense, transparent.

And in every practical sense these disclosures might as well be written on the dark side of the moon. The information is not hidden. The legibility is.

The effect, whether it was intended or coalesced, is that connecting money to votes requires building a machine. The machine I built uses gradient-boosted trees, SHAP decomposition, donor clustering, and bill embeddings. It ingests 34GB of data. It makes 36 million joins. And after all of that, it produces a number (21.3%) that is, at best, a noisy estimate of the financial system's statistical footprint on legislative behavior, hedged by every caveat I've described in this post: the deduplication threshold that creates its own reality, the counterfactual that imagines a world that doesn't exist, the SHAP values that partition without explaining, the emergent clusters that are real but not interpretable.

All of this was necessary to see a thing that was, legally speaking, never hidden.

 

The Federal Election Campaign Act of 1971 begins with a statement of purpose. The purpose, per the legislative history, is to promote "public disclosure of the sources and uses of funds in Federal election campaigns." The word "disclosure" appears 70 times15 in the original act. The word "comprehension" appears zero times. And I keep coming back to this, because I think the gap between those two words is where the entire story lives.

Disclosure means making information available. Comprehension means making information usable. These are different things in the way that putting all the ingredients on the counter is different from cooking dinner. The FECA put every ingredient on the counter. It did not, at any point, in any section, require that the ingredients be combinable into a meal. The result, fifty years later, is a countertop so covered with disconnected ingredients that the only people who can cook with them are the ones who build their own kitchens from scratch, which is a group that includes, as far as I can determine: some political scientists, some investigative journalists, some well-funded advocacy organizations, and a dude writing a blog post sitting next to a cat litterbox.16

I want to be careful with what I'm saying here, because there is a version of this argument that sounds like a conspiracy theory, and that is not what I mean. Nobody designed this. The opacity is emergent. But emergent opacity and engineered opacity produce the same result for the citizen standing outside the system trying to look in, which is: nothing.

And here's where I arrive at a question I'm not qualified to answer and that I don't think the unanswerability excuses me from asking: is a disclosure system that discloses everything and connects nothing actually fulfilling the purpose for which it was created? When James Madison wrote in Federalist No. 52 that the House of Representatives should have "an immediate dependence on, and an intimate sympathy with, the people," he was articulating a principle that assumes citizens can see what their representatives are doing and why. The "why" is the part that requires connection. And the connection is the part that requires plumbing. And the plumbing is the part that the law neither provides nor requires, which means the "intimate sympathy with the people" part depends, structurally, on whether someone felt like building a data pipeline that week.17

The 21.3% is interesting. I believe it points at something real. But I think the deeper finding is the one the pipeline's existence reveals: that the system is working exactly as designed, that the design achieves exactly what it specifies, and that what it specifies is not what most people assume it specifies, and that the distance between the specification and the assumption is wide enough to hold $14 billion and 350,000 votes and the entire relationship between the two, in plain sight, in public, invisible.

Footnotes

  1. The seven sources: FEC bulk data (fec.gov), Congress.gov API, VoteView (voteview.com), congress-legislators GitHub repo (unitedstates/congress-legislators), Senate LDA API (lda.senate.gov), OpenSecrets/CRP bulk data, and ProPublica Nonprofit Explorer. Total raw data: 36 million contributions, 2.4 million votes, 460,000 lobbying filings, 81,000 bills, 12,000 legislators. If you want to reproduce this, bring coffee and a high tolerance for government data formats.

  2. The Federal Election Campaign Act of 1971, as amended, requires the Commission to "receive and make available" reports and statements. The word "analyze" does not appear in the mandate. This is not an accident and not a trivial distinction. An agency that receives and makes available is a filing cabinet. An agency that analyzes is a regulator. The FEC is, by design, the former, which is one of several reasons why the FEC has been, for roughly its entire existence, the subject of intense debate about whether it is structurally capable of fulfilling the purpose most people assume it exists to fulfill. The Commission has an even number of members (six), no more than three from any party, and a history of partisan deadlocks so consistent they have become, in the academic literature, an object of study in their own right.

  3. The congress-legislators YAML is maintained by volunteers at the unitedstates GitHub organization. It maps Bioguide to FEC to OpenSecrets to VoteView ICPSR to GovTrack IDs. Its memory footprint, loaded into a bidirectional lookup index, is about 15MB. The infrastructure connecting 40% of American political data projects weighs less than a single high-resolution photo of a legislator.

  4. Ew.

  5. In 1938, a wallet manufacturer in Lockport, New York, included a sample Social Security card in its wallets bearing the real SSN of a company secretary named Hilda Schrader Whitcher. 40,000 Americans adopted Hilda's number as their own. At the peak in 1943, 5,755 people were simultaneously using it. The SSN was designed for payroll tracking and printed NOT FOR IDENTIFICATION, which turned out to be one of the least enforceable directives in the history of American bureaucracy. The IRS adopted it in 1962. The military in 1969. The credit agencies eventually. Each adoption was pragmatic. The cumulative effect was a de facto national ID that nobody voted for. The FEC, notably, did not adopt it as a donor identifier, which is one of many choices that is easy to criticize and hard to fault.

  6. The bill identity problem is its own special nightmare. VoteView identifies bills by chamber-specific codes (HR21, S567). Congress.gov uses a different format. The pipeline reconciles them, but VoteView frequently assigns placeholder IDs (prefixed vv-) for roll-call votes whose associated bill can't be matched to Congress.gov metadata, which means a non-trivial number of votes float in the database attached to bills that the pipeline knows exist but can't fully describe. These phantom bills are excluded from analysis but their existence is a reminder that even the vote database, which is the simplest part of the pipeline, contains lacunae.

  7. DW-NOMINATE stands for Dynamic Weighted NOMINAl Three-step Estimation, an acronym that could only have been produced by political scientists. The first dimension captures the liberal-conservative spectrum. The second dimension captures something that political scientists have been arguing about for decades.

  8. Structured political profiles for 535 members of Congress, generated via the Anthropic Batch API: $9. The profiles include interest group ratings, district demographics, electoral vulnerability, caucus memberships, biographical details, and donor tension analysis. This is opposition research at 99.997% below market rate, which is either a marvel of technology or a sign that the opposition research industry should update its pricing model.

  9. The Herfindahl-Hirschman Index was developed for antitrust economics, where it measures market concentration. Applied to campaign donors, it measures how close a legislator's funding base is to a monoculture. A legislator whose funding is a monopoly of one industry is, at minimum, in a different structural position than one whose funding is a competitive market. Whether "different structural position" translates to "different voting behavior" is what the model is trying to determine.

  10. The donor clusters are computed via soft Gaussian Mixture Model clustering on the co-funding graph: donors who give to the same legislators end up in the same clusters, regardless of what industry they work in or where they live. The clusters are emergent, which means they were not defined by any prior theory about how donor networks work. Whether they correspond to actual coordination among donors or merely to shared preferences is, like most interesting questions in this project, one the model raises and cannot answer.

  11. Shannon's information gain, which measures the reduction in entropy from a binary split, is literally the criterion that decision tree algorithms use to choose each split. The 20 Questions analogy is not a metaphor. Decision trees are Shannon's information theory applied to classification. The optimal question is the one that resolves the most bits of uncertainty about the target variable. The tree asks: what single yes/no question, right now, would most reduce my confusion about whether this legislator is about to defect?

  12. 500 trees, max depth 6, learning rate 0.05, 80% row and column subsampling. Hyperparameters via Optuna with time-series cross-validation (train on earlier congresses, validate on later ones, so the model never sees the future during training).

  13. The technical term for this in the political science literature is the "selection vs. influence" problem, and it has been the subject of approximately ten thousand published papers, none of which have resolved it. The best summary I've encountered is from a researcher who said the field's consensus is "probably both, in proportions that vary by member and by vote, in ways we cannot measure." This is a very expensive way of saying "we don't know."

  14. Weather forecasting achieves 80-90% accuracy at the 24-hour horizon, on a problem with much more regular underlying physics. Whether human political behavior should be more or less predictable than a thunderstorm is a question I find myself thinking about at odd moments.

  15. I dare you to count to see if I actually did it or not.

  16. I recognize that this list is both incomplete and self-aggrandizing, and I want to acknowledge that the self-aggrandizement bothers me more than the incompleteness, which probably tells you something about my priorities that I'd rather it didn't. And the general rule of thumb is one litter box per cat, plus one extra.

  17. Madison, of course, was writing in a context where the total federal budget was approximately $4 million and the concept of a Political Action Committee would not exist for another 165 years. Whether his principles about democratic visibility scale to a system processing $14 billion in campaign contributions across 539 members of Congress is a question that I suspect he would find fascinating and I find genuinely unanswerable.

campaign financetechnicalmachine learningXGBoostSHAP