Key Takeaways
1. Early Statistics Served State Power and Social Order
Statistics was originally knowledge of the state and its resources, without any particularly quantitative bent or aspirations at insights, predictive or otherwise.
State needs drove counting. The origins of statistics in the 18th century were tied to statecraft, driven by the needs of rulers for information on population, land, and resources to manage taxes and wage wars. This early "statistics" was more descriptive than analytical, a qualitative account of the state.
Numbers became political. An "avalanche of numbers" began in the late 18th century, as states increasingly turned to quantification to understand and govern their populations. The US Constitution, for instance, enshrined the census, demonstrating the political nature of numbers from the outset.
Social physics sought order. Figures like Adolphe Quetelet aimed to create a "social physics" using data and mathematical tools from astronomy to find regularities in human behavior like crime and mortality. His concept of the "average man" sought to understand society as a whole, not just individuals, and aimed for gradual, non-revolutionary social reform.
2. Quantification Became a Tool for Ranking and Eugenics
A race would be characterized by its measurements of physical and moral qualities, summed up in the average man of that race.
From average to deviation. Building on Quetelet, Francis Galton used the normal curve not just to describe the average but to understand variation within a group, seeking to rank individuals and "races" based on quantifiable traits. This shift laid the groundwork for measuring and classifying human differences.
Eugenics aimed to "improve" humanity. Galton coined "eugenics" to describe the conscious effort to improve human hereditary qualities, particularly within national "races." He believed talent and character were inherited and could be bred, applying statistical concepts like "regression" and "correlation" to human populations.
Statistics justified social hierarchy. Figures like Karl Pearson institutionalized these ideas, using statistics to argue for the inherited nature of intelligence and moral traits. Their work provided a "scientific" veneer for existing class and racial biases, influencing policies like immigration restrictions and contributing to the idea that social problems stemmed from poor "breeding stock."
3. Statistical Methods Applied to Policy Raised Questions of Causality
It is not in the conditions of life, but in race and heredity that we find the explanation of the fact to be observed in all parts of the globe, in all times and among all peoples, namely, the superiority of one race over another, and of the Aryan race overall.
Data used to justify discrimination. Frederick Hoffman's work for Prudential Insurance used statistics to argue for the inherent inferiority of Black Americans, justifying discriminatory practices like charging higher insurance rates. His claims, though statistically flawed, were influential in providing a "scientific" basis for Jim Crow laws and practices.
Correlation vs. Causation in policy. Statisticians like Udny Yule applied new tools like multiple regression to social problems, seeking to identify the "causes" of poverty. Yule's work, though technically advanced, highlighted the difficulty of inferring causality from observational data, a challenge still central to data analysis today.
Proxies can obscure reality. The debate over poverty statistics revealed how administrative categories (like "pauperism") used as proxies for complex social phenomena (like "poverty") can flatten reality and lead to misleading conclusions. This "reification" of abstractions remains a danger in using data for policy and social understanding.
4. World War II Fueled Data-Intensive Computation and Secret Methods
Bletchley Park in 1944 was not the hutted, collegiate, informal organisation of popular myth, but rather to scale up and industrialise the techniques developed by the master codebreakers, and to create systems allowing their methods to be applied to thousands of items of data, at speed, by staff without an Oxbridge level of education.
Codebreaking demanded scale and speed. World War II cryptanalysis, particularly at Bletchley Park and the NSA's predecessors, required processing massive volumes of data in real-time to break enemy codes. This practical, martial need drove the development of specialized computing hardware and statistical methods, often outside traditional academic statistics.
Bayesian methods proved practical. Despite being disdained by many academic statisticians, Bayesian methods, which update probabilities based on new evidence and prior beliefs, were found to be highly effective for evaluating potential decryptions at Bletchley Park. This pragmatic, data-driven approach contrasted with the more theoretical focus of academic statistics.
Intelligence agencies pioneered data infrastructure. The NSA, founded in 1952, faced unprecedented data processing needs, sponsoring intense work on larger storage mechanisms and faster processing. This secret work on handling massive data streams laid groundwork for future commercial data technologies, though its influence remained hidden for decades.
5. Early Artificial Intelligence Focused on Logic, Not Data
You might expect the analysis of data to be central to this project. It wasn’t.
AI's symbolic origins. The initial vision for Artificial Intelligence, notably at the 1956 Dartmouth Workshop, focused on emulating human intelligence through logic, symbols, and rules programmed into computers. This approach largely de-emphasized learning from data, viewing it as less central to complex cognitive tasks like reasoning and problem-solving.
Against data and measurement. Some influential figures in postwar social sciences and early AI advocated for abstract, axiomatic approaches over data accumulation and quantification. They believed true understanding and intelligence lay in formal theories and symbolic manipulation, not in processing messy empirical data.
Expert systems faced bottlenecks. Later AI efforts shifted to replicating specialized human expertise in "expert systems" by encoding expert knowledge into rules. However, converting tacit human knowledge into explicit rules proved difficult ("knowledge acquisition bottleneck"), highlighting the limitations of purely symbolic approaches without robust data learning.
6. Postwar Data Collection Scaled Dramatically, Challenging Privacy
The change in the variety and concentration of institutional relationships with individuals is that record keeping about individuals now covers almost everyone and influences everyone’s life.
Electronic data processing expanded. Following WWII, digital computers enabled the large-scale collection, processing, and storage of data about citizens and consumers, moving beyond older systems like punch cards. This transition, often driven by military and corporate needs, required significant investment in infrastructure and changes in organizational practices.
Databases raised privacy alarms. By the 1970s, the proliferation of government and corporate databases led to growing concerns about privacy and the potential for misuse of personal information. Legislators like Goldwater and Ervin proposed ambitious bills to give citizens control over their data, recognizing the dangers of combining information across different systems.
Privacy protections were limited. Despite early alarms, efforts to enact broad federal privacy laws covering the private sector largely failed in the US. This left a patchwork of sectoral regulations, allowing the free collection and exchange of personal data to become a de facto norm, setting the stage for the data-intensive business models of the internet era.
7. Machine Learning Prioritized Prediction Over Interpretability
Machine learning seemed far more ambitious when he described the same field more than a quarter century earlier, in 1984, separating the narrow goals of “pattern recognition” from the “symbolic” approach of AI.
From AI to prediction. As symbolic AI faced challenges, a related field, machine learning, emerged, drawing methods from pattern recognition, statistics, and neural networks. This field increasingly focused on the practical task of making predictions and classifications from data, often abandoning the broader goals of emulating human cognition or providing interpretable rules.
Neural nets revived by data and computation. Despite early setbacks and skepticism, neural networks saw a resurgence, particularly with the development of training algorithms like backpropagation and increasing computational power. These "deep learning" models proved powerful predictors but were often "black boxes," lacking human-understandable explanations for their decisions.
Optimization became the key value. The success of machine learning, especially in competitions like the Netflix Prize, cemented prediction accuracy as a primary metric of success. This focus on optimizing quantitative performance, often at the expense of interpretability or theoretical understanding, became a defining characteristic of the field.
8. "Data Science" Emerged from Industry's Need for Scalable Data Analysis
At Facebook we felt like different job titles like research scientist, business analyst didn’t quite cut it for the diversity of things that you might do in my group.
Industry faced data overload. Companies like Facebook and Google accumulated data at unprecedented rates, overwhelming existing analytical tools and infrastructure. This challenge required new technologies like Hadoop and MapReduce to store and process massive, messy datasets across distributed systems.
A new role for data wrangling. The term "data scientist" emerged in industry to describe professionals with a blend of skills: statistical and machine learning knowledge, software engineering, and the practical ability to clean, organize, and analyze large, real-world data. This role prioritized getting value from data at scale for business needs.
Academic statistics lagged behind. While some statisticians like John Tukey and William Cleveland advocated for a more data-focused "data analysis" or "data science," academic statistics largely remained focused on mathematical theory and smaller, cleaner datasets. The rise of data science in industry highlighted a gap between academic training and the demands of the data-rich world.
9. The Pursuit of Data Ethics Is a Contested Space, Often Co-opted
What I always see in the AI literature these days is “ethics.” I want to strangle ethics.
Ethical failures spurred principles. Historical events like the Tuskegee Syphilis Study led to the Belmont Report, establishing principles (respect for persons, beneficence, justice) and institutional review boards (IRBs) for ethical human subjects research. This framework provided a model for thinking about applied ethics in research.
Tech companies adopted ethics, selectively. In response to public scrutiny over studies like Facebook's emotional contagion experiment, tech companies began adopting ethical language and creating internal review processes, sometimes framed as "evolving the IRB." This move was often seen as a form of self-regulation to preempt government intervention.
Ethics faces challenges in practice. Integrating ethical principles into corporate decision-making is difficult, especially when they conflict with profit motives. Critics point to "ethics washing" or "ethics theater," where companies prioritize the appearance of ethics over meaningful constraints. Technical solutions to ethical problems like fairness or privacy, while valuable, often address symptoms rather than underlying power structures.
10. The Attention Economy and Venture Capital Drive Data-Driven Persuasion
In an information-rich world, the wealth of information means . . . a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients.
Attention became a valuable commodity. As information became abundant and cheap to distribute, human attention became the scarce resource, creating an "attention economy." This met the existing advertising industry, which sought new ways to capture and monetize attention, moving from broadcast media to the internet.
Web platforms optimized for engagement. The rise of Web 2.0 and user-generated content created vast amounts of material, which platforms like Google and Facebook organized and presented using algorithms. These algorithms were increasingly optimized for metrics like "watch time" or "engagement," as these correlated with advertising revenue.
Venture capital accelerated disruption. Venture capital funding enabled internet companies to grow rapidly and acquire massive user bases before establishing profitable revenue models. This "blitzscaling" allowed data-rich companies to dominate markets quickly, often relying on network effects where more users generate more data, leading to better products and further growth.
11. The Future of Data Is an Unstable Contest of Powers
Civil society can also play an important role [in addition to] the other two rails of society— government and business.
Corporate power dominates data. Large tech companies hold immense power over data, controlling vast datasets and the infrastructure for analysis. Their business models, often based on surveillance advertising and optimization for engagement, drive the direction of data use and technology development.
State power seeks to regulate. Governments are increasingly attempting to check corporate power through regulation, though often facing challenges from outdated laws and corporate lobbying. Efforts include antitrust actions, privacy regulations (like GDPR and CCPA), and reevaluating legal protections like Section 230.
People power provides friction. Individuals and collective action within companies (employee activism, unionization) and externally (data boycotts, advocacy groups) provide a crucial check on corporate and state power. These efforts, though often incremental, can introduce friction and push for changes in norms, laws, and corporate practices to align data use with justice and democratic values.
Last updated:
Review Summary
How Data Happened receives mixed reviews, with an average rating of 3.57/5. Readers appreciate the book's ambitious scope and historical insights but criticize its organization and writing style. Many find it repetitive and dense, making it challenging to digest. The audiobook version is noted to be particularly difficult due to mathematical content. Positive aspects include fascinating historical points, ethical discussions, and well-researched content. Some readers recommend the print version over audio and suggest using the footnotes for additional resources.
Download PDF
Download EPUB
.epub
digital book format is ideal for reading ebooks on phones, tablets, and e-readers.