Seventh workshop on the philosophy of information
William Wong (Middlesex University): Liberty and Security: VALCRI - Because we can … should we?
In this talk I will introduce VALCRI - Visual Analytics for sense-making in CRIminal intelligence analysis. This is an FP7 integrating project designed to develop a system that would facilitate human reasoning and analytic discourse. It would be tightly coupled with a semi-automated human-mediated semantic knowledge extraction engine. It is intended to assist police analyse large data sets to discover crime patterns and other criminal intelligence information that can be used to support investigations into specific crimes. I hope to discuss a number of philosophical issues associated with the design and development of such a joint cognitive system intended to support law enforcement agencies ensure our rights to liberty and security. I hope to discuss some of the issues and trade-offs that we face in making design decisions, in particular, how we steer between the boundaries of intelligent computation while encouraging human imagination, enabling insight, ensuring transparency, and engaging with fluidity and rigour.
Emma Tobin (UCL): Data in Protein Classification
Judith Simon (IT University of Copenhaghen): Approaching Big Data - Relating Epistemology, Ethics & Politics
Big Data is a contested term and topic. While some emphasize its promises for economic prosperity, technological and societal advances, others have highlighted the ethical and societal dangers of big data practices. Still others consider big data to be merely a new buzzword. While the processing of large amounts of data is hardly new, what cannot be denied is that the types of data being processed and combined have changed as have the implications of these data practices. Of particular concern to my analyses are big data processes related to humans and how they are likely to aggravate what Hacking (1995) has labeled the looping effect. My methodological claim is that many issues related to big data practices can only be understood by combining epistemological with ethical analyses, because numerous ethical problems, e.g. related to anonymity and privacy, can only be targeted if their epistemological premises, e.g. the possibilities of re-identification, are understood and taken into account. Moreover, formerly rather clear-but divisions between data types that require more or less protection become fuzzy when the processing and combination of data can make seemingly innocent data highly predictive of sensitive realms of life. In addition to the epistemological and ethical lenses, big data practices must also be assessed from a political perspective for two reasons. On the one hand, big data practices are political because they are increasingly used to provide rationales and justification for political action, i.e. big data is considered to support or guide governance (e.g. White House 2014, Pentland 2014). On the other hand, big data practice must be governed in themselves for various ethical and epistemological reasons. While the former link between big data and politics is clearly rooted in the historical and more general linkage between statistics and governance (e.g. Porter 1995, Desrosières 1998), the latter requirements for the governance of big data pose some novel challenges due to a) the complexity and speed of development, b) the lack of skills and competencies related to data analytics and c) even more profoundly the lack of access to data in both politics and academia. I will end my talk with some considerations on how a potential governance of big data may deal with these challenges.
Desrosières, A. (1998). The Politics of Large Numbers: A History of Statistical Reasoning,Harvard University Press.
Hacking, I. (1995). The looping effects of human kinds. Causal cognition: A multidisciplinary debate D. Sperber, D. Premack and A. J. Premack. Oxford, UK, Oxford University Press: 351–383.
Pentland, A. (2014). Social Physics: How Good Ideas Spread - the Lessons from a New Science. New York, Penguin Press.
Porter, T. M. (1995). Trust in Numbers: The Pursuit of Objectivity in Science and Public Life. Princeton, Princeton University Press.
White House (2014). Big Data: Seizing Opportunities, Preserving Values.http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_repo...nt.pdf
Rob Kitchin (Mynooth University, Ireland): Big Data, new epistemologies and paradigm shifts
This paper examines how the availability of Big Data, coupled with new data analytics, challenges established epistemologies across the sciences, social sciences and humanities, and assesses the extent to which they are engendering paradigm shifts across multiple disciplines. In particular, it critically explores new forms of empiricism that declare ‘the end of theory’, the creation of data-driven rather than knowledge-driven science, and the development of digital humanities and computational social sciences that propose radically different ways to make sense of culture, history, economy and society. It is argued that:
(1) Big Data and new data analytics are disruptive innovations which are reconfiguring in many instances how research is conducted;
(2) there is an urgent need for wider critical reflection within the academy on the epistemological implications of the unfolding data revolution, a
task that has barely begun to be tackled despite the rapid changes in research practices presently taking place.
After critically reviewing emerging epistemological positions, it is contended that a potentially fruitful approach would be the development of a situated, reflexive and contextually nuanced epistemology.
Louise Bezuidenhout (University of Exeter): Data Creation and Research Environments: Implications for the Re-Use of Open Data
The object of this talk is to examine the relationship between data curation and data re-use – more specifically how scientists’ perceptions of the conditions under which the data are created influence the likelihood of re-use. This talk is informed by recent fieldwork done in a number of low/middle-income countries (LMICs) in which scientists expressed concern that their data would not be used if placed online. These scientists discussed a variety of reasons for these concerns, relating to the use of older equipment and methodologies, the low visibility of their research institutions and the limitations associated with lack of funding (such as expensive additional experiments). As more and more attention is focused attention on how scientific data can best be shared so as to benefit humanity, Open Science (OS) has taken up the challenge of ensuring that the maximum amount of data are available to scientists for re-use. As a result, OS discussions have largely focused on expanding the availability of data online – with this availability being taken to signify the utility of the data accumulation and its eventual re-use. The concerns of LMIC scientists thus strike to the heart of the OS movement and force us to re-examine how far the likelihood of re-use can be assumed in OS discussions. As the amount, types and sources of data become increasingly available for re-use it becomes apparent that not all data will be usefully re-used. It becomes important to ask how scientists judge which data to re-use, and why some data are selected preferentially. Past research in the social studies of science has suggested that scientists rely heavily on the use of specific indicators to evaluate data, notably judgements about how the data were collected and analysed. Thus, data evaluations are likely to be related to opinions about the equipment and methodologies used; the laboratory environment in which the research is conducted; the prestige, visibility and funding of the research group and so forth. Scientists, in deciding to trust data, therefore can be said to rely – at least in part - on constructed perceptions of the intellectual and physical laboratory environment in which the research has been conducted. In effect, the how they evaluate the “trace” that the creator and context leave on the data becomes an important factor in the balance of trust. Scientists thus develop complicated strategies through which they select and discard online data. These strategies are undoubtedly influenced by the life assemblage in which they live, and depend on personal, cultural and political issues as well as scientific knowledge and experience. Determinations of reliability and re-usability of data are thus intimately connected to the personal perceptions of the individual creator and their environment as well as to their methodological integrity and reproducibility. The concerns of the LMIC scientists, mentioned above, stem from the recognition of these issues. If these scientists generate data using older equipment and methodologies as well as limited resources, in laboratories with low visibility and international prestige it is possible that their data will be dismissed as not-usable without a firm scientific basis. This highlights a tension between ensuring that data are good enough quality, and such marginalizing good quality data due to unduly stringent concerns is an important area for further discussion and one that requires considerable investigation. The recognition of the concerns of these LMICs raises two important problems for discussions on data sharing and Open Data. Firstly, it is possible that the increasing demands for extended metadata may create situations in which data arising from LMICs may be unfairly prejudiced against due to perceptions of research environments. Secondly, that some scientists in LMICs may not be participating in data sharing activities due to the perception that their actions may be futile and their data not re-used. Highlighting these issues raises concerns that extend beyond practical data sharing discussions. Indeed, the need to understand how individual strategies for judging data are developed by scientists and the implications of applying them have extensive implications for ontological discussions that consider data accumulation and the direction of scientific research.
Reuben Binns (University of Southampton): Big data, data protection, and epistemic standards
In recent years, some computer scientists and technologists have claimed that the tools and techniques of 'big data' mean that the traditional approach to science is no longer relevant. They argue that traditional processes of theory-driven hypothesis-generation and testing are inefficient, and can be replaced by automatically crunching vast quantities of data ((Hey, 2012),(Anderson, 2008)). Over the same period, big data has come to be used extensively in many industries, particularly those involving the collection of personal data. In a typical scenario, machine learning will be applied to predict some output or target variable from a large pool of input parameters. For instance, an advertising network might use their amassed online browsing data to predict an individual consumer's purchasing intentions, or a mortgage broker might employ logistic regression on historical demographic data to identify high and low risk loan applicants. This analysis can proceed in the absence of any prior hypotheses about possible causal relationships between the input parameters and the target variable. Instead, so the proponents of this approach argue, these relationships are surfaced through the process of analysis. In addition, under this approach, the collection of data does not need to be justified by reference to a specific experiment. The ease of storing and sharing data means that it can always be re-used and re-purposed in other contexts, for other purposes. Whether or not this approach is justified is clearly an important question for the future of scientific endeavour (Pietsch, 2014). But it is also an important consideration for the governments, organisations, and businesses who adopt this approach. Public concerns about surveillance and the uses of personal data may increasingly force these entities to justify their activities on epistemic as well as moral and political grounds. Aside from public debates, the existing regulations which apply to these activities – data protection and privacy laws – must also evolve to meet these new practices. However, as they stand, these regulations seem designed for organisations whose use of data approximates the scientific method, however imperfectly. Consider the recommendations around purpose specification, collection limitation and use limitation common in many data protection laws and other privacy frameworks (OECD, 1980). The principle of purpose specification states that that the purposes for which personal data is collected should be specified at the point of collection and future uses be constrained by those same purposes. In its guidance issued on big data, the UK regulator makes clear that just as with any form of data processing, organisations using big data techniques still need to be clear about the purposes of processing and collect only the minimum amount of data necessary for those purposes. This seems to contradict the approach of collecting and sharing data in case it turns out to be useful, if the new purposes are incompatible with the original purpose. But the problem goes further than restrictions on re-use. In addition to specifying a purpose, data controllers should only collect data that is relevant to that purpose. This requirement conflicts with what many proponents of big data have said is its greatest strength. In a popular magazine article which popularised the idea, Chris Anderson characterised the differences between traditional science and data science along the following lines. The traditional scientific method involves formulating a hypothesis, deciding what data is relevant to testing it, and running an experiment to gather the data, analyse it, and confirm or falsify the hypothesis. This method lends itself well to purpose specification and data minimisation; the data to be gathered is specifiable and justifiable in advance, by reference to the scientist's purpose. But according to Anderson, the big data scientist doesn't operate that way. She doesn't start out with a hypothesis; she simply crunches all the available data in the hope of finding correlations. Preconceived ideas about which variables might correlate with each other have no place in the big data approach. So even if the purpose of data collection is known in advance, the range of data required to serve that purpose is potentially infinite. The principle of data minimisation states that organisations should only hold the minimum amount of personal data they need to properly fulfil their purpose. The problem is that, according to the proponents of big data, relevance and excess can only be established after the analysis has been done. At any time, predictive models may need to be tweaked in light of new correlations that are found. So what's relevant, and what can be considered 'excessive', is constantly changing and cannot be specified from the outset. It may not be the place of regulators to adjudicate the epistemic standards of private organisations. However, in designing regulations, they cannot help but assume certain conceptions about how data science works. They would do well, therefore, to engage with the methodology of big data and the evaluation of epistemic standards.
Anderson, C. (2008). The End of Theory. Wired Magazine, 16–07.
Hey, T. (2012). The Fourth Paradigm–Data-Intensive Scientific Discovery. In E-Science and Information Management. Springer Berlin Heidelberg,.
OECD. OECD Guidelines on the Protection of Privacy and Transborder Flows of Personal Data (1980). Retrieved from http://www.oecd.org/document/18/0,3343,en_2649_34255_1815186_1_1_1_1,00....
Pietsch, W. (2014). Aspects of theory-ladenness in data-intensive science. Retrieved from http://philsci-archive.pitt.edu/10777/
Sabina Leonelli (University of Exeter): Data Journeys: Openness and Shadows
Current discourse around data, and particularly scientific and ‘big’ data, is infused with the importance of the available, the pre-existing, the present. Data are quite literally “givens”, things that are and thus can be used as evidence; they are also tangible goods, the result of investments and labour, which thus arguably need to be spread and used to improve human life and understanding. The very definition of openness, as phrased by the Open Knowledge Foundation, focuses on making data accessible and useable: “Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness)” (http://opendefinition.org/). In this talk, I discuss the the notion of Open Data, its current manifestation into data sharing policies, and its relation to the epistemology of data dissemination and use within science. I note how making data travel across sites involves a lot of labour and infrastructural/material/human resources, in the absence of which making data ‘open’ proves to be a much more controversial and multi-faceted process than typically envisaged within Open Data movements. I also discuss how considering data shadows — data which is not present, not readily available, not usable to prove claims or foster discoveries — helps to confront questions concerning what constitutes knowledge in the first place, and particularly about what is tacit, ignored, denied, forbidden, private, inaccessible, unknown and/or unexplored.
Oliver Marsh (UCL): Le Geek, C’est Clique: Understanding and Representing Science Enthusiasm
The idea of a ‘science enthusiast’ identity, often tied to such terms as ‘geek’ or ‘nerd’, is a familiar one in general parlance. However, the visibility of such an identity has increased in recent years with the rise of online science enthusiast groups. Examples include the Facebook page ‘I Fucking Love Science’ (IFLS), which has over 18m ‘likes’ and regularly tops Facebook’s user engagement statistics, or Reddit threads such as r/science or r/AskScience which offers its thousands of users the chance to informally chat with Nature writers and other leading scientists. Such sites frequently ascribe their popularity to their distinctive combination of entertainment and information – in the words of Elise Andrew, founder of IFLS, “[users] can come and laugh but still know that everything they see is accurate”. However, such features provoke conversations within and outside these groups over the role of entertainment and enthusiasm in these groups; detractors of IFLS claim that it spreads poorly-informed science under the guise of humour and social solidarity, while conversely many of the largest Reddit science threads are criticised for disallowing jokes and informal banter. Through discourse analysis of conversations within and about these groups, and by combining perspectives from Science and Technology Studies (STS) with work from scholars of online fan communities, my research aims to illustrate how familiar STS concepts of drawing boundaries between the scientific and non-scientific operate in settings dominated by less familiar (to STS scholars) concepts of ‘phatic’ interaction and pseudonymous identities. The overall aim of this project is to consider questions around public engagement with science in less explicitly political settings than the more frequently studied situations of crisis and political controversy, but rather in everyday leisure-time settings. In line with the conference theme, in this paper I will consider the difficulties of understanding and representing such problematic concepts as ‘enthusiasm’ in the form of usable information. This raises three main questions. Firstly, the problem of scaling – how do such personal concepts as enthusiasm change when expressed (and analysed) at the scale of thousands to millions of users? The fact that 19 million people like IFLS and the fact that an r/science user may seek out large quantities of peer-reviewed literature for a single comment both seem to indicate ‘science enthusiasm’, but of a substantively different nature. Secondly, the role of specific online platforms. How can we access ‘thick’ qualitative information hidden in the quantitative data provided by online platforms, particularly those such as Facebook with a rhetorical bias towards positive sentiments of ‘liking’ and ‘friending’ – where a ‘like’ may commonly be a prosaic means for activating notifications rather than an expression of positive sentiment. Finally, what opportunities do such concepts open for constructive and creative approaches to representing information, most notably expressed in infographic work such as the Information is Beautiful project? These collected issues are familiar across numerous fields, from data science to digital humanities. However I suggest that the specificity of ‘science enthusiasm’, sitting at the unusual intersection of STS and fan studies, and the personal/phatic and political/professional, provides a novel lens for considering these questions.
Julian Newman (Glasgow Caledonian University): Data Quality Implications of Scientific Software Complexity
Scientific findings based on computer simulation evoke sceptical responses because data generated by simulation models does not appear to have an objective status comparable with data captured by observation or experiment. Counter to this, philosophers sympathetic to computational science, such as Winsberg (2010) and Humphreys (2004), emphasise parallels between experiment and simulation. Winsberg compares the strategies of simulationists with those of experimenters in augmenting our “reasonable belief” in their results. He extends to simulation Hacking’s notion that experimental practice is self-vindicating. Practices such as parameterisation and tricks to overcome computational difficulties “carry with them their own credentials”. Lenhard & Winsberg (2010) argue that complex simulation models face epistemological challenges associated with a novel kind of “confirmation holism”. Whereas Duhem argued that no single hypothesis can be tested in isolation, Lenhard & Winsberg claim that it is impossible to locate the sources of the failure of any complex simulation to match known data, because of three characteristics which they regard as intrinsic to the practice of complex systems modelling. These are identified as “fuzzy modularity”, “kludging” and “generative entrenchment”. Modularity should provide a way of managing complexity; but here the accumulation of more and more interactive sub-models does not break down a complex system into separately manageable pieces, so that it must stand or fall as a whole. Three related issues are posed by this work: is confirmation holism essential to and unavoidable in complex systems modelling and/or embedded in the specific disciplinary practices of climate science and/or an in-principle remediable failure to observe, recognise and apply available and well-established sound software engineering practices when developing simulation software? Let us consider the three alleged sources of this novel confirmation holism. “Fuzzy modularity” speaks for itself: the different modules simulate different parts of the climate system, and these are in continual interaction, such that the behaviour of the sub-models is strongly influenced by the higher level models and through it possibly by one another. Thus it has proved difficult to define clean interfaces between the components of the model. In some respects this problem most readily evokes sympathy in the critic: accepting fuzziness may be seen to enhance the realism of the overall model. But as fuzzy modularity impairs the testability of the software, it can be argued that the price is paid in poor data reliability. Kludges and Generative Entrenchment are intimately interrelated. A kludge is an inelegant, ‘botched together’ piece of program, very complex, unprincipled in its design, ill-understood, hard to prove complete or sound and therefore having unknown limitations, and hard to maintain or extend. Generative Entrenchment refers to the historical inheritance of hard-to-alter features from predecessor models: “… complex simulation models in general, and climate models in particular, are – due to fuzzy modularity, kludging and generative entrenchment – the products of their contingent respective histories … As such, climate models are analytically impenetrable in the sense that we have been unable, and are likely to continue to be unable, to attribute the various sources of their successes and failures to their internal modelling assumptions.” (Lenhard & Winsberg p 261). The problem of emergent “Architectural Defects” in large software systems has attracted significant research efforts in Software Engineering Science. A software architecture captures basic design decisions which address qualities central to the system’s success, such as performance, reliability, security, maintainability and interoperation. As many as 20% of the defects in a large system can be attributed to architectural decisions, and these can involve twice as much effort to correct as defects arising from mistakes in requirements specification or in the implementation of software components. Li et al (2011) point out that architectural decisions typically affect multiple interacting software components, and as a result architectural defects typically span more than one component: they therefore concentrated on the problems of finding and correcting “multiple-component defects” (MCDs). To this end, they conducted a case study based on the defect records of a large commercial software system which had gone through six releases over a period of 17 years. Compared to single component defects they found that MCDs required more than 20 times as many changes to correct, and that an MCD was 6 to 8 times more likely to persist from one release to another. They identified “architectural hotspots” consisting of 20% of software components in which 80% of MCDs were concentrated, and these architectural hotspots tended to persist over multiple system releases. This research provides an excellent example of the part played by the “Engineering Science” of Software Engineering in using a large corpus of data to develop a relevant body of theory upon which design and development practice in Software Engineering can build. It exemplifies practices contrary to Humphreys’ acceptance of opacity as a sign of the obsolence of human centred epistemology, or Lenhard & Winsberg’s acceptance of confirmation holism with respect to complex system models. In the practice of software engineering, defects are to be expected, because software development is a human activity which is error-prone, and the more complex the software product the more prone to defects it will be and the harder the defects will be to eliminate. Thus the results of simulation and computational science should never be taken “on trust”. Similarly, the fact that a kludge may reflect the model’s historical development is in no sense a credential of the model. Kludging in the early stages of a software project creates “Technical Debt” on which interest will accrue in the form of error and maintenance through the lifecycle (Brown et al, 2010; Kruchten et al, 2012). There are growing movements to apply to software the criteria and techniques of rigour that have developed in other sciences, and there is at least some research to suggest that not all climate models are as analytically impenetrable as Lenhard & Winsberg suggest (Pipitone & Easterbrook, 2012). Thus the rgument from the essential epistemic opacity of computational science to a non-anthropocentric epistemology runs counter to best practice in software engineering and to empirical results of software engineering science.
Brown, N., Cai, Y., Guo, Y., Kazman, R., Kim, M., Kruchten, P et al (2010). Managing technical debt in software-reliant systems. In Proceedings of the FSE/SDP workshop on Future of software engineering research (pp. 47-52). ACM.
Kruchten, P., Nord, R. L., & Ozkaya, I. (2012). Technical debt: from metaphor to theory and practice. IEEE Software, 29(6), 18-21.
Lenhard, J & Winsberg, E (2010) “Holism, entrenchment and the future of climate model pluralism.” Studies in the History and Philosophy of Modern Physics, 41, 253-263.
Li, Z, Madhavji, NH, Murtaza, SS, Gittens, M, Miranskyy, AV, Godwin, D & Cialini, E (2011) Characteristics of Multiple-component Defects and Architectural Hotspots: A large system case study Empirical Software Engineering 16: 667-702.
Pipitone, J., & Easterbrook, S. (2012). Assessing climate model software quality: a defect density analysis of three models. Geoscientific Model Development Discussions, 5(1), 347-382.
Winsberg, E (2010) Science in the Age of Computer Simulation. Chicago University Press.
Raj Patel (University of Pennsylvania): Why privacy is not enough: Big Data and predictive analytics
Big Data (BD) is generating interest from diverse groups of people. Governments, private organizations, and scientists are interested in the enormous amounts of information about people that can now be collected, curated, and analyzed (Hey, Tansley & Tolle, 2009). These massive quantities of information are sometimes composed of digitized medical, educational, and financial records, as well as the digital trails left behind through Internet usage and technologies such as mobile phones. Access to this information can be useful. For example, governments can use large datasets about benefit allocation to prevent fraud and benefit overpayment (Bryant, Katz & Lazowska 2008, p. 15). Private companies can use data about purchasing habits to infer preferences and determine which products to show consumers and at what price (Bollier & Firestone 2010, p. 3 – 6). Medical professionals can discover previously unknown adverse drug interactions from large datasets of medical information (Tene & Polonetsky 2013, p. 245). Notwithstanding the promises of BD, some commentators have adopted a more cautious attitude. Much of this caution is grounded in privacy concerns. This work argues that BD has the potential to violate key informational privacy principles (such as control of personal data) set out by law. Usually, the policy implications that flow from these criticisms advocate greater control over personal information. Consider the following representative claims: “Some are concerned that OBA [online behavioural advertising] is manipulative and discriminatory, but the dominant concern is its implications for privacy” (Toubiana et al. 2010, p. 1). “Our personal information is used to feed all kinds of [BD] business models. In fact, we are the fuel for what has become a multi-billion dollar economic engine ... Our advice: your best bet ... is to control how much information you put out there[.]” (Craig and Ludloff 2011, p. 63). “Increasing volume and variety of datasets within an enterprise also highlight the need for control of access to such information” (Chaudhuri 2012, p. 1). I will refer to this as the ‘privacy approach’ to BD. The privacy approach has a relative monopoly on the policy discussions surrounding BD. But BD raises significant ethical challenges outside of traditional privacy concerns. These include:
1. The accentuation of information asymmetries between large organizations and individuals (Schermer 2011, p. 47).
2. The prospect of opaque decision-making (Tene & Polonetsky 2013, p. 252).
3. The ‘chilling effect’ on particular behaviours by people who know their data is being collected.
4. Discriminatory outcomes (Gandy 2010).
These challenges are often ignored in the policy debates surrounding the ethical issues raised by BD use because of the monopoly enjoyed by the privacy approach. This paper argues that the privacy approach is an inadequate framework for conceptualizing the harms posed by the use of BD. This is because, in practice, the privacy approach focuses on the procedural issues surrounding the collection of data, as opposed to seriously engaging with the ethical harms that flow from the use of data and the outcomes of those uses. The current monopoly the privacy approach enjoys in policy debates surrounding ethical issues raised by BD is unwarranted. To make my case, I highlight a specific harm (discriminatory outcomes) that is ignored by the privacy approach. The case study in the paper is about credit score rating systems to determine creditworthiness. To illustrate the conceptual issues at stake, I introduce the notion of non-interpretable preemptive predictions (NIPP). Preemptive predictions are predictions that attempt to limit the future range of a person’s actions (e.g., no-fly lists to stop suspected terrorists from boarding flights). A non-interpretable process is one that is hard to represent in human language. Non-interpretability might be the result of algorithmic software that makes decisions on many variables, some of which may have been learned from the initial set of data. Software that employ particular types of machine-learning algorithms – i.e., an algorithm that ‘learns’ as it analyzes data, without being explicitly programmed – seems more likely to lead to non-interpretable processes (van Otterlo, p. 57). Non-interpretability also implies that the process is more complex. Thus a NIPP process is hard to represent in human language and that also produces preemptive predictions. NIPP processes in credit scoring systems run the risk of systematized discrimination in morally relevant ways. Moral relevance is important here, as the point of a credit scoring system is to categorize individuals into groups based on creditworthiness. But presumably these categories should not be based on irrelevant and immutable characteristics, such as race and sex. The credit scoring systems case I outline in the paper highlights two important issues that flow from the NIPP process which are redlining and hidden system-level biases. Redlining is the practice of denying loans in particular areas regardless of the creditworthiness of the individual loan applicant (Holmes and Horvitz 1994, p. 81). Redlining is usually associated with racial discrimination. Hidden system-level biases create systematized discrimination that is ‘built’ into systems through, e.g., code, datasets, evaluative judgments used to determine risk factors, etc. We might not be able to locate exactly where systematized discrimination enters into systems yet outcomes might still be relevantly discriminatory. These are significant worries that are largely ignored in current policy debates. Kafkaesque decision-procedures that have real weight over the quality of people’s lives ought to be subject to scrutiny in a liberal society. Our policy discussions must be concerned with murky decision-procedures involving BD processes toward this end. If the arguments in this paper are right, then the current policy focus that rests exclusively on issues of procedural privacy is misguided and ought to be changed.
Wolfgang Pietsch (Technische Universität München): Difference-making as a notion of causation for data-intensive science
I propose a specific way, how to understand causation in at least some areas of data-intensive science, namely in terms of a difference-making approach relying on eliminative induction in the tradition of Mill’s methods. I broadly outline the account, address its advantages in comparison with other approaches to causation and proceed to show, how it can serve in the analysis of several widely-used algorithms. The challenge of data-intensive science for scientific method is bigger than many people realize. For most of the 20 of the promises of data-intensive science actually depend on there being only few and very general modeling assumptions. Thus, statisticians increasingly realize that data-intensive science calls for novel statistical tools and a rigorous treatment of those methods that are currently being used in machine learning in mostly heuristic ways (e.g. van der Laan & Rose 2010). Causation is one crucial concept, on which a novel perspective is required within a model-lean data-intensive science since the relevant causal structure cannot be derived from the model anymore but must be justified from the data. The concept of causation that I propose as particularly useful for data-intensive science is a difference-making account based on eliminative induction in the tradition of Mill’s methods (for relatively recent treatments, see for example Baumgartner & Graßhoff 2004, Pietsch 2014). This account construes causal (ir-)relevance as a three-place relation: a condition C is (ir-)relevant to a phenomenon A with respect to a certain background B of further conditions that remain constant if causally relevant or that are allowed to vary if causally irrelevant. There are two principal methods. The method of difference determines causal relevance: If in one instance a specific condition CX and the phenomenon A are present and in another instance both CX and A are absent, while all other potentially relevant conditions remain unchanged, then CX is causally relevant to A. The strict method of agreement establishes causal irrelevance in a complementary manner, when the change in CX has no influence on A. Eliminative induction can deal with functional dependencies and an extension to statistical relationships is straightforward. The restriction to a context B of background conditions is required because there is no guarantee that in a different context B*, the causal relation between C and A will continue to hold. Since the relevant conditions in the background can only rarely be made explicit, causal laws established by eliminative induction have a distinct ceteris-paribus character. I will now argue that the difference-making account has several advantages over other interpretations of causation that render it particularly useful for the analysis of large, high-dimensional data-sets. For example, it should be distinguished from regularity theories since it emphasizes evidence in terms of the variation of circumstances and not in terms of constant conjunction. Obviously, this corresponds well to the kind of evidence that is used in data-intensive science. A further crucial difference with respect to regularity accounts concerns the premises that need to be fulfilled for reliable inferences. As emphasized in Pietsch (2014, Sec. 3f), there is a distinct problem of induction for the difference-making approach in comparison with regularity theories. The latter rely on an essentially indefensible principle of the uniformity of nature. By contrast, the difference-making account presupposes a set of premises that seem much more reasonable and realistic, including: (i) determinism; ii) constancy of background conditions; (iii) adequate causal language. As argued in Pietsch (forthcoming), these conditions provide a reasonable guideline, under which premises data-intensive methods yield reliable results. The difference-making account is closely related to the interventionist approach and for example can make sense of the criteria for causation depicted in Woodward’s account. However, the term intervention conveys the misleading impression that observational data cannot suffice for the identification of causal relationships, but that experimentation or manipulation is required. This is formally integrated in Woodward’s account by the idea that interventions act as switches for interfering variables. Clearly, stressing the interventionist character radically reduces the usefulness of any account of causation for an analysis of data-intensive science, since most of the data there is of observational nature. By contrast, according to the difference-making account, there is no qualitative difference but only a pragmatic advantage of interventionist over observational data. Like the counterfactual approach, the difference-making account aims to establish the validity of counterfactual statements. But it does not evaluate counterfactuals in terms of possible worlds as for example David Lewis suggests, but by referring to actual situations that are different only in terms of irrelevant circumstances. While Lewis’s account relies on the notoriously difficult concept of similarity between worlds, the difference-making approach works with a notion of causal irrelevance of boundary conditions as established by the strict method of agreement. This corresponds well to the way predictions are established in data-intensive science, namely in terms of difference-making with respect to a constant background, while similarity between worlds is not evoked. In the final part, I will briefly draw a connection between the sketched difference-making account of causation and specific algorithms from data-intensive science. I will chiefly consider two, classificatory trees and non-parametric regression, and will sketch under which conditions they are able to identify causal relationships.
Baumgartner, M.; Graßhoff, G. (2004). Kausalität und kausales Schließen. Bern: Bern Studies in the History and Philosophy of Science.
Pietsch, W. (2014). “The Nature of Causal Evidence Based on Eliminative Induction.” In P.Illari & F. Russo (eds.), special issue on “Evidence and Causality in the Sciences” of Topoi 33 (2): 421-435.
Pietsch, W. (forthcoming). “Aspects of theory-ladenness in data-intensive science”, Philosophy of Science (proceedings PSA2014), http://philsci-archive.pitt.edu/10777/
van der Laan, M.; Rose, S. (2010). “Statistics ready for a revolution: Next generation of statisticians must build tools for massive data sets.” Amstat News 399:38-39.
Teresa Scantamburlo (University of Ca' Foscari, Venice): Big Data: the empiricist approach and its philosophical underpinnings
Recent studies on the epistemological implications of data revolution raised some troublesome questions about the nature and the explanatory capacity of correlations. Indeed, it seems that Big Data provides substantial support to the idea that statistical dependencies can easily supersede more elusive notions of causality paving the way to a reductionist approach to research activities. Starting from the assumption that Big Data marked the “triumph of correlations” (Mayer-Schönberger and Cukier, 2013) we would like to consider the critical reaction to the new wave of empiricism which has gained credence especially outside academy (Kitchin, 214). A radical version of this new empiricist perspective asserts that data-driven approach can give us a full resolution of the world since data alone, without theory, can speak for themselves (Anderson, 2008). At a general level the empiricist approach pushes data-driven sciences, like pattern recognition and machine learning, to appreciate more predictions then explanations or, in other words, to prefer knowing whether something will happen rather than knowing why something happened (Mayer-Schönberger and Cukier, 2013; Cristianini, 2014). Theoretically, the shift from prediction to explanation reveals a dismissal of causal relationship in the research activity or, better, its assimilation within the realm of patterns and correlations. But against this tendency, which is in fact more widespread within business circles, several arguments have been raised (boyd and Crawford, 2012; Kitchin, 2014). Among these, many warnings regard the alleged neutrality of numbers and the unavoidable bias which normally accompanies data collections and analytics. Other criticism argues that Big Data is not free from theory at all, being the discovery of meaningful patterns often preceded and refined by previous finding and reasoning. In these respects it is interestingly to note that almost a decade ago a study on the use of analogy in data analysis models (D.M. Bailer-Jones and C.A.L. Bailer-Jones, 2002) revealed a limited utility of such models to the study of empirical phenomena. Indeed the authors considering the wide applicability of data analysis techniques pointed out that “beyond the goal of accurate prediction, the scientific insight that a computational data models give in a specific case may be limited” (D.M. Bailer-Jones and C.A.L. Bailer-Jones, 2002 p. 14). Nowadays the concern to make data analytics more informative and adequate increased the awareness of the limits of a flat application of data analysis techniques and encourages a more attentive approach to the outcomes of correlations. We think that the course of critical reflections on Big Data might be read in the light of a renewed interest in combining data with theory, the so called “matter of facts”, made of measures and correlations, with the sphere of reason (e.g.: theoretical models, explanations, qualitative evaluations). This emerging perspective gave rise also to a new mode of conducting research, alternative to the empiricists standpoint and allowing cross-fertilization among complementary methodologies, such as induction, deduction and abduction (Kitchin, 2014). Note that according to others, if applied to specific domains like biology, data-driven approach rather than establishing a novel epistemology revitalizes several aspects that have long featured in the practice of science, including speculation as well as extant experimental practices (Leonelli 2014). In this paper we would like to approach data science and in particular pattern recognition, as an exemplar case of the contemporary efforts in the business of Big Data, in the light of its main conceptual roots which can thereby justify the return of empiricism (characterized by the “end of theory” and the “triumph of correlations”) and at the same time the development of an alternative approach attempting to reconsider the role of data within the practice of science. Specifically we would like to undertake an in-depth examination of the critical analyses which seek to reconcile a pure data driven approach with more theoretical aspects and to get to the bottom of the empiricist drift. Our main point is that we could better understand the conceptual divide between theory and data if we considered Hume’s naturalist account of inductive reasoning. We think, indeed, that Hume’s perspective could provide a suitable basis for accounting the trajectory that would have led to “the end of theory” and “the triumph of correlations”. This would also contribute to better understand the rationale underlying the great respect of “the unreasonable effectiveness of data” (Halevy, Norvig and Pereira, 2009). In conclusion, our overall discussion will suggest that the contemporary debate on the limits and the potential benefits of Big Data should be traced back to a more fundamental level that could help us to discern Hume’s profound impact on the evolution of data science. This will also allow us to reconsider the role of induction within pattern recognition research and to compare its main characterization with some alternative perspectives (e.g., Biondi and Groarker, 2014).
Anderson C., The end of theory: The data deluge makes the scientific method obsolete, Wired, 23 June 2008, Available at: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
Bailer-Jones D. M. and Bailer-Jones C. A. L., Modelling data: Analogies in neural networks, simulated annealing and genetic algorithms, in Magnani L. and Nersessian N. (eds.), Model-Based Reasoning: Scientific Discovery, Technology, Values, New York, Plenum Publishers, p. 147-165, 2002
Biondi P.C. and Groarke L.F (eds.) Shifting the Paradigm: Alternative Perspectives on Induction, Walter de Gruyter, Berlin/Boston, 2014
Boyd D. and Crawford K., Critical questions for Big Data: provocations for a cultural, technological, and scholarly phenomenon, in Information, Communication & Society,15(5): 662–679, 2014
Cristianini N., On the current paradigm in artificial intelligence, AI Commun., 27 (1): 37-43, 2014
Halevy A., Norvig P. and Pereira F., The Unreasonable Effectiveness of Data, in IEEE Intelligent Systems, 24(2): 8-12, 2009
Hume, A Treatise of Human Nature, Oxford University Press, 2000.
Kitchin, R., Big data, new epistemologies and paradigm shifts, in Big Data and Society, 1:1 – 12, 2014
Leonelli S., What Difference Does Quantity Make? On The Epistemology of Big Data in Biology, in Big Data & Society, 1:1 – 11, 2014
Mayer- Schönberger V. and Cukier K., Big Data: A Revolution that Will Change How We Live, Work and Think, London, John Murray, 2013
Orlin Vakarelov (Duke University): Scientist on the anthill
Complex, multi-modal “databases” have become essential for scientific research, especially in biology. As a result, they have been subject to analysis in the philosophy of science in recent years.1 In discussion, they have been treated as repositories of “data” – row, or processed and annotated. “Data” has been treated, along the lines of traditional models of scientific theorizing, as the inputs to scientific theories. Much of the discussion has been about the added value of such databases for knowledge formation and discovery and about the underappreciated contribution of activities such as data curation. Here I will argue for a more radical interpretation of such “databases” (or expected near-future developments of such “databases”) in scientific practice. I will argue that such “databases” 2 should be given the status of theories. Such an argument depends on a reconception of the notion of theory. As a consequence, the role of the scientific agent in scientific practice must be reconceived as well. I propose a new model of the relationship between scientist and large “database” theories – the anthill model. The strategy is not to argue that a new technological development has brought (or will bring) a revolution in scientific practice. Just the opposite: the strategy is to argue that by reconceiving what the informational-carriers of scientific practice are (and have always been) we can maintain better continuity of scientific practice. To support the reconception of the notion of theory we must examine three questions: (1) What is the structure of the information-carriers of science? (2) What is the normative framework for updating the information-carriers? And, (3) what the role of the scientific agent? One central dogma, with ancient roots, of the epistemological framework of science, as best epistemic practice, is that knowledge is something that serves the mind. It is the job of science to make the world comprehensible to human experience. To know is for the mind to enter in some relation with the world. Post-Fregean analytic philosophy tried to eliminate “psychologism” from epistemology by focusing on formal languages as mind-free information-carriers. Languages, however, are the ultimate mental interfaces to the world. The result was that dogma was implicitly preserved, while its explicit principles were buried under the rug, where they are not subject to reflection. Why does this matter? Because insisting that the informational-carriers – the embodiment of scientific knowledge – must be comprehensible by a human mind, places a strong negative constraint on the complexities on the systems that may be represented. Simply put, some of the systems we would like to investigate, manipulate and control are too complex to be subject to mental comprehension. We have been through this before, when we realized that the world is too complex (small, large, fast, dim, or undetectable) to our senses. Lenses took us only so far. Much of modern science bypasses human perception entirely, moving from detectors strait to numbers, with colorful visualizations attached at the end for comfort. The reason that biology has benefited mostly from large “databases” is not surprising. It is a science whose subject matter demands the kinds of representational resources that are in the realm of mindboggling. Biology has the task (not the only task) of understanding the interactions of billions organisms (grouped in complex structure of types) with millions of relevant environmental parameters, where each organism (itself an ecosystem) is a complex biochemical and bioelectric network of control relations. No doubt, part of the understanding will involve micro-understandings of local regularities or global aggregate statistics. Such micro-understandings are comprehendible by minds, but they (qua understandings) vastly underdetermine the system that is the target of investigation. The system of micro-understandings is, of course, the all too familiar nomological model of theories – the idea that the primary goal of a theory is to capture the natural laws. The problem of the nomological model as applied to the study of real biological systems is that too much of the behavior of such systems depends on idiosyncratic local conditions. To restate this in the more familiar terms of laws and auxiliary facts (or initial conditions), too much of the information needed to control the system is “encoded” not in the laws but in the auxiliary facts. This takes us to the first question: By theory of X we mean the vehicles that contain the information about X. More specifically, an information vehicle about X is an information medium (or an information medium network) that has been given a semantic interpretation of X as a target. Both syntactic and semantic views of theories are subsumed under this characterization. The characterization, however, is more general. It allows broader classes of data structures, including “databases”, to be theories. Within the assumptions of the central dogma, the normative principles of theory construction and evaluation come from the requirement that the theory provides knowledge or understanding (or explanation) to the mind. In the case when the theory is not accessible to a mind, such derived normative significance is not available. But, an alternative source of normative evaluation exists. It can be derived from pragmatics. The theory must be able to modulate an agent’s interaction with the target system. (This is the answer to the second question.) Within the central dogma, it is possible, at least superficially, to separate knowledge from its application. Such separation cannot happen when the information vehicles are not comprehensible. In such a case, the theory itself must contain the recommendations for action. To put this somewhat metaphorically (but quite literally, in the case of some “databases”), the theory must include a system for answering queries. The concept of theory presented here involves semantic information. Where is the semantics coming from? This leads us to the third question. While it is possible to have an information medium with semantic information (which is never intrinsic to the medium) when it fits in a sufficiently complex goal-directed system (and when by semantic information we mean a particular technical concept), we should not assume such a case in our analysis. First, we have human agents to be our semantic providers, and second, the theories do not need the kinds of machinery that may provide then with other sources of meaning (if possible). Scientist interacting with the information vehicles and using them to control the intended target provide all the meaning that, well, scientists need of the theory. Scientist are not only semantic engines and users of the theories. They are also creators and curators of them. This aspect of the model suggested here is of particular interest, because it the role of scientist in their integration with the theories/databases is different when the theory is not fully comprehensible. The difference is not dictated by issues in social epistemology, than treats theory construction and testing as a distributed activity. The problem of social epistemology fits squarely with the central dogma. While the exploration of the theoretical search space may be distributed, the theories produced are assumed to be comprehensible. The new problem is, how do you build, modify and manage a theory that cannot be comprehended? Here I propose an anthill (proto-)model of the relationship of scientist and the theory. Imagine a large data structure, a database+ (with a big +), encoding the information in an ecosystem. This is our theory. No single scientist can comprehend the entire system. Let us assume, in fact that no scientist can comprehend more than 1% (or .001%, or 10-12% ) of the information in the system. Each scientist can, at best, comprehend, update and modify only a small part of the system. This can happen by fairly traditional ways, by controlled experiments, observation, imagination and mathematical incite. The scientist, according to the model, is like the ant building the anthill of knowledge embodied in the giant data structure. Is such a giant anthill theory possible? It better be, if the considerations about the complexity of biological systems presented above is of any significance. This is not an exercise of wild speculations about the future of science 200 years from now. In small ways, the process of creation of such theories by scientist-ants has already start. To reiterate the point stated earlier, the goal is, by reconceptualizing the role of such “databases”, treating them as full-rank theories, to establish continuity with scientific practice of theory building. By doing this, it is possible to appreciate the new problems that emerge when a theory is incomprehensible, and to initiate an investigation of the methodological principles (and new job descriptions) needed for constructing with such anthill theories.
Billy Wheeler: Causation and Information: What is Transferred?
The conserved quantities theory of causation (Dowe, 2000; Salmon, 1994) identifies a causal process as the preservation of a physical quantity among a series of events. It goes on to define a causal act as the exchange of a physical quantity among two or more separate causal processes. Despite capturing many intuitions regarding our everyday notion of causation, the conserved quantities theory faces two well-known objections (Illari, 2011). The first, which we might call ‘the problem of applicability’, is that very few physical quantities are known to be conserved fully in a closed system. Typical examples like ‘mass-energy’, ‘momentum’ and ‘charge’ work relatively well for processes on the atomic scale, but are less applicable to the special sciences, which are interested in causal claims about quite different properties. The second problem, which we might call ‘the problem of absences’, concerns the regular practice of conferring causal efficacy to ‘gaps, ‘absences’ or ‘voids’ as in the example ‘His failure to turn up caused a storm at the party conference’. Yet if causation really is the transfer of a physical quantity, it is hard to imagine how this is possible in cases where an absence is cited as the cause. It has recently been argued that these problems for the conserved quantities view can be overcome if the quality exchanged between two or more processes is not a physical property but rather a quantity of information. There is good evidence to suggest that this view holds the hope of unifying causal claims among the sciences as well as assisting in the search for causes in predominantly data-driven sciences. (Illari, 2011; Illari & Russo, 2014). As it stands, however, this new conception of causation is unlikely prove illuminating or helpful in these practices unless we first define in clear terms the nature of the information that ‘gets transferred’. Famously, the concept of ‘information’ has proven difficult to pin down, and there exist several different (although not necessarily competing) formal definitions. In this talk I will assess the suitability of a number of these definitions as a foundation for causation. In particular I will focus on the following three: (1) information as ‘entropy’ as defined by Shannon Information Theory (Shannon & Weaver, 1949), (2) information as ‘minimum description length’ as defined by Kolmogorov Complexity (Solomonoff 1997; Chaitin, 1987; Li & Vitanyi, 1997), and, (3) information as ‘knowledge-update’ as defined by various advocates of Epistemic Logic (Hintikka, 1982; van Bentham 2006). I argue that only the second of these has a chance of helping overcome the problems of applicability and absences—and even this is only when it is given a radical metaphysical interpretation. It will be shown that unless the information-transfer theory signs up to ‘digital realism’ (also known as ‘digital metaphysics’ (Steinhart, 1998) or the ‘it-from-bit-hypothesis’ (Wheeler, 1990)) the two problems merely reappear at the level of the ground of the information. This problem is inherent in John Collier’s information transfer theory (1999; 2010) where he identifies the ground as ‘matter or whatever other “stuff” is involved’ (1999, p. 2). Whilst common conserved physical properties such as mass-energy do nicely as the information carriers for physics, it is hard to see what category of “stuff” would do for biology, psychology or economics, where most data-driven science occurs. A much simpler solution is to interpret physical processes as emerging out of basic cellular automata (Toffoli, 1984; Burks, 1970; Wolfram, 1986) which form the metaphysical ground for causal processes and causal interaction. This particular view has radical consequences for the fabric of reality that will not be to everyone’s taste; but they are perhaps outweighed by the improved understanding it affords us of causality and the search for causal links among complex data.
van Benthem, J. (2006) “Epistemic Logic and Epistemology: the State of their Affairs”, Philosophical Studies 128, pp. 49-76.
Burks, A., ed. (1970) Essays on Cellular Automata. Urbana, IL: University of Illinois Press.
Chaitin, G. (1987) Algorithmic Information Theory. New York: Cambridge University Press.
Collier, J. (1999) “Causation is the Transfer of Information” in H. Sankey ed., Causation, Natural Laws and Explanation. Dordrecht: Kluwer).
Collier, J. (2010) “Information, Causation and Computation” in G. D. Crnkovic & M.Burgin ed., Information and computation: Essays on Scientific and Philosophical Understanding of Foundations of Information and Computing. Singapore: World Scientific.
Dowe, P. (2000) Physical Causation. Cambridge: Cambridge University Press.
Hintikka, J. (1962) Knowledge and Belief. Ithica: Cornell University Press.
Illari, P. (2011) “Why Theories of Causality need Production: an Information-Transmission Account”, Philosophy & Technology 24:2 , pp. 95-114.
Illari, P & Russo, F. (2014) Causality: Philosophical Theory Meets Scientific Practice. Oxford: Clarendon Press.
Li, M. & Vitanyi, P. (1997) An Introduction to Kolmogorov Complexity and its Applications. New York: Springer-Verlag.
Salmon, W. (1994) “Causality Without Counterfactuals” Philosophy of Science 61, pp. 297-312.
Shannon, C & Weaver, (1949) The Mathematical Theory of Communication. Illinois:University of Illinois Press.
Solomonoff, R. “The Discovery of Algorithmic Probability” Journal of Computer and Systems Sciences, vol. 55, pp. 73-88.
Steinhart, E. (1998) “Digital Metaphysics” in T. W. Bynum & J. H. Moor ed., The Digital Phoenix: How Computers are Changing Philosophy. Oxford: Blackwell.
Toffoli, T. (1984) “Cellular Automata as an Alternative (Rather than an Approximation of) Differential Equations,” Physica D, 10: 117-127.
Wheeler, J. A. (1990) “Information, Physics, Quantum: The Search for Links” in W. H. Zurek ed., Complexity, Entropy and the Physics of Information. SFI Studies in the Sciences of Complexity, vol. 8. Reading, MA: Addison-Wesley.
Wolfram, S. (1986) Theory and Applications of Cellular Automata. Singapore: World Scientific.