Research Interests

Data science reference framework: Defining a new field of inquiry

Data science is not a science. It is a knowledge discovery and research paradigm with unfathomed scope, scale, complexity, and power for knowledge discovery that is not otherwise possible and can be beyond human reasoning. It is changing our world practically and profoundly, already widely deployed in tens of thousands of applications in every discipline in an AI Arms Race that, due to its inscrutability, can lead to unfathomed benefits and risks.

This research proposes the development of coherent, unified definition of the data science[1] knowledge discovery paradigm by the data science community over the next decade. Emerging slowly since the early 1960s and rapidly since 2000, data science, while in its infancy, is a new knowledge discovery paradigm that is one of the most active, powerful, and rapidly evolving innovations of the 21st century[2]. Due to its value, power, and breadth of applicability, it is being applied in most disciplines, hundreds of research areas, and tens of thousands of applications. ChatGPT and other AI product have captured the world’s attention for AI’s potential benefits and risks. Yet we are just beginning to understand it. 106 data science publications contain myriad definitions of data science artifacts and problem solving, developed largely independently hence can be discipline– or application–specific or are mutually incomplete, redundant, or inconsistent. Hence, so is data science as a field of inquiry. These definitions while adequate for source contexts may not be adequate for all data science disciplines or for better understanding data science. The resulting challenges and opportunities constitute the data science multiple definitions challenge – one of many definitional challenges addressed in this research.

This research is developing the data science reference framework with which the data science community can understand, define, and develop data science as a field of inquiry. The framework consists of the philosophy of data science, the data science problem-solving paradigm (the data science method), the data science problem-solving workflow that, in turn, are defined in the data science reference framework (axiology, ontology, epistemology, methodology, methods, technology, community). It offers a much called for [10, 13] unifying framework to understand, unify, and define data science. A comprehensive data science definition may exceed 20,000 data science artifacts. While many data science artifacts are well-understood, many, e.g., epistemology or theory, are currently inscrutable.

To be of value, a definition of data science must represent a consensus of the data science community. Hence, this research proposes an online data science journal system (DSJS) to be operated by and for the data science community as means for defining, unifying, and evolving data science artifact definitions and their relationships. This research presents a comprehensive solution[1]-[9] with 1,000+ proof of concept candidate definitions and a prototype DSJS, for the data science community to modify to meet its requirements.

Visionaries [1, 2, 3] herald data science as our Promethean Moment that changes everything, requiring reimagining and restructuring our world. Unlike past moments based on single inventions, e.g., the printing press, this AI moment is based on a meta-technology[3] augmented by data for the whole world in what Friedman [NYT March 21, 2023] calls The Age of Acceleration, Amplification, and Democratization proclaiming impacts far beyond knowledge discovery.

[1] John Tukey, the originator of data analysis [11, 12] now called data science.

[2] 500,000 AI articles were published in 2021, more than in any other discipline, dominated by pattern recognition, machine learning, and computer vision. There are 200+ data science journals.

[3] A meta technology produces new knowledge hence is applicable to every human endeavor.

The data science Reference Framework is defined in nine documents.

  1. Brodie, M.L., Defining data science: a new field of inquiry, Harvard University, July 2023, arXiv preprint https://doi.org/10.48550/arXiv.2306.16177
  2. Brodie, M.L., Data science reference framework: Part 0: Introduction and Development Guidelines, Harvard University, in development.
  3. Brodie, M.L., A data science axiology: the nature, value, and risks of data science, Harvard University, July 2023, arXiv preprint http://arxiv.org/abs/2307.10460
  4. Brodie, M.L., Data science reference framework Part II: data science ontology: Informal definitions, Harvard University, in development.
  5. Brodie, M.L., Data science reference framework Part III: data science epistemology: data science reasoning, Harvard University, in development.
  6. Brodie, M.L., Data science reference framework Part IV: data science methodology: The data science method and workflow, Harvard University, in development.
  7. Brodie, M.L., Data science reference framework Part V: data science methods: Computational methods, solutions and results, Harvard University, in development.
  8. Brodie, M.L., Data science reference framework Part VI: data science technology: data science engineering, Harvard University, in development.
  9. Brodie, M.L., Data science reference framework Part VII: Data science Journal: A proof of concept system, Harvard University, in development.
  10. Donoho, David (2017) 50 Years of Data Science, Journal of Computational and Graphical Statistics, 26:4, 745-766, (republished with comments)
  11. Tukey, John W. “The Future of data analysis.” The Annals of Mathematical Statistics33, no. 1 (1962): 1–67. http://www.jstor.org/stable/2237638.
  12. Tukey, John Wilder(1977). Exploratory data analysis. Addison-Wesley. ISBN 978-0-201-07616-5. OCLC 3058187
  13. Victoria Stodden. The Data Science Life Cycle: A Disciplined Approach to Advancing Data Science as a Science. Commun. ACM 63, 7 (July 2020), 58–66. DOI: https://doi.org/10.1145/3360646

For more information on the Data Science Reference Framework project contact Michael L. Brodie (mlbrodie@seas.harvard.edu), DASLab, School of Engineering and Applied Sciences, Harvard University.

For a motivation for and introduction to The Data Science Reference Framework, see