The Data Heterogeneity problem

The Data Heterogeneity problem
by Trinity College Dublin

Learning Outcomes

By the end of this section, you should be able to:

  • Explain what we mean by Data Heterogeneity
  • Explain why Data Heterogeneity is problematic for data management
  • Explain what cause data to become diverse
  • Describe some of the solutions to Data heterogeneity

What is Data Heterogeneity?

Rapid advances in information technology enable the production of ever-greater volumes of data However, rather than increasing access to information and the ability to understand it, it actually results in creating mutually incomprehensible data silos.  This can happen when institutions adopt different standards and curate their data in incompatible ways. As a result, researchers following particular lines of investigation struggle with having to master many different information management systems in order to access different pools of data and resources locked into these particular formats.  A library, for example, may have a collection of research monographs by a particular historian, an archive of this same historian’s papers, and museum objects collected in the course of this historian’s active work. It is natural for a researcher working on this historian to want to connect and compare these information streams, since they all refer to the same person and same real world objects of research.

Different institutions, by adopting different standards for different ends, end up potentially blocking the researcher in his/her investigation. The lack of a comprehensive way of representing this data, regardless of its location and the type of object documented, means that the researcher a) might entirely miss a stream of information  because they could not find it; and b) will necessarily struggle to join the information and compare it because of the different information management decisions taken at different institutions. What may be hidden, thanks to these incompatible data silos, may lie not just in relevant but incompatible information from the researcher’s individual field of research, but also data arising from other fields that would also be relevant to the research in question. The information management challenge here lies in trying to mitigate against such data incompatibility and create broader spaces of information compatibility in order to support the greatest potential awareness of relevant resources and information for any given research question.

The differences between data and how it is managed and represented in a knowledge representation system leading to incompatible data that cannot be compared is what we call data heterogeneity.

What causes this?

We can look at several factors that drive the heterogeneous production of data and the resulting problematic in terms of its access.

Effective Cause (Disciplinary focus and research particularity)

  • flexibility of the medium -you can build and save data any way you like, there are no traditions and few practical limitations
  • rapid pace of change – the creation of information is dependent on change of software which is related to wider industry developments and generally outside the control of researchers
  • difference of tools –  researchers will deploy different tools for different ends. These tools and software will generate different forms of information in different formats which will not necessarily be compatible.
  • difference of actors –  different kinds of actors from individual researchers, to institutions to large consortia will have access to different means and be motivated by different goals which ultimately has an effect on data modelling choices and trends against homogeneity of data formats
  • difference of means –  actors have access to different funds in order to be able to create and curate data leading to different decisions
  • differences of questions and traditions –  research programmes in digital humanities are often grounded in research questions and traditions that predate digital humanities techniques. These different modes of questioning focus on different forms of information and have different requirements for data formatting

Data is never as neat and easy to categorise as researchers would like, despite careful preparation.  It is almost impossible to reduce this heterogeneous nature, and moreover, particularly within the Humanities, it is the variation that raises the most interesting questions or insights.  Therefore, it is fundamentally necessary to have an information strategy that accepts and allows for these variants but supported their reconciliation at a more general level.

Material (Schematic) Cause

  • different schemas for data representation –  data must be saved in some form or structure. This is called a schema. It is the model by which you represent data and interlink it.
  • different formats for data representation –  data must ultimately be stored in a file format. This file format will have its own particular demands and limitations affecting the ultimate compatibility of data.

Representing information in a symbolic form, which is the fundamental task of information management, necessarily entails a reduction to a given format. The diverse factors driving alternative modes of data production and representation necessarily results in a a wide set of existing standards and formats serving different needs that cannot and will not be eliminated. There will never be one standard format for all data. Rather, we must find means to translate between them.

Final (Apathetic) Cause

  • lack of interest in the data problem as such –  researchers are naturally interested in what their data allows them to do, analyse questions about their area of questioning. The question of data sustainability is a question about data itself. Therefore it seems to fall out of the purview of humanities researchers. It is considered an ‘IT’ problem. Since, however, long term sustainable research depends on solving this problem, such a perspective is short sighted.
  • lack of serious policies regarding data –  declarations around data sustainability and requirements for plans regarding data indicate good intention, but without serious institutional commitments at all levels, these remain intentions at a hypothetical level

While data heterogeneity is a serious problem for long term research, it is not an issue that tends to trouble the average humanities, arts or social sciences researcher on a daily basis. Digital humanities, of course, places a stronger focus in humanities research using digital means.  On the other hand, computer scientists, on the whole, are interested in data structures and their use, but this is from a theoretical perspective regarding their structure and use, and is not normally in reference to the content that they represent.

The data heterogeneity problem occurs at the junction between these two domains.  In order to get around this problem, we need both the humanist’s perspective  making one aware of the semantic intent of data structures, and the computer scientist’s perspective for providing technically sophisticated solutions to structuring data.  It is also perhaps necessary to take a philosophic/epistemic perspective in how to first conceive of and then structure information.

 Watch!

Dr. George Bruseker discusses issues around using multiple datasets and the complexities that can arise through using different types of data.

Why does it matter?

Digital humanities is based on the premise that in a digital age, the study of arts and humanities requires new tools and new methods that harness the possibilities opened up by the new digital environment and digital tools. That is to say, digital humanities does not constitute a sub-discipline of humanities as such, but should be considered as a continuation of the humanities project, the critical investigation of human being, by other means. It thus extends and critically challenges the notion of humanities both by bringing new tools and new perspectives from the world of computer science into the field of humanities to challenge the limits and scope of this field.

In this on-going process, the question of how to approach data as information that can be used to generate new insights is crucial. There are many epistemic and philosophic issues at play here, as well as practical methodological issues. The question of data heterogeneity and how to manage it is a central question amongst these issues to be posed in the digital humanities both theoretically and in order to reach practical solutions.

At the disciplinary scale, data heterogeneity is a fundamental problem that affects the long-term viability and sustainability of digital humanities. It can also impact on the ability to re-use datasets, and combining datasets either within one discipline, or across disciplines.

Already at the scale of individual projects, a lack of means to tackle this problem threatens the coverage and completeness of one’s research. Having a solid methodology for addressing the coverage of your digital corpus is crucial. At a more general level, if research is carried out in a digital environment and its research outputs are in digital formats, the importance of finding ways to express this data in a commonly accessible format over the long-term is pressing.

The tradition of scholarship regardless of the tools for its expression relies on the ability to preserve and test knowledge over time. Leaving the data heterogeneity problem unsolved and even unaddressed threatens the viability and credibility of digital research.


Your progress through the "Formal Ontologies: A Complete Novice's Guide" module

30%