Data Quality Assessment

Data Quality Assessment
by Trinity College Dublin

By the end of this section, you should be able to…..

  • Understand what data quality means
  • Explain why data quality is important
  • Understand and explain the different approaches to assessing data quality
  • Understand the CoreTrustSeal

What is Data Quality?

Virtually everyone who owns data has had experience with data quality, albeit often not consciously. Maybe there is that holiday picture which was spoiled by lens flare; maybe you lost it, because you forgot how you named the file; or it vanished as the external hard drive on which it was stored broke down and could not be recovered.

All of these examples illustrate a lack in data quality and its possible consequences. Hence, they show why it is of vital importance to handle your data with care and to be wary of the risks of quality loss.

Why is Data Quality important?

When using data for research, it is vital that the source can be both understood and trusted. This starts at the level of the data itself. When a digitized image is distorted, that could make it harder to determine where the photo was shot or recognize faces of people. This is why organizations can decide to install technical image criteria, such as color accuracy, bit depth, white balance and gain modulation.

On metadata level, there is also a variety of considerations which are important when deciding on metadata management. High-quality metadata greatly enhance the findability, accessibility (and restrictions where they are due), interoperability and reusability of the data they are about. The importance of these four positive features of data and ways to make sure that your metadata live up to them can be found under the paragraph on the FAIR principles. 

Lastly, the quality of the repository is important for the durability of your data. Quality here is mainly concerned with data management. This involves technical trustworthiness (“are the authenticity and integrity of the data preserved in a secure way?”) and legal coverage (“can the data be accessed under clear rights and licenses?”). To prove that both are taken care of, a data repository can be certified, e.g., under the data seal of approval 

The following video from the WePreserve project (now ended) explains the importance of data quality:

https://www.youtube.com/watch?v=pbBa6Oam7-w&feature=youtu.be

When carefully planning data quality, it is important to involve all relevant stakeholders. This could include a wide range of groups, among which: research communities, Research Infrastructures, data repositories, and Cultural Heritage Institutions.

The concept of Data Quality applies both to the (meta)data and to the repository, on which these data are stored.

How do you assess Data Quality?

To make sure that the quality of your data is up to standards, you can investigate their structure. To assist this process, sets of guidelines have been designed. As stated earlier, the FAIR principles can be a useful tool when examining the level of data quality, as the degree of findability, accessibility, interoperability and reusability of data, are important factors when determining whether the full potential of data is unlocked.

An important principle is that research data should be understandable to other researchers. For the assessment, this means that data formats and metadata need to be examined from the perspective of a user who has not worked with the data before. Can they find what they are looking for, can they gather the data they need, can they open the data, and can they understand the content? These are crucial questions when determining whether data can be reused.

Metadata need to be as complete as possible and as transparent as possible. If codes or variables are used, the explanation of those codes and variables needs to be directly available. Additionally, users need to be able to determine which files contain what kind of data. They need to be able to open the file format, and if possible, without needing to use specific software or hardware.

Generally, when applying the FAIR principles to data for their assessment, the following questions could serve as a point of departure:

  • Does the dataset have a persistent identifier?
  • Is there metadata or documentation available? Is the metadata sufficient for fully understanding the data content?
  • Are the metadata accessible?
  • Does the dataset have a user licence, are there clear conditions of reuse? Do user restrictions apply?
  • Are the data files in a proprietary format, a well-supported ‘acceptable proprietary format, or are they in a preferred/open format?
  • Does the data use a standardized coding scheme?
  • Is the data linked to other data (how)?

How do you assess Data Repository Quality?

Even if the data adheres to almost all the FAIR principles, the data repository is an important factor in determining their long-term sustainability. For that reason, it is vital to choose a repository which adheres to legal and technical checks and balances, making it a trustworthy location where the authenticity, integrity and security of data is warranted.

When choosing a data repository which carries a certification, the quality assessment has already been done for you. The certification can be done at different levels.

This framework has three levels, in increasing trustworthiness:

  1. Core Certification is granted to repositories which obtain the CoreSealTrust certification (see the next page in this module)
  2. Extended Certification is granted to Basic Certification repositories which in addition perform a structured, externally reviewed and publicly available self-audit based on DIN 31644: nestorSeal.
  3. Formal Certification is granted to repositories which in addition to Basic Certification obtain full external audit and certification based on ISO 16363: the ISO-certification.

The first two levels are based on self-assessment, combined with external review. The third and highest level, the formal certification, however, is based on a full external audit.

For more information, see the Data Quality Repositories webinar by DANS (approx 1 hour duration):

https://www.youtube.com/watch?v=VFLTJ7D2y5s&t=5m35s