Data lifecycle and curation

Data lifecycle and curation
by Trinity College Dublin

The ‘data lifecycle’ is a commonly accepted model for viewing specific stages of the “life” of research data, from their creation to re-use. Data are created for research projects, but their lifespan is usually longer than the duration of these projects. As a result, the ‘data lifecycle’ is often used to encourage long term thinking and planning in the gathering and use of research materials.  This perspective can also be useful in understanding the challenges research infrastructures face.

The UK data archive defines the research data lifecycle as having six stages, each of which involves a specific set of processes which infrastructures will seek to either support internally or facilitate via external links.  A brief synopsis of the phases of the data lifecycle are shared below, along with some of the mechanisms existing research infrastructures provide to facilitate them

Stages in the Data Lifecycle

This is only a brief overview of a very large field of expertise and practice, but one which research infrastructures can help researchers to manage more effectively in their own work.  Given the depth of experience in these projects, they can also often advise on practices such as the creation of data management plans and formats to facilitate long term storage and access for re-use.



RI support

Creating Data At the beginning, there must be a design of research, a planning of data management (formats, storage etc) and plan consent for sharing. What follows is locating existing data, collecting them and finally capturing and creating metadata. This is a phase that RIs engage in very actively.  Projects such as the European Holocaust Research Infrastructure (EHRI) find and federate records from thousands of different institutions for researchers to search and browse. 
Processing Data Processing data includes data entry, digitisation, transcription and translation. Subsequently, there should be check, validation and cleaning of data. Also, data should be anonymised where necessary, and described, managed and stored. The CENDARI Note Taking Environment allows users to upload scanned images from their research so as to  translate, transcribe, take notes, and register significant entities from them. 
Analysing Data In the stage of data analysis the researcher interprets data, derives new data and produces research outputs. He/she also prepares data for preservation. The CLARIN Weblicht provides access to many of the most commonly used tools for interrogation of data by computational linguists, such as tokenizers, part of speech taggers, and parsers. 
Preserving Data Preserving data involves data migration to best format and to a suitable medium, back-up and data storage. Creating metadata and documentation follows. Finally, data should be archived.  The ARIADNE has made a significant contribution to the preservation of archaeological data through the development and promotion of the Archaeological Reference Model, an ontology enabling a standardised  approach to archaeological metadata. 
Giving Access to Data Once data have been preserved, they are ready for distribution and sharing. But first, access should be controlled and copyright must be established. Data promotion comes next.  RIs give access to data in a number of ways: for example,  the DARIAH HAL repository and Hypotheses blog platform provides a trusted stable location for research pre-prints and scholarly blogging, respectively 
Re-Using Data In the final stage of re-using data, follow-up research takes place. New research may be planned, where the researcher undertakes research reviews and scrutinises findings. In the end teaching and learning develop.  The IPERION project is working to unify digital tools and protocols for storing, re-using and sharing multi-format scientific cultural heritage data. This long-standing network is an excellent platform for sharing research results for further development by partners.