Strategies for structuring and documenting data so that you, and others, can find and use it into the future.
Defining documentation and metadata
Purpose of documentation and metadata
Decisions about documentation and metadata
Types of documentation and metadata standards
Controlled vocabularies
Researchers must ensure that sufficient documentation or metadata (i.e. information about the data) is created and maintained to enable research data to be found, used and managed throughout its lifecycle.
Documentation and metadata requirements will differ depending on the discipline and the nature of the research. They should be identified during data planning and adopted for use by all the researchers working on the project.
Data documentation provides provenance or context for the data and ensures that the data can be understood in the long term. It may include information such as:
In the context of research data, metadata can be considered as a subset of your overall data documentation. Metadata is usually structured using standards or schemas. Common types of metadata include:
Documentation and metadata assist with many aspects of research, scholarly publishing and research data management, including the following:
Data that has been poorly documented will be difficult (or impossible) to find. Even if the data can be found, its value will be diminished if it is hard to interpret the contents, and to judge the quality or validity of the data. If it is not possible to determine when, where, how, and by whom the data was originally produced, there is also the risk that the data could be exploited inappropriately, or even accidentally destroyed.
The Australian Code for Responsible Conduct of Research requires all researchers to maintain a list of their research data assets.
The UK Data Audit Framework Methodology proposes the following as the absolute minimum set of elements for a data register or inventory:
Some common descriptive standards are available that work for many different kinds of material and across disciplines. The most widely-used of these is Dublin Core.
This simple and general metadata standard facilitates the finding, sharing and management of data. It includes elements such as Title, Creator, Subject, Date and Type.
Dublin Core (or DC) is not specific to the research environment, to certain disciplines or to particular technologies. It can be used to describe many different types of data (not just digital), and is widely used as the metadata standard in institutional repositories, including the Monash University Research Repository.
In many disciplines, existing standards or best practices will be available that are specifically designed for describing and sharing data within a particular discipline or cluster.
Some examples include:
Discipline-specific standards abound. In choosing documentation and metadata standards for your research data, consideration of what is commonly used in your discipline should be part of your data planning
An identifier is a reference number or name for a data object and forms a key part of your documentation and metadata. To be useful over the long-term, identifiers need to be:
Some common kinds of identifiers are:
Wherever possible, you should use an existing controlled vocabulary. Even if you need to adapt or customise an existing standard, this is likely to be preferable to creating something from scratch. Agreeing on a controlled vocabulary and applying it consistently will make your documentation and metadata more valuable in terms of providing searchability and context for your data in the future and enabling it to be shared with other researchers in the same discipline. Keywords and tags can be easier to apply, but if researchers do not agree in their choice of terminology then the ability of the data to be found and used in future may be diminished.
Digital file names can be important for identifying and finding digital files. You should develop file naming conventions early in a research project, and agree on these with colleagues and collaborators before data is created.
Conventions will differ depending on the nature and size of a research project. In all cases, filenames should be unique, persistent and consistently applied, if they are to be useful for finding and retrieving data.
In deciding on digital file naming conventions, you should consider:
Many software programmes enable the addition of structured metadata in the form of "Properties". Common pieces of metadata that can be added include title, author, organisation, subjects and keywords, and additional comments.
Researchers can also ensure that digital files are well-structured internally. By simply adding document titles, authors and their contact details, dates, version control information, and column and row labels for tables and spreadsheets, you greatly increase the ability of your research data to be found, managed and interpreted over time.
A data dictionary, data definition file or schema describes the attributes of data fields, and may include any rules relating to how data is entered. This information can be stored stored internally (e.g. as a table in a database) or externally (e.g. as a separate document). External documentation should be retained with the data in the long-term, as it will provide valuable context to the data over time.
Subject headings, thesauri, taxonomies and ontologies are all examples of controlled vocabularies, i.e. lists of words or phrases used to provide consistent classification (or tagging). Like metadata standards, these vocabularies range from the very generic (e.g. Library of Congress Subject Headings and Dewey Classification) through to very discipline-specific lists created and maintained by experts in that field.
Storage and backup is just as essential for your documentation and metadata as it is for your research data, and the same guidelines apply.