b. Implementing an International Data Network
3. Working Group Discussions
3.2. Theme 2: Data Documentation and Publication
The development of digital data resources for marine geoscience data, along with new technologies for data visualization and analysis, is changing the way marine geoscience research is conducted. More and more scientists make use of digital data collections as primary resources for data in an area of interest, for conducting global syntheses, and to facilitate new multi-disciplinary studies. The utility of digital data resources fundamentally depends on the comprehensiveness and the quality of the data they provide and therefore requires that data are (a) openly and fully accessible, and (b) documented properly at all stages of the data life cycle, from initial acquisition, through processing, to primary and later secondary publication to ensure evaluation of data quality.
These requirements deeply impact the scientific data culture, imposing new obligations on scientists such as metadata compilation and full disclosure of data, and changing the way data is referenced and cited. This Theme focused on issues of Data Documentation and Publication.
3.2.1. Session I. Data Documentation
The breakout group on Standards for Data Documentation addressed the following questions:
Review current practices for different sub-domains.
How can we achieve standardized data documentation during acquisition in the field/at
sea? For derived data?
How do we ensure the highest level of data quality? What metadata requirements are
necessary?
What roles can and should agencies, ship operating institutions, and publishers play?
Working group discussions focused primarily on field data acquisition during marine surveys.
Current practices for data acquisition and documentation at sea are highly heterogeneous across the global marine geoscience community. In many cases, data documentation is the exclusive domain of the science party. While scientist must ensure that adequate documentation of their data of interest is obtained for their own use, this documentation is typically recorded only in the scientists own workbooks or spreadsheets and is seldom captured for later incorporation into data systems. In addition, the documentation that a scientist may provide for their own data reduction purposes is often insufficient to facilitate later use of the data by others. The ROSCOP cruise form, widely used to report cruises within the European community, captures only a minimal documentation of cruise operations. Furthermore, on many modern expeditions routine collection of data types other than those of primary interest to the science party may occur which can go largely undocumented. The challenge then is to move toward more thorough and complete data documentation for all marine programs carried out within the international research community.
The consensus is that while the collection of cruise metadata is often incomplete and that this is a global issue, improving data documentation at sea is a tractable problem. The needed information is collected in some form during a field program. The problems are to find relatively easy ways to get this information out of the notebook or personalized electronic file of the scientist or technician, and into a standardized format, and to formalize the transfer of this record-keeping to the relevant database system.
Procedures for capturing this information need to be of obvious benefit to the scientists themselves and must minimally impact their existing responsibilities. The current bureaucratic overhead of research for scientists is high and it is important to design documentation procedures that add minimum extra burden to their responsibilities.
To facilitate more complete documentation of data acquisition at sea, standardized metadata forms have been developed within some communities (e.g. the MGDS forms developed for the US MARGINS and Ridge2000 programs; www.marine-geo.org/metadata_forms.html). IFREMER has established a data quality plan that outlines procedures for standard data acquisition aboard their ships (**ask Eric Moussat to describe further**). The sample registry SESAR provides unique identifiers (the International Geo Sample Number) for samples to ensure that all sample analyses can be ultimately tied to a unique sample. The existing standardized forms of the MGDS were examined during breakout group discussions as possible working models for basic data documentation at sea. The information requested is generic and should not be considered an extra burden for scientists to provide. Marine expeditions involve a wide array of data collection activities in addition to the standard underway geophysical data streams (e.g such as multibeam, gravity, magnetics) and all of these must be documented (e.g cores and dredges, biology samples from dives, OBS deployments, etc.). Ideally, standard forms should be designed so that they can replacescientist’spersonalrecords.An “open format”where scientistscan add columnsto the standardized format according to their requirements would be needed.
Recommendations
T2-R1. The ultimate responsibility for ensuring adequate documentation of a field program
lies with scientists. On many ships and for many data types, the shipboard science support staff will
produce the needed data documentation as part of their routine operations. But the shipboard support staff is unlikely to have access to all information on the full suite of data acquired during a program. Scientists bring their own sensors on board, and are typically in charge of station operations associated with sampling or instrument deployment. As the primary interest and responsibility for the scientific data acquired during an expedition reside with the science party, the ultimate responsibility for ensuring comprehensive documentation for all data should also lie with the scientists. For some ships, (e.g. within the UK) a data/metadata specialist who is responsible for generating complete documentation of survey operations sails on each cruise (eg. Ask Roy Lowry to confirm and for more info here on BODC operations). As an alternative, standard practiceshould include the identification of a“dataliaison”from within thescienceparty,who workswith theship’ssupportstaffto ensurecaptureofallneeded information.
T2-R2.
Routine use of standardized data documentation procedures should be adopted by ship operators and scientists. Comprehensive and standardized data documentation at sea is a
tractable goal. The standardized electronic metadata forms provided by the MGDS, the data quality plan of IFREMER, and assignment of IGSNs to samples are steps in the right direction and provide models for wider adoption. While ships are operated by different agencies in different countries, each with its own procedures and requirements for survey operations, the concept of standard metadata forms should be generally applicable. Metadata forms need to be developed in close collaboration with users with easy mechanisms for users to customize forms for specialized use.Required basic cruise level information should include listings of the science party, roles and affiliations, an inventory of all projects associated with an expedition and of all data types collected.
Minimum required metadata for all kinds of data acquisition are date, time, latitude-longitude, and
depth. All rock and sediment samples should be assigned International Geo Sample Numbers
(IGSNs). For sensor data, other required basic information includes: Information on all ship sensors operated during program, including manufacturer, make, model, and if possible serial number.
Basic sensor information for any sensors brought on board by the science party.
Calibration information for all equipment.
T2-R3. Automated tools for metadata creation at sea are needed. Metadata creation suitable to support long-term preservation of data is time consuming for scientists to produce and they lack sufficient incentive. New automated methods to tag data with required metadata at the time of data acquisition are needed. The long-term future vision to support marine geoscience data acquisition is a web-based shipboard event logging system that pulls in the required information such as navigation, person, sampling event or operation, sample type, etc. The shipboard event logging system should include pull down menus of controlled vocabularies to describe operations. A comprehensive shipboard data acquisition system is in use for IODP cruises and is a model for wider application.
T2-R4. Funding agencies must be involved in enforcing standard practices for data
documentation and submission to data centers. Requirements for the standard documentation and
submission of data acquired during all field programs will need to be enforcable through funding agency actions.3.2.2. Session II: Data Publication
Discussions in this working group were concerned with issues relating to policies and procedures for data publication:
What data need to be accessible (raw vs. derived, published vs. unpublished)?
How should data be identified? (Use & granularity of unique identifiers for data)
How can new requirements for data publication be implemented, what are the special
disciplinary issues?
Issues concerning data publication are a key concern to both individual scientists and to data system providers. Scientists publish the data that they acquire through analytical, experimental, or computationalproceduresasamajorproductoftheirresearch,‘marketing’them to gain creditand reputation that ultimately form the currency of their careers (Edwards et al. 2007). In many scientific cultures, data have traditionally been treated as private intellectual property and have typically been shielded carefully, often even after publication. Journal articles frequently contain only fragmentsofa‘published’dataset(tableswith ‘representativeanalyses’).Publication ofraw data has been a rare exception and data documentation in general is poor and quite heterogeneous.
Edwards et al. (2007) state that the
“pr i v at e -ownership practice has led to a plethora of data collection practices and data formats, many of them idiosyncratic, as well as an absence of the me t adat a ne e de d by ot her s c i e nt i s t s t o unde r s t and how t he dat a was or i gi nal l y pr oduc e d. ”
While many scientists now recognize the benefits of digital data collections and support their existence, they are rightfully concerned that access to their data via digital data resources will circumvent the original journal publication of the data and leave them without being properly cited and receiving credit for their data. Policies and procedures for data publication as well as the design of a global data network need to address these concerns. The appropriate use of globally unique identifiers for data that allow dataset to be identified and cited independent of a journal publication, but also allow to link data in digital data collections to the original publication in the scientific literature can contribute to a satisfactory solution.2
Scientific data come in many different types. The main differences relate to their origin (e.g.
sensors, observation, experiment, modeling), their nature (digital data, physical specimens, numerical,images,video),and thelevelofprocessing (raw data,corrected,reduced,or‘derived’
data). Data related to oceanic expeditions ranges from geophysical to geochemical to biological data.
Data acquired shipboard range from raw to processed data, among them underway geophysical data streams (e.g. multibeam, gravity, magnetics), CTD casts, and rock, fluid, or biological samples.
‘Derived’dataaremostly generated on-shore in laboratories, with application of a wide range of processing procedures to raw geophysical data or by analyzing samples collected during a cruise.
Guidelines are necessary to define criteria for identifying data that should be preserved, data that should bepublished,and datathatshould be ‘discarded’afteruse.An exampleforsuch guidelines
2 For example, the German project “Publication and Citation of Scientific Primary Data”
(http://www.std-doi.de) has prototypically implemented a system for the publication of scientific data, which is open to the scientific community in any scientific field. This project uses persistent identifiers (DOI, handle.net and URN) to identify datasets available in a digital format.
are the "Rules of Good Scientific Practice" adopted by the Max-Planck-Society that take a general perspective on the data preservation issue:
“Scientific examinations, experiments and numerical calculations can only be reproduced or reconstructed if all the important steps are comprehensible. For this reason, full and adequate reports are necessary, and these reports must be kept for a minimum period of ten years, not least as a source of reference,should thepublished resultsbe called into question byothers.”
A large part of the discussion was related to who should submit the data to the archive (database), revealing differences in culture between countries on how the ships are operated. It also brought to the forefront that the data submission requires standard data input, like cruise name, dates, participants etc, that are already available in some form to the ship operator. This standard data should be pre-loaded, or be easily available without re-entering.
Recommendations
T2-R5: All data necessary to reproduce published scientific results needs to be published and
archived in an accepted data archive. Raw data from sensors should be archived along with the
appropriate metadata that allow processing and interpretation of the data. In addition, standard (routine)correctionsshould beapplied to the“raw”datato makethedatamoreeasily usableto a larger community. These corrected data should be archived as well. Physical samples are considered ‘raw’dataforanalyticaldatasuch asgeochemicalmeasurementsand should bearchived to ensure that analytical data is reproducible and can be complemented by new measurements.Sample repositories barely exist for samples from ocean going expeditions, and are virtually absent for land-based expeditions. It is critical that samples carry globally unique identifiers to ensure unambiguous identification and allow tracking their analytical history.
During a cruise, some data types may be processed. Files with processed data should be submitted to the relevant databases, accompanied by adequate documentation about the processing method. For post-cruise processed data, the situation can be very different. While it is unclear how to proceed, there was consensus that PIs should notify collecting institution database groups when they submit processed data to relevant data banks.
T2-R6: Data submission should be streamlined and standardized. Procedures are needed to seamlessly integrate data into databases, and make the process of data submission as easy as possible for scientists, while ensuring comprehensive and consistent data documentation. Data submission requires standard data input, like cruise name, dates, participants, etc. that are already available in some form to the ship operator. This standard data should be easily available so that researchers submitting their data do not have to re-enter this information.
Data types such as geochemical measurements need a standard set of parameters (sample and analytical metadata) at the time of publication to accompany the sample information before a paper is accepted. Editors need to link acceptance of a manuscript to the submission of the data and
accompanying metadatato apublic“accepted”archive.Wheneverpossible,published derived data should be in a re-usable format (e.g. electronic data table).
T2-R7: Unique identifiers for data should be used at the level of a study or publication. The working group reached consensus that unique identifiers for data should be applied at the level of a
“study” or “publication”, and not at finer granularity such as a single analysis. This recommendation pertains to raw data as well as peer-reviewed published data, which is often derived data. Modern publications already have unique identifiers (DOI). Older publications mightnot,and incorporation ofthatdatain databasesmightrequire“new”uniqueidentifiers.
T2-R8: Scientific societies should take on an active role in formulating best practice