Tuesday, June 10, 2008

Keeping research data safe : JISC

Keeping research data safe : JISC 

Author: Neil Beagrie; Julia Chruszcz; and Brian Lavoie

Publication date: 12 May 2008

Publication Type(s): Reports

JISC theme(s): e-Research1, Information environment2

This study has investigated the medium to long term costs to Higher Education Institutions (HEIs) of the preservation of research data and developed guidance to HEFCE and institutions on these issues. It has provided an essential methodological foundation on research data costs for the forthcoming HEFCE-sponsored feasibility study for a UK Research Data Service. It will also assist HEIs and funding bodies wishing to establish strategies and TRAC costings for longterm data management and archiving.

Executive summary

The rising tide of digital research data raises issues relating to access, curation and preservation for HEIs and within the UK a growing number of research funders are now implementing policies requiring researchers to submit data management, preservation or data sharing plans with their funding applications. This study provides: Research funders are implementing policies requiring researchers to submit data management, preservation or data sharing plans with their applications

Brief overviews of the potential benefits to HEIs of preservation of research data; issues that HEIs will need to consider when determining the medium to long-term costs of data preservation; and different service models. A framework and guidance for determining costs consisting of:

  • A list of key cost variables and potential units of record
  • An activity model divided into pre-archive, archive, and support services
  • A resources template including major cost categories in TRAC; and divided into the major phases from our activity model and by duration of activity

A series of case studies from Cambridge University, King’s College London, Southampton University, and the Archaeology Data Service at York University, illustrating different aspects of costs for research data within HEIs. Recommendations for future work and use/adaptation of software costing tools to assist implementation

Overall our approach has focused on developing a framework for determining costs and the major deliverable from the study has been the costing framework.

In addition our case studies and specific work on costs provide valuable examples of research data costs. Given the emerging nature of the field, the limited time for the study, and sample size of case studies and interviews these must be regarded as illustrative examples of costs. However there are a number of emerging findings from them which are potentially very significant and which we have recommended should be explored and tested further in future work:

Institutional data repositories

Our case studies suggest that the service requirements for data collections and the best structure for organising relevant services locally will be more complex than many have thought previously. Both Cambridge and KCL are developing central repositories to work with departmental facilities and discussing federated local data repositories for research data preservation combining services and skills from central and departmental repositories. Costs for the central data repository component at Cambridge and KCL are an order of magnitude greater than that suggested for a typical institutional repository focused on e-publications alone. These costs are discussed in greater detail in Chapter 10 of the full report and briefly summarised below:

Institutional Repository (epublications)
Staff
Equipment (capital depreciated over 3 years)

Annual recurrent costs
1 FTE
£1,300 pa

Federated Institutional Repository (data)
Annual recurrent costs
Staff
Equipment (capital depreciated over 3 years)

Cambridge
4 FTE
£58,764 pa

KCL
2.5 FTE
£27,546 pa

Long-term digital preservation costs

The profile of costs across functions within the national data centres we interviewed appears to be very consistent. It was notable that they all believed their accessioning and ingest costs were higher than ongoing long-term preservation and archiving costs. For example the following approximate division of costs across high-level archive functions of our activity model were suggested for the UK Data Archive:

Acquisition and Ingest
Archival Storage & Preservation
Access

c. 42%
c. 23%
c. 35%

The implications of this for the cumulative long-term costs of archiving research data are particularly interesting and perhaps point to potentially effective management strategies (addressing issues early during acquisition and ingest) for managing longer-term costs. In a similar vein, the Archaeology Data Service (ADS) has been in operation for 10 years and provided an interesting projection of its long-term preservation costs for research data based on its costs to date and ongoing trends. This shows relatively high costs in the early years after accessioning but costs declining to a minimal level over 20 years as follows:

5-yearly & cumulative refreshment cost (ADS) diagram

The ADS projection is a complex mix of underlying trends such as long-term declining data storage costs, costs for ongoing actions such as preservation interventions (file format migrations),and assumptions of archive growth which provide economies of scale. However, the implications of these factors and projection for sustainability of data archives e.g. via archive charges to project budgets, are notable and worthy of more extensive study and testing.

Archive economics

We have observed and documented a number of significant issues for archives and preservation costs including:

  • Timing Our activity model allows for consideration of relative costs arising from when activities are undertaken. We provide examples such as that from Digitale Bewaring Project which estimated costs c. 333 euros for the creation of a batch of 1000 records in the pre-archive phase. In contrast once 10 years have passed and material has been transferred to an archive it may cost 10,000 euros to ‘repair’ a batch of 1000 records with badly created metadata.
  • Efficiency curve effects Our case studies illustrate a number of efficiency curve effects. The start-up phases of repositories reflect both the ramping-up of activities e.g. recruitment of staff and specific start-up activities such as developing new policies and procedures for the archive. The start-up costs particularly in terms of staff time can be substantial. The operational phases reflect increasing productivity and efficiency as procedures become established, tested and refined and the volume of users and deposits increases.
  • Economy of scale effects We identify the importance of economies of scale and the impact this has on unit costs for digital preservation. As an example, the University of London Computer Centre (ULCC) which runs the National Digital Archive of Datasets, provided us with costs for accession rates of 10 or 60 data collections: a 600% increase in accessions only increases costs by 325% as a result of economy of scale effects.
'First- mover innovation' costs

Within our activity model we have identified digital preservation costs attributing to the traditional areas of archive storage, data management and preservation planning. However in addition we have identified activities and costs relating to the category of 'First- Mover Innovation' Costs. Where preservation functions and file formats are evolving a high-degree of R&D expenditure might be required in implementation phases and in developing the first tools, standards and best practices. Many of the disciplines and archives covered in this study have made considerable investments as communities in evolving shared standards, practices, and tools and we believe this could be making a significant impact on their long-term digital preservation costs.

The cost framework

Our case study sites found the cost framework approach of value to their institutions and it will benefit from wider adoption, testing and evolution in other HEIs. Its particular strengths are:

  • It is based on Full Economic Costs (FEC) which are not in or partial in other models. We believe absence of FEC (a) can distort business cases and under-estimate cost benefits eg for automation, and also mean (b) HEIs cannot accurately compare in-house or out-source costs
  • It can cost for in-house archive, full or partial shared service(s), or archive charges to projects and is implementation and technology-neutral. It is applicable in most digital preservation contexts, regardless of choices involving system architecture, preservation strategy, or service delivery
  • It is tailored for research data by allowing for different data collection levels and preservation aims, and data-specific activities such as generating products from data
Summary of recommendations

This has been an intensive study over a period of 4 months focusing on the issue of the preservation costs of research data for UK HEIs. Our recommendations for future work to develop and implement outcomes from the study are discussed in detail in Chapter 11 of the full report and summarised below:

Recommendation 1
The outcomes of this study should be considered and utilised by the forthcoming JISC Data Audit Framework study.

Recommendation 2
Departments and Central Services within HEIs should utilise recurrent data audits to inform both their initial appraisal and development of data policies and future capacity planning for services.

Recommendation 3
HEIs should consider utilising the US National Science Board (the governing body for the National Science Foundation) long-lived data collection levels to aid understanding and categorisation of user requirements and costs over time.

Recommendation 4
HEIs should consider federated structures for local data storage within their institution comprising data stores at the departmental level and additional storage and services at the institutional level. These should be mixed with external shared services or national provision as required. HEIs should work with and utilise national and international disciplinary data archives where these exist. The hierarchy of data stores should reflect the detailed nature of the content, services required, and the changing nature of its importance over time.

Recommendation 5
We recommend consideration of the study and further work on development and implementation of relevant cost models and tools to HEIs, research funders, and service providers.

Recommendation 6
JISC should produce a short briefing paper or summary of this report and its findings aimed at senior managers including university academics, administrators and research support services.

Recommendation 7
JISC should consider developing project costing tools to build on and implement work within this study. These tools may be valuable for some of JISC’s own projects and may also be of interest to other research funders and have potential for joint funding and development.

Recommendation 8
JISC should consider undertaking additional work to examine how the cost components and variables defined in our framework can be further quantified, and what additional data and data collection mechanisms are needed to support them.

Recommendation 9
JISC should consider further detailed study of longitudinal data for digital preservation costs and cost variables to extend the work of this study. Possibly this could be part of a UK based taskforce to feed into its joint international work on digital preservation costs.

Recommendation 10
JISC and/or other funders should consider funding further work on quantifying the benefits of research data preservation.

Download the full report below

Keeping research data safe : JISC

No comments: