Data Caring: Why Manage Your Research Data?

15 downloads 171 Views 1MB Size Report
https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe- ... I dropped my laptop, ..... made op
Data Caring: Why Manage Your Research Data?

Julia Barrett UCD James Joyce Library

An Leabharlann UCD

Outline •Research data: definition •Data management drivers •Data management benefits •Data management components •Documentation practices •File management •Storage •Access management

•Data sharing •Data repositories •Help @ UCD

What is research data? “The data, records, files or other evidence, irrespective of their content or form (e.g. in print, digital, physical or other forms), that comprise research observations, findings or outcomes, including primary materials and analysed data.” – Australian National Data Service

Examples: •Statistics and measurements •Results of experiments or simulations •Laboratory notebooks •Observations e.g. fieldwork •Survey results – print or online •Interview recordings and transcripts •Images, from cameras and scientific equipment

Five Top Reasons to Protect Your Data and Practise Safe Science

https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/

1. Data output is growing rapidly

• Graphic

• 90% of all the data in the world has been generated over the last 2 years. Scientific data output is currently increasing at an annual rate of 30%.

2. Despite significant investment, data is not being managed effectively

• $1.5 trillion is the current estimated total global spend on research and development, which could be at risk. Much of the data generated is lost – in one study, the odds of sourcing datasets declined by 17% each year, with 80% of datasets over 20 years old not available.

80% of datasets over 20 years old not available

www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416

3. Much of the data remains unverifiable

• 54% of the resources used across 238 published studies could not be identified, making verification impossible.

4. Time and money is wasted, impacting on science and society

• Since 2000, over 80,000 patients have taken part in clinical trials based on research that was later retracted because of error or fraud. The number of retractions due to error has grown over fivefold since 1990.

5. Funders increasingly require data management and sharing policies

• Key funding bodies such as the NIH, MRC and Wellcome Trust now request data management plans be part of applications.

Why manage data? I want to be able to find my data two years from now My colleague left 6 months ago and I can’t make any sense of his data

On a recent train journey my shampoo leaked into my laptop and I lost my files….

I dropped my laptop, lost about 2 months of work, yeah I know, should make backups more often, but I could never think of this happening.

Lost data •http://blog.allusb.com/2011/03/rising-trend-in-lostusb-flash-drives/

At more than 500 laundromats and dry cleaners in the UK, 17,000 USB flash drives were left behind between December 2010 and January 2011. According to the study’s researchers at Credant Technologies, that’s a 400 percent increase inlost devices compared to the year before.

• http://www.connachttr ibune.ie/galwaynews/item/1372blaze-that-destroyedgalway-hse-unitsuspicious-say-gardai

Benefits of Managing Data • Saves time – being able to find things • Reduces possibility of data loss through managed back-ups, storage and security processes • Reduces errors e.g. badly described data, confusion between file versions • Enables you and others to find and understand what you have done through the provision of descriptions, metadata, file management etc.

• Provides evidence of work undertaken • Provides evidence of validity of work undertaken • Verifies – provides evidence of logical processes and methods • Ensures retraceability and reproducibility

Data Management Plan (DMP) A data management plan is a formal and practical document developed at the start of a research project which outlines all aspects of the data, including: •

The nature of your data



How it is organised and described



How it is shared with others



How it will be stored in the long-term – https://www.admin.ox.ac.uk/rdm/dmp/plans/

Developing a data management plan helps to ensure the research data are accurate, complete, reliable, and secure both during and after completion of the research. Funding bodies increasingly require that grant applications include data management plans. For example, current NSF (National Science Foundation - American funding agency) policy states that as of January 18, 2011 all NSF proposals must have a supplementary document of no more than two pages labelled data management plan. http://www.youtube.com/watch?v=Lc82pxxRkMo

Data management components Documentation practices • Project documentation; process documentation; data documentation

File management • File organisation; File naming; File formats

Storage • Backup strategy; Security

Access management • Data sharing; publishing; archiving

Data Management Checklist http://www.ucd.ie/t4cms/Guide121.pdf

Documentation practices

CONTEXT • Principal investigator

PROJECT DESCRIPTION

• Researchers/other project • Project title • The aim/ purpose of members the research • Main contact details

• Collaborators/Partner Institutions

• Project duration

• Roles and responsibilities INVENTORIES • Funding source(s) and requirements

• Budget

• Servers, directories, data, lab equipment etc.

Documentation practices • Data Capture – How will data be created? – Any special hardware / software requirements? – How will metadata be captured, created and managed?

• Processes – Sometimes individual effort, sometimes collaborative – Protocols, code commentary – Workflow descriptions/diagr ams

Data documentation • “A crucial part of making data user-friendly, shareable and with long-lasting usability is to ensure they can be understood and interpreted by any user. This requires clear data description, annotation, contextual information and documentation • Data documentation explains how data were created or digitised, what data mean, what their content and structure are, and any manipulations that may have taken place. It ensures that data can be understood during research projects, that researchers continue to understand data in the longer term and that re-users of data are able to interpret the data. Good documentation is also vital for successful data preservation.” (UK Data Archive).

• Good documentation ensures your data can be: – Searched for and retrieved – Understood now and in the future – Properly interpreted, as relevant context is available.

Metadata Definition Metadata is...data about data “Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”. It enables: • resource discovery and retrieval • data sharing and reuse – allows data to be interpreted or analysed by others • management of resources – records aspects of the production and curation process, rights information, location and access information

Metadata: Data about Data Three broad categories of metadata are: Descriptive - common fields such as title, author, abstract, keywords which help users to discover online sources through searching and browsing. Administrative - preservation, rights management, and technical metadata about formats. Structural - how different components of a set of associated data relate to one another, such as a schema describing relations between tables in a database.

Metadata answers basic questions about data • Who created and maintains the data? Who can access it? • Why was the data created? • What is the content and structure of the data? What changes have been made to it? • When collected? When published?

• Where is the geographic location? – Where is the data held?

• How was the data produced?

“What information would I need to understand and use this data in twenty years?”

Considerations for choosing a standard • The discipline, domain • The format of the data • Repository or funder requirements • Recognition and/or certification of standard • Controlled vocabularies, thesauri, and authorities

• Available metadata tools • Skills required and time available

Metadata creation tools www.earthchem.org/data/templates

• http://www.earthchem.org/data/templates

Metadata creation tools ISSDA Data Deposit Form

Metadata creation tools Knowledge Network for Biocomplexity

Morpho User Guide https://knb.ecoinformatics.org/software/morpho/MorphoUserGuide.pdf

Organising Data & File Formats

• File structure • Folder structure • File and folder naming conventions • Versioning • File formats – Choose platform and vendor-independent file formats to ensure the best chance for future compatibility • File transformation

Organising Data: Good File Management •

Research data files and folders need to be labelled and organised in a systematic way so that they are both identifiable and accessible for current and future users. The benefits of consistent data file labelling are: Data files are distinguishable from each other within their containing folder Data file naming prevents confusion when multiple people are working on shared files

Data files are easier to locate and browse Data files can be retrieved not only by the creator but by other users Data files can be sorted in logical sequence Data files are not accidentally overwritten or deleted Different versions of data files can be identified If data files are moved to other storage platform their names will retain useful context

• Use a combination of different types of information to make the context and content of a file clear, e.g. – – – –

Data source Measured variable Experiment Date

“AHRC_TechnicalApp_Res ponse20120925.docx” rather than: “what we got back from funders about the data stuff.docx”

http://www.youtube.com/w atch?v=Z_ysxiAGKC8&featu re=player_embedded

Data storage and backup • Estimated size of data ; growth rate • Where (physically) will you store the data? Server, pc/laptop, external storage device…geographically distributed… • On what media will you store the data?

• Whose responsibility is the storage of the data? • How will you transmit the data, if required? • How is your data backed up? • How often is your data backed up? • Who is responsible for this?

• Avoid single points of error – Use managed networked storage whenever possible – Move data off portable media – Make multiple backups: Lots of Copies Keeps Stuff Safe (LOCKSS) – Be wary of software lifespans

Data security • How will you ensure the security of your data? – How will data be shared during the project? – How will you organise access to sensitive data? – How will you enforce permissions, restrictions and embargoes? – Other security issues e.g. damage, theft • Information Security Risk Assessment for UCD Research Groups online survey – https://docs.google.com/a/ucd.ie/spreadsheet /viewform?formkey=dGV3QUF4UGxTaTkweDF JWlhiU1g2VVE6MA#gid=0

Access management: Ethics and IP • Are there any ethical or privacy issues that may prohibit the sharing of some or all of the dataset/s? • If so what possible ways might there be to resolve these? (E.g. referral to UCD’s Ethics Committee; anonymisation of data; formal consent agreements; different levels of access to data, e.g. research purposes only, no commercial) • Who owns the copyright and other intellectual property?

Data sharing: drivers and benefits

• Facilitating research and discovery • Scientific integrity • Funders and government • Journal publishers • Recognition and impact

• Collaboration • Funding application advantage

Facilitating research and discovery • Cucumbers, E-coli and open data: The 2011 outbreak of E. coli poisoning in Germany illustrated the changes in attitudes to sharing scientific research and data; within weeks of the outbreak, the genome of the bacteria was identified, and given the seriousness of the outbreak, the results were published on the Internet as soon as they were available.

Facilitating research and discovery

• “As research becomes more data intensive, research datasets increase in number and size. Re-using (combinations of) research datasets produced by researchers in the same discipline or from different disciplines brings about novel approaches, such as data exploration, simulation and modelling, system level science, and transdisciplinary research”. •

Van der Graaf, M. and Waaijers, L. (2011). A Surfboard for Riding the Wave. Towards a four country action programme on research data. A Knowledge Exchange Report, available from www.knowledge-exchange.info/surfboard

Scientific integrity • Publishing research data and citing its location in published research papers allows others to replicate, validate or build upon your results thus improving the scientific record by encouraging scientific enquiry and debate.

• Openly sharing research data also encourages the improvement and validation of research methods and minimises the need for data recollection. • Verify results; uncover errors

• Contains errors and excludes some data that significantly undermined the results.

• The results were published in a prestigious journal, the American Economic Review, that failed to enforce its own data availability policy

Funders and Government • NERC (Natural Environment Research Council) expects everyone that it funds to manage the data they produce in an effective manner for the lifetime of their project, and for these data to be made available for others to use with as few restrictions as possible, and in a timely manner. • To protect the research process NERC will allow those who undertake NERC-funded work a period of time to work exclusively on, and publish the results of, the data they have collected. This period will normally be a maximum of two years from the end of data collection. •

UK Funders’ Data Policies: www.dcc.ac.uk/resources/policy-andlegal/funders-data-policies

Funders and Government

• Research data in general should be deposited whenever this is possible, and linked to associated publications where this is appropriate. It should be made openly accessible, in keeping with best practice for reproducibility of scientific results. – European and national data protection rules must be taken into account in relation to research data, as well as concerns regarding trade secrets and intellectual property rights, confidentiality, or national security. – At a minimum, metadata describing research data and its location and access rights should be deposited. – It is recognized that managing access to research data may be a new approach for many research organisations. This policy is intended to encourage the improvement of discoverability and development of open access to research data over time.

National Advisory Committee on Drugs and Alcohol: NACDA Research Data Management Policy • Specifies: – Copyright (owned by NACDA) – Data quality – Provision of supporting material – Data security – Confidentiality & protection of personal data by data anonymisation

– Data sharing – Deposit to a data archive / repository – Informed consent to not preclude data sharing beyond the original research – Data management plan requirement

Funders and Government

• Minister Howlin said “The public service needs to share data to deliver services, and more data-sharing will be necessary to deliver the joined-up services we aspire to. At the same time data protection and privacy are concerns for all of us. Today we agreed that a new legal framework is required to enable the public service deliver the next generation of services both effectively and securely.”

Journal Publishers • Increasing number of journal publishers require the sharing of associated data • DRYAD (www.datadryad.org/ ) – an international repository that manages the research data underpinning peer-reviewed articles in the biosciences. – Landscape of public data archives is patchy; Dryad fills a gap http://www.youtube.com/watch?v=RP33cl8tL28

• Figshare (http://figshare.com/ ) - repository where users can make all of their research outputs available in a citable, shareable and discoverable manner. Partnering with Taylor & Francis to host the supplemental data to T&F published papers.

Recognition and Impact • Others who re-use data and cite it in their own research help to spread the word about the research and increase its impact • Increased citation rates • Piwowar H, Vision TJ. (2013) Data reuse and the open data citation advantage. PeerJ PrePrints 1:e1v1 http://dx.doi.org/10.7287/peerj.preprints.1v1

Collaboration • Data sharing may lead to new collaborations between data users and data creators. Sharing data can often lead to improvements such as corrections in the documentation, or combination or comparison of datasets leading to new information. • Collaboration drives more research

Collaboration

• NASA Landsat satellite imagery of Earth surface environment, collected over the last 40 years was sold through the US Geological Survey for US$600 per scene until 2008, when it became freely available from the Survey over the internet.



Usage leapt from sales of 19,000 scenes per year, to transmission of 2,100,000 scenes per year. Google Earth now uses the images.

• There has been great scientific benefit, not least to the Geological Survey, which has seen a huge increase in its influence and its involvement in international collaboration.

Funding application advantage • Professor Derek Offord, Department of Russian was awarded £800,000 from the Arts and Humanities Research Council to conduct the first large-scale history of the French language in Russia. However, upon initial submission of the funding proposal, peer reviewers criticised data sharing plans and suggested that the application’s Technical Appendix should be rewritten. • While the intellectual excellence of the proposal was not in doubt, without resubmission with appropriate changes to data sharing plans, the application would not have been successful. – Data management and data sharing plans are important : competitive advantage • http://data.bris.ac.uk/files/2013/06/data-bris-benefitsreport-V2.pdf

Barriers to sharing • A huge amount of data ends up unpublished, unshared and essentially wasted – another form of data loss particularly for datasets that have clear scope for wider research use, decisionmaking, policy making and hold significant longterm value • Tension between the pressure to make data more open earlier on and the real fear that researchers have that if they do that others will reap the benefits from the hard work they’ve done • Culture of “my” data

Why might public access to your research data be restricted? • “We intend to make a patent application, and must avoid prior disclosure.” • “Don’t want to make locations of members of endangered species available to poachers.” • “The research data are confidential because of the arrangement my research group has made with the commercial partner sponsoring our research.” • “My data form part of a long-term study upon which my research group is entirely reliant for its on-going research publications and academic reputation. We only share this with trusted colleagues.”

Examples of Repositories • Earthchem www.earthchem.org/ – geosciences, with particular emphasis on geochemical, geochronlogical, and petrological data

• KNB http://knb.ecoinformatics.org – biosciences, ecology, evolutionary biology

• GenBank http://www.ncbi.nlm.nih.gov/genbank/ – DNA sequence data

• Machine Learning Data Repository http://archive.ics.uci.edu/ml/datasets.html – a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms

• Irish Social Science Data Archive www.ucd.ie/issda

Advantages of a repository • Provides a metadata structure / metadata form for you to fill in • Publishes the data for you by giving your dataset a unique identifier, e.g. DOI • Serves as a backup vehicle for your data

• May preserve your data for the future • Makes sharing your data easy • Others may cite your research more

Locating relevant datasets using “portals” • Databib – http://databib.org/

• Registry of Research Data Repositories – http://www.re3data.org/

• CalPoly’s LibGuide – http://libguides.calpoly.edu/content.php?pi d=277668&sid=2288020

Using Google to locate data • Astronomy dataset OR "data archive" OR "data portal“ • hydrogeology OR groundwater dataset OR "data archive" OR "data portal“ • migration dataset OR "data archive" OR "data portal“

Help @ UCD •

Data storage, backup and security – UCD’s Research IT support team is available to discuss the options available to you regarding data storage or any of your IT requirements. Contact [email protected] or [email protected] • www.ucd.ie/itservices/researchit/ •

Intellectual property – For queries regarding intellectual property and support for researchers interested in commercialisation please contact Caroline Gill, Innovation Education Manager [email protected] • www.ucd.ie/innovation/researchers/



Research ethics – Research Ethics Administrator. One-to-one consultations with researchers who are about to submit for either a full review or exemption. Contact Jan Stokes, Research Ethics Administrator [email protected] • www.ucd.ie/researchethics/



Research data management checklist – Some assistance can be given by UCD Library. Contact Julia Barrett, Research Services Manager, UCD Library [email protected] • www.ucd.ie/library/supporting_you/research_support/data_manag ement/

Finally….take away with you…. • STORE – Three copies on Three Disks in Three Locations • ORGANISE – If you make a plan, you just might follow it • DOCUMENT – What would my colleagues need to know to understand this data? • SHARE – Data makes an impact (to you, to your research group, to society)