Evolving ImzML and mzML to an authentic open data format
Transcription
Evolving ImzML and mzML to an authentic open data format
Open Formats in Mass Spectrometry Evolving ImzML and mzML to an authentic open data format Alfons Hester, Bernhard Spengler Institute of Inorganic and Analytical Chemistry, Justus Liebig University, Giessen [email protected] Overview GLP Statement of OECD [1] Where computerised systems are used to capture, process, report or store raw data electronically, system design should always provide for the retention of full audit trails to show all changes tothe data without obscuring the original data. It should be possible to associate all changes to data with the persons making those changes by use of timed and dated (electronic) signatures. Reasons for change should be given. OECD SERIES ON PRINCIPLES OF GOOD LABORATORY PRACTICE AND COMPLIANCE MONITORING NUMBER 10, Page 10, Section 5. Data mzML and imzML are common data formats for storing MS and MS imaging data. mzML and imzML documents can be digitally signed. Such documents comply with requirements e.g. of GLP (OECD) (cf. “GLP Statement of OECD”) Authentic open data formats make data reliable and trustworthy. Depending on the quality of used certificates the highest security level as defined in GLP (OECD), is possible. Introduction Rapid progress in IT leads to a rapid obsolescence of data formats, IT equipment and media (Fig. 1). While technical improvement determines media and equipment, data formats are independent from hardware. Data in mass spectrometry is mostly stored in homemade data formats (company, institute, private formats) which are often not well documented or regarded as company secret. Rapid changes in IT business make data formats become obsolete and no longer supported. This is a severe risk for the accessibility for existing data. Supporting many different formats for the same subject leads to large economic problems, (Fig. 2). Networking creates a demand for suitable data formats to exchange data automatically. Results in science should be open, i.e. available without obstacles, for everyone who is interested in them. Examples of some removable media used within the last five decades: punch card, capacity 80 Byte, 1932-1975 compact disc, 10-870 MByte, since 1981 floppy disk, 80 kByte-2880kByte, 1971-1999 blue ray disc, 7.8-50 GByte, since 2006 Additional requirements to an open data formats are reliability and trustworthiness. Originator, authorities and legislators require quality assurance of studies, research assignments and surveys. When data is created, stored, processed, dispatched, received, translated and converted there is always a probability of forging raw data. Such accidental or wilful changes have to made detectable. This can be assured using checksums (hash values) or error correcting code. Since open formats should be writeable for everyone the data stored therein can be changed and written properly and a receiver is not able to detect this “betrayal”. Using asymmetric cryptography and embedding it into the data format can protect the data. The primary goal of a data format is to represent data! Therefore the authentic part should be optional. Finally an authentic open data format ought to be self-consistent which is the reason to implement authenticity within the definition of the format. Figure 1 : The rapid progress in IT during the last five decades demonstrated for removable media. Quadratic growth of the number of required converters Nc depending on the number of different formats n : n n−1 N c= 2 Format Check sum authen Purpose ticity mzML 1.1 x - MS Data (HUPO-PSI) ConsensusXML1.3 - - OpenMS FeatureXML 1.3 - - OpenMS IdXML 1.2 - - OpenMS ParamXML 1.3 - - OpenMS TrafoXML 1.0 - - OpenMS GelML - - Gel data mzData - - former MS Data analysisXML - - MS analysis (HUPO-PSI-PI) Table 1 : Security features implemented in some actually used open formats for the MS community. If a format defines a checksum, in order to detect erratical changes, it is marked in the 2nd column with „x” otherwise with „-”. If a format has authenticity features it is also marked in 3rd with „x” otherwise with „-”. n=5, Nc=10 n=7, Nc=21 n=5, Nc=55 Figure 2 : Each vertex represents a single data format, and each edge represents a converter able to convert this two formats into each other Methods The Software for reading and writing the new authentic open data format was developed with ObjectPASCAL Delphi 2009 and runs on MS Windows. hash algorithm block size [bit] As basis for implementing authenticity the open data format imzML was used. imzML was developed within the COMPUTIS project of the European Union in order to store MS imaging data. imzML itself is an extension of mzML, which is a XML based open data format for mass spectrometric data and which will probably become the leading MS data format. Haval 256 X 1992 Josef Pieprzyk Jennifer Seberry Yuliang Zheng http://labs.calyptix.com/haval.php MD4 128 - 1990 Ronald Rivest RFC 1320, successful collision attack MD5 128 - 1991 Ronald Rivest RFC 1321, successful collision attack in December 2008 RipeMD128 128 - 1996 Bosselaers successful collision attack in August Antoon 2004 Hans Dobbertin Bart Preneel RipeMD160 160 X 1996 SHA1 SHA256 SHA384 SHA512 Tiger 160 256 384 512 192 X X X X 1995 2001 2001 2001 1995 Differing from the mzML format, imzML consists of two files, a XML file derived from mzML and a binary file. Both files are linked together by an UUID (universally unique identifier). The binary file contains the measured spectra whereas the XML file contains the rest of the data necessary to describe an experiment. The parameters for reading the binary data are stored in the textual part too. (=> cf. poster PMM: 83 „Imaging mzML (imzML) – a common format for the comparison and exchange of imaging mass spectrometry data”). The component responsible for reading and writing the binary file automatically calculates a checksum for each spectrum. This checksum is stored in the XML part. When all binary data is written a general checksum of the binary file is calculated and stored in the XML file. secure develo originator 25.08.2 ped 009 in Several checksum calculation algorithms are implemented (cf. table 2) and tested (cf. figure 3). In order to calculate a ciphered version of each checksum stored in the XML file the asymmetric cryptographic system RSA is used. Therefore an originator (human or machine) must have 2 corresponding keys : a private one and a public one. Such pairs can be created using diverse software or obtained from a certificate authority (CA). remark NSA NSA NSA NSA Ross Anderson Eli Biham security lack in 2005 discovered http://www.cs.technion.ac.il/~biham/Reports/T Table 2 : All implemented hash functions, it's „year of birth” its actual security status. One can see that some algorithm are unbroken for nearly 2 decades. Figure 3 : Screenshots of developed software Results Performance of implemented hash functions: ! w ne tio op scalable Figure 4 : Throughput in MByte per second. The computer which was used to determine these values was an AMD Athlon 2500+ with 2GB of memory. future-proof l na ! open modular 0 economic w ne 20 powerful Authentic Open Data Format l na 40 recognize modification tio op 60 trustworthy l na 80 tio op 100 Haval 256 MD4 MD5 RipeMD-128 RipeMD-160 SHA1 SHA256 SHA384 SHA512 Tiger application -oriented eligible for QA ! w ne 120 Implemented requirements of the authentic open data format : platform independent manufacturer independent device independent Figure 5 : Features of mzML/imzM plus new and optional security features. A software module was developed which is able to implement authenticity on a still existing mzML/imzM file and during initial file creation. The performance is adequate and the underlying mzML/imzML format is now usable e.g. for QA, monitored studies and surveys. Conclusion An authentic open data format based on mzML and imzML was defined. This format allows the user to sign mzML / imzML documents. Data is readable with or without the authenticity extension. Authenticity is part of this open format. It is planned to integrate the developments within mzML/imzML-standard. Acknowledgement The authors gratefully acknowledge financial support by the European Union (STREP project LSHG-CT-2005-518194) References [ 1] OECD Series on Principles of Good Laboratory Practice and Compliance Monitoring http://www.oecd.org/document/63/0,3343,en_2649_34381_2346175_1_1_1_1,00.html [ 2] PKCS #1 v2.1: RSA Cryptography Standard, RSA Laboratories, ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-1/pkcs-1v2-1.pdf [ 3] PSI-MS: Mass Spectrometry Standards Working Group http://www.psidev.info/index.php?q=node/80 [ 4] OpenPGP Message Format RFC4880 http://tools.ietf.org/html/rfc4880