TY - JOUR
T1 - mz5
T2 - Space- and time-efficient storage of mass spectrometry data sets
AU - Wilhelm, Mathias
AU - Kirchner, Marc
AU - Steen, Judith A.J.
AU - Steen, Hanno
PY - 2012/1
Y1 - 2012/1
N2 - Across a host of MS-driven-omics fields, researchers witness the acquisition of ever increasing amounts of high throughput MS data and face the need for their compact yet efficiently accessible storage. Addressing the need for an open data exchange format, the Proteomics Standards Initiative and the Seattle Proteome Center at the Institute for Systems Biology independently developed the mzData and mzXML formats, respectively. In a subsequent joint effort, they defined an ontology and associated controlled vocabulary that specifies the contents of MS data files, implemented as the newer mzML format. All three formats are based on XML and are thus not particularly efficient in either storage space requirements or read/write speed. This contribution introduces mz5, a complete reimplementation of the mzML ontology that is based on the efficient, industrial strength storage backend HDF5. Compared with the current mzML standard, this strategy yields an average file size reduction to ∼54% and increases linear read and write speeds ∼3-4-fold. The format is implemented as part of the ProteoWizard project and is available under a permissive Apache license. Additional information and download links are available from http://software.steenlab.org/mz5.
AB - Across a host of MS-driven-omics fields, researchers witness the acquisition of ever increasing amounts of high throughput MS data and face the need for their compact yet efficiently accessible storage. Addressing the need for an open data exchange format, the Proteomics Standards Initiative and the Seattle Proteome Center at the Institute for Systems Biology independently developed the mzData and mzXML formats, respectively. In a subsequent joint effort, they defined an ontology and associated controlled vocabulary that specifies the contents of MS data files, implemented as the newer mzML format. All three formats are based on XML and are thus not particularly efficient in either storage space requirements or read/write speed. This contribution introduces mz5, a complete reimplementation of the mzML ontology that is based on the efficient, industrial strength storage backend HDF5. Compared with the current mzML standard, this strategy yields an average file size reduction to ∼54% and increases linear read and write speeds ∼3-4-fold. The format is implemented as part of the ProteoWizard project and is available under a permissive Apache license. Additional information and download links are available from http://software.steenlab.org/mz5.
UR - http://www.scopus.com/inward/record.url?scp=84856069798&partnerID=8YFLogxK
U2 - 10.1074/mcp.O111.011379
DO - 10.1074/mcp.O111.011379
M3 - Article
C2 - 21960719
AN - SCOPUS:84856069798
SN - 1535-9476
VL - 11
JO - Molecular and Cellular Proteomics
JF - Molecular and Cellular Proteomics
IS - 1
ER -