Slides - eSI Wiki
Transcription
Slides - eSI Wiki
Provenance Tracking in Climate Science Data Processing Systems Curt Tilmes NASA Goddard Space Flight Center [email protected] Workshop on Principles of Provenance (PrOPr) November 19-20, 2007 1 of 13 MODIS Adaptive Processing System (MODAPS) Level 1 and Atmosphere Archive and Distribution System (LAADS) • Located: Goddard Space Flight Center • MODIS Level 1 and atmosphere products • Archive size (approx): 600 TB • Ingest rate (approx): 100 GB/Day • Distributes (approx): 5 TB/Day • Provide access to MODIS Level 1 and Atmosphere products for 17,303 unique users since September 2006 • Subsetting, subsampling, mosaicing, masking, reprojection and format conversion options enable users to transform MODIS standard products • http://ladsweb.nascom.nasa.gov/ 2 of 13 * Courtesy Ed Masuoka, MODAPS Lead MODIS Adaptive Processing System (MODAPS) Level 1 and Atmosphere Archive and Distribution System (LAADS) • All MODIS Atmosphere products are available on disk Enables rapid staging of orders, <5 seconds for 2,000 files All data on over 20 servers organized in one directory tree by Reprocessing Collection, Product Name and Date • Daily distribution in October 2007 140,000 files to the public 140,000 files to science team 230,000 files to DAACs • Remote Sensing Information Gateway server for subsets, aggregation, visualization and format conversion of MODIS and air quality data used by EPA • Level 0 5 minute files for Ocean processing within 7 hours of acquisition [OCDPS] • Level 1 for CERES production [LaRC] • Atmosphere Level 3 for MOVAS [GES DISC] • Custom products for carbon modelers delivered through LAADS with product generation in MODAPS 3 of 13 * Courtesy Ed Masuoka, MODAPS Lead MODIS Adaptive Processing System (MODAPS) Level 1 and Atmosphere Archive and Distribution System (LAADS) • 100TB of MODIS Level 1, Atmosphere and Land products will be shipped to JAXA starting April 2008 from online archive, data pool and processing • Online archive enables innovative services SOAR [UMBC] web Services for Ondemand gridded multisatellite Atmospheric Radiances (MODIS [LAADS], AIRS [GES DISC] 4 of 13 • AVHRR 5km data products from 19812000 processed in MODAPS, archived in LAADS • VIIRS data sets produced from MODIS for testing algorithms • Comparisons with MODIS used to assess quality of VIIRS SDRs and EDRs for NASA science * Courtesy Ed Masuoka, MODAPS Lead MODIS Data Flow MODIS production flow: 5 of 13 20071119 Provenance in an SDPS Just as a laboratory experimenter must control and capture everything about the experiment environment, so should a science data processing system… • All ingested data, with the source • Algorithm Theoretical Basis Documents (ATBD) • Software Source Code, version • Software Build Environment, version Static libraries, versions, Compiler versions • Execution Environment Specific hardware, OS version, Dynamic libraries versions • Execution Instance Runtime parameters, Input files and versions Very rigorous Configuration Management practices required 6 of 13 20071119 Data Processing and Archiving Earth Science Data Archive volumes growing steadily Over time, the systems evolve: • Spacecraft, sensors, data processing frameworks • Science algorithms for transforming and analyzing data Tracking data provenance through processing systems and archives is a very complicated problem • Across organizations / agencies this just gets worse Science data is being used in new ways not planned by originators Value Added Services release their own processed data from independent archives 7 of 13 20071119 Data Processing and Archiving Previous versions of data are often discarded in favor of newer ones • Provenance information stored as metadata along with data is usually removed along with the data itself Provenance information is incomplete, and represented in nonstandard forms that are difficult to follow • Imagine a phone call to a researcher “where did you get this data, and what did you do to it?” Even if provenance is captured, some systems can’t (or won’t) reproduce older datasets • Rely on an error prone, manual process to attempt to reproduce data previously released 8 of 13 20071119 Provenance Roadblocks Proprietary information • Hardware and software designs provide a competitive advantage, why share them? US International Traffic in Arms Regulations (ITAR) • Broadly applied, default is to restrict Cost • Capturing/distributing provenance isn't a priority • A project that proposes comprehensive provenance is at a competitive disadvantage to one that doesn't. Competition • Why should I share my system for reproducing my data which would give my competitor a leg up? 9 of 13 20071119 Provenance Objectives Capturing complete and accurate provenance during data ingest and primary data processing Archiving provenance such that it can be easily retrieved and searched, even if the data are deleted Representing provenance to human users and providing tools for navigating graph to search and explore data provenance Representing provenance semantically to other systems at cooperating institutions with standard ontologies • Semantic Web for Earth and Environmental Terminology (SWEET) Allow agents to traverse intersystem provenance graphs and answer provenance questions Allow independent systems to mechanically reproduce data processing using the provenance information 10 of 13 20071119 Reprocessing Forward processing is easy. • Have a whole day to process each data day (1X) Science keeps marching forward • MODIS had an average of one new science algorithm version update delivered per day for its first year! Do you start processing with the new software immediately each time you find a bug? • Sometimes it is better to keep a dataset consistent with known problems than inconsistent. Periodically need to correct old data to make a new “baseline” At 1X reprocessing, 7 years of MODIS data would take 7 years – way too long. Even at 10X, it takes over 8 months.. 11 of 13 20071119 Process On Demand For valid science and complete “scientific reproducibility”, you must capture sufficient information to trace back the provenance of each product. Given such provenance and the ability to use it, do you still need the files in the archive at all? “Extreme Compression” • Instead of storing the data product, just store the provenance. • When someone needs the file, just recreate it. • Given periodic reprocessing, many files are never needed again anyway.. Allows much larger “virtual archives” • We make choices about which products to create, archive and distribute – intermediate products not always kept anyway 12 of 13 20071119 Process On Demand Challenges Can I prove that the new file is the same as the old? What if it is “almost” the same...? • Partial checksumming? Delta differences? • “black box” the problem Do I still have the same hardware/os/compiler/etc. 10 years from now, much less 100 or 1000? • Maintain software validity (with formal science validation) on newer hardware • We are already imposing System Engineering rigor onto a science programming world with much resistance... (CM, Regression Testing, etc.) • We just reprocessed over 30 years of ozone data. Do I have the right input products online, or do I have to remake them too? (and their inputs…) • Cascading problem, make the “archive” or “processondemand” decision at every level. 13 of 13 20071119
Similar documents
MODAPS and LAADS evolution of product distribution for MODIS Ed
LAADS – Archive and Distribution • Level 1/Land and Atmosphere Archive and Distribution System (LAADS) • Location: NASA/GSFC, Greenbelt, MD • Archives : MODIS Level 0, Level 1, Land and Atmosphere...
More information