Distributed data access: THREDDS
Transcription
Distributed data access: THREDDS
Distributed data access: THREDDS, OAI, CDP Presented By: Michael Burek Acknowledgments: CDP staff: Dave Brown, Luca Cinquini, Don Middleton, Rob Markel, Scott Nixon, Nate Wilhelmi NCAR Scientific Computing Division Supercomputing • Communications • Data Outline • • • • • • • Community Data Portal (CDP) THREDDS in the CDP introduction THREDDS in detail THREDDS applied in the CDP, some details OAI -- Open archives initiative Demo Thoughts about future developments NCAR Scientific Computing Division Supercomputing • Communications • Data Introduction to the CDP Community Data Portal (CDP) Project UCAR wide, uniform, community resource for discovery (search and browse) across the organization Search/browse: o Supports free or structured queries to find data o Boolean combinations o Keyword, controlled vocabularies – Creator, Publisher, Science Keyword (GCMD), Variable name (CF) – Data Format, Data Type, Data Delivery Service o Geographic, Time, Altitude Data delivery Services o aggregation, subsetting, FTP, HTTP, Mass Store, LAS/FERRET, OPEnDAP NCAR Scientific Computing Division Supercomputing • Communications • Data Introduction to the CDP, cont. The CDP serves diverse range of data providers: o o o o o o Project based archives -- small, often limited resources Multi institutional teams -- geographically separated Multiple data types within a project: measurements, models, images The CDP cooperates with NCAR existing data organizations A few unusual datasets -- HAO division Model software. Visualizations. NCAR Scientific Computing Division Supercomputing • Communications • Data CDP, Technologies o o o o o o o o o The CDP was begun in 2001 Uses THREDDS* catalogs as to describe data content and structure Uses Lucene as the search/discovery back end Uses Open Archives Initiative OAI to share metadata Uses SRM to access deep archive data, share data externally (ESG project) Experimental use of SRB to share intra-institution Sister site, Earth System Grid (ESG), uses grid technology to share data Uses DODS/OPEnDAP for aggregation and subsetting data sets Uses a distributed model for accessing data and metadata https://cdp.ucar.edu/ *Thematic Realtime Environmental Distributed Data Services NCAR Scientific Computing Division Supercomputing • Communications • Data Introduction, THREDDS in the CDP • THREDDS is a schema used for DATA DELIVERY Can be also used for geoscience data search and discovery THREDDS catalogs: • Are ingested into Lucene and GEO extent searching tools for search and discovery • Are used to supply data for search results and browse pages • Specify data access mechanisms http, http restricted, OPEnDAP, MSS, TDS, LAS, GDS, CDP/agg • Point to and use non-THREDDS metadata ESG, DC, NcML, NcML, GML, DIF Can interoperate with WMO metadata when available NCAR Scientific Computing Division Supercomputing • Communications • Data Introduction, THREDDS in the CDP, cont • The CDP federates directly with other sites that use THREDDS catalogs NCAR DSS, NCAR EOL, UCAR UNIDATA • THREDDS catalogs are used inside DODS/OPEnDAP, GDS, and forthcoming Thredds Data Server • THREDDS will support a data access control system, locally and distributed NCAR Scientific Computing Division Supercomputing • Communications • Data THREDDS Background • THREDDS v0.6 Support for describing the hierarchical structure of datasets Support for describing data delivery services Some very basic descriptive metadata Support for extensible and distributed catalogs Support for “inheritance” of metadata and services Allows other descriptive schemas to be part of the catalog Emphasizes the hierarchical relationships between data items, containing datasets and groups of datasets NCAR Scientific Computing Division Supercomputing • Communications • Data THREDDS V1.0 • THREDDS v1.0 Added descriptive “minimal” metadata tuned for Earth Science search/discovery “Minimal” defined -- Metadata sized for search/discovery Again, Metadata can be inherited within the hierarchy Design goal was to interoperate with core elements of DIF, ISO19115, DC metadata UNIDATA looking at incorporating THREDDS metadata in NetCDF* and forthcoming TDS** Exploring possibly interoperating with BADC model extensions V1.0x will have access control elements URL: http://my.unidata.ucar.edu/content/projects/THREDDS/index.htm *NetCDF UNIDATA defined binary data format for gridded and other geoscience data. Includes metadata that describes the data in the file header **TDS THREDDS data server -- will handle GRIB and NetCDF, will have WCS NCAR Scientific Computing Division Supercomputing • Communications • Data THREDDS -- CDP • CDP THREDDS design choices Use THREDDS descriptive metadata for search/discovery Use GCMD DIF controlled vocabularies for science keyword hierarchies, creator, publisher, project Use Climate and Forecasting CF conventions for variable names when applicable Mandate use of unique identifier to identify data Use forthcoming THREDDS elements for data access control Use OAI to import DIF records from BADC and GCMD, transform these records into equivalent THREDDS for use in the CDP Import ESG (CCSM) records (THREDDS, ESG), extract a subset of descriptive metadata for search and discovery NCAR Scientific Computing Division Supercomputing • Communications • Data THREDDS, the details General Structure of a simple THREDDS catalog <catalog> <service name=“httpService” type=“HTTP” base=“http://dataportal.ucar.edu/data/abcData/”> <service name=“mssService” type=“MSS” base=“/mssRoot/abcData/”/ <dataset name=“abc” ID=“ucar.scd.cdp.datasetName”> <!-- container dataset --> <metdadata inherit=“true”> <!-- descriptive metadata --> <description type=“summary”> <creator> <geospatialCoverage> <!-- geographic location --> <….> <!-- other metadata (13 total) --> </metadata> <dataset ID=“ucar.scd.cdp.datasetName.item1”> <!-- describes a data item --> <dataSize units=“Kbytes”>123</datasize> <access serviceName=“httpService" urlPath=”subDataset/SOLVE_DC8_19991119.nc> <access serviceName=”mssService" urlPath=”subDataset/SOLVE_DC8_19991119.nc> </dataset> <more datasets> <!-- more dataset items --> </dataset> <! -- close enclosing dataset -> </catalog> NCAR Scientific Computing Division Supercomputing • Communications • Data Dataset URL = base + access points to local server or local service THREDDS, simple catalog catalog service service HTTP data service MSS data service dataset (container) metadata description creator geospatialCoverage other elements dataset (data item) access, size, extent dataset access, size, extent dataset access, size, extent dataset access, size, extent NCAR Scientific Computing Division Supercomputing • Communications • Data Local data access/ local MSS service THREDDS, distributed catalogs example dataset.thredds.xml catalog dataset (container) metadata link metadata description creator geospatialCoverage other elements catalogRef ACCESS CONTROL catalogRef 1. Descriptive Metadata is in a separate file, could be on anther server. 2. Dataset contains references to remote catalogs. 3. Catalog Level Access control elements Remote Server catalog (remote) ACCESS CONTROL service metadata description … datasets catalog (remote) service metadata description … datasets NCAR Scientific Computing Division Supercomputing • Communications • Data Remote data services THREDDS, database application example Virtual catalog service External HTTP data service External Server Arbitrary Metadata Database Database to THEDDS catalog builder (web service) metadata dataset (data item) access, size, extent dataset access, size, extent dataset access, size, extent dataset access, size, extent NCAR Scientific Computing Division Supercomputing • Communications • Data External Data hosting THREDDS, distributed data example catalog service service External HTTP data service MSS data service dataset (container) Metadata external reference Metadata external reference dataset (data item) access, size, extent dataset access, size, extent dataset access, size, extent dataset access, size, extent NCAR Scientific Computing Division Supercomputing • Communications • Data metadata description creator geospatialCoverage other elements External Server ISO-19115 iso-19115 elements External Data hosting 1. Data is not on CDP, service is external, service can implement access control if required 2. Descriptive metadata is in a separate file, does not have to be THREDDS CDP - distributed datasets, overview Community Data Portal THREDDS catalog top NCAR Data Support Section T T M T A A = Access control T = THREDDS catalog D T T D NCAR MSS MASS Store LANL, ORNL, LBNL LANL, ORNL, LBNL SRM LANL, ORNL, LBNL A T NCAR EOL section Metadata database SRB T NCAR Atmospheric Chemistry. D Boston University D T D M SRB T A SRM = Data Archive, M= MSS deep archive NCAR Scientific Computing Division Supercomputing • Communications • Data T T SRM (ESG) T T CDP data storage: WACCM, ACD, CME, CGD, …. XSLT DIFs OAI client BADC OAI OAI server DIF DIF DIF DIF D THREDDS review/summary • • • • • • • • • • THREDDS is a schema used for DATA DELIVERY Contains basic geoscience discovery data Is designed to work with distributed data, distributed metadata Contains elements for data access restriction Can work with real time data Can be a container for non-THREDDS descriptive metadata Defines the hierarchical relationships of datasets Defines data delivery services Supports a hierarchical view of metadata Integrated with many data delivery and visualization services NCAR Scientific Computing Division Supercomputing • Communications • Data Distributed Descriptive Metadata with OAI • Metadata is immediately “distributed” if metadata is contained in or is pointed to by THREDDS catalogs • Metadata can also be shared using OAI technology OAI -- Open Archives Initiative from the Digital Library (DL) communit OAI is a web service definition for sharing metadata OAI uses six verbs to define the service OAI uses Dublin Core, DC, as the baseline schema OAI can specify other XML schemas -- we use this capability OAI can be used as a gateway to send information to an established DL community -- THREDDS -> DC => DL community via OAI • OAI disadvantage -- hierarchical relationships are lost NCAR Scientific Computing Division Supercomputing • Communications • Data Distributed Metadata with OAI -- CDP • THREDDS records are “flattened” (hierarchy collapsed) one record -> one dataset • Flattened records are shared using OAI • For a test, the THREDDS records were transformed into DIF using XSLT • DIF records were ingested from BADC transformed into THREDDS catalogs, and ingested into CDP search and browse NCAR Scientific Computing Division Supercomputing • Communications • Data CDP metadata architecture external metadata Web Interface/Web Service Metadata Conversion Catalog Parsing THREDDS catalog parse invokes write Metadata Processing read THREDDS THREDDS records THREDDS records records DIF metadata DC metadata index into Metadata repository XML viewer web application XML results passed to Metadata DB (Lucene) OAI client THREDDS catalogs browser Web UI free-text Search Query UI NCAR Scientific Computing Division Supercomputing • Communications • Data Structured, Geospatial, Temporal Query UI OAI server import remote Data Center or Digital Library export Data publication on the CDP Is ingested Metadata indexing application THREDDS descriptive metadata Dataset Disk, HTTP, Database, … Creates Catalog crawler THREDDS application hierarchy metadata Access control XSLT rendering Allows Edits HTTP Creates link BROWSER CDP Catalog Presentation NCAR Scientific Computing Division Supercomputing • Communications • Data Starts Metadata Authoring tool Creates Lucene Index Demo • • • • • Data searching: controlled vocabularies, GEO searching Data browsing: access control BADC shared metadata directory Metadata editing IDV Bundle showing integrated data source NCAR Scientific Computing Division Supercomputing • Communications • Data Experimental Topology to share data? GISC -> CDP CDP WMO GISC THREDDS CATALOG DB THREDDS CATALOG XSLT NetCDF GRIB … HTTP CDP Search WMO metadata T THREDDS metadata 1. OAI transfers of WMO records 2. CDP Crawls data hierarchy -- no metadata 3. GISC creates Web interface to produce virtual THREDDS Catalogs (embedded WMO descriptive metadata) D WW WT W WW WW WW WT Crawler WW WT WMO DCPC WW WW OAI WW WW OAI OAI XSLT OAI NCAR Scientific Computing Division Supercomputing • Communications • Data D Experimental Topology to share data CDP->GISC WMO GISC CDP THREDDS CATALOG NetCDF GRIB … WW WT XSLT CDP Search WW WW OAI W WMO metadata T THREDDS metadata NCAR Scientific Computing Division Supercomputing • Communications • Data WMO Search OAI WW WW Questions? NCAR Scientific Computing Division Supercomputing • Communications • Data