Distributed data access: THREDDS

Transcription

Distributed data access: THREDDS
Distributed data access: THREDDS, OAI, CDP
Presented By:
Michael Burek
Acknowledgments:
CDP staff: Dave Brown, Luca Cinquini, Don Middleton,
Rob Markel, Scott Nixon, Nate Wilhelmi
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Outline
•
•
•
•
•
•
•
Community Data Portal (CDP)
THREDDS in the CDP introduction
THREDDS in detail
THREDDS applied in the CDP, some details
OAI -- Open archives initiative
Demo
Thoughts about future developments
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Introduction to the CDP
Community Data Portal (CDP) Project
 UCAR wide, uniform, community resource for discovery (search and
browse) across the organization
 Search/browse:
o Supports free or structured queries to find data
o Boolean combinations
o Keyword, controlled vocabularies
– Creator, Publisher, Science Keyword (GCMD), Variable name (CF)
– Data Format, Data Type, Data Delivery Service
o Geographic, Time, Altitude
 Data delivery Services
o
aggregation, subsetting, FTP, HTTP, Mass Store, LAS/FERRET, OPEnDAP
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Introduction to the CDP, cont.
 The CDP serves diverse range of data providers:
o
o
o
o
o
o
Project based archives -- small, often limited resources
Multi institutional teams -- geographically separated
Multiple data types within a project: measurements, models, images
The CDP cooperates with NCAR existing data organizations
A few unusual datasets -- HAO division
Model software. Visualizations.
NCAR Scientific Computing Division
Supercomputing • Communications • Data
CDP, Technologies
o
o
o
o
o
o
o
o
o
The CDP was begun in 2001
Uses THREDDS* catalogs as to describe data content and structure
Uses Lucene as the search/discovery back end
Uses Open Archives Initiative OAI to share metadata
Uses SRM to access deep archive data, share data externally (ESG project)
Experimental use of SRB to share intra-institution
Sister site, Earth System Grid (ESG), uses grid technology to share data
Uses DODS/OPEnDAP for aggregation and subsetting data sets
Uses a distributed model for accessing data and metadata
https://cdp.ucar.edu/
*Thematic Realtime Environmental Distributed Data Services
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Introduction, THREDDS in the CDP
• THREDDS is a schema used for DATA DELIVERY

Can be also used for geoscience data search and discovery
THREDDS catalogs:
• Are ingested into Lucene and GEO extent searching tools for
search and discovery
• Are used to supply data for search results and browse pages
• Specify data access mechanisms
 http, http restricted, OPEnDAP, MSS, TDS, LAS, GDS, CDP/agg
• Point to and use non-THREDDS metadata
 ESG, DC, NcML, NcML, GML, DIF
 Can interoperate with WMO metadata when available
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Introduction, THREDDS in the CDP, cont
• The CDP federates directly with other sites that use
THREDDS catalogs
 NCAR DSS, NCAR EOL, UCAR UNIDATA
• THREDDS catalogs are used inside DODS/OPEnDAP,
GDS, and forthcoming Thredds Data Server
• THREDDS will support a data access control system,
locally and distributed
NCAR Scientific Computing Division
Supercomputing • Communications • Data
THREDDS Background
• THREDDS v0.6






Support for describing the hierarchical structure of datasets
Support for describing data delivery services
Some very basic descriptive metadata
Support for extensible and distributed catalogs
Support for “inheritance” of metadata and services
Allows other descriptive schemas to be part of the catalog
 Emphasizes the hierarchical relationships between data
items, containing datasets and groups of datasets
NCAR Scientific Computing Division
Supercomputing • Communications • Data
THREDDS V1.0
• THREDDS v1.0
 Added descriptive “minimal” metadata tuned for Earth Science
search/discovery
 “Minimal” defined -- Metadata sized for search/discovery
 Again, Metadata can be inherited within the hierarchy
 Design goal was to interoperate with core elements of DIF, ISO19115, DC metadata
 UNIDATA looking at incorporating THREDDS metadata in NetCDF*
and forthcoming TDS**
 Exploring possibly interoperating with BADC model extensions
 V1.0x will have access control elements
URL:
http://my.unidata.ucar.edu/content/projects/THREDDS/index.htm
*NetCDF UNIDATA defined binary data format for gridded and other geoscience data. Includes
metadata that describes the data in the file header
**TDS THREDDS data server -- will handle GRIB and NetCDF, will have WCS
NCAR Scientific Computing Division
Supercomputing • Communications • Data
THREDDS -- CDP
• CDP THREDDS design choices
 Use THREDDS descriptive metadata for search/discovery
 Use GCMD DIF controlled vocabularies for science keyword
hierarchies, creator, publisher, project
 Use Climate and Forecasting CF conventions for variable names
when applicable
 Mandate use of unique identifier to identify data
 Use forthcoming THREDDS elements for data access control
 Use OAI to import DIF records from BADC and GCMD, transform
these records into equivalent THREDDS for use in the CDP
 Import ESG (CCSM) records (THREDDS, ESG), extract a subset of
descriptive metadata for search and discovery
NCAR Scientific Computing Division
Supercomputing • Communications • Data
THREDDS, the details
General Structure of a simple THREDDS catalog
<catalog>
<service name=“httpService” type=“HTTP” base=“http://dataportal.ucar.edu/data/abcData/”>
<service name=“mssService” type=“MSS” base=“/mssRoot/abcData/”/
<dataset name=“abc” ID=“ucar.scd.cdp.datasetName”>
<!-- container dataset -->
<metdadata inherit=“true”>
<!-- descriptive metadata -->
<description type=“summary”>
<creator>
<geospatialCoverage>
<!-- geographic location -->
<….>
<!-- other metadata (13 total) -->
</metadata>
<dataset ID=“ucar.scd.cdp.datasetName.item1”>
<!-- describes a data item -->
<dataSize units=“Kbytes”>123</datasize>
<access serviceName=“httpService" urlPath=”subDataset/SOLVE_DC8_19991119.nc>
<access serviceName=”mssService" urlPath=”subDataset/SOLVE_DC8_19991119.nc>
</dataset>
<more datasets>
<!-- more dataset items -->
</dataset>
<! -- close enclosing dataset ->
</catalog>
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Dataset URL = base + access
points to local server or local service
THREDDS, simple catalog
catalog
service
service
HTTP data
service
MSS data
service
dataset (container)
metadata
description
creator
geospatialCoverage
other elements
dataset (data item)
access, size,
extent
dataset
access, size,
extent
dataset
access, size,
extent
dataset
access, size,
extent
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Local data
access/ local
MSS service
THREDDS, distributed catalogs example
dataset.thredds.xml
catalog
dataset (container)
metadata link
metadata
description
creator
geospatialCoverage
other elements
catalogRef
ACCESS CONTROL
catalogRef
1. Descriptive Metadata is in a separate file, could be on
anther server.
2. Dataset contains references to remote catalogs.
3. Catalog Level Access control elements
Remote Server
catalog (remote)
ACCESS CONTROL
service
metadata
description
…
datasets
catalog (remote)
service
metadata
description
…
datasets
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Remote data
services
THREDDS, database application example
Virtual catalog
service
External HTTP data
service
External Server
Arbitrary
Metadata
Database
Database to
THEDDS catalog
builder
(web service)
metadata
dataset (data item)
access, size,
extent
dataset
access, size,
extent
dataset
access, size,
extent
dataset
access, size,
extent
NCAR Scientific Computing Division
Supercomputing • Communications • Data
External
Data
hosting
THREDDS, distributed data example
catalog
service
service
External HTTP data
service
MSS data
service
dataset (container)
Metadata external reference
Metadata external reference
dataset (data item)
access, size,
extent
dataset
access, size,
extent
dataset
access, size,
extent
dataset
access, size,
extent
NCAR Scientific Computing Division
Supercomputing • Communications • Data
metadata
description
creator
geospatialCoverage
other elements
External Server
ISO-19115
iso-19115 elements
External
Data
hosting
1. Data is not on CDP, service is external, service can implement
access control if required
2. Descriptive metadata is in a separate file, does not have to
be THREDDS
CDP - distributed datasets, overview
Community Data Portal
THREDDS
catalog
top
NCAR Data Support Section
T
T
M
T
A
A
= Access control
T
= THREDDS catalog
D
T
T
D
NCAR MSS
MASS
Store
LANL, ORNL, LBNL
LANL, ORNL, LBNL
SRM
LANL, ORNL, LBNL
A
T
NCAR EOL section
Metadata
database
SRB
T
NCAR Atmospheric Chemistry.
D
Boston University
D
T
D
M
SRB
T
A
SRM
= Data Archive, M= MSS deep archive
NCAR Scientific Computing Division
Supercomputing • Communications • Data
T
T
SRM
(ESG)
T
T
CDP data
storage:
WACCM,
ACD, CME,
CGD, ….
XSLT
DIFs
OAI
client
BADC OAI
OAI
server
DIF
DIF
DIF
DIF
D
THREDDS review/summary
•
•
•
•
•
•
•
•
•
•
THREDDS is a schema used for DATA DELIVERY
Contains basic geoscience discovery data
Is designed to work with distributed data, distributed metadata
Contains elements for data access restriction
Can work with real time data
Can be a container for non-THREDDS descriptive metadata
Defines the hierarchical relationships of datasets
Defines data delivery services
Supports a hierarchical view of metadata
Integrated with many data delivery and visualization services
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Distributed Descriptive Metadata with OAI
• Metadata is immediately “distributed” if metadata is contained
in or is pointed to by THREDDS catalogs
• Metadata can also be shared using OAI technology






OAI -- Open Archives Initiative from the Digital Library (DL) communit
OAI is a web service definition for sharing metadata
OAI uses six verbs to define the service
OAI uses Dublin Core, DC, as the baseline schema
OAI can specify other XML schemas -- we use this capability
OAI can be used as a gateway to send information to an established
DL community -- THREDDS -> DC => DL community via OAI
• OAI disadvantage -- hierarchical relationships are lost
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Distributed Metadata with OAI -- CDP
• THREDDS records are “flattened” (hierarchy collapsed)
one record -> one dataset
• Flattened records are shared using OAI
• For a test, the THREDDS records were transformed into
DIF using XSLT
• DIF records were ingested from BADC transformed into
THREDDS catalogs, and ingested into CDP search and
browse
NCAR Scientific Computing Division
Supercomputing • Communications • Data
CDP metadata architecture
external
metadata
Web Interface/Web Service
Metadata Conversion
Catalog Parsing
THREDDS
catalog
parse
invokes
write
Metadata Processing
read
THREDDS
THREDDS
records
THREDDS
records
records
DIF metadata
DC metadata
index into
Metadata repository
XML viewer
web application
XML results
passed to
Metadata DB
(Lucene)
OAI client
THREDDS catalogs browser
Web UI
free-text
Search Query UI
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Structured,
Geospatial, Temporal
Query UI
OAI server
import
remote Data Center
or Digital Library
export
Data publication on the CDP
Is ingested
Metadata
indexing
application
THREDDS
descriptive
metadata
Dataset
Disk,
HTTP,
Database,
…
Creates
Catalog
crawler
THREDDS
application
hierarchy
metadata
Access
control
XSLT
rendering
Allows
Edits
HTTP
Creates
link
BROWSER
CDP
Catalog
Presentation
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Starts
Metadata
Authoring
tool
Creates
Lucene
Index
Demo
•
•
•
•
•
Data searching: controlled vocabularies, GEO searching
Data browsing: access control
BADC shared metadata directory
Metadata editing
IDV Bundle showing integrated data source
NCAR Scientific Computing Division
Supercomputing • Communications • Data
Experimental Topology to share data?
GISC -> CDP
CDP
WMO GISC
THREDDS
CATALOG
DB
THREDDS
CATALOG
XSLT
NetCDF
GRIB
…
HTTP
CDP
Search
WMO metadata
T
THREDDS metadata
1. OAI transfers of WMO records
2. CDP Crawls data hierarchy -- no metadata
3. GISC creates Web interface to produce
virtual THREDDS Catalogs
(embedded WMO descriptive metadata)
D
WW
WT
W
WW
WW
WW
WT
Crawler
WW
WT
WMO DCPC
WW
WW
OAI
WW
WW
OAI
OAI
XSLT
OAI
NCAR Scientific Computing Division
Supercomputing • Communications • Data
D
Experimental Topology to share data
CDP->GISC
WMO GISC
CDP
THREDDS
CATALOG
NetCDF
GRIB
…
WW
WT
XSLT
CDP
Search
WW
WW
OAI
W
WMO metadata
T
THREDDS metadata
NCAR Scientific Computing Division
Supercomputing • Communications • Data
WMO
Search
OAI
WW
WW
Questions?
NCAR Scientific Computing Division
Supercomputing • Communications • Data

Similar documents