presentation

Transcription

presentation
(Almost) Hands-Off
Information Integration for the
Life Sciences
Ulf Leser, Felix Naumann
Humboldt-Universität zu Berlin
Aladin
• Basic idea
-
Urgent need for data integration in the life sciences
Life science databases have certain characteristics
Life science database users have certain intentions
These can be exploited to automate integration
• ALmost Automatic Data INtegration
for the Life Sciences
- Minimize manual effort
- Keep quality of integrated data as high as possible
- Use domain-specific heuristics
Leser. Naumann, Hands-Off Information Integration, CIDR 05
2
Integration?
• Database integration
• Schema level
• Data integration
• Data level
Leser. Naumann, Hands-Off Information Integration, CIDR 05
Export schema
Export schema
Federated schema
Federated schema
Export schema
Export schema
Export schema
Component schema
Component schema
Component schema
Local schema
Local schema
Local schema
Data
Source
Data
Source
Data
Source
3
Two Cultures of Integration
• Schema-driven (computer scientists)
-
Much smaller than data, (hopefully) well-defined elements
Resolve redundancy and heterogeneity at the schema level
High degree of automation once system is set-up
Focus on methods - you rarely publish a “data paper”
• Data-driven (biologists)
- Value is in the data, abstraction is a result of analysis
- Don‘t bother with schemas
• Abstraction is volatile and depends on experimental technique
- Manual integration at data level, constant high effort
- You rarely publish a (database) “method paper”
Leser. Naumann, Hands-Off Information Integration, CIDR 05
4
Two Cultures: TAMBIS & SWISS-PROT
• Semantic middleware
• 6 sources, 1200 concepts
• Ever adopted in any other project?
- Integrated schema difficult to understand
- No agreement on “global” concepts
- Data provenance
• Database of protein sequences
• Papers, pers. comm., ext. databases, …
• Large effort: 30+ data curators
- Gold standard database
• Mostly perceived and used as a book
Leser. Naumann, Hands-Off Information Integration, CIDR 05
5
Linking Associated Objects
• Schema-driven
- Too abstract; tends to blur data provenance
• Data-driven
- Costly and time-consuming; inadequate use of DB technology
• Alternative: Concentrate on object links
• Example: SRS
- Maps a flat-file into a semi-structured,
“one class” representation
- Never mixes data from diff. sources
- Use cross-references for navigation and
joins
Leser. Naumann, Hands-Off Information Integration, CIDR 05
6
Cross-References
Leser. Naumann, Hands-Off Information Integration, CIDR 05
7
Aladin’s Scenario
• Assumptions
- Integration of many, many biological databases
- As little manual interventions as possible
- Do not merge data from different databases
• Challenges
- Push automation as far as possible without lowering quality
of integrated data too much
- Systematically evaluate quality of automatic integration
• Why will it work?
- Integrate by generating / finding links between objects
- Exploit characteristics of life science databases
Leser. Naumann, Hands-Off Information Integration, CIDR 05
8
Properties – and how to use them
• Data sources have only one “type” of object
• Objects have nested, semi-structured annotations
 Detect hierarchical structure
• Objects have stable, unique accession numbers
• Databases heavily cross-reference each other
 Detect objects
 Detect existing cross-references
• Objects have rich annotations (often free text,
sequences)
 Detect further associations based on “similarity”
Leser. Naumann, Hands-Off Information Integration, CIDR 05
9
A Biological Database
Leser. Naumann, Hands-Off Information Integration, CIDR 05
10
Columba: Multidimensional Integration
•
•
•
•
Interdisciplinary project
Integrates 15 sources annotating protein structures
Sources are dimensions for PDB entries
Neither data nor schema integration - links
SCOP
Class
Fold
Superfamily
CATH
Class
Architecture
Topology
Homolog. sf
DSSP
Secondary
structure
elements
SwissProt
Description
Domains
Feature
PDB
PDB_ID
Compounds Chains
Ligands
GeneOntology
Terms
TermRelations
Ontologies
• Advantages
• Users recognize their sources
• Intuitive query concept
• “Relatively” easy to maintain/extend
KEGG
Pathway
Enzyme
EC Number
Leser. Naumann, Hands-Off Information Integration, CIDR 05
11
Columba Experiences
• = Aladin’s assumptions





Relational approach feasible: Sources are downloadable, parsers exist
Databases are collections of each one type
Hierarchical structure, only 1:n relationships
Objects have unique accession numbers
Importance of and lack of cross references
• Lessons learned
- Schema reengineering is extremely time-consuming
• Although we will only use a small part at the end
- There is more demand than resources
• Why not be less specific about which data to integrate,
but much faster?
Leser. Naumann, Hands-Off Information Integration, CIDR 05
12
Materialized Integration
Data
Warehouse
BIND
Brenda
PDB
OMIM
Genbank
SWISSPROT
Leser. Naumann, Hands-Off Information Integration, CIDR 05
PubMed
KEGG
13
Materialized Integration
BIND
Aladin
Brenda
PDB
OMIM
Genbank
SWISSPROT
Leser. Naumann, Hands-Off Information Integration, CIDR 05
PubMed
KEGG
14
Five Steps to Integration
Source-specific
1. Download source, parse, import into RDBMS
2. Guess primary objects
3. Guess (hierarchically structured) annotation
Across data sources
4. Guess cross-references
•
Objects sharing some piece of information
5. Guess duplicates
•
Highly similar objects
Leser. Naumann, Hands-Off Information Integration, CIDR 05
15
Overview – Steps 1-3
Steps 2 and 3
• Guess primary objects
• Guess accession number
• Guess / find FK constraints
Step 1
• Parse and import
• Arbitrary target schema
• With or without FK constraints
Leser. Naumann, Hands-Off Information Integration, CIDR 05
16
Overview – Steps 4+5
Step 5
• Guess duplicates
• Different degrees of
“duplicateness”
Step 4
• Guess existing cross-refs
• Compute new cross-refs
Leser. Naumann, Hands-Off Information Integration, CIDR 05
17
1. Download, parse, import
• Q: Is that possible in an automatic way?
• Q: What is the target schema?
• Answers
-
Here, some manual work is involved, but …
Parsers are almost always available (BioXXX)
Aladin doesn‘t mind the target schema
Target schemas are completely source-specific
… may or may not contain FK constraints (MySql is …!)
But: Universal relation won’t work
Leser. Naumann, Hands-Off Information Integration, CIDR 05
18
2. Guess Primary Objects
• Q: What’s a primary object?
• Q: How do you find them?
• Answers
- A database is a collection of objects of one type
• Many biological databases started as books
- These primary objects have stable accession numbers
- Accession numbers look very much the same
• P0496, DXS231, 1DXX, …
• Analyze length, composition, variation, uniqueness, NOT NULL
- But: Databases may have more than one primary type
Leser. Naumann, Hands-Off Information Integration, CIDR 05
19
3. Guess Dependent Annotation
• Q: Can we detect dependency from data?
• Q: What about complex relationships?
• Answers
- Hierarchical annotation means 1:1 or 1:n relationships
• Annotations don’t reference each other
• No m:n - especially flat-file parsers don’t generate m:n
- Guess or use primary keys and foreign key constraints
• Unique and not null; subset relationship; surrogate keys; …
- Lot of previous work, e.g. [KL92], [MLP02], …
Leser. Naumann, Hands-Off Information Integration, CIDR 05
20
4. Guess Associations between Objects
• Q: How can we find existing cross-refs?
• Q: How can we generate new cross-refs?
• Answers
- An existing cross-reference is essentially a pair of identical
accession numbers in two different data sources
• Same characteristics as accession number (minus uniqueness)
- Guess new cross-refs based on similarity of attribute values
• Similarity of text fields (text mining), sequences, …
- Note: cross-refs are on the object level – need to be stored
- Lot of previous work, e.g. [NHT+02], [HBP+05], [AMS+97]
Leser. Naumann, Hands-Off Information Integration, CIDR 05
21
5. Guess Duplicates
• Q: If we don’t even know classes – what’s a
duplicate?
• Answer
- Most difficult part, but there are many kind-of duplicates
• Are sequence-identical genes in different species the same?
- Need for varying degrees of “duplicateness”
• Data level (overlap in attribute values)
• Schema-level (schema matching)
- Note: No removal or merging of duplicates
- Lot of previous work, e.g. [MGR+02], [BN05], [MLF04], …
Leser. Naumann, Hands-Off Information Integration, CIDR 05
22
Caveats
• Not meant for high-throughput data
- Proteomics profiling, gene expression databases
- Targets “knowledge-rich” databases
• Resulting warehouse will contain errors
- Wrong cross-refs, misinterpreted structure, missing links
- Requirement: Measure quality of Aladin’s methods
• Use existing integrated databases as gold standard
• Precision/recall measures can be derived for all steps
• Intended for human usage, not for automatic
further processing
Leser. Naumann, Hands-Off Information Integration, CIDR 05
23
Summary
• Five step (almost) automatic integration procedure
- Depends on domain characteristics
- Guesses primary objects, annotations, cross-references, duplicates
- Neither schema integration nor data fusion – links
• Which quality does Aladin achieve?
- We don’t know yet – needs to be evaluated
• Issue: Scalability
- Needs many, many comparisons of tables, tuples, values
- But: Incremental integration, sampling, pruning
• Issue: Searching and result presentation
- Full text search, browsing
- But: Queries across sources possible for advanced users
Leser. Naumann, Hands-Off Information Integration, CIDR 05
24
Acknowledgements Columba
• Humboldt University
Silke Trissl
Heiko Müller
Raphael Bauer
• Charite
Kristian Rother
Stefan Günther
Robert Preissner
Cornelius Frömmel
Leser. Naumann, Hands-Off Information Integration, CIDR 05
• Conrad-Zuse Center
Rene Heek
Thomas Steinke
• Technische
Fachhochschule
Patrick May
Ina Koch
• Funding: BMBF
25

Similar documents