Document 6502708

Transcription

Document 6502708
20
010
0
How
w to
o Buiild an
a Efffecttive
Data
D a Mo
odel
© Daniel LLinstedt, 200
08-2010
Dan Linsteedt, LLC
1/1/2010
0
How to Build an Effective
Data Vault Model
Page 2 of 152
How to Build an Effective
Data Vault Model
Data Vault Modeling How To Guide
Copyright © Dan Linstedt, 2008-2010
President, Empowered Holdings, Inc
http://EmpoweredHoldings.com – Company Home
http://DanLinstedt.com – Data Vault Home
All rights reserved.
All images are the property of Dan Linstedt, unless an image source is otherwise noted.
No part of this book may be reproduced in any form or by any electronic or mechanical means
including information storage and retrieval systems, without permission in writing from the
author. The only exception is by a reviewer, who may quote short excerpts in a review.
Printed in the United States of America
First Printing: September, 2010
Co-Editors: Kent Graziano
Abstract:
The purpose of this book is to present and discuss the technical components of the Data Vault Data Model.
The examples in this book provide a strong foundation for how to build, and design structures in using the
Data Vault modeling technique. This book is a second in the series of books surrounding the Data Vault
model and methodology (approach). The target audience is anyone wishing to implement a Data Vault model
for integration purposes whether it be an Enterprise Data Warehouse, Operational Data Warehouse, or
Dynamic Data Integration Store.
Front Cover Image References:
http://www.becomehealthynow.com/article/bodynervousadvanced/817
http://brain0.com/erobot.html
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 3 of 152
Table of Contents
1.0
Introduction and Terminology ...................................................................................................... 11
1.1
Do I need to be a Data Modeler to Read this Book? .............................................................. 11
1.2
Review of Basic Terminology .................................................................................................... 11
1.3
Notations used in this text ........................................................................................................ 15
1.4
Data Models as Ontology’s ....................................................................................................... 15
1.5
Data Model Naming Conventions and Abbreviations ............................................................. 17
1.6
Introduction to Hubs, Links, and Satellites ............................................................................. 19
1.7
Flexibility of the Data Vault Model ........................................................................................... 21
1.8
Data Vault Basis of Commutative Properties and Set Based Math ....................................... 23
1.9
Data Vault and Parallel Processing Mathematics ................................................................... 26
1.10 Loading Processes: Batch Versus Real Time .......................................................................... 31
2.0
Architectural Definitions ............................................................................................................... 32
2.1
Staging Area .............................................................................................................................. 32
2.2
EDW – Data Vault...................................................................................................................... 33
2.3
Metrics Vault.............................................................................................................................. 34
2.4
Meta Vault ................................................................................................................................. 34
2.5
Report Collections ..................................................................................................................... 35
2.6
Data Marts ................................................................................................................................. 35
2.7
Business Data Vault .................................................................................................................. 35
2.8
Operational Data Vault ............................................................................................................. 36
2.9
Dynamic Data Vault .................................................................................................................. 37
3.0
Common Attributes ....................................................................................................................... 38
3.1
Sequence Numbers .................................................................................................................. 40
3.2
Sub Sequence Numbers (Item Numbering) ............................................................................ 41
3.3
Load Dates ................................................................................................................................ 41
3.4
Load End Dates ......................................................................................................................... 43
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 4 of 152
3.5
Last Seen Dates ........................................................................................................................ 44
3.6
Extract Dates ............................................................................................................................. 48
3.7
Record Creation Dates.............................................................................................................. 49
3.8
Record Sources ......................................................................................................................... 49
3.9
Process ID’s ............................................................................................................................... 50
4.0
Hub Entities ................................................................................................................................... 51
4.1
Hub Definition and Purpose ..................................................................................................... 53
4.2
What is a Business Key? .......................................................................................................... 54
4.3
Where do we find Business Keys? ........................................................................................... 55
4.4
Why are Business Keys Important? ......................................................................................... 56
4.5
Why not Surrogate Keys as “Master Keys”? ........................................................................... 58
4.6
Hub Smart Keys, Intelligent Keys ............................................................................................. 58
4.7
Hub Composite Business Keys ................................................................................................ 59
4.8
Hub Entity Structure .................................................................................................................. 60
4.9
Hub Examples ........................................................................................................................... 61
4.10 Dependent and Non-dependent Child Keys ............................................................................ 63
4.11 Mining patterns in the Hub Entity ............................................................................................ 65
4.12 Process of Building a Hub Table .............................................................................................. 67
4.13 Modeling Rules and Standards for Hub Tables ...................................................................... 69
4.14 What Happens when the Hub Standards Are Broken............................................................. 70
5.0
Link Entities ................................................................................................................................... 72
5.1
Link Definition and Purpose ..................................................................................................... 72
5.2
Reasons for Many To Many Relationships .............................................................................. 72
5.3
Flexibility .................................................................................................................................... 76
5.4
Granularity ................................................................................................................................. 79
5.5
Dynamic Adaptability ................................................................................................................ 82
5.6
Scalability................................................................................................................................... 83
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 5 of 152
5.7
Link Entity Structure.................................................................................................................. 86
5.8
Link Driving Key ......................................................................................................................... 87
5.9
Link Examples ........................................................................................................................... 89
5.10 Degenerate Fields In Links ....................................................................................................... 90
5.11 Multi-Temporal Date Structures ............................................................................................... 91
5.12 Link-To-Link (Parent/Child Relationships) ............................................................................... 92
5.13 Link Applications ....................................................................................................................... 95
5.14 Hierarchical Links...................................................................................................................... 96
5.15 Same-As Links ........................................................................................................................... 98
5.16 Begin and End Dating Links ..................................................................................................... 99
5.17 Low Value Links....................................................................................................................... 102
5.18 Transactional Links ................................................................................................................. 103
5.19 Computed Aggregate Links..................................................................................................... 105
5.20 Strength and Confidence Ratings in Links ............................................................................ 106
5.21 Vector Links (Directional)........................................................................................................ 107
5.22 Exploration Links ..................................................................................................................... 108
5.23 Capturing Changes to Source Systems Over Time................................................................ 108
6.0
Satellite Entities .......................................................................................................................... 110
6.1
Satellite Definition and Purpose ............................................................................................ 110
6.2
Satellite Entity Structure ......................................................................................................... 111
6.3
Satellite Examples................................................................................................................... 112
6.4
Importance of Keeping History ............................................................................................... 113
6.5
Classification or Type of Data Examples................................................................................ 114
6.6
Rate of Change Examples ...................................................................................................... 116
6.7
Satellites Arranged by Source System ................................................................................... 118
6.8
Overloaded Satellites (The Flip-Flop Effect) .......................................................................... 120
6.9
Satellite Applications: ............................................................................................................. 122
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 6 of 152
6.10 Effectivity Satellites ................................................................................................................. 122
6.11 Record Tracking Satellites ...................................................................................................... 123
6.12 Status Tracking Satellites ....................................................................................................... 126
6.13 Computed Satellites (Quality Generated) .............................................................................. 127
6.14 Multiple Active Satellite Rows ................................................................................................ 128
6.15 Splitting Satellites ................................................................................................................... 130
6.16 Consolidating Satellites .......................................................................................................... 134
7.0
Query Assistant Tables ............................................................................................................... 138
7.1
Point in Time Tables ................................................................................................................ 138
7.2
Bridge Tables ........................................................................................................................... 141
8.0
Reference Tables ........................................................................................................................ 144
8.1
No-History Reference Tables .................................................................................................. 145
8.2
History Based Reference Tables ............................................................................................ 145
8.3
Code and Descriptions............................................................................................................ 145
8.4
National Drug Codes ............................................................................................................... 145
8.5
ICD9 Diagnosis Codes ............................................................................................................ 145
8.6
Calendars (Financial and Gregorian) ..................................................................................... 145
9.0
Ontologies, Metadata, and Enterprise Data Warehousing ....................................................... 146
9.1
Introduction to an Ontology .................................................................................................... 146
9.2
Ontological Importance in an EDW ........................................................................................ 147
9.3
Maximizing Unstructured Data with a Data Vault ................................................................. 148
9.4
Building Dynamic BI Releases ................................................................................................ 148
9.5
Maintaining, Managing and Governing Ontological Metadata ............................................. 148
9.6
Data Vault Hierarchies, Modeling and Managing ................................................................. 148
10.0 Additional Data Vault Thoughts .................................................................................................. 149
10.1 Introduction to a Business Based Data Vault ....................................................................... 149
10.2 Metadata and the Data Vault Model ..................................................................................... 151
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 7 of 152
10.3 Master Data and the Data Vault Model ................................................................................. 152
10.4 Introduction to Load Metrics and the Data Vault Model ...................................................... 152
10.5 Growth Patterns and the Architecture ................................................................................... 152
10.6 Future Look: Ontology and Dynamic BI Solutions ................................................................. 152
10.7 Dynamic Data Warehousing, an Introduction ....................................................................... 152
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 8 of 152
Table of Figures
Figure 1-1: Example E-R Diagram (Elmasri/Navathe) ............................................................................ 13
Figure 1-2: Crows Foot and Arrow Notation Example ............................................................................ 15
Figure 1-3: Small Example: Ontology for Vehicle.................................................................................... 16
Figure 1-4: Example Abbreviations and Naming Conventions .............................................................. 18
Figure 1-5: Example Data Vault ............................................................................................................... 20
Figure 1-6: Flexibility of Adapting to Change .......................................................................................... 22
Figure 1-7: 3rd Normal Form Product and Supplier Example ................................................................ 23
Figure 1-8: Applied Set Theory for the Data Vault .................................................................................. 25
Figure 1-9: Parallel Computing Simplified .............................................................................................. 27
Figure 1-10: Logical Data Vault Hyper Cube........................................................................................... 28
Figure 1-11: Physical Data Vault Layout (Starting point) ....................................................................... 29
Figure 1-12: Physical Data Vault Layout (Partitioned) ........................................................................... 29
Figure 2-1: Enterprise BI Architectural Components ............................................................................. 32
Figure 3-1: Time Series Batch Loaded Data ........................................................................................... 38
Figure 3-2 Real-Time Arrival, Data Geology ............................................................................................ 39
Figure 3-3: Load Date Time Stamp and Record Source ........................................................................ 42
Figure 3-4: Example Load Date Time Stamp Data ................................................................................. 42
Figure 3-5: Load End Date Computations, Descriptive Data Life Cycle ................................................ 44
Figure 3-6: Structures containing Last Seen Dates ............................................................................... 45
Figure 3-7: Scan all data in EDW............................................................................................................. 46
Figure 3-8: Reduced Scan Set after Applying Last Seen Date .............................................................. 48
Figure 4-1: Business Key Changing Across Line of Business ................................................................ 52
Figure 4-2: Hub Example Images ............................................................................................................ 53
Figure 4-3: Hub Example Data ................................................................................................................ 54
Figure 4-4: Smart Key Example ............................................................................................................... 58
Figure 4-5: Composite Business Key Hub Example ............................................................................... 60
Figure 4-6: Example Hub Entity Structure .............................................................................................. 61
Figure 4-7: Example Hubs from Adventure Works 2008 ....................................................................... 62
Figure 4-8: Example of National Drug Code Data Vault ......................................................................... 63
Figure 4-9: Dependent Child Relationship Modeling ............................................................................. 64
Figure 4-10: Typical Hub Row Sizing ....................................................................................................... 70
Figure 5-1: Relationship Changes Over Time ......................................................................................... 74
Figure 5-2: Link Table Structure Housing Multiple Relationships ......................................................... 75
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 9 of 152
Figure 5-3: Starting Model Before Changes ........................................................................................... 76
Figure 5-4: Data Vault After Modification ............................................................................................... 77
Figure 5-5: Additional Data Vault Model - More Changes...................................................................... 78
Figure 5-6: Global Data Vault Linking ..................................................................................................... 79
Figure 5-7: Uncovering Fact Table Grain ................................................................................................ 80
Figure 5-8: Data Vault Grain, Representing Star Schema ..................................................................... 80
Figure 5-9: Traditional Data Vault Storage Layout ................................................................................. 83
Figure 5-10: Performance Physical Split Version 1 ................................................................................ 84
Figure 5-11: Performance Physical Split Version 2 ................................................................................ 85
Figure 5-12: Performance Physical Split Version 3 ................................................................................ 85
Figure 5-13: Sample Link Structure ........................................................................................................ 86
Figure 5-14: Example Driving Key for Link ............................................................................................. 87
Figure 5-15: Example of Link Satellite with Driving Key ........................................................................ 87
Figure 5-16: Insert to Link/Sat Based on Driving Key ........................................................................... 88
Figure 5-17: Link Driving Key/Satellite End Dated ................................................................................ 88
Figure 5-18: Example of Link Tables From Adventure Works 2008 Data Vault................................... 89
Figure 5-19: Example of Link To Link Relationships .............................................................................. 92
Figure 5-20: Step 1, Flattening Link-To-Link Hierarchy ......................................................................... 94
Figure 5-21: Step 2, Flattening Link-To-Link Hierarchy ......................................................................... 94
Figure 5-22: Example Organization Structure ........................................................................................ 96
Figure 5-23: Hierarchical Link for Offices ............................................................................................... 97
Figure 5-24: Example Hierarchical Link of Employees .......................................................................... 97
Figure 5-25: Same-As Link Example, Business Data ............................................................................. 98
Figure 5-26: Same-As Link Data Vault Model......................................................................................... 99
Figure 5-27: Incorrect Link with Begin/End Date ................................................................................... 99
Figure 5-28: Begin & End Dates in Links .............................................................................................. 101
Figure 5-29: Example of Poorly Constructed Link ................................................................................ 101
Figure 5-30: Satellite Effectivity on a Link ............................................................................................ 102
Figure 5-31: Transactional Link Example ............................................................................................. 103
Figure 5-32: Transactional Link, No Satellite ....................................................................................... 104
Figure 5-33: Example of Computed Aggregate Link ............................................................................ 105
Figure 6-1: Example Satellite Entity ...................................................................................................... 112
Figure 6-2: Example Satellite Entities ................................................................................................... 113
Figure 6-3: Satellites Split by Type Of Data Option 1 ........................................................................... 115
Figure 6-4: Satellite Data Rate of Change Example ............................................................................. 116
Figure 6-5: Satellite Split by Rate Of Change ....................................................................................... 117
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com
How to Build an Effective
Data Vault Model
Page 10 of 152
Figure 6-6: Customer Satellites Split by Source System ..................................................................... 119
Figure 6-7: Satellite Overload from Many Sources .............................................................................. 121
Figure 6-8: Satellite Effectivity ............................................................................................................... 122
Figure 6-9: Denormalized Record Source Tracking Satellite ............................................................... 124
Figure 6-10:Normalized Record Source Tracking Satellite.................................................................. 125
Figure 6-11: Status Tracking Satellite .................................................................................................. 127
Figure 6-12: Multi-Active Satellite Rows ............................................................................................... 129
Figure 6-13: Multi-Active Satellite Row Data ........................................................................................ 130
Figure 6-14: Multi-Active Satellite with Business Sub-Sequence........................................................ 130
Figure 6-15: Step 1: Identify Satellite Split Columns ........................................................................... 131
Figure 6-16: Step 2: Split Satellite Columns, Design New tables ....................................................... 132
Figure 6-17: Step 3: Copy Data From Original to New Satellites ........................................................ 132
Figure 6-18: Step 4: Eliminate Duplicates ............................................................................................ 133
Figure 6-19: Step 4: Alternate Elimination of Duplicates .................................................................... 133
Figure 6-20: Step 5: End Dates Adjusted After Satellite Split ............................................................. 134
Figure 6-21: Consolidating Satellite Data ............................................................................................. 135
Figure 6-22: Load End Dates Calculated in Consolidated Satellite .................................................... 137
Figure 7-1: Structure of PIT Table ......................................................................................................... 138
Figure 7-2: PIT Table Architecture Overview ......................................................................................... 139
Figure 7-3: Example PIT Table with Snapshot Dates ........................................................................... 140
Figure 7-4: Bridge Table Structure ........................................................................................................ 141
Figure 7-5: Bridge Table Architectural Overview .................................................................................. 142
Figure 7-6: Bridge Table Example Data ................................................................................................ 143
© Dan Linstedt 2010, all rights reserved
http://danLinstedt.com