Copyright © 2014 Mosaic Data Science. All rights reserved.
Transcription
Copyright © 2014 Mosaic Data Science. All rights reserved.
Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. Why Enterprise Data Warehouse Projects Fail, and What to do About it Introduction Mosaic Data Science, January 2014 Everyone knows data warehouses are risky. It is an IT truism that enterprise data warehouse (EDW) projects are unusually risky. One paper on the subject begins, “D Introduction warehouse projects are notoriously difficult to manage, and many of them end in failure.”1 A book on EDW project management reports that “the most experienced Everyone knows data warehouses are risky. It is an with IT truism enterprise databecause “estimating on project managers [struggle] EDWthat projects,” in part warehouse (EDW) projectswarehouse are unusually risky.isOne on the subject “Data projects verypaper difficult [since] eachbegins, data warehouse project can be so 2 warehouse projects are notoriously difficult manage, and EDW many of in Manytoarticles about riskthem cite end the laundry lists of EDW failure and r different.” project reports failure.”1 A book on EDWtypes in management Chapter 4 “Risks” ofthat the “the samemost book,experienced arguing that EDW project riskiness deri project managers [struggle]from withthe EDW projects,” in part because “estimating multitude of risk factors inherent in EDWon projects. Likewise, experts such a warehouse projects is very Ralph difficult [since]offer each numerous data warehouse project be so EDW project risk.3 Kimball guidelines forcan decreasing different.”2 Many articles about EDW risk cite the laundry lists of EDW failure and risk types in Chapter 4 “Risks” One of therisk same book,stands arguing thatWe EDW project riskiness derives factor out. agree that EDW projects are complex, and have ma from the multitude of risk factors inherent EDW projects. experts such as are more difficult to risk factors. Weinalso agree that inLikewise, general, complex projects 3 Ralph Kimball offer numerous guidelines forsaid decreasing EDWdecades project of risk. manage. Having that, several experience with EDW projects lead us focus on one particular pattern of EDW project failure that we believe is largely One risk factor stands out. We agreefor thatEDW EDWprojects’ projectsreputation are complex, responsible for and highhave risk.many This white paper describes th risk factors. We also agreefailure that in pattern, general,explains complexthe projects are more difficult to risk factor behind the pattern, and suggests how to avoid manage. Having said that, risk several decades of experience with EDW projectsthroughout lead us to the project lifespan. while delivering value as quickly as possible focus on one particular pattern of EDW project failure that we believe is largely responsible for EDW projects’ reputation for high risk. This white paper describes the failure pattern, explains theThe risk Failure factor behind the pattern, and suggests how to avoid the Pattern risk while delivering value as quickly as possible throughout the project lifespan. The pattern proper. The failure pattern is surprisingly mundane: The Failure Pattern 1. The EDW project’s business sponsors agree to fund the project. The pattern proper. The failure is surprisingly mundane: 2. pattern In exchange for sponsorship, the sponsors pressure the project team to deliver substantial business value early—typically within three months of project ons 1. The EDW project’s business sponsors agreethe to fund the project. More generally, sponsors require a project roadmap that promises frequen delivery of major new increments of functionality. 2. In exchange for sponsorship, the sponsors pressure the project team to deliver substantial business value within threethe months project onset. 3. early—typically Eager for a chance to build EDW,ofthe project team negotiates and commit More generally, the sponsors require adelivery project schedule, roadmap that promises frequent of some high-visibility dat a frequent starting with delivery delivery of major new increments reports of at functionality. the end of the first project phase. 3. Eager for a chance to build the EDW, the project team negotiates and commits to 1 Robert M. starting Bruckner,with Beatedelivery List, and of Josef Schiefer, “Risk Management a frequent delivery schedule, some high-visibility data orfor Data Warehouse System Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). reports at the end of2 the first project phase. Sid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Management (Pearson, 2000), pp. 1-26. 3 http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing 1 Robert M. Bruckner, Beate List,visited and Josef Schiefer, “Risk Management for Data Warehouse Systems.” January 30, 2014. Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 Sid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Management (Pearson, 2000), pp. 1-26. 3 http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing/, visited January 30, 2014. 1 1 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. 4. The project team works very hard through the first project phase, but still does not Introduction deliver its initial commitments on time. Everyone knows data warehouses are risky. It is an IT truism that enterprise data 5. As time progresses the inevitable frequent scopeare changes (thatrisky. partlyOne make EDW warehouse (EDW) projects unusually paper on the subject begins, “D projects infamous) accrue. These divert the project team’s attention from its warehouse projects are notoriously difficult to manage, and many of them end in original commitments. The 1team mayon renegotiate the delivery schedule, butthat “the most experienced A book EDW project management reports failure.” nevertheless continues to fall ever more behind. project managers [struggle] with EDW projects,” in part because “estimating on warehouse projects is very difficult [since] each data warehouse project can be so 2 6. Eventually the project’s schedule dilation ruinsabout the project team’s Many articles EDW risk citecredibility. the laundry lists of EDW failure and r different.” Management replaces oneinteam of EDW developers management types Chapter 4 “Risks” of thewith sameanother, book, arguing that EDW project riskiness deri itself gets replaced, from or thethe project is scrapped. multitude of risk factors inherent in EDW projects. Likewise, experts such a Ralph Kimball offer numerous guidelines for decreasing EDW project risk.3 Here’s the remarkable thing about this pattern: even experienced EDW technologists well able to anticipate scopeOne changes find themselves in the The scope risk factor stands out. Wesame agreebind. that EDW projects are complex, and have ma changes per se are not the root of the problem. Nor is the problem just a failure to risk factors. We also agree that in general, complex projects are more difficult to renegotiate schedule commitments; these renegotiations are a typical part of the pattern. manage. Having said that, several decades of experience with EDW projects lead us Rather, the problem lies with the on project team’s failure to estimate requirements focus one particular pattern of EDW time project failure that we believe is largely accurately—even time requirements for re-negotiated priorities—especially earlyThis white paper describes th responsible for EDW projects’ reputation for highinrisk. project phases. The key question is why this occurs. To answer this question, we must failure pattern, explains the risk factor behind the pattern, and suggests how to avoid detour into data architecturerisk enough grasp a few of its concepts, notably the whiletodelivering value as basic quickly as possible throughout the project lifespan. entity type. Data Architecture 101. An entity typePattern is a class of things that a database represents. The Failure Common business examples of entity types include customer, product, and address. A typical business database represents on the order of 100 entitypattern types. is surprisingly mundane: The pattern proper. The failure Data architects organize entity 1. types into dataproject’s models. business A logicalsponsors data model The EDW agreerepresents to fund the project. entity types in the abstract. A physical data model realizes a logical model in a given types of model account forthe thesponsors logical relations physical database.4 In principle, 2. both In exchange for sponsorship, pressure the project team to deliver among entity types.5 However, thesubstantial data architect must decide which relations a data three months of project ons business value early—typically within model actually captures. A good logical model is fairly exhaustive abouta project these relations, More generally, the sponsors require roadmap that promises frequen but a physical model typically represents a proper subset those in the logical model.6 delivery of major new of increments of functionality. 3.as Eager for aphysical chancemodels to build the one EDW, project team negotiates and commit Logical models are often represented if they were having of thethe normal forms a frequent delivery schedule, starting with delivery of some high-visibility dat initially defined by the database researchers Codd and Boyce in the 1979s. See http://en.wikipedia.org/wiki/Database_normalization#Normal_forms the project list of normal forms. reports at the end of thefor first phase. 5 Data models are similar in principle to class hierarchies in object-oriented programming. Data models differ from class hierarchies in that they do not emphasize the inheritance relation. Object-relational programming attempts to reconcile the relational and object-oriented either through a middle for Data Warehouse System 1 Robert M. Bruckner, Beate List, and paradigms, Josef Schiefer, “Risk Management tier between application code andLecture database that supposedly eliminates the “impedance mismatch” between Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 them, or through an object-relational database management system that has both relational and objectSid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project oriented features. See for example http://www.princeton.edu/~achaney/tmve/wiki100k/docs/ObjectManagement (Pearson, 2000), pp. 1-26. 3 relational_database.html. http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing 6 In a physical model, most relations among entity visited January 30,types 2014.appear as foreign keys. Here’s the short course on database keys. A primary key is a unique identifier for an entity, that is, an instance of an entity type. A natural primary key is a unique identifier that comes from outside of a database. (Someone enters it manually, or a data-integration process imports it.) A surrogate primary key is a primary key generated by the database, usually an integer. A foreign key is a reference in one table to a surrogate primary key in another table. 4 2 2 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER WhyScience. Enterprise Copyright © 2014 Mosaic Data AllData rightsWarehouse reserved. Projects Fail, and What to do About it Mosaic Data Science, January 2014 Precedence relations in logical models. Now, suppose that within an EDW’s logical Introduction model, two entity types have some relation. For example, the logical model might represent the customer and Everyone purchase-transaction entity types, with customer knows data warehouses areeach risky. It is an IT truism that enterprise data 7,8 optionally making several purchases. In such cases one of the entity types mustpaper on the subject begins, “D warehouse (EDW) projects are unusually risky. One precede the other in two senses, both important to the EDW development process: warehouse projects are notoriously difficult to manage, and many of them end in failure.”1 A book on EDW project management reports that “the most experienced Physical-implementation One entity project precedence: managers [struggle] withtype’s EDW physical projects,”model in partmust because “estimating on be implemented first. Here, the customer (dimension) table can be implemented warehouse projects is very difficult [since] each data warehouse project can be so 2 physically, regardless of the existence of a purchase But purchase Many articles about (fact) EDW table. risk cite thethe laundry lists of EDW failure and r different.” table must have a foreign key referencing the customer (or customer group) types in Chapter 4 “Risks” of the same book, arguing that EDW project riskiness deri dimension table’s surrogate key to implement theprojects. physical Likewise, experts such a from theprimary multitude ofcolumn, risk factors inherent infully EDW 9 model of a purchase.Ralph Kimball offer numerous guidelines for decreasing EDW project risk.3 ETL-implementation One entity extract-transform-load Oneprecedence: risk factor stands out. type’s We agree that EDW projects are complex, and have ma (ETL) process mustrisk be implemented first. In our example one loadprojects a factors. We also agree that in general, could complex are more difficult to customer without having loaded any particular purchase, but one cannot load a manage. Having said that, several decades of experience with EDW projects lead us specific purchase without first having loaded the customer (or customer group) focus on one particular pattern of EDW project failure that we believe is largely 10 making the purchase. responsible for EDW projects’ reputation for high risk. This white paper describes th failure pattern, explains the risk factor behind the pattern, and suggests how to avoid These two precedence relations always run in thevalue sameasdirection, so possible logicallythroughout speaking the project lifespan. risk while delivering quickly as they can combine into a single precedence relation. A natural visual representation the entity-type Theof Failure Pattern precedence relations in a logical model is a directed graph, where the arcs (arrows) represent precedence. For example, Figure 1 below represents customer The preceding purchase: pattern proper. The failure pattern is surprisingly mundane: 1. The EDW project’s business sponsors agree to fund the project. 2. In exchange for sponsorship, the sponsors pressure the project team to deliver substantial business value early—typically within three months of project ons 7 If several customers can make one purchase together, the relation between customer purchase is many that promises frequen More generally, the sponsors require and a project roadmap (customers) to many (purchases); if not, it is one (customer) to many (purchases). Likewise, if a customer delivery of major new increments of functionality. need not have a purchase, the relation is optional on the customer side; otherwise it is required. These aspects of a logical relation are termed the relation’s cardinality. 3. Eager forbea achance to build EDW, the project team negotiates and commit In the present example a customer would probably dimension, and athe purchase a fact, in the EDW’s physical model. The analysis applies regardless, but isdelivery more subtle, in other cases—for example, if bothof some high-visibility dat a frequent schedule, starting with delivery entity types are dimensions, or if one is a reports dimension other attribute of thephase. dimension. For atand thethe end of is theanfirst project brevity’s sake we gloss this distinction. 9 Partial representations can skirt the problem at the cost of creating substantial technical debt. Hard experience has taught us to implement an entire entity type’s physical model all at once (and likewise an 1 M. aBruckner, Beate Josef Schiefer, for Data Warehouse System entity type’s entire ETL process, atRobert least for given source or List, set ofand sources). We have“Risk neverManagement found Lecture Notes inthe Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). implementing partial physical entity types worth cost. 2 10 Sid Adelman and Larissa Moss,using “Introduction Data termed Warehousing,” It is possible to load missing dimension data after a fact isT.loaded, a design to pattern a late- Data Warehouse Project Management (Pearson, 2000), pp. 1-26. arriving dimension. See http://www.kimballgroup.com/data-warehouse-business-intelligence3 http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing resources/kimball-techniques/dimensional-modeling-techniques/late-arriving-dimension/. Late-arriving visited 2014. dimensions don’t really avoid the issue, January because 30, even when one uses late-arriving dimensions, one still 8 must implement ETL to load the dimension foreign key for the fact whenever the foreign key’s value is available. In a strong sense, late-arriving dimensions are a workaround for a data-quality or datagovernance problem. They are not the dimension’s happy-path ETL process. 3 3 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. Introduction Customer Purchase Everyone knows data warehouses are risky. It is an IT truism that enterprise data warehouse (EDW) projects are unusually risky. One paper on the subject begins, “D warehouse projectsPrecedence are notoriously difficult to manage, and many of them end in Figure 1: Example Graph failure.”1 A book on EDW project management reports that “the most experienced project managers [struggle] with EDW projects,” Let’s call this sort of diagram the logical model’s precedence digraph (PDG).in11part because “estimating on warehouse projects is very difficult [since] each data warehouse project can be so 2 A PDG has three interestingdifferent.” properties. Many articles about EDW risk cite the laundry lists of EDW failure and r types in Chapter 4 “Risks” of the same book, arguing that EDW project riskiness deri theismultitude of risk connected, factors inherent in EDW PDGs are acyclic: from A PDG not necessarily but each of its projects. connectedLikewise, experts such a 3 Ralph Kimball numerous guidelines for decreasing components is necessarily acyclic. offer A cycle would imply that some entity typeEDW project risk. 12 precedes itself. One risk factor stands out. We agree that EDW projects are complex, and have ma agree acyclic that in general, complex projects PDGs have stages: risk As factors. is true forWe anyalso directed graph, the nodes can be are more difficult to said that, several decades of experience with arranged into stages.manage. A node Having is in stage 0 if it has no predecessors. It is in stage n EDW projects lead us focus on one particular pattern of EDW project failure that we (for n > 0) if all of its predecessors are lower-numbered stages, and at least one of believe is largely reputation for 10 high risk. 13This white paper describes th its predecessors is inresponsible stage n – 1.forAEDW typicalprojects’ EDW PDG has about stages. failure pattern, explains the risk factor behind the pattern, and suggests how to avoid risk iswhile delivering value as stages: quickly as throughout Most of the good stuff in the high-numbered Thepossible entity types most the project lifespan. useful to a business generally appear in a PDG’s late (high numbered) stages. For example, analyzing early-stage entity types such as dates, locations, and products The Failure Pattern value per se. In contrast, purchases does not typically provide much business may involve products, customers, contracts, discounts, locations, dates and times, pattern proper. failure is purchases surprisingly salespeople, etc. SoThe all of these other entityThe types mustpattern precede in mundane: the PDG, forcing the purchase entity type to appear in a late stage. And analyzing purchase patterns is a high-value EDWproject’s activity. business sponsors agree to fund the project. 1. The EDW Entity-type precedence and EDW scheduling. We canthe now explainpressure the failure 2. Inproject exchange for sponsorship, sponsors the project team to deliver pattern we describe at the start of this paper, inbusiness terms ofvalue PDGs.early—typically Here are the key insights: substantial within three months of project ons More generally, the sponsors require a project roadmap that promises frequen 1. EDW project sponsors wantdelivery to reportofon and analyze late-stageofentity types, early major new increments functionality. in the project. 3. Eager for a chance to build the EDW, the project team negotiates and commit 2. When EDW project teams promise delivery of late-stage entity types, a frequent delivery schedule, starting with they delivery of some high-visibility dat habitually fail to recognize that they are also committing to deliver all of reports at the end of the first project phase. the sofar unimplemented entity types required by the promised late-stage entity types. Early in an EDW project, most of a late-stage entity type’s PDG predecessors are Robert Beate List, and Josef Schiefer, “Risk Management for Data Warehouse System unimplemented. So1the riskM.ofBruckner, over-commitment is greatest at project onset. Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 Sid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Management (Pearson, 2000), pp. 1-26. 11 You don’t need to generate the3graph manually. Instead, use the open-source utility graphviz to translate http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing a text file listing the graphs nodesvisited and arcs into a30, nicely formatted graph that you can print in any size. January 2014. 12 Graphing a logical model’s PDG and checking it for cycles is a great way to discover structural problems in the logical model. The Unix/Linux tsort (topological sort) utility makes checking for cycles trivial. 13 Batch ETL can be architected according to these stages. The data is loaded one stage at a time, and all entity types in a given stage can be loaded in parallel. This approach provably maximizes the parallelism in the ETL process, which is another good reason to construct and study the PDG of an EDW’s logical model. 4 4 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. Notice that the risk is likelyIntroduction to be present each time a project team re-negotiates a project schedule. This is true because the re-negotiation is usually a response to shifting sponsor priorities, not to the projectEveryone team’s discovering the warehouses extent of its over-commitment. knows data are risky. It is an IT truism that enterprise data warehouse (EDW) projects are unusually risky. One paper on the subject begins, “D Outside of our own work, we have never seen EDW project teams consider formally how warehouse projects are notoriously difficult to manage, and many of them end in 1 entity type precedence relations bear on project Inmanagement particular, wereports have never A book onscheduling. EDW project that “the most experienced failure.” seen an EDW team construct a PDG. Nor to [struggle] our knowledge has anyone developed project managers with EDW projects,” in partand because “estimating on published an EDW project-scheduling methodology based on the PDG. warehouse projects is very difficult [since] each data warehouse project can be so different.”2 Many articles about EDW risk cite the laundry lists of EDW failure and r types in Chapter 4 “Risks” of the same book, arguing that EDW project riskiness deri The Remedy: EDW Project-Schedule Optimization from the multitude of risk factors inherent in EDW projects. Likewise, experts such a Ralph Kimball offer numerous guidelines for decreasing EDW project risk.3 In our experience the best way to avoid EDW project over-commitment is to maintain a PDG for the EDW’s logicalOne model, to use the PDG inform riskand factor stands out. toWe agreethe thatproject’s EDW projects are complex, and have ma implementation schedule. In what follows we enlarge this recommendation intoprojects a formalare more difficult to risk factors. We also agree that in general, complex model of optimal EDW project scheduling. Some of the model’s elements will be manage. Having said that, several decades of experience with EDW projects lead us familiar to anyone with EDW experience. focus on one particular pattern of EDW project failure that we believe is largely responsible for EDW projects’ reputation for high risk. This white paper describes th Enterprise ontology. The failure scheduling model requires enterprise ontology, whichand is a suggests how to avoid pattern, explains thean risk factor behind the pattern, logical model that lists eachrisk entity type’s data sources (databases, OLTP applications, while delivering value as quickly as possible throughout the project lifespan. etc.) with their volumetrics, and that records other metadata relevant to EDW implementation. For technical reasons the ontology should limit itself to entity types that will be represented by a single in the physical model. For example, a person entity Thetable Failure Pattern type is too abstract, because an EDW will generally have several tables representing different kinds of people: employees, supplier contacts, customer contacts, etc. The pattern proper. The failure pattern is surprisingly mundane: Functional areas. A functional a collection entity types that satisfy thefund the project. 1. area The is EDW project’sofbusiness sponsors agree to reporting requirements of a given business function. (One entity type may appear in many functional areas.) For example, a product-definition functional area might require 2. In exchange for sponsorship, the sponsors pressure the project team to deliver four entity types: manufacturer product, supplier product, internally developed product, substantial business value early—typically within three months of project ons and unit of measure. This functional areagenerally, might serve product-management More the the sponsors require a project roadmap that promises frequen organization. A typical EDW project will have on the order of 100 functional areas. delivery of major new increments of functionality. Each functional area should list3.theEager reports corresponding for required a chanceby to the build the EDW, theorganization. project team negotiates and commit (Several organizations may requireathe same report.) reportstarting shouldwith be mapped frequent delivery Each schedule, deliverytoof some high-visibility dat the entity types it queries. Each report should also be assigned a utility weight, reports at the end of the first project phase. that is, a dimensionless positive number representing the report’s relative importance to the business. We suggest these weights be integers between one and 10. 1 Robert M. Bruckner, Beate List, and Josef Schiefer, “Risk Management for Data Warehouse System Lecture Notes in Science Vol. 2114, pp. 219-229 Verlag, 2001). Project schedule. A project schedule isComputer a sequence of project stages. Each(Springer functional 2 14 Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Sid Adelman and area belongs to exactly one project stage. Consistency with the PDG is a necessary Management (Pearson, 2000), pp. 1-26. http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing visited January 30, 2014. 14 We could define a project schedule more directly as a sequence of entity types and reports. Our definition in terms of project stages means we search a subset of the set of all possible sequences of entity types and reports. As a result, an optimal project schedule may not optimize over this larger set. However, the optimization algorithm is free to optimize the ordering of previously unimplemented entity types and reports within a given stage’s functional areas. So depending on the load-leveling constraint, our definition 3 5 5 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. Introduction condition for feasibility. More formally, a schedule is infeasible if the following conditions hold: Everyone knows data warehouses are risky. It is an IT truism that enterprise data warehouse (EDW) area projects 1. Functional area A precedes functional B inare theunusually schedule. risky. One paper on the subject begins, “D warehouse arex notoriously 2. There exist entity types x and yprojects such that is in A and ydifficult is in B.to manage, and many of them end in on PDG. EDW project management reports that “the most experienced failure.”1 A 3. Entity type y is an antecedent of book x in the project managers [struggle] with EDW projects,” in part because “estimating on warehouse projects is very [since] each data warehouse A feasible schedule thus traverses the PDG (visits eachdifficult node once) while visiting a given project can be so 2 Many articles about EDW risk citearea the in laundry different.” node only after visiting all of its predecessors. Thus placing a functional a givenlists of EDW failure and r types inimplement Chapter 4 “Risks” the same book, arguing that EDW project riskiness deri project stage means the team must all of theofso-far unimplemented entity types required by the functional includingofallrisk of factors the antecedents the from area, the multitude inherent required in EDW by projects. Likewise, experts such a functional area’s so-far unimplemented reports, in that project stage. for decreasing EDW project risk.3 Ralph Kimball offer numerous guidelines It is usually desirable to have therisk project team’s sizeout. remain over the projects project are complex, and have ma One factor stands We constant agree that EDW lifespan. One can express this by allowing each to require at most some are more difficult to risk naturally factors. We also agree thatstage in general, complex projects fixed number of person hours. If a project such decades a load-leveling constraint, manage. Havingteam saidimposes that, several of experience with EDW projects lead us it becomes another necessary condition schedule feasibility. focus on onefor particular pattern of EDW project failure that we believe is largely responsible for EDW projects’ reputation for high risk. This white paper describes th Workload model. The project team should construct workload model attributing failure pattern, explains thearisk factor behind the pattern,aand suggests how to avoid number of person hours to implementing each entity physical model and ETL, andthe project lifespan. risk while delivering valuetype’s as quickly as possible throughout to implementing each report. For example, it is common to categorize each data source as having low, medium, or high complexity, and to assume that all source tables having a given complexity level require same Pattern amount of time (same goes for reports).15 The Thethe Failure workload model should also include one-time activities such as hardware and software configuration. The workload model mustproper. let the EDW team compute project’s total The pattern The failure pattern isthe surprisingly mundane: and remaining person-hour requirements, given the current project schedule. 1. The EDW project’s business sponsors agree to fund the project. Scheduling algorithm. Finally, the model should include a scheduling algorithm. The algorithm must output a feasible schedule optimizes the remaining project utility. 2. project In exchange forthat sponsorship, sponsors pressure the project team to deliver The optimization may be stepwise substantial optimal (greedy) or globally optimal. Ordinarily business value early—typically within three months of project ons project priorities (represented by the reports’ utilitythe weights) are require likely toa change More generally, sponsors project roadmap that promises frequen frequently over the project lifespan,delivery so the optimization be greedy, to ensure the of major newshould increments of functionality. project realizes known high utilities in the short term. Global optimization is only appropriate when all stakeholders unusually confident that the project priorities will team not negotiates and commit 3. are Eager for a chance to build EDW, the project change, which typically implies that all requirements known starting and wellwith understood. a frequent deliveryare schedule, delivery of some high-visibility dat reports at the end of the first project phase. It is a practical impossibility for a human to construct a (feasible) project schedule that is near optimal. There are simply too many possible schedules. For example, if a project 1 87 List, and JoseftoSchiefer, “Risk includes 63 functional areas,Robert there M. areBruckner, 63! ≈ 10Beate permutations consider (forManagement feasibilityfor Data Warehouse System Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 Sid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project may not strongly constrain the search. We choose the definition based Management (Pearson, 2000), pp. 1-26.on project stages and functional areas because the business world3 generally requires that EDW projects be structured in this fashion, with a http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing stage typically lasting between three weeks and three months. visited January 30, 2014. 15 For source tables, we suggest using as a complexity metric the product of column count and base-10 log of row count. The assumption is that replication effort is linear in the number of columns, while datacleansing effort increases very sub-linearly with record quantity. Then one can categorize a source table as low, medium, or high complexity according to the metric’s value. We likewise suggest a report-complexity metric linear in the number of entity types the report queries. 6 6 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. and optimality). Fortunately, there is a well-established class of algorithms for solving Introduction 16 this kind of problem. Everyone knows data warehouses are risky. It is an IT truism that enterprise data Sample results. We have used a greedy algorithm for are an EDW project having warehouse (EDW) projects unusually risky. Onethe paper on the subject begins, “D following volumetrics: warehouse projects are notoriously difficult to manage, and many of them end in failure.”1 A book on EDW project management reports that “the most experienced • 116 entity types project managers [struggle] with EDW projects,” in part because “estimating on • 372 entity-type precedence relations warehouse projects is very difficult [since] each data warehouse project can be so • 674 source tables different.”2 Many articles about EDW risk cite the laundry lists of EDW failure and r types in Chapter “Risks” of the same book, arguing that EDW project riskiness deri • 1,267 source-table requirements (for 4entity tables) from the multitude of risk factors inherent in EDW projects. Likewise, experts such a • 325 reports Ralph Kimball numerous guidelines for decreasing EDW project risk.3 • 1,380 entity-type requirements (foroffer reports) • 63 functional areas. One risk factor stands out. We agree that EDW projects are complex, and have ma risk factors. We also agree in general, complex projects are more difficult to Setting the work-leveling ceiling to 5,000 person hoursthat yielded a project schedule with 19 manage. Having said that, several decades of experience with stages. Table 1 below lists the stage metrics. The original utilities are between one and EDW projects lead us one particular failure that we believe is largely 10, but the utility-per-hour focus figuresonare scaled up bypattern 10,000of forEDW ease project of comparison: responsible for EDW projects’ reputation for high risk. This white paper describes th failure pattern, the risk factorUtility behind the pattern, and suggests how to avoid Stage Functional TotalexplainsSource risk while delivering value as quickly as possible throughout the project lifespan. Area Count Hours Tables per Hour 1 6 4,488 134 20 2 1 2,068 62 10 The Pattern 3 1 Failure 2,706 74 7 4 2 3,762 89 27 The pattern proper. The failure pattern is surprisingly mundane: 5 1 1,782 46 28 6 1 4,400 85 16 1. The EDW project’s business sponsors agree to fund the project. 7 18 4,664 117 259 8 2 22 1 6818 2. In exchange for sponsorship, the sponsors pressure the project team to deliver 9 1 22 business value 1 early—typically 4546 substantial within three months of project ons 10 10 418 12 1507 a project roadmap that promises frequen More generally, the sponsors require 11 3 220 of major new 9increments 1046 delivery of functionality. 12 1 110 2 909 13 3 3. Eager176 2 568 the project team negotiates and commit for a chance to build the EDW, 14 1 154 delivery schedule, 4 455 with delivery of some high-visibility dat a frequent starting 15 6 616 16 552phase. reports at the end of the first project 16 2 66 2 1515 17 1 176 8 398 1 18 1Robert M. Bruckner, 88 Beate List, and 4 Josef Schiefer, 341 “Risk Management for Data Warehouse System Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 19 396 6 328 22 Sid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Management (Pearson, 2000), pp. 1-26. 3 Table 1: Sample Greedy Schedule http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing visited January 30, 2014. 16 That is, a network optimization problem with path-dependent utilities and costs. See for example Dimitri P. Bertsekas, Network Optimization: Continuous and Discrete Models (Athena Scientific, 1998). 7 7 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. If we consolidate stages 8-19 into a single final stage, preserving functional-area order Introduction within the final stage, we end up with an eight-stage schedule. Assuming an eight-person team working 180 person hours per month, longest stage requires aboutItthree Everyone knowsthe data warehouses are risky. is ancalendar IT truism that enterprise data months; the shortest, about warehouse five weeks.(EDW) See Table 2: projects are unusually risky. One paper on the subject begins, “D warehouse projects are notoriously difficult to manage, and many of them end in on EDW project failure.”1 A book Stage Functional Total Sourcemanagement Utilityreports that “the most experienced Area Hours Tables per Hour in part because “estimating on project managers [struggle] with EDW projects,” warehouse projects is very difficult [since] each data warehouse project can be so Counts articles about134 EDW risk cite20 the laundry lists of EDW failure and r different.” 1 6 2 Many 4,488 types in Chapter 4 “Risks” of the same book, arguing that EDW project riskiness deri 2 1 2,068 62 10 from the of risk factors 74 inherent in EDW 3 1 multitude 2,706 7 projects. Likewise, experts such a Ralph 2Kimball offer for decreasing EDW project risk.3 4 3,762numerous guidelines 89 27 5 1 1,782 46 28 One risk stands out. We 85 agree that EDW 6 1 factor4,400 16 projects are complex, and have ma risk factors. We4,664 also agree that in general, complex 7 18 117 259 projects are more difficult to manage. Having said that, several decades of experience with EDW projects lead us 8 33 2,464 67 832 focus on one particular pattern of EDW project failure that we believe is largely for EDW projects’ reputation for high risk. This white paper describes th Table 2:responsible Consolidated Sample Greedy Schedule failure pattern, explains the risk factor behind the pattern, and suggests how to avoid while delivering value as quickly possible throughout the project lifespan. The business analyst for therisk same project attempted repeatedly toas order the functional areas intuitively. He settled on a schedule that yielded the schedule structure in Table 3: The Failure Pattern Stage Source Tables The pattern proper. The failure 1 252 pattern is surprisingly mundane: 2 102 1. The EDW project’s business sponsors agree to fund the project. 3 55 4 70 2. In exchange for sponsorship, the sponsors pressure the project team to deliver 5 60 substantial business value early—typically within three months of project ons 6 41 More generally, the sponsors require a project roadmap that promises frequen 7 delivery of major new33 increments of functionality. 8 8 9 13 3. Eager for a chance to build the EDW, the project team negotiates and commit 10 delivery schedule, 17 a frequent starting with delivery of some high-visibility dat 11 6 reports at the end of the first project phase. 12 9 1 Table 3: Manually Schedule Robert M. Bruckner,Generated Beate List, and Josef Schiefer, “Risk Management for Data Warehouse System Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 Sid eight Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Consolidating the tail to obtain stages as before yields the schedule in Table 4: Management (Pearson, 2000), pp. 1-26. http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing visited January 30, 2014. 3 8 8 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. Introduction Stage Source Tables Everyone knows data warehouses are risky. It is an IT truism that enterprise data 1 252 warehouse (EDW) are unusually risky. One paper on the subject begins, “D 2 projects 102 warehouse projects are notoriously difficult to manage, and many of them end in 3 55 on EDW project management reports that “the most experienced failure.”1 A book 4 70 project managers 5 [struggle] with 60 EDW projects,” in part because “estimating on warehouse projects is very difficult [since] each data warehouse project can be so 6 41 articles about EDW risk cite the laundry lists of EDW failure and r different.”2 Many 7 33 types in Chapter8 4 “Risks” of53 the same book, arguing that EDW project riskiness deri from the multitude of risk factors inherent in EDW projects. Likewise, experts such a Ralph KimballGenerated offer numerous guidelinesSchedule for decreasing EDW project risk.3 Table 4: Sample Manually Consolidated One risk factorisstands out. that EDW projects are complex, and have ma The steady drop in source tables per stage striking, as isWe theagree difference in absolute risk factors. We also agreeAfter that consolidation, in general, complex projects are more difficult to deviations between consecutive source-table counts. the largest manage. Having said that, several decades of experience with EDW projects lead us stage of the greedy algorithm's schedule contains 2.9 times as many source tables as the focus on one particular pattern of EDW project failure that smallest; but the largest manual stage requires 7.6 times as many source tables as the we believe is largely responsible for EDW projects’ reputation for high risk. This white paper describes th smallest. See Figure 2 below: failure pattern, explains the risk factor behind the pattern, and suggests how to avoid risk while delivering value as quickly as possible throughout the project lifespan. Source Tables per Stage 300 The Failure Pattern Source Tables 250 The pattern proper. The failure pattern is surprisingly mundane: 200 150 1. The EDW project’s business sponsors agree to 100 project. 2. In exchange for sponsorship, the sponsors pressure the project team to deliver substantial business value early—typically within three months of project ons More generally, the sponsors require a8project roadmap that promises frequen 3 4 5 6 7 delivery of major new increments of functionality. Schedule Stage 50 0 1 Greedy Manual fund the Ideal 2 3. Eager for a chance to build the EDW, the project team negotiates and commit frequent schedule, starting with delivery of some high-visibility dat Figure 2: SourceaTables vs.delivery Stage for Both Schedules reports at the end of the first project phase. Figure 3 below presents both schedules' source-table burn rates against an ideal (constant) rate: 1 Robert M. Bruckner, Beate List, and Josef Schiefer, “Risk Management for Data Warehouse System Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 Sid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Management (Pearson, 2000), pp. 1-26. 3 http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing visited January 30, 2014. 9 9 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER Why Enterprise Data Warehouse Projects Fail, and What to do About it Mosaic Data Science, January 2014 Copyright © 2014 Mosaic Data Science. All rights reserved. IntroductionBurn Rates 800 Source Tables 700 600 500 400 300 200 100 0 1 2 Everyone knows data warehouses are risky. It is an IT truism that enterprise data warehouse (EDW) projects are unusually risky. One paper on the subject begins, “D warehouse projects are notoriously difficult to manage, and many of them end in that “the most experienced failure.”1 A book on EDW project management reports Greedy project managers [struggle] with EDW projects,” in partManual because “estimating on Ideal warehouse projects is very difficult [since] each data warehouse project can be so different.”2 Many articles about EDW risk cite the laundry lists of EDW failure and r types in Chapter 4 “Risks” of the same book, arguing that EDW project riskiness deri from the multitude of risk factors inherent in EDW projects. Likewise, experts such a 3 4 6 guidelines 7 8 Ralph Kimball offer5numerous for decreasing EDW project risk.3 Schedule Stage One risk factor stands out. We agree that EDW projects are complex, and have ma risk factors. We also agree that in general, complex projects are more difficult to Figure 3: Burn Rates (Greedy vs. Manual vs. Ideal Schedule) manage. Having said that, several decades of experience with EDW projects lead us focus on one particular pattern of EDW project failure that we believe is largely The greedy algorithm achieves far greater load leveling, while choosing at each stage the responsible for EDW projects’ reputation for high risk. This white paper describes th functional areas that will deliver the most utility possible in an acceptable amount of failure pattern, explains the risk factor behind the pattern, and suggests how to avoid time. risk while delivering value as quickly as possible throughout the project lifespan. Using the PDG and scheduling algorithm to negotiate project schedules. The key benefit of using the PDG and scheduling algorithm is not to optimize the project The Failure Pattern schedule. After all, once the project is complete, the business will probably continue to benefit from the EDW’s data and reports for years, while the project may only last for The pattern proper. The failure pattern is surprisingly mundane: between six and eighteen months. 1. The EDW project’s business sponsors agree to fund the project. Rather, the main benefits are twofold: • • 2. In exchange for sponsorship, the sponsors pressure the project team to deliver The project team can identify all of thebusiness time requirements implicit in a within proposed substantial value early—typically three months of project ons project schedule. This complete understanding of time requirements, together More generally, the sponsors require a project roadmap that promises frequen with a realistic load-leveling constraint, lets the project team avoid early overdelivery of major new increments of functionality. commitment and the resulting loss of credibility and risk of project failure. 3. Eager for a chance to build the EDW, the project team negotiates and commit The project team can use the PDG to help the business team a frequent delivery schedule,sponsors starting and withproject delivery of some high-visibility dat develop a shared understanding of the scheduling constraints that the EDW’s reports at the end of the first project phase. subject matter imposes on the project. This shared understanding can help the business sponsors accept realistic schedule requirements, recognizing that they 1 cannot have everything theyM.want threeBeate months the Schiefer, project starts. Robert Bruckner, List, after and Josef “Risk Management for Data Warehouse System Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 Sid Adelman Larissa T. Moss, to Data We encourage the project team to print aand large copy of the“Introduction PDG, and use it inWarehousing,” schedule Data Warehouse Project Management (Pearson, 2000), pp. 1-26. negotiations with project sponsors to illustrate scheduling realities by 3 http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing visited January 30, 2014. • circling all of the entity types required by a popular functional area, • marking the entity types required by some high-utility reports (to illustrate that they tend to lie in the PDG’s late stages), and 10 10 Copyright © 2014 Mosaic Data Science. All rights reserved. A MOSAIC DATA SCIENCE WHITE PAPER WhyScience. Enterprise Copyright © 2014 Mosaic Data AllData rightsWarehouse reserved. Projects Fail, and What to do About it Mosaic Data Science, January 2014 challenging businessIntroduction sponsors to construct a project schedule consistent with the PDG and load-leveling constraint that delivers total utility over the project warehouses are risky. It is an truism that enterprise data lifespan comparableEveryone to the totalknows utilitydata delivered by the project schedule thatITthe warehouse optimization algorithm outputs.(EDW) projects are unusually risky. One paper on the subject begins, “D warehouse projects are notoriously difficult to manage, and many of them end in 1 A book on EDW project management reports that “the most experienced failure.” In our experience, this approach quickly persuades business sponsors that they can only managers [struggle] with EDW projects,” partto because “estimating on get so much so fast. In fact,project the approach frees the sponsors and project teaminalike is verystage, difficult [since] each datatowarehouse project can be so change the schedule freely warehouse at the end ofprojects each project using the algorithm 2 Many articles about EDW risk cite the laundry lists of EDW failure and r different.” recompute an optimal schedule after changing report utilities to reflect changing business types in Chapter 4 “Risks” of the same book, arguing that EDW project riskiness deri priorities. Paradoxically, this more formal approach to project scheduling ends up from multitude of risk factors inherent in EDW projects. helping the project be as agile asthe possible, as well as minimizing the risk that the project Likewise, experts such a 3 will miss its schedule.17 Ralph Kimball offer numerous guidelines for decreasing EDW project risk. • One risk factor stands out. We agree that EDW projects are complex, and have ma risk factors. We also agree that in general, complex projects are more difficult to manage. Having said that, several decades of experience with EDW projects lead us focus on one particular pattern of EDW project failure that we believe is largely responsible for EDW projects’ reputation for high risk. This white paper describes th failure pattern, explains the risk factor behind the pattern, and suggests how to avoid risk while delivering value as quickly as possible throughout the project lifespan. The Failure Pattern The pattern proper. The failure pattern is surprisingly mundane: 1. The EDW project’s business sponsors agree to fund the project. 2. In exchange for sponsorship, the sponsors pressure the project team to deliver substantial business value early—typically within three months of project ons More generally, the sponsors require a project roadmap that promises frequen delivery of major new increments of functionality. 3. Eager for a chance to build the EDW, the project team negotiates and commit a frequent delivery schedule, starting with delivery of some high-visibility dat reports at the end of the first project phase. 1 Robert M. Bruckner, Beate List, and Josef Schiefer, “Risk Management for Data Warehouse System Lecture Notes in Computer Science Vol. 2114, pp. 219-229 (Springer Verlag, 2001). 2 Sid Adelman and Larissa T. Moss, “Introduction to Data Warehousing,” Data Warehouse Project Management (Pearson, 2000), pp. 1-26. 3 http://www.kimballgroup.com/2009/04/26/eight-guidelines-for-low-risk-enterprise-data-warehousing visited January 30, 2014. 17 In particular, it becomes natural to treat a project phase as an agile/scrum iteration. 11 11