Why Enterprises Need Trustworthy Data TDWI
Transcription
Why Enterprises Need Trustworthy Data TDWI
June 2014 TDWI E-Book Why Enterprises Need Trustworthy Data 1 Q&A: Building Trust in Your Data 4 The Ramifications of Trusted Data 6 Fostering Confidence in Data 9 About IBM Sponsored by: tdwi.org Expert Q&A Why We Need Trusted Data Fostering Confidence in Data About IBM Q&A: Building Trust in Your Data Without accurate and trustworthy data, the value of enterprise analytics will be diminished. In this Q&A, we discuss governance, data quality, and the role of the chief data officer. Enterprises are increasingly recognizing the importance of having accurate, trustworthy data in their data warehouses. Data governance can help ensure data quality, but it need not be heavy-handed and inflexible. We spoke to Paula Wiles Sigmon, program director of product marketing for the InfoSphere Information Integration and Governance portfolio at IBM Corporation’s Information Management division, about what enterprises are doing to improve the quality of their data, who should be responsible for data that goes in and comes out of an enterprise data warehouse, and the emerging role of the chief data officer. TDWI: How have your recent conversations with clients about data warehousing been different from conversations in the past? Paula Wiles Sigmon: In recent years, most organizations have recognized the importance of a data warehouse and have spoken with us primarily about either how to get started with a new one or how to improve the warehouses already in place. Today, the focus has shifted to the concept of a modernized warehouse—in particular, one that meets the needs of an organization that is starting to take in lots more data from new sources, in more and different forms. What are the new issues and concerns? The one we’re hearing most often is a concern about how to take advantage of what’s best in a modern data warehouse as well as what’s best in Hadoop. After a brief period a year or so ago, when organizations wondered if data warehouses were even needed in a world where Hadoop was an option, most enterprises today realize that the warehouse and the Hadoop environment can and should coexist and complement each other. What organizations are trying to do is to figure out how to get it right—how to move the right data from Hadoop to the warehouse, move the right data out of the warehouse when it no longer has value, and make the warehouse a home for good data rather than bad data or questionable data, so that deep analytics can be based on the best available information. 1 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data You are focused on “information integration and governance”—not the first thoughts many people have when they start to plan or modernize a data warehouse. Can you help explain the relationship? It’s certainly true that people considering a new or modernized warehouse tend to focus first on the warehouse itself and not the data flowing into it. Sometimes very early in the process and sometimes a bit later, they realize that the whole point of the warehouse is to provide a foundation for reporting and analytics. If those reports and analytics are not based on the best available information, then the entire warehouse/analytics ecosystem has diminished value to the organization. We’ve all seen the scenario where business people simply don’t trust the reports they receive. They can base their decisions on them anyway; they can ignore them and make decisions that don’t even pretend to be fact based; they can strike out on their own and try to create their own systems to meet their specific needs, creating more data silos. The options aren’t pretty. People start asking key questions, such as: How can I create a warehouse that instills confidence among the business users who receive the output? Tracing back from that question are questions about the factors that tend to build confidence. For example: How can I create and manage data quality? How can I provide transparency into the lineage of the data—where it originated, who or what has changed it, when it was last updated? How can I make sure I’m keeping the information that’s needed for compliance or for business operations but not the data that has passed its useful life and become a liability rather than an asset? How can I provide the best 360-degree view of customers, products, and other key entities despite conflicting information flowing toward the warehouse from multiple sources? How can I protect the information in the warehouse from both accidental leaks and intentional breaches? All these questions are critical to the design of a modern data warehouse ecosystem, and they all point to the importance of information integration and governance. Fostering Confidence in Data About IBM Some people raise concerns about governance as a heavyweight undertaking that can slow down projects or increase costs. What is your view? We certainly have heard that concern. It tends to result from the belief that governance is a “one size fits all” undertaking. If that were the case, the concern would probably be valid. If we needed to define our governance practices for the critical data from internal systems that feed our financial reporting, and then apply these practices to comments gleaned from social media—and intended to provide a sense of the market or some additional insights into a particular customer’s interests—we would end up with a heavy-handed approach to the second set of data that is neither practical nor reasonable. However, that isn’t the case. One size doesn’t fit all. What organizations need to do instead is agree that all data brought into the organization needs some level of governance, but understand that the levels vary based on both the data source and the intended use of the data. Then we end up with a situation where we have appropriate controls in place, where users can have confidence in the information at their disposal, but where there is plenty of agility to adapt to new data sources without a governance sledgehammer. We’ve heard customers describe their own approaches to rightsizing data governance. For example, some use a time-boxed approach, accepting data from a new source into a special test zone for a short period of time while it is evaluated in isolation. Then, if it is determined to have longer-term value, it must move to another zone where more controls are in place. Another approach is to classify data according to intended uses, and set up governance zones according to those classifications. So many conversations today focus on what comes out of the data warehouse—in particular, the analytics that organizations need to drive the business. What is your perspective? I believe that’s an important focus. If we put data into a warehouse and never produced any output, it would not serve the business. Today’s organizations are moving more and more to a data-driven approach to decision making. All eyes are on the analytics, and that’s appropriate. 2 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data It’s our view, though, that any close examination of analytics must ultimately lead back to the underlying data. If it’s outdated, inconsistent, or derived from questionable sources, the organization’s best efforts at data-driven decisions can lead to some disastrous results. Should business people, in particular, be concerned about what goes into the warehouse? Yes. No business person should be comfortable making decisions based on questionable data. For those who want to take a look under the covers, the facts about what goes into the warehouse should be transparent. For others who don’t really want to take a deep dive into data lineage as an ongoing practice, there still should be clear answers to questions about where and when the data originated, how it is secured, and so on. In fact, we have found that coming to a clear understanding of information, its history, and its meaning is an important area of collaboration between IT and the business. That collaboration is essential to good governance and also confidence in data. Are you starting to deal with individuals in any roles that previously were not involved in data warehousing? As CMOs grow in their importance as consumers of information and analytics and as investors in technology, they clearly care more and more about the data coming out of the warehouse, if not in the warehouse itself. The significant new players we’re seeing in data warehouse conversations are the data scientists. These folks are popping up everywhere as they look to apply varied techniques to derive meaning from data, whether it is in a data warehouse or elsewhere. Their role seems to be taking on new importance in a big data world, where there is more data to which they can apply their skills. How are chief data officers (CDOs) getting involved with functions such as data warehousing and analytics? Although most organizations don’t yet have CDOs, the CDO population is growing rapidly, especially in industries such as government and financial services. The CDO job description varies from one organization to the next, but often the CDO has either direct or dotted-line responsibility for analytics. The CDO’s role is all about governance (a control-oriented objective) combined with a creative search for ways to drive Fostering Confidence in Data About IBM value from information. Because much of the information in question resides in the data warehouse, the CDO naturally takes an interest in that data, how it’s managed, and the value it contains. Is there any connection between emerging roles such as the CDO and the issue of confidence (or lack of confidence) in data? There isn’t enough data yet to define a causal relationship between the presence of a CDO and an increase in user confidence in data, but we do see a correlation between the two, and that makes sense. The very presence of a chief data officer means that the organization cares enough about data and its business value that it has created a new role to focus on data. The CDO is a C-level executive who thinks every day about ways to make data better and to make it work better to support the goals of the organization. If the CDO is a good communicator—and every CDO should be—then it just makes sense that people in the organization will understand more about their data and trust it more because it is getting top-level attention. What does IBM bring to the table for organizations planning a new or modernized data warehouse? IBM offers a complete portfolio of tools and solutions ranging from data warehouse appliances to data exploration tools to a Hadoop distribution that’s perfect for landing big data or offloading data from a warehouse within a zone architecture. For information integration and governance—so important to data warehouses that inspire confidence in business users— the IBM InfoSphere product family, part of Watson Foundations, enables organizations to integrate data from diverse sources at high speed, establish and maintain data quality, foster business/IT collaboration, manage master data, manage data across its life cycle, and enhance data security and privacy. Whether organizations are just getting started with a new data warehouse or expanding and enriching an existing warehouse environment, IBM has not only the tools but also the expertise to help accelerate success. 3 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data Fostering Confidence in Data About IBM The Ramifications of Trusted Data By Philip Russom What is “trusted data” and how do we achieve it? The term trusted data has been bandied about a lot lately, and everyone seems to have a different definition. It’s an important concept, so in this column I will define the term and consider the ramifications for business intelligence and other data-driven business processes. For some, trusted data is an emotional matter. In the context of business intelligence (BI), they want to feel confident that data presented in reports, cubes, dashboards, and other BI products is in the best condition possible because data in good condition makes their jobs easier and their actions more accurate, timely, effective, and compliant. People who consume data through operational applications have similar concerns. 4 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data Emotions aside, the condition of trusted data is easily quantified. Namely, condition is a technical measurement of data’s completeness, quality, age, schema, profile, and documentation. The assumption is that trusted data should come from carefully selected sources, be transformed in accordance with data’s intended use, and be delivered in formats and time frames that are appropriate to specific consumers of reports and other manifestations of data. Hence, the trustworthiness of data is quantified mostly by the technical properties that define its condition. You still need to be mindful of the emotional impact that data’s condition has on the perceptions of people who consume the data for BI. Why We Need Trusted Data A lack of trusted data leads to a number of poor practices. For example, if BI data isn’t trustworthy because it’s not in good condition, then poor decisions are based on poor data. Whether data is in good condition or not, the mere perception that the data’s not trusted can lead users to ignore supplied BI data and instead build their own BI data stores, such as rogue data marts, spreadsheets, and personal productivity databases. Less-than-trustworthy data creates problems in operational processes, too. When data is (or is perceived to be) of poor quality or incomplete, users and managers base tactical and operational decisions on guesses. If you can turn these problems around, then trusted data has benefits. Whether in BI or operations, people will use data they trust, which in turn leads to greater consistency, compliance, and accuracy in business processes based on the data. How to Get Trusted Data Achieving trustworthiness for data is a multi-step process: • Select sources that are appropriate, certified, and diverse • Process data to transform and aggregate it for the intended use, improve its quality, and enhance its metaand master data Fostering Confidence in Data About IBM • Deliver data in the right time frame, in forms suited to its intended use The process and its best practices involve a mix of: • Organizational effort: Business and technical people and teams, in collaboration • Technical automation: Information management tools and techniques for integration, quality, databases, applications, metadata and master data management, and so on A Final Word Giving business and technical users data they trust is key to good business intelligence. If users don’t have confidence in the data of a data warehouse and other BI data stores, they may argue over the data’s accuracy, refuse to use the reports and analyses fed from the data, or build their own data stores. Non-trusted data likewise hinders operational excellence, as people sidestep prescribed processes and misuse applications. All these paths are nonproductive and lead to faulty decision making and operational actions. To learn more, replay Philip Russom’s TDWI Webinar on Trusted Data for BI. Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 500 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at [email protected], @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom. • Collaborate cross-functionally and govern data for compliance 5 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data Fostering Confidence in Data About IBM Fostering Confidence in Data Analytics has little value if users don’t have confidence in their data. Governance practices combined with data management best practices can enhance confidence in data and give decision makers confidence in their analysis. Now as ever, the data warehouse (DW) is a good idea. Ask a team of experts to design a data architecture capable of delivering accurate, trustworthy information and timely analytic insights and the finished product would closely resemble the DW. Even though trends such as big data, advanced analytics, and the cloud are massive forces for disruption, they don’t—singly or in combination—obviate or nullify the data warehouse. The DW design balances the business decision makers’ need for qualified and consistent information (a need for history: a sixmonth, one-year, or even five-year perspective on what’s going on with the business) with their need for access to timely and, above all, trustworthy insights and information. Trust can best be won by ensuring the consistency, cleanliness, correctness, lineage, and security of data, argues Praveenkumar Hosangadi, product marketing manager with IBM’s Information Management team. He says research shows that one in three business leaders doesn’t trust the information used to make important decisions. “The tsunami of data that we are seeing now will only add to the data uncertainty. One of the critical success factors for a modern data architecture is the implementation of information governance. Governance practices, including data quality, data life cycle management, master data management, and data security and privacy—when blended with best practices in data management—can enhance confidence in data and enable confident decision making. 6 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data “Confidence in data is very important to the adoption of big data and analytics, as questionable data can lead to questionable insights—with no value as a basis for enterprise decisions or operations,” he argues. “Data confidence is especially vital for business users who make high-impact decisions based on insights into data. If users lack confidence in their data, they will lack confidence in the results. Data confidence is all the more important in the big data world because so much of the data growth is coming in forms and from sources whose reliability is questionable.” Mutual Complementarity Hosangadi doesn’t question the importance of Hadoop and other big data technologies, however. What’s intriguing, he suggests, is just how neatly the DW and big data complement one another. “The traditional data warehouse ecosystem needs to adapt to big data scale and support a wider variety of data types. This does not mean ripping and replacing infrastructure. Instead, what’s needed is the right strategy and the right combination of technologies,” he says. “As more big data use cases are emerging and more organizations are gaining experience with big data, it is clear that the data warehouse and other big data technologies such as Hadoop complement each other. The analytic performance, the capability sets, the value per byte of data stored, and the costs involved are totally different.” The keyword, again, is complementarity: big data technologies support workloads and use cases that the DW itself cannot cost-effectively perform. These include advanced analytics— particularly in conjunction with multi-structured data types— and the use of the Hadoop platform as a landing zone or even as a persistent store for staging and transforming data. (MapReduce, Hadoop’s built-in parallel processing engine, is a big help in this regard.) “It is important to view a data warehouse as an ecosystem that fosters analytic and BI systems. The more insights organizations seek, the more crucial the modernization step is,” Hosangadi says. “To adapt to the big data world, organizations need to consider a few types of modernization, including preprocessing of data in a ‘landing zone’ to determine what should be moved Fostering Confidence in Data About IBM to the warehouse, facilities for offloading infrequently accessed data from the warehouse, and exploration of big data to discover new, high-value information and free up the warehouse for deeper analytics.” Complementarity is, by definition, a quid-pro-quo proposition. Just as big data technologies largely complement the data warehouse, Hosangadi suggests, so, too, does the DW fill a complementary role vis-à-vis Hadoop and other big data technologies. “[O]rganizations need to make sure that their capabilities in areas such as data quality and transformation of data that’s moved to the data warehouse are up to the task, ready to handle high volumes at high speed, and [able to] handle the diversity of data types that [an] organization [today] needs to be able to handle.” Complementarity also means using the right platform for the right workloads. The Hadoop platform bundles a distributed file system (HDFS) with a built-in massively parallel processing (MPP) compute layer, the MapReduce engine. For this reason, some vendors now position Hadoop as a onestop platform for all things—including the decision support and analytic workloads traditionally performed by MPP data warehouse systems. This is a huge mistake, argues Hosangadi, inasmuch as it miscasts Hadoop for a role that it cannot possibly fulfill. “Hadoop as a big data platform is not a substitute for an MPP data integration engine because data integration capabilities have not yet matured on the Hadoop stack. Features such as data cleansing, data profiling, and the capture of changed data—capabilities of today’s powerful integration systems— are not available in Hadoop,” he contends. “Organizations need to take advantage of Hadoop for providing functions such as storage of big data in a landing zone while continuing to exploit the rich, production-proven capabilities of data integration systems for getting trustworthy data into the enterprise data warehouse.” Increasing Confidence There’s another, often-overlooked aspect of trustworthiness, Hosangadi points out: security. Organizations must not only tend to the consistency and quality of the data that they load 7 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data into a DW, but must also protect that data against breaches from within or without. Moreover, as it develops and “productionizes” big data technologies, an organization must also determine how to integrate information with its data warehouse in such a way as to substantively address data consistency and quality issues, as well as governance requirements. After all, the data warehouse (or a specialized, usually MPP, analytic database) is the logical destination for the analytic insights that are identified and refined in a big data platform such as Hadoop, as well as for the smaller, conformed data sets that are to be prepared and exported from Hadoop when it is used as a landing zone or persistent store for data of all kinds. The key issue is that business decision makers and knowledge workers must be able to trust this information. “In nextgeneration data warehouses, information integration and governance capabilities enable organizations to manage data growth with a scalable, high-performance platform and deliver information that is worthy of knowledge workers’ trust,” Hosangadi observes, citing a number of issues that he says “are keys to confidence in information.” Fostering Confidence in Data About IBM such as data quality and governance. This isn’t just a mistake, Hosangadi argues, it’s unnecessary. It might sound heretical, but it’s frankly OK to relax quality or governance rules for certain kinds of data. “Data abundance makes data quality and governance much more relevant today than ever before. Participants in a recent Twitter chat mentioned that they spend anywhere from 40 to 80 percent of their time searching for the right information. The more time spent locating the right information, the less time available for analytics and innovation,” he notes. “Different types of data require different levels of governance,” Hosangadi continues. “For example, customer data, product data, and data to be used for financial planning, budgeting, and forecasting require maximum control and governance, whereas social network data and unstructured external data, when used to assess high-level market trends, typically need much less governance. So while data of all types should be governed in some way, organizations can be smart about their governance implementations and invest in the areas of greatest need.” These include: • System integrity: Data must be consistent across different systems • Data governance: Governance policies must be in place, and, more important, must be enforced • Data completeness: Records must be complete, with a common view of master data records • Data correctness: Data must be validated, verified, and standardized • Data currency: Data must be up to date • Data lineage: The source or lineage of data must be known and qualified • Data security and protection: Data must be safeguarded against breach and/or data loss When dealing with data at big data scale, there’s a temptation to throw in the towel on some core data management tenets, 8 TDWI e - book Wh y Enterprises Need Trustworth y Data Expert Q&A Why We Need Trusted Data Fostering Confidence in Data About IBM www.ibm.com tdwi.org IBM offerings for the data warehouse environment include data warehouse appliances, data exploration tools, a Hadoop distribution, and an information integration and governance portfolio that is critical to delivering the best available data to the warehouse. TDWI, a division of 1105 Media, Inc., is the premier provider of in-depth, high-quality education and research in the business intelligence, data warehousing, and analytics industry. TDWI is dedicated to educating business and information technology professionals about the best practices, strategies, techniques, and tools required to successfully design, build, maintain, and enhance business intelligence, data warehousing, and analytics solutions. TDWI also fosters the advancement of business intelligence, data warehousing, and analytics research and contributes to knowledge transfer and the professional development of its members. TDWI offers a worldwide membership program, five major educational conferences, topical educational seminars, role-based training, on-site courses, certification, solution provider partnerships, an awards program for best practices, live Webinars, resourceful publications, an in-depth research program, and a comprehensive website, tdwi.org. As a critical element of Watson™ Foundations, the IBM big data and analytics platform, InfoSphere Information Integration and Governance (IIG) provides market-leading functionality to handle the challenges of big data. It provides optimal scalability and performance for massive data volumes, agile and right-sized integration and governance for the increasing velocity of data, and support and protection for a wide variety of data types and big data systems. It enables organizations to have a clear understanding of information, its history and its meaning, facilitating collaboration between IT and business. InfoSphere capabilities include: Metadata, business glossary and policy management; data integration, data quality, master data management (MDM), data life cycle management, and data security and privacy. Together, these capabilities help make data warehousing, big data and analytics projects successful by delivering business users the confidence to act on insight. Details are available at ibm.com/software/data/ information-integration-governance. • Trusted Information • Information Integration and Governance © 2014 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to [email protected]. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies. 9 TDWI e - book Wh y Enterprises Need Trustworth y Data