You can model and predict ... integrating data across call centers, ... Abstract
Transcription
You can model and predict ... integrating data across call centers, ... Abstract
4th International Conference on Humanities, Geography and Economics (ICHGE'2013) Oct. 6-7, 2013 Dubai (UAE) A New Data Management Challenge: How to Handle Big Data Zeljko Panian You can model and predict customer behaviors by integrating data across call centers, blogs, forums, and social media platforms into deeper analytics. You can deploy better call-center metrics for customer management or even create an effective micro-targeting and micro-segmentation model for new customer acquisition with better response rates of acceptance. If an enterprise deals with products and/or services, with Big Data analytics you can create powerful models for trends, behaviors, and markets, and you can solve research and ideation issues by leveraging “crowd-sourcing” models and embedding analytical results from your work [2]. If you work in the utility industry, you can create predictive models of consumer markets by deploying technologies such as a smart grid. This would create revenue opportunities in advisory services and provide better models for rate management. Another popular area where Big Data analytics is being used is health care. For example, service providers can leverage Big Data to deploy Body Area Networks, helping lower patient costs while providing "patient-centric" services. Lowering costs and enabling efficiencies are critical goals for hospitals, nursing homes, and caregivers. Another application of Big Data is to optimize clinical trials to prevent errors, reduce costs, and ensure compliance and ensure you're meeting regulatory requirements consistently. Although these analytics are partially fulfilled today, their expansion will enable proactive approaches rather than reactive ones. There is considerable focus these days on new technology and it’s possible to integrate survey, social media feedback, and participation information into traditional platforms. These data points can be represented in analytic and reporting visualizations, helping hospitals and care providers (for example) to improve their quality of service [3]. Abstract—Humans and their machines are cranking out digital information at a mind-boggling rate. Therefore, scientists coined a special term – “Big Data” – to convey the extraordinary scale of the data collections now being amassed inside public and private organizations and out on the Web. Big Data has been of concern to organizations working in select fields for some time, such as the physical sciences (meteorology, physics), life sciences (genomics, biomedical research), government (defense, treasury), finance and banking (transaction processing, trade analytics), communications (call records, network traffic data), and, of course, the Internet (search engine indexation, social networks). While some research organizations may rely on supercomputers to meet their Big Data needs, we propose a toolbox to be offered like Software-as-a-Service (SaaS) with tools accessible to organizations of all sizes and types. Keywords— Big Data, Business Intelligence, Data Management, Software-as-a-Service. I. INTRODUCTION F UELED by the pervasiveness of the Internet, unprecedented computing power, ubiquitous sensors and meters, addictive consumer gadgets, inexpensive storage and (to-date) highly elastic network capacity, we humans and our machines are cranking out digital information at a mindboggling rate. No wonder then scientists coined a special term – “Big Data” – to convey the extraordinary scale of the data collections now being amassed inside public and private organizations and out on the Web [1]. Big Data provides access to data that was always available but not consumable. The expansion of Big Data into Business Intelligence (BI) platforms enables BI professionals to leverage expanded analytics and create 360° perspectives. Let us provide a few sample areas that can benefit from BI and Big Data analytics. For customer relationship management systems, you can create powerful visualizations of customer sentiments, wish lists, and actual customer response data from campaigns to measure true campaign effectiveness. II. BIG FATA CHARACTERISTICS A. What Exactly Is Big Data Big Data is more a concept than a precise term. Some apply the ”Big Data” label only to petabyte-scale data collections (> one million GB) [4]. For others, a Big Data collection may house ‘only’ a few dozen terabytes of data [5]. More often, however, Big Data is defined depending on situation rather than by size. Specifically, a data collection is considered “Big Data” when it is so large an organization cannot effectively or affordably manage or exploit it using conventional data Zeljko Panian, Ph.D. is professor at the Faculty of Economics and Business, University of Zagreb, Croatia (phone: 385-1-238-3217; fax: 385-1233-5633; e-mail: [email protected]). 47 4th International Conference on Humanities, Geography and Economics (ICHGE'2013) Oct. 6-7, 2013 Dubai (UAE) management tools. Relationship Management (CRM) system. Or, you can extend a digital product catalog with Web content (like, product descriptions, photos, specifications, and supplier information). You can even use such content to improve the quality of your organization’s master data management, using the Web to verify details or fill in missing attributes. B. Who Is Affected By Big Data? Big Data has been of concern to organizations working in select fields for some time, such as the physical sciences (meteorology, physics), life sciences (genomics, biomedical research), government (defense, treasury), finance and banking (transaction processing, trade analytics), communications (call records, network traffic data), and, of course, the Internet (search engine indexation, social networks). Now, however, due to our digital fecundity, Big Data is becoming an issue for organizations of all sizes and types. In fact, in 2008 businesses were already managing on average 100TB or more of digital content. Big Data has even become a concern of individuals as awareness grows of the depth and breadth of personal information being amassed in Big Data collections (in contrast, some, like LifeLoggers, broadcast their day-to-day lives in a Big Data stream of their own making) [6]. V. DIFFICULTIES WITH BIG DATA ADOPTION Adoption of Big Data is slower than expected in traditional enterprises. There are several reasons for this. At the top of the list is that current business models and goals do not require Big Data integration. Furthermore, there is no perceived additional value offered by Big Data as to the organization -there is no clear business case articulated, and thus no business value calculated. There are other inhibiting factors. Executives lack an understanding of (and thus sponsorship of) Big Data, which also brings processing complexities that create additional stress on IT teams (in terms of maintenance) and business teams (in terms of adoption and usage). In these times of tight budgets, IT teams simply do not have the necessary bandwidth to implement yet another new system or technology. C. Bringing All Components Together For some, Big Data simply means Big Headaches, raising difficult issues of information system cost, scaling and performance, as well as data security, privacy and ownership. But Big Data also carries the potential for breakthrough insights and innovation in business, science, medicine and government—if we can bring humans, machines and data together to reveal the natural information intelligence locked inside our mountains of Big Data. VI. RECOMMENDATIONS FOR BIG DATA ADOPTION IMPROVEMENT In order to implement a Big Data project, here are a few recommendations. First, it would be helpful to create a powerful team that can set up a platform to explore Big Data. This team will work with business and data analysts to create the road map for further execution. Critical success factors include [8]: Availability of IT resources to build and configure the selected Big Data platform; Availability of business SMEs with data and domain expertise; Availability of resources with BI expertise and deep statistical knowledge; Implementation of a technology center of excellence to provide Big Data infrastructure support; Extension of other BI best practices, including data governance, Master Data Management, metadata management, analytics definition, and visualization, to include Big Data; Adequate training for users to understand the new data and its integration into the analytical and reporting platforms. When it comes to people, having a combination of individuals mentioned above will create a team that can leverage each other's skills and create a unified vision for exploring Big Data. While some research organizations may rely on supercomputers to meet their Big Data needs, we propose a toolbox to be offered like Software-as-a-Service (SaaS) with tools accessible to organizations of all sizes and types. These III. BIG DATA OPPORTUNITIES Innovative public and private organizations are already demonstrating that transforming raw Big Data collections into actionable wisdom is possible. They are showing in particular that tremendous value can be extracted from the ”grey” data that makes up the bulk of Big Data, that is to say data that is unused (or under-used) because it has historically been [7]: Too voluminous, unstructured and/or raw (i.e., minimally structured) to be exploited by conventional information systems, or In the case of highly structured data, too costly or complex to integrate and exploit (e.g., trying to gather and align data from dozens of databases worldwide). These organizations are also opening new frontiers in operational and exploratory analytics using structured data (like database content), semi-structured data (such as log files or XML files) and unstructured content (like text documents or Web pages). IV. DATABASE ENRICHMENT Once you can collect, analyze and organize unstructured Big Data, you can use it to enhance and contextualize existing structured data resources like databases and data warehouses. For instance, you can use information extracted from highvolume sources like email, chat, website logs and social networks to enrich customer profiles in a Customer 48 4th International Conference on Humanities, Geography and Economics (ICHGE'2013) Oct. 6-7, 2013 Dubai (UAE) tools include [9]: Data Capture & Preprocessing: ETL (Extract, Transform and Load) Tools; APIs (Application Programming Interfaces)/Connectors; Crawlers; Messaging System. Data Processing & Interaction: NoSQL Systems; NewSQL Systems; Search Engine. Auxiliary Tools: Cloud Services; Visualization tools. Each has a different role to play in capturing, processing, accessing or analyzing Big Data. Let’s look first at data capture and preprocessing tools. With Big Data loads, APIs can cause bottlenecks due to poor design or insufficient computing or network resources, but they have generally proven to be flexible and capable tools for exchanging large-volume data and services. In fact, one could argue the proliferation of public and private APIs has played an important role in creating today’s Big Data world. Nonetheless, you can still sometimes achieve better performance with an embedded ETL tool than an API, or, in the case of streaming data, with a messaging architecture (see Messaging Systems below). Moreover, APIs are generally not the best choice for collecting data from the Web. A crawler is a better tool for that task. There are three main drawbacks to APIs in the Web context: In spite of their proliferation, only a tiny percentage of online data sources are currently accessible via an API. APIs usually offer access to only a limited portion of a site’s data. Formats and access methods are at the owner’s discretion, and may change at any time. Because of this variability and changeability, it can take a significant amount of time to establish and maintain individual API links, an effort that can become completely unmanageable in Big Data environments. Crawlers are software programs that connect to a data source, methodically extract the metadata and content it contains, and send the extracted content back to a host system for indexation [11]. One type of crawler is a file system crawler. This kind of crawler works its way recursively through computer directories, subdirectories and files to gather file content and metadata (like file path, name, size, and last modified date). File system crawlers are used to collect unstructured content like text documents, semi-structured content like logs, and structured content like XML files. Another type of crawler is a Web (HTTP/HTTPS) crawler. This type of crawler accesses a Web site, captures and transmits the page content it contains along with available metadata (page titles, content labels, etc.), then follows links (or a set visitation list) to proceed to the next site. Typically a search engine is used to process, store and access the content captured by crawlers, but crawlers can be used with other types of data management systems (DMS). As with other data collection tools, one needs to configure crawls so as not to place an undue load on the source system – or the crawler. The quality of the crawler determines the extent to which loads can be properly managed. It should also be kept in mind that crawlers recognize only a limited number of document formats (e.g., HTML, XML, text, PDF, etc.).If you want to use a crawler to gather non-supported document formats, data need to be converted into an ingestible format using tools like API connectors (standard with most commercial search engines), source-system export tools, ETL platforms or messaging systems. MOM (Message-Oriented Middleware) systems provide an A. Data Capture & Preprocessing ETL (Extract, Transform and Load) tools are used to map and move large volumes of data from one system to another. They are most frequently used as data integration aids. More specifically, they are commonly used to consolidate data from multiple databases into a central data warehouse through bulk data transfers. ETL platforms usually include mechanisms for ”normalizing” source data before transferring it, that is to say, for performing at least the minimal processing needed to align incoming data with the target system’s data model and specifications, and removing duplicate or anomalous data [10]. In Big Data environments, the Extract process can sometimes place an unacceptable burden on source systems, and the Transform stage can be a bottleneck if the data is minimally structured or very raw (most ETL platforms require an external or add-on module to handle unstructured data). The Load process can also be quite slow even when the code is optimized for large volumes. This is why ETL transfers, which are widely used to feed data warehouses, tend to be executed during off-hours – usually overnight – resulting in unacceptable data latency in some situations. It should be noted, however, that many ETL vendors are developing – or have already developed – special editions of their software to address these limitations. Application Programming Interfaces (APIs) are softwareto-software interfaces for exchanging almost every type of service or data you can conceive, though we focus here on the use of APIs as tools for data exchange or consolidation. In this context, an API may enable a host system to receive (ingest) data from other systems (a “push” API), or enable others to extract data from it (a publishing or a ”pull” API). APIs typically employ standard programming languages and protocols to facilitate exchanges (e.g., HTTP/REST, Java, XML). Specific instances of packaged APIs on a system are often referred to as ”connectors,“ and may be general in nature, like the Java Database Connectivity (JDBC) API for connecting to most common RDBMSs, or vendor/platform specific, like a connector for IBM Lotus Notes. 49 4th International Conference on Humanities, Geography and Economics (ICHGE'2013) Oct. 6-7, 2013 Dubai (UAE) enabling backbone for enterprise application integration. Often deployed within service-oriented architectures (SOA), MOM solutions loosely couple systems and applications through a bridge known as a message bus. Messages (data packets) managed by the bus may be configured for point-to-point delivery (message queue messaging) or broadcast to multiple subscribers (publish-subscribe messaging). They vary in their level of support for message security, integrity and durability. Exchanges between disparate systems are possible because all connected systems (“peers”) share a common message schema, set of command messages and infrastructure (often dedicated). Data from source systems is transformed to the degree necessary to enable other systems to consume it, for example, binary values may need to be converted to their textual (ASCII) equivalents, or session IDs and IP addresses may be extracted from log files and encoded as XML records. APIs for managing this data processing may be embedded in individual systems connected to the bus, or they may be embedded in the MOM platform. Messaging systems were specifically designed to meet the high-volume, high-velocity data needs of industries like finance, banking and telecommunications. Big Data volumes can nonetheless overload some MOM systems, particularly if the MOM is performing extensive data processing—filtering, aggregation, transformation, etc. – at the message bus level. In such situations, performance can be improved by offloading processing tasks to either source or destination systems. integrate NoSQL or Search components to address the 4V challenges of Big Data). Their primary drawback is cost and rigidity (most are integrated hardware/software appliances). Unlike NoSQL solutions, NewSQL systems tend to be commercial rather than open source (though they may incorporate open source components), with their MPP capacity usually achieved through symmetric processing across a large number of processors embedded within a single high-end computer (usually proprietary), or a small cluster of such computers. Search platforms as a complete search engine system that can serve as a multi-purpose, information aggregation, access and analysis platform in addition to meeting classic enterprise or Web search needs. Such a search platform, also referred to as a “unified information access” (UIA) platform, encompasses all core data management functions, though with an NLP (Natural Language Processing) or indexing twist. These functions include: Data capture (crawlers, connectors & APIs); Data storage (cached copies of source content and the index itself); Data processing (NLP and index construction and maintenance); Data access (human and machine information retrieval, faceted navigation and dashboard analytics). A search system is therefore a DMS like its NoSQL and NewSQL counterparts, and it achieves massive scalability in much the same way, i.e., through distributed architectures, parallel processing, column-oriented data models, etc. [13]. However, it is the semantic capabilities and high usability of search-based DMS that make them ideal complements to (and in some cases, alternatives to) NoSQL and NewSQL systems. Tools like natural language search, faceted navigation and data visualization provide users of all skill levels with an instantly familiar way of exploring and analyzing Big Data. That is to say, they allow users to launch any search or analytical task the same way they launch a search on the Web – by entering a phrase or a few keywords in a text box. They also enable users to conduct iterative exploratory analytics simply by clicking on (traversing) dynamic data clusters (represented as text menus or in visual forms like charts or graphs). Search platforms are responsive because they are optimized for fast query processing against large volumes of data (e.g. read operations), and because most of the calculations they use to produce dashboard analytics and ad hoc drilling are automatically executed as part of routine indexing processes: the results are there waiting to be exploited with no processing overhead. For all these reasons, search platforms serve as perfect complements to NoSQL and NewSQL systems, and, in some contexts, provide a pragmatic alternative to them. B. Data Processing & Interaction Today, classic Relational Database Management Systems (RDBMSs) are complemented by a rich set of alternative DMSs (Database Management Systems) specifically designed to handle the volume, variety, velocity and variability of Big Data collections (the so-called “4Vs” of Big Data). These DMS include NoSQL, NewSQL and Search-based systems. All can ingest data supplied by any of the capture and preprocessing tools discussed in the last section (ETLs, APIs, crawlers or messaging systems). NoSQL systems are distributed, non-relational databases designed for large-scale data storage and for massively-parallel data crunching across a large number of commodity servers [12]. They can support multiple activities, including exploratory and predictive analytics, ETL-style data transformation, and non-mission-critical OLTP (Online Transaction Processing), like, e.g., managing long-duration or inter-organization transactions. Their primary drawbacks are their unfamiliarity, and, for the youngest of these largely opensource solutions, their instability. NewSQL systems are relational databases designed to provide ACID (Analysis Console for Intrusion Databases)compliant, real-time OLTP and conventional SQL-based OLAP in Big Data environments. These systems break through conventional RDBMS performance limits by employing NoSQL-style features such as column-oriented data storage and distributed architectures, or by employing technologies like in-memory processing, SMP or MPP (some go further and C. Auxiliary Tools To meet data management needs, Cloud services should be used, including: 50 4th International Conference on Humanities, Geography and Economics (ICHGE'2013) Oct. 6-7, 2013 Dubai (UAE) Data acquisition services; Data processing (computational) services; Data storage services. In terms of data acquisition, commercial firms are offering a growing range of data sets that companies can crossreference with their own Big Data, or with each other, to yield new views and insights. These largely commercial offerings are complemented by an ever-expanding number of public data sets being published on the Internet by government, educational and scientific organizations. On the processing front, Amazon, Google, Microsoft and others are further enabling companies to use their massive MapReduce-based computing infrastructures to process or analyze these collections. In the case of data storage, many specialty providers of storage/back-up/recovery solutions as well as general Web services companies like Amazon and Microsoft now offer NoSQL-based solutions to help companies affordably store massive collections of semi-structured or unstructured data for archiving or analysis – data companies might otherwise not even have retained. All these diverse Cloud services are helping organizations of all sizes and types work around the technical and financial barriers to exploiting Big Data. Visualization implies representing Big Data in visual form helps make it comprehensible to human beings. It is such an effective aid that most science, engineering and Business Intelligence (BI) software features embedded 2D and 3D data visualization and navigational tools. On the part of major BI vendors – including SAP Business Objects, IBM Cognos, MicroStrategy, SAS Institute, and Information Builders – visualization capabilities include, for example, interactive bar charts, dashboard gauges, pie charts and geographic mapping [14]. SBA engines like CloudView offer this capability as well, generating navigable representations like heat maps, scatter plots, charts, bullet graphs, relationship graphs, tag clouds, sliders, wheels and geospatial maps. Visualization is a terrific tool for summarizing large sets, and for discovering and exploring unexpected relationships and trends, but it is important to select the right type of representation for a given data set and analytical end. Otherwise, you might wind up with a representation that is misleading, confusing or just plain unreadable. To mitigate this risk, and to make the best use of visualization in general, make sure the tool you use produces fully interactive representations. actionable wisdom is possible. They are showing in particular that tremendous value can be extracted from the “grey” data that makes up the bulk of Big Data. It can be used for many valuable purposes, but it also creates some new challenges as far as data management methodology is concerned. In the paper, we have explained a proposition of a new type of toolbox for Big Data management with tools accessible to organizations of all sizes and types. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] VII. CONCLUSION Big Data, as the extraordinary scale of the data collections now being amassed inside public and private organizations and out on the Web, occupies attention of professionals, managers, and businessmen. Innovative public and private organizations are already demonstrating that transforming raw Big Data collections into 51 J. Hurwitz, A. Nugent, F. Halper, and M. Kaufman, Big Data for Dummies. Hoboken, NJ: John Wiley & Sons, Inc., 2013, p 16 G. Cokins, „Big Data versus Big Value,“ Information Management News, Apr. 24, 2013, http://www.information-management. com/news/ big-data-versus-big-value-100243241.html?ET=informationmgmt:e9152:148552a:&st=email&utm_source= editorial&utm_medium=email&utm_campaign=IM_Daily_042413 M. Walker, „Big Data Analytics Infrastructure,“ personal blog, http://www.datasciencecentral.com/profiles/blogs/big-data-analyticsinfrastructure, Dec. 26, 2012. C. Clark. “Case in Brief: Avoiding Intrusiveness in the Age of Big Data.” 1 to 1 Magazine, Feb 08, 2012, http://www.1 to1media. com/ view.aspx?docid=33448&utm_source=1to1weekly&utm_medium=H&u tm_campaign=02272012 J. Kolb and J. Kolb, Secrets of the Big Data Revolution, Plainfield, IL: Applied Data Labs, 2013, p 33 G. Gruman, “Tapping into the power of Big Data,” in Technology Forecast, Issue 3, PriceWaterhouseCoopers, NY: New York, 2010. J.-P. Dijcks, “Q&A: Driving Business Value from Big Data,” TDWI BI News, Nov. 9, 2011, http://tdwi.org/articles/2011/11/09/lesson-drivingbusiness-value-from-big-data.aspx J. Ericson, Information Management Blogs, Aug. 3, 2010, http://www.information-management.com/blogs/big_data_hype_reality100184701.html?ET=informationmgmt:e1653:148552a:&st=email&utm_source= editorial&utm_medium=email&utm_campaign=IM_Blogs_080410 Z. Panian, Alati za upravljanje tzv. velikim podatcima. Zagreb, Croatia: Ekonomski fakultet, to be published. M. Golfarelli and S. Rizzi, Data Warehouse Design: Modern Principles and Methodologies, Bologna, Italy: The McGraw-Hill Companies, S.r.l.-Publishing Group Italia, 2009, pp. 279-274. J. Grappone and G. Couzin, Search Engine Optimization: An Hour a Day, Third Edition, Indianapolis, IN: John Wiley Publishing, Inc., 2011, ch 4. R. Sasirekha, “NoSQL, the database for the Cloud,“ Tata Consultancy Services, http://www.tcs.com/consulting, 2011. S. Ghandeharizadeh, NewSQL Database Management Systems, Los Angeles, CA: The University of Southern California, 2012, pp 44-53. G. Travinsky, “Big Data Visualization Tools for SMBs,“ Planet Soho, May 6, 2013.